jaketf commented on a change in pull request #11596:
URL: https://github.com/apache/beam/pull/11596#discussion_r421807059
##########
File path:
sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/healthcare/HL7v2IO.java
##########
@@ -472,24 +548,120 @@ public void initClient() throws IOException {
this.client = new HttpHealthcareApiClient();
}
+ @GetInitialRestriction
+ public OrderedTimeRange getEarliestToLatestRestriction(@Element String
hl7v2Store)
+ throws IOException {
+ from = this.client.getEarliestHL7v2SendTime(hl7v2Store, this.filter);
+ // filters are [from, to) to match logic of OffsetRangeTracker but need
latest element to be
+ // included in results set to add an extra ms to the upper bound.
+ to = this.client.getLatestHL7v2SendTime(hl7v2Store, this.filter).plus(1);
+ return new OrderedTimeRange(from, to);
+ }
+
+ @NewTracker
+ public OrderedTimeRangeTracker newTracker(@Restriction OrderedTimeRange
timeRange) {
+ return timeRange.newTracker();
+ }
+
+ @SplitRestriction
+ public void split(
+ @Restriction OrderedTimeRange timeRange,
OutputReceiver<OrderedTimeRange> out) {
+ // TODO(jaketf) How to pick optimal values for desiredNumOffsetsPerSplit
?
Review comment:
Unfortunately, in this use case dynamic splitting would be crucial
because we can't know the distribution of data in the restriction dimension
(sendTime).
If you imagine a hospital might be much busier during daytime / weekdays
than night times weekends (though never dormant due to ICU and emergency
services). "Day time" might change base on hospital location, week days are
subject to holidays, etc.
Data distribution in sendTime may be subject to significant spikes if one of
the upstream systems populating sendTime has to backfill after a maintenance
period and doesn't responsibly set this field to event time but sets all of the
sendTimes to a short range of backfill
processing time (this is sub optimal behavior of that system but sometimes a
reality).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]