I have been trying to figure out the potential efficiency of sliding windows. 
Looking at the TrafficRoutes example - 
https://github.com/GoogleCloudPlatform/DataflowJavaSDK-examples/blob/master/src/main/java/com/google/cloud/dataflow/examples/complete/TrafficRoutes.java
 -  it seems that the GatherStats class explicitly sorts its data (in 
event-time order) within every window for every key. 
(Collections.sort(infoList)). 
Is this necessary? If the data for each key arrives in event-time order and 
that order is maintained as the data flows through the pipeline, then the data 
within each window should already be sorted. For large sliding windows with 
small lags/sliding offsets re-sorting is going to be very inefficient. Or is it 
the case in Beam/DataFlow that even if the underlying data stream is ordered, 
there are no guarantees to the ordering of the data after a window transform or 
GroupByKey has been applied? 
Thanks,
Bill.

Reply via email to