GroupByKey and SortValues Transformation

Juan Carlos Garcia Thu, 18 Apr 2019 23:05:21 -0700

Hi Folks,

I have question regarding *GroupBy* and *SortValue* (via SecondaryKey),


The pipeline looks like:

...source+initialTransformations
.apply(Window.Into(FixedWindows.of(Duration.standardMinutes(10))))
.apply("GroupByKey", GroupByKey.create())  *// using my primaryKey (userId)*
.apply("SortValues", SortValues.create(BufferedExternalSorter.options())) *//
My Secondary Key is the timestamp of the incoming event*
.apply("Enrichment", ParDo.of(new *BusinessEnrichment*())) *// i receives a
KV<userId, Iterable<KV<timestamp, String>>>*
..SinkToBigQuery:

1. Is it guarantee that my *BusinessEnrichment *will hold all the data
grouped, sorted in on single *machine* and that bundles of the same keys
will not be parallelize during auto scaling (i am running on top of
Dataflow) ?

I expect a parallel computation on different keys (users), but the bundles
within the grouped key (userId) to be treated sequentially inside
*BusinessEnrichment*, is that correct?

I am asking due to an observation of mixed results which could be due to a
bug on my code or parallel bundles computations within the same key, my
expectation is to have a sequential processing after *grouping + sorting*,
for example having a userId with 1000 sorted events i expect the processing
as the following sequence:

     userId(1) with bundleA (0..250) -> userId(1) with bundleB (251..500)
-> userId(1) with bundleC (501..750) -> userId(1) with bundleD (751..1000)

Please advise if my assumption is wrong.

Thanks and regards

-- 

JC

GroupByKey and SortValues Transformation

Reply via email to