At the moment, portability has GroupByKey transform. In most data
processing frameworks, such as Hadoop MR and Apache Spark there is a
concept of secondary sorting during the shuffle phase. Dataflow worker code
has it under the name BatchViewOverrides.GroupByKeyAndSortValuesOnly [1],
it's PTransform<PCollection<KV<K1, KV<K2, V>>>, PCollection<KV<K1,
Iterable<KV<K2, V>>>>>. It does sharding by K1 and sorting by K2 within
each shard.

I see a lot of value in adding GroupByKeyAndSort to the list of built-in
transforms so that runners can efficiently override it. It's possible to
define GroupByKeyAndSort as GroupByKey+SortValues [2], however, having it
as primitive will open the possibility for more efficient implementation.
What could be potential drawbacks? I didn't think much how it could work
for non-bach pipelines.

Gleb

[1]:
https://github.com/spotify/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/BatchViewOverrides.java#L1246
[2]:
https://github.com/apache/beam/blob/master/sdks/java/extensions/sorter/src/main/java/org/apache/beam/sdk/extensions/sorter/SortValues.java

Reply via email to