David Moravek created BEAM-5392: ----------------------------------- Summary: GroupByKey: All values for a single key need to fit in-memory at once Key: BEAM-5392 URL: https://issues.apache.org/jira/browse/BEAM-5392 Project: Beam Issue Type: Bug Components: runner-spark Affects Versions: 2.6.0 Reporter: David Moravek Assignee: David Moravek
Currently, when using GroupByKey, all values for a single key need to fit in-memory at once. There are following issues, that need to be addressed: a) We can not use Spark's _groupByKey_, because it requires all values to fit in memory for a single key (it is implemented as "list combiner") b) _ReduceFnRunner_ iterates over values multiple times in order to group also by window Solution: In Dataflow Worker code, there are optimized versions of ReduceFnRunner, that can take advantage of having elements for a single key sorted by timestamp. We can use Spark's `{{repartitionAndSortWithinPartitions}}` in order to meet this constraint. For non-merging windows, we can put window itself into the key resulting in smaller groupings. This approach was already tested in ~100TB input scale on Spark 2.3.x. (custom Spark runner). I'll submit a patch once the Dataflow Worker code donation is complete. -- This message was sent by Atlassian JIRA (v7.6.3#76005)