[ https://issues.apache.org/jira/browse/BEAM-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ismaël Mejía reassigned BEAM-3022: ---------------------------------- Assignee: (was: Amit Sela) > Enable the ability to grow partition count in the underlying Spark RDD > ---------------------------------------------------------------------- > > Key: BEAM-3022 > URL: https://issues.apache.org/jira/browse/BEAM-3022 > Project: Beam > Issue Type: Improvement > Components: runner-spark > Reporter: Tim Robertson > Priority: Major > > When using a {{HadoopInputFormatIO}} the number of splits seems to be > controlled by the underlying {{InputFormat}} which in turn determines the > number of partitions and therefore parallelisation when running on Spark. It > is possible to {{Reshuffle}} the data to compensate for data skew, but it > _appears_ there is no way to grow the number of partitions. The > {{GroupCombineFunctions.reshuffle}} seems to be the only place calling the > Spark {{repartition}} and it uses the number of partitions from the original > RDD. > Scenarios that would benefit from this: > # Increasing parallelisation for computationally heavy stages > # ETLs where the input partitions are dictated by the source while you wish > to optimise the partitions for fast loading to the target sink > # Zip files (my case) where they are read in single threaded manner with a > custom HadoopInputFormat and therefore get a single task for all stages > (It would be nice if a user could supply a partitioner too, to help dictate > data locality) -- This message was sent by Atlassian Jira (v8.3.4#803005)