[jira] [Assigned] (BEAM-3022) Enable the ability to grow partition count in the underlying Spark RDD

Jira Thu, 05 Mar 2020 03:45:22 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ismaël Mejía reassigned BEAM-3022:
----------------------------------

    Assignee:     (was: Amit Sela)

> Enable the ability to grow partition count in the underlying Spark RDD
> ----------------------------------------------------------------------
>
>                 Key: BEAM-3022
>                 URL: https://issues.apache.org/jira/browse/BEAM-3022
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-spark
>            Reporter: Tim Robertson
>            Priority: Major
>
> When using a {{HadoopInputFormatIO}} the number of splits seems to be 
> controlled by the underlying {{InputFormat}} which in turn determines the 
> number of partitions and therefore parallelisation when running on Spark.  It 
> is possible to {{Reshuffle}} the data to compensate for data skew, but it 
> _appears_ there is no way to grow the number of partitions.  The 
> {{GroupCombineFunctions.reshuffle}} seems to be the only place calling the 
> Spark {{repartition}} and it uses the number of partitions from the original 
> RDD.
> Scenarios that would benefit from this:
> # Increasing parallelisation for computationally heavy stages
> # ETLs where the input partitions are dictated by the source while you wish 
> to optimise the partitions for fast loading to the target sink
> # Zip files (my case) where they are read in single threaded manner with a 
> custom HadoopInputFormat and therefore get a single task for all stages
> (It would be nice if a user could supply a partitioner too, to help dictate 
> data locality)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (BEAM-3022) Enable the ability to grow partition count in the underlying Spark RDD

Reply via email to