Re: Is there any way to set the parallelism of operators like group by, join?

Jeff Zhang Sun, 16 Apr 2023 19:20:18 -0700

Thanks Holden, this would work for Spark, but Flink doesn't have such kind
of mechanism, so I am looking for a general solution on the beam side.


On Mon, Apr 17, 2023 at 10:08 AM Holden Karau <hol...@pigscanfly.ca> wrote:

> To a (small) degree Sparks “new” AQE might be able to help depending on
> what kind of operations Beam is compiling it down to.
>
> Have you tried setting spark.sql.adaptive.enabled &
> spark.sql.adaptive.coalescePartitions.enabled
>
>
>
> On Mon, Apr 17, 2023 at 10:34 AM Reuven Lax via user <user@beam.apache.org>
> wrote:
>
>> I see. Robert - what is the story for parallelism controls on GBK with
>> the Spark or Flink runners?
>>
>> On Sun, Apr 16, 2023 at 6:24 PM Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>> No, I don't use dataflow, I use Spark & Flink.
>>>
>>>
>>> On Mon, Apr 17, 2023 at 8:08 AM Reuven Lax <re...@google.com> wrote:
>>>
>>>> Are you running on the Dataflow runner? If so, Dataflow - unlike Spark
>>>> and Flink - dynamically modifies the parallelism as the operator runs, so
>>>> there is no need to have such controls. In fact these specific controls
>>>> wouldn't make much sense for the way Dataflow implements these operators.
>>>>
>>>> On Sun, Apr 16, 2023 at 12:25 AM Jeff Zhang <zjf...@gmail.com> wrote:
>>>>
>>>>> Just for performance tuning like in Spark and Flink.
>>>>>
>>>>>
>>>>> On Sun, Apr 16, 2023 at 1:10 PM Robert Bradshaw via user <
>>>>> user@beam.apache.org> wrote:
>>>>>
>>>>>> What are you trying to achieve by setting the parallelism?
>>>>>>
>>>>>> On Sat, Apr 15, 2023 at 5:13 PM Jeff Zhang <zjf...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Reuven, what I mean is to set the parallelism in operator
>>>>>>> level. And the input size of the operator is unknown at compiling stage 
>>>>>>> if
>>>>>>> it is not a source
>>>>>>>  operator,
>>>>>>>
>>>>>>> Here's an example of flink
>>>>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/parallel/#operator-level
>>>>>>> Spark also support to set operator level parallelism (see groupByKey
>>>>>>> and reduceByKey):
>>>>>>> https://spark.apache.org/docs/latest/rdd-programming-guide.html
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Apr 16, 2023 at 1:42 AM Reuven Lax via user <
>>>>>>> user@beam.apache.org> wrote:
>>>>>>>
>>>>>>>> The maximum parallelism is always determined by the parallelism of
>>>>>>>> your data. If you do a GroupByKey for example, the number of keys in 
>>>>>>>> your
>>>>>>>> data determines the maximum parallelism.
>>>>>>>>
>>>>>>>> Beyond the limitations in your data, it depends on your execution
>>>>>>>> engine. If you're using Dataflow, Dataflow is designed to automatically
>>>>>>>> determine the parallelism (e.g. work will be dynamically split and 
>>>>>>>> moved
>>>>>>>> around between workers, the number of workers will autoscale, etc.), so
>>>>>>>> there's no need to explicitly set the parallelism of the execution.
>>>>>>>>
>>>>>>>> On Sat, Apr 15, 2023 at 1:12 AM Jeff Zhang <zjf...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Besides the global parallelism of beam job, is there any way to
>>>>>>>>> set parallelism for individual operators like group by and join? I
>>>>>>>>> understand the parallelism setting depends on the underlying execution
>>>>>>>>> engine, but it is very common to set parallelism like group by and 
>>>>>>>>> join in
>>>>>>>>> both spark & flink.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best Regards
>>>>>>>>>
>>>>>>>>> Jeff Zhang
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Regards
>>>>>>>
>>>>>>> Jeff Zhang
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Best Regards
>>>>>
>>>>> Jeff Zhang
>>>>>
>>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
Best Regards

Jeff Zhang

Re: Is there any way to set the parallelism of operators like group by, join?

Reply via email to