Re: Is there any way to set the parallelism of operators like group by, join?

Reuven Lax via user Sun, 16 Apr 2023 18:34:19 -0700

I see. Robert - what is the story for parallelism controls on GBK with the
Spark or Flink runners?


On Sun, Apr 16, 2023 at 6:24 PM Jeff Zhang <[email protected]> wrote:

> No, I don't use dataflow, I use Spark & Flink.
>
>
> On Mon, Apr 17, 2023 at 8:08 AM Reuven Lax <[email protected]> wrote:
>
>> Are you running on the Dataflow runner? If so, Dataflow - unlike Spark
>> and Flink - dynamically modifies the parallelism as the operator runs, so
>> there is no need to have such controls. In fact these specific controls
>> wouldn't make much sense for the way Dataflow implements these operators.
>>
>> On Sun, Apr 16, 2023 at 12:25 AM Jeff Zhang <[email protected]> wrote:
>>
>>> Just for performance tuning like in Spark and Flink.
>>>
>>>
>>> On Sun, Apr 16, 2023 at 1:10 PM Robert Bradshaw via user <
>>> [email protected]> wrote:
>>>
>>>> What are you trying to achieve by setting the parallelism?
>>>>
>>>> On Sat, Apr 15, 2023 at 5:13 PM Jeff Zhang <[email protected]> wrote:
>>>>
>>>>> Thanks Reuven, what I mean is to set the parallelism in operator
>>>>> level. And the input size of the operator is unknown at compiling stage if
>>>>> it is not a source
>>>>>  operator,
>>>>>
>>>>> Here's an example of flink
>>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/parallel/#operator-level
>>>>> Spark also support to set operator level parallelism (see groupByKey
>>>>> and reduceByKey):
>>>>> https://spark.apache.org/docs/latest/rdd-programming-guide.html
>>>>>
>>>>>
>>>>> On Sun, Apr 16, 2023 at 1:42 AM Reuven Lax via user <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> The maximum parallelism is always determined by the parallelism of
>>>>>> your data. If you do a GroupByKey for example, the number of keys in your
>>>>>> data determines the maximum parallelism.
>>>>>>
>>>>>> Beyond the limitations in your data, it depends on your execution
>>>>>> engine. If you're using Dataflow, Dataflow is designed to automatically
>>>>>> determine the parallelism (e.g. work will be dynamically split and moved
>>>>>> around between workers, the number of workers will autoscale, etc.), so
>>>>>> there's no need to explicitly set the parallelism of the execution.
>>>>>>
>>>>>> On Sat, Apr 15, 2023 at 1:12 AM Jeff Zhang <[email protected]> wrote:
>>>>>>
>>>>>>> Besides the global parallelism of beam job, is there any way to set
>>>>>>> parallelism for individual operators like group by and join? I
>>>>>>> understand the parallelism setting depends on the underlying execution
>>>>>>> engine, but it is very common to set parallelism like group by and join 
>>>>>>> in
>>>>>>> both spark & flink.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Regards
>>>>>>>
>>>>>>> Jeff Zhang
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Best Regards
>>>>>
>>>>> Jeff Zhang
>>>>>
>>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: Is there any way to set the parallelism of operators like group by, join?

Reply via email to