Re: Is there any way to set the parallelism of operators like group by, join?

Robert Bradshaw via user Sat, 15 Apr 2023 22:10:24 -0700

What are you trying to achieve by setting the parallelism?

On Sat, Apr 15, 2023 at 5:13 PM Jeff Zhang <[email protected]> wrote:


> Thanks Reuven, what I mean is to set the parallelism in operator level.
> And the input size of the operator is unknown at compiling stage if it is
> not a source
>  operator,
>
> Here's an example of flink
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/parallel/#operator-level
> Spark also support to set operator level parallelism (see groupByKey and
> reduceByKey):
> https://spark.apache.org/docs/latest/rdd-programming-guide.html
>
>
> On Sun, Apr 16, 2023 at 1:42 AM Reuven Lax via user <[email protected]>
> wrote:
>
>> The maximum parallelism is always determined by the parallelism of your
>> data. If you do a GroupByKey for example, the number of keys in your data
>> determines the maximum parallelism.
>>
>> Beyond the limitations in your data, it depends on your execution engine.
>> If you're using Dataflow, Dataflow is designed to automatically determine
>> the parallelism (e.g. work will be dynamically split and moved around
>> between workers, the number of workers will autoscale, etc.), so there's no
>> need to explicitly set the parallelism of the execution.
>>
>> On Sat, Apr 15, 2023 at 1:12 AM Jeff Zhang <[email protected]> wrote:
>>
>>> Besides the global parallelism of beam job, is there any way to set
>>> parallelism for individual operators like group by and join? I
>>> understand the parallelism setting depends on the underlying execution
>>> engine, but it is very common to set parallelism like group by and join in
>>> both spark & flink.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: Is there any way to set the parallelism of operators like group by, join?

Reply via email to