Re: Is there any way to set the parallelism of operators like group by, join?

2023-04-15 Thread Robert Bradshaw via user
What are you trying to achieve by setting the parallelism?

On Sat, Apr 15, 2023 at 5:13 PM Jeff Zhang  wrote:

> Thanks Reuven, what I mean is to set the parallelism in operator level.
> And the input size of the operator is unknown at compiling stage if it is
> not a source
>  operator,
>
> Here's an example of flink
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/parallel/#operator-level
> Spark also support to set operator level parallelism (see groupByKey and
> reduceByKey):
> https://spark.apache.org/docs/latest/rdd-programming-guide.html
>
>
> On Sun, Apr 16, 2023 at 1:42 AM Reuven Lax via user 
> wrote:
>
>> The maximum parallelism is always determined by the parallelism of your
>> data. If you do a GroupByKey for example, the number of keys in your data
>> determines the maximum parallelism.
>>
>> Beyond the limitations in your data, it depends on your execution engine.
>> If you're using Dataflow, Dataflow is designed to automatically determine
>> the parallelism (e.g. work will be dynamically split and moved around
>> between workers, the number of workers will autoscale, etc.), so there's no
>> need to explicitly set the parallelism of the execution.
>>
>> On Sat, Apr 15, 2023 at 1:12 AM Jeff Zhang  wrote:
>>
>>> Besides the global parallelism of beam job, is there any way to set
>>> parallelism for individual operators like group by and join? I
>>> understand the parallelism setting depends on the underlying execution
>>> engine, but it is very common to set parallelism like group by and join in
>>> both spark & flink.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: Is there any way to set the parallelism of operators like group by, join?

2023-04-15 Thread Jeff Zhang
Thanks Reuven, what I mean is to set the parallelism in operator level. And
the input size of the operator is unknown at compiling stage if it is not a
source
 operator,

Here's an example of flink
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/parallel/#operator-level
Spark also support to set operator level parallelism (see groupByKey and
reduceByKey):
https://spark.apache.org/docs/latest/rdd-programming-guide.html


On Sun, Apr 16, 2023 at 1:42 AM Reuven Lax via user 
wrote:

> The maximum parallelism is always determined by the parallelism of your
> data. If you do a GroupByKey for example, the number of keys in your data
> determines the maximum parallelism.
>
> Beyond the limitations in your data, it depends on your execution engine.
> If you're using Dataflow, Dataflow is designed to automatically determine
> the parallelism (e.g. work will be dynamically split and moved around
> between workers, the number of workers will autoscale, etc.), so there's no
> need to explicitly set the parallelism of the execution.
>
> On Sat, Apr 15, 2023 at 1:12 AM Jeff Zhang  wrote:
>
>> Besides the global parallelism of beam job, is there any way to set
>> parallelism for individual operators like group by and join? I
>> understand the parallelism setting depends on the underlying execution
>> engine, but it is very common to set parallelism like group by and join in
>> both spark & flink.
>>
>>
>>
>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>

-- 
Best Regards

Jeff Zhang


Re: Is there any way to set the parallelism of operators like group by, join?

2023-04-15 Thread Reuven Lax via user
The maximum parallelism is always determined by the parallelism of your
data. If you do a GroupByKey for example, the number of keys in your data
determines the maximum parallelism.

Beyond the limitations in your data, it depends on your execution engine.
If you're using Dataflow, Dataflow is designed to automatically determine
the parallelism (e.g. work will be dynamically split and moved around
between workers, the number of workers will autoscale, etc.), so there's no
need to explicitly set the parallelism of the execution.

On Sat, Apr 15, 2023 at 1:12 AM Jeff Zhang  wrote:

> Besides the global parallelism of beam job, is there any way to set
> parallelism for individual operators like group by and join? I
> understand the parallelism setting depends on the underlying execution
> engine, but it is very common to set parallelism like group by and join in
> both spark & flink.
>
>
>
>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Is there any way to set the parallelism of operators like group by, join?

2023-04-15 Thread Jeff Zhang
Besides the global parallelism of beam job, is there any way to set
parallelism for individual operators like group by and join? I
understand the parallelism setting depends on the underlying execution
engine, but it is very common to set parallelism like group by and join in
both spark & flink.







-- 
Best Regards

Jeff Zhang