Just for performance tuning like in Spark and Flink.

On Sun, Apr 16, 2023 at 1:10 PM Robert Bradshaw via user <
user@beam.apache.org> wrote:

> What are you trying to achieve by setting the parallelism?
>
> On Sat, Apr 15, 2023 at 5:13 PM Jeff Zhang <zjf...@gmail.com> wrote:
>
>> Thanks Reuven, what I mean is to set the parallelism in operator level.
>> And the input size of the operator is unknown at compiling stage if it is
>> not a source
>>  operator,
>>
>> Here's an example of flink
>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/parallel/#operator-level
>> Spark also support to set operator level parallelism (see groupByKey and
>> reduceByKey):
>> https://spark.apache.org/docs/latest/rdd-programming-guide.html
>>
>>
>> On Sun, Apr 16, 2023 at 1:42 AM Reuven Lax via user <user@beam.apache.org>
>> wrote:
>>
>>> The maximum parallelism is always determined by the parallelism of your
>>> data. If you do a GroupByKey for example, the number of keys in your data
>>> determines the maximum parallelism.
>>>
>>> Beyond the limitations in your data, it depends on your execution
>>> engine. If you're using Dataflow, Dataflow is designed to automatically
>>> determine the parallelism (e.g. work will be dynamically split and moved
>>> around between workers, the number of workers will autoscale, etc.), so
>>> there's no need to explicitly set the parallelism of the execution.
>>>
>>> On Sat, Apr 15, 2023 at 1:12 AM Jeff Zhang <zjf...@gmail.com> wrote:
>>>
>>>> Besides the global parallelism of beam job, is there any way to set
>>>> parallelism for individual operators like group by and join? I
>>>> understand the parallelism setting depends on the underlying execution
>>>> engine, but it is very common to set parallelism like group by and join in
>>>> both spark & flink.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards
>>>>
>>>> Jeff Zhang
>>>>
>>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>

-- 
Best Regards

Jeff Zhang

Reply via email to