Just for performance tuning like in Spark and Flink.
On Sun, Apr 16, 2023 at 1:10 PM Robert Bradshaw via user < user@beam.apache.org> wrote: > What are you trying to achieve by setting the parallelism? > > On Sat, Apr 15, 2023 at 5:13 PM Jeff Zhang <zjf...@gmail.com> wrote: > >> Thanks Reuven, what I mean is to set the parallelism in operator level. >> And the input size of the operator is unknown at compiling stage if it is >> not a source >> operator, >> >> Here's an example of flink >> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/parallel/#operator-level >> Spark also support to set operator level parallelism (see groupByKey and >> reduceByKey): >> https://spark.apache.org/docs/latest/rdd-programming-guide.html >> >> >> On Sun, Apr 16, 2023 at 1:42 AM Reuven Lax via user <user@beam.apache.org> >> wrote: >> >>> The maximum parallelism is always determined by the parallelism of your >>> data. If you do a GroupByKey for example, the number of keys in your data >>> determines the maximum parallelism. >>> >>> Beyond the limitations in your data, it depends on your execution >>> engine. If you're using Dataflow, Dataflow is designed to automatically >>> determine the parallelism (e.g. work will be dynamically split and moved >>> around between workers, the number of workers will autoscale, etc.), so >>> there's no need to explicitly set the parallelism of the execution. >>> >>> On Sat, Apr 15, 2023 at 1:12 AM Jeff Zhang <zjf...@gmail.com> wrote: >>> >>>> Besides the global parallelism of beam job, is there any way to set >>>> parallelism for individual operators like group by and join? I >>>> understand the parallelism setting depends on the underlying execution >>>> engine, but it is very common to set parallelism like group by and join in >>>> both spark & flink. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Best Regards >>>> >>>> Jeff Zhang >>>> >>> >> >> -- >> Best Regards >> >> Jeff Zhang >> > -- Best Regards Jeff Zhang