I see. Robert - what is the story for parallelism controls on GBK with the Spark or Flink runners?
On Sun, Apr 16, 2023 at 6:24 PM Jeff Zhang <zjf...@gmail.com> wrote: > No, I don't use dataflow, I use Spark & Flink. > > > On Mon, Apr 17, 2023 at 8:08 AM Reuven Lax <re...@google.com> wrote: > >> Are you running on the Dataflow runner? If so, Dataflow - unlike Spark >> and Flink - dynamically modifies the parallelism as the operator runs, so >> there is no need to have such controls. In fact these specific controls >> wouldn't make much sense for the way Dataflow implements these operators. >> >> On Sun, Apr 16, 2023 at 12:25 AM Jeff Zhang <zjf...@gmail.com> wrote: >> >>> Just for performance tuning like in Spark and Flink. >>> >>> >>> On Sun, Apr 16, 2023 at 1:10 PM Robert Bradshaw via user < >>> user@beam.apache.org> wrote: >>> >>>> What are you trying to achieve by setting the parallelism? >>>> >>>> On Sat, Apr 15, 2023 at 5:13 PM Jeff Zhang <zjf...@gmail.com> wrote: >>>> >>>>> Thanks Reuven, what I mean is to set the parallelism in operator >>>>> level. And the input size of the operator is unknown at compiling stage if >>>>> it is not a source >>>>> operator, >>>>> >>>>> Here's an example of flink >>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/parallel/#operator-level >>>>> Spark also support to set operator level parallelism (see groupByKey >>>>> and reduceByKey): >>>>> https://spark.apache.org/docs/latest/rdd-programming-guide.html >>>>> >>>>> >>>>> On Sun, Apr 16, 2023 at 1:42 AM Reuven Lax via user < >>>>> user@beam.apache.org> wrote: >>>>> >>>>>> The maximum parallelism is always determined by the parallelism of >>>>>> your data. If you do a GroupByKey for example, the number of keys in your >>>>>> data determines the maximum parallelism. >>>>>> >>>>>> Beyond the limitations in your data, it depends on your execution >>>>>> engine. If you're using Dataflow, Dataflow is designed to automatically >>>>>> determine the parallelism (e.g. work will be dynamically split and moved >>>>>> around between workers, the number of workers will autoscale, etc.), so >>>>>> there's no need to explicitly set the parallelism of the execution. >>>>>> >>>>>> On Sat, Apr 15, 2023 at 1:12 AM Jeff Zhang <zjf...@gmail.com> wrote: >>>>>> >>>>>>> Besides the global parallelism of beam job, is there any way to set >>>>>>> parallelism for individual operators like group by and join? I >>>>>>> understand the parallelism setting depends on the underlying execution >>>>>>> engine, but it is very common to set parallelism like group by and join >>>>>>> in >>>>>>> both spark & flink. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Best Regards >>>>>>> >>>>>>> Jeff Zhang >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Best Regards >>>>> >>>>> Jeff Zhang >>>>> >>>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> > > -- > Best Regards > > Jeff Zhang >