No, I don't use dataflow, I use Spark & Flink.
On Mon, Apr 17, 2023 at 8:08 AM Reuven Lax <[email protected]> wrote: > Are you running on the Dataflow runner? If so, Dataflow - unlike Spark and > Flink - dynamically modifies the parallelism as the operator runs, so there > is no need to have such controls. In fact these specific controls wouldn't > make much sense for the way Dataflow implements these operators. > > On Sun, Apr 16, 2023 at 12:25 AM Jeff Zhang <[email protected]> wrote: > >> Just for performance tuning like in Spark and Flink. >> >> >> On Sun, Apr 16, 2023 at 1:10 PM Robert Bradshaw via user < >> [email protected]> wrote: >> >>> What are you trying to achieve by setting the parallelism? >>> >>> On Sat, Apr 15, 2023 at 5:13 PM Jeff Zhang <[email protected]> wrote: >>> >>>> Thanks Reuven, what I mean is to set the parallelism in operator level. >>>> And the input size of the operator is unknown at compiling stage if it is >>>> not a source >>>> operator, >>>> >>>> Here's an example of flink >>>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/parallel/#operator-level >>>> Spark also support to set operator level parallelism (see groupByKey >>>> and reduceByKey): >>>> https://spark.apache.org/docs/latest/rdd-programming-guide.html >>>> >>>> >>>> On Sun, Apr 16, 2023 at 1:42 AM Reuven Lax via user < >>>> [email protected]> wrote: >>>> >>>>> The maximum parallelism is always determined by the parallelism of >>>>> your data. If you do a GroupByKey for example, the number of keys in your >>>>> data determines the maximum parallelism. >>>>> >>>>> Beyond the limitations in your data, it depends on your execution >>>>> engine. If you're using Dataflow, Dataflow is designed to automatically >>>>> determine the parallelism (e.g. work will be dynamically split and moved >>>>> around between workers, the number of workers will autoscale, etc.), so >>>>> there's no need to explicitly set the parallelism of the execution. >>>>> >>>>> On Sat, Apr 15, 2023 at 1:12 AM Jeff Zhang <[email protected]> wrote: >>>>> >>>>>> Besides the global parallelism of beam job, is there any way to set >>>>>> parallelism for individual operators like group by and join? I >>>>>> understand the parallelism setting depends on the underlying execution >>>>>> engine, but it is very common to set parallelism like group by and join >>>>>> in >>>>>> both spark & flink. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Best Regards >>>>>> >>>>>> Jeff Zhang >>>>>> >>>>> >>>> >>>> -- >>>> Best Regards >>>> >>>> Jeff Zhang >>>> >>> >> >> -- >> Best Regards >> >> Jeff Zhang >> > -- Best Regards Jeff Zhang
