What are you trying to achieve by setting the parallelism? On Sat, Apr 15, 2023 at 5:13 PM Jeff Zhang <[email protected]> wrote:
> Thanks Reuven, what I mean is to set the parallelism in operator level. > And the input size of the operator is unknown at compiling stage if it is > not a source > operator, > > Here's an example of flink > https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/parallel/#operator-level > Spark also support to set operator level parallelism (see groupByKey and > reduceByKey): > https://spark.apache.org/docs/latest/rdd-programming-guide.html > > > On Sun, Apr 16, 2023 at 1:42 AM Reuven Lax via user <[email protected]> > wrote: > >> The maximum parallelism is always determined by the parallelism of your >> data. If you do a GroupByKey for example, the number of keys in your data >> determines the maximum parallelism. >> >> Beyond the limitations in your data, it depends on your execution engine. >> If you're using Dataflow, Dataflow is designed to automatically determine >> the parallelism (e.g. work will be dynamically split and moved around >> between workers, the number of workers will autoscale, etc.), so there's no >> need to explicitly set the parallelism of the execution. >> >> On Sat, Apr 15, 2023 at 1:12 AM Jeff Zhang <[email protected]> wrote: >> >>> Besides the global parallelism of beam job, is there any way to set >>> parallelism for individual operators like group by and join? I >>> understand the parallelism setting depends on the underlying execution >>> engine, but it is very common to set parallelism like group by and join in >>> both spark & flink. >>> >>> >>> >>> >>> >>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> > > -- > Best Regards > > Jeff Zhang >
