Thanks Holden, this would work for Spark, but Flink doesn't have such kind of mechanism, so I am looking for a general solution on the beam side.
On Mon, Apr 17, 2023 at 10:08 AM Holden Karau <hol...@pigscanfly.ca> wrote: > To a (small) degree Sparks “new” AQE might be able to help depending on > what kind of operations Beam is compiling it down to. > > Have you tried setting spark.sql.adaptive.enabled & > spark.sql.adaptive.coalescePartitions.enabled > > > > On Mon, Apr 17, 2023 at 10:34 AM Reuven Lax via user <user@beam.apache.org> > wrote: > >> I see. Robert - what is the story for parallelism controls on GBK with >> the Spark or Flink runners? >> >> On Sun, Apr 16, 2023 at 6:24 PM Jeff Zhang <zjf...@gmail.com> wrote: >> >>> No, I don't use dataflow, I use Spark & Flink. >>> >>> >>> On Mon, Apr 17, 2023 at 8:08 AM Reuven Lax <re...@google.com> wrote: >>> >>>> Are you running on the Dataflow runner? If so, Dataflow - unlike Spark >>>> and Flink - dynamically modifies the parallelism as the operator runs, so >>>> there is no need to have such controls. In fact these specific controls >>>> wouldn't make much sense for the way Dataflow implements these operators. >>>> >>>> On Sun, Apr 16, 2023 at 12:25 AM Jeff Zhang <zjf...@gmail.com> wrote: >>>> >>>>> Just for performance tuning like in Spark and Flink. >>>>> >>>>> >>>>> On Sun, Apr 16, 2023 at 1:10 PM Robert Bradshaw via user < >>>>> user@beam.apache.org> wrote: >>>>> >>>>>> What are you trying to achieve by setting the parallelism? >>>>>> >>>>>> On Sat, Apr 15, 2023 at 5:13 PM Jeff Zhang <zjf...@gmail.com> wrote: >>>>>> >>>>>>> Thanks Reuven, what I mean is to set the parallelism in operator >>>>>>> level. And the input size of the operator is unknown at compiling stage >>>>>>> if >>>>>>> it is not a source >>>>>>> operator, >>>>>>> >>>>>>> Here's an example of flink >>>>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/parallel/#operator-level >>>>>>> Spark also support to set operator level parallelism (see groupByKey >>>>>>> and reduceByKey): >>>>>>> https://spark.apache.org/docs/latest/rdd-programming-guide.html >>>>>>> >>>>>>> >>>>>>> On Sun, Apr 16, 2023 at 1:42 AM Reuven Lax via user < >>>>>>> user@beam.apache.org> wrote: >>>>>>> >>>>>>>> The maximum parallelism is always determined by the parallelism of >>>>>>>> your data. If you do a GroupByKey for example, the number of keys in >>>>>>>> your >>>>>>>> data determines the maximum parallelism. >>>>>>>> >>>>>>>> Beyond the limitations in your data, it depends on your execution >>>>>>>> engine. If you're using Dataflow, Dataflow is designed to automatically >>>>>>>> determine the parallelism (e.g. work will be dynamically split and >>>>>>>> moved >>>>>>>> around between workers, the number of workers will autoscale, etc.), so >>>>>>>> there's no need to explicitly set the parallelism of the execution. >>>>>>>> >>>>>>>> On Sat, Apr 15, 2023 at 1:12 AM Jeff Zhang <zjf...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Besides the global parallelism of beam job, is there any way to >>>>>>>>> set parallelism for individual operators like group by and join? I >>>>>>>>> understand the parallelism setting depends on the underlying execution >>>>>>>>> engine, but it is very common to set parallelism like group by and >>>>>>>>> join in >>>>>>>>> both spark & flink. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Best Regards >>>>>>>>> >>>>>>>>> Jeff Zhang >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Best Regards >>>>>>> >>>>>>> Jeff Zhang >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Best Regards >>>>> >>>>> Jeff Zhang >>>>> >>>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> -- > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > -- Best Regards Jeff Zhang