To a (small) degree Sparks “new” AQE might be able to help depending on what kind of operations Beam is compiling it down to.
Have you tried setting spark.sql.adaptive.enabled & spark.sql.adaptive.coalescePartitions.enabled On Mon, Apr 17, 2023 at 10:34 AM Reuven Lax via user <user@beam.apache.org> wrote: > I see. Robert - what is the story for parallelism controls on GBK with the > Spark or Flink runners? > > On Sun, Apr 16, 2023 at 6:24 PM Jeff Zhang <zjf...@gmail.com> wrote: > >> No, I don't use dataflow, I use Spark & Flink. >> >> >> On Mon, Apr 17, 2023 at 8:08 AM Reuven Lax <re...@google.com> wrote: >> >>> Are you running on the Dataflow runner? If so, Dataflow - unlike Spark >>> and Flink - dynamically modifies the parallelism as the operator runs, so >>> there is no need to have such controls. In fact these specific controls >>> wouldn't make much sense for the way Dataflow implements these operators. >>> >>> On Sun, Apr 16, 2023 at 12:25 AM Jeff Zhang <zjf...@gmail.com> wrote: >>> >>>> Just for performance tuning like in Spark and Flink. >>>> >>>> >>>> On Sun, Apr 16, 2023 at 1:10 PM Robert Bradshaw via user < >>>> user@beam.apache.org> wrote: >>>> >>>>> What are you trying to achieve by setting the parallelism? >>>>> >>>>> On Sat, Apr 15, 2023 at 5:13 PM Jeff Zhang <zjf...@gmail.com> wrote: >>>>> >>>>>> Thanks Reuven, what I mean is to set the parallelism in operator >>>>>> level. And the input size of the operator is unknown at compiling stage >>>>>> if >>>>>> it is not a source >>>>>> operator, >>>>>> >>>>>> Here's an example of flink >>>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/parallel/#operator-level >>>>>> Spark also support to set operator level parallelism (see groupByKey >>>>>> and reduceByKey): >>>>>> https://spark.apache.org/docs/latest/rdd-programming-guide.html >>>>>> >>>>>> >>>>>> On Sun, Apr 16, 2023 at 1:42 AM Reuven Lax via user < >>>>>> user@beam.apache.org> wrote: >>>>>> >>>>>>> The maximum parallelism is always determined by the parallelism of >>>>>>> your data. If you do a GroupByKey for example, the number of keys in >>>>>>> your >>>>>>> data determines the maximum parallelism. >>>>>>> >>>>>>> Beyond the limitations in your data, it depends on your execution >>>>>>> engine. If you're using Dataflow, Dataflow is designed to automatically >>>>>>> determine the parallelism (e.g. work will be dynamically split and moved >>>>>>> around between workers, the number of workers will autoscale, etc.), so >>>>>>> there's no need to explicitly set the parallelism of the execution. >>>>>>> >>>>>>> On Sat, Apr 15, 2023 at 1:12 AM Jeff Zhang <zjf...@gmail.com> wrote: >>>>>>> >>>>>>>> Besides the global parallelism of beam job, is there any way to set >>>>>>>> parallelism for individual operators like group by and join? I >>>>>>>> understand the parallelism setting depends on the underlying execution >>>>>>>> engine, but it is very common to set parallelism like group by and >>>>>>>> join in >>>>>>>> both spark & flink. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Best Regards >>>>>>>> >>>>>>>> Jeff Zhang >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Best Regards >>>>>> >>>>>> Jeff Zhang >>>>>> >>>>> >>>> >>>> -- >>>> Best Regards >>>> >>>> Jeff Zhang >>>> >>> >> >> -- >> Best Regards >> >> Jeff Zhang >> > -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau