Ideally this should be handled by the underlying data source to produce a reasonably partitioned RDD as the input data. However if we already have a poorly partitioned RDD at hand and want to repartition it properly, I think an extra shuffle is required so that we can know the partition size first.
That said, I think calling `.repartition()` with no args is indeed a good solution for this problem. On Sat, May 22, 2021 at 1:12 AM mhawes <hawes.i...@gmail.com> wrote: > Adding /another/ update to say that I'm currently planning on using a > recently introduced feature whereby calling `.repartition()` with no args > will cause the dataset to be optimised by AQE. This actually suits our > use-case perfectly! > > Example: > > sparkSession.conf().set("spark.sql.adaptive.enabled", "true"); > Dataset<Long> dataset = sparkSession.range(1, 4, 1, > 4).repartition(); > > assertThat(dataset.rdd().collectPartitions().length).isEqualTo(1); > // true > > > Relevant PRs/Issues: > [SPARK-31220][SQL] repartition obeys initialPartitionNum when > adaptiveExecutionEnabled https://github.com/apache/spark/pull/27986 > [SPARK-32056][SQL > <https://github.com/apache/spark/pull/27986%5BSPARK-32056%5D%5BSQL>] > Coalesce partitions for repartition by expressions when > AQE is enabled https://github.com/apache/spark/pull/28900 > [SPARK-32056][SQL][Follow-up] Coalesce partitions for repartiotion hint and > sql when AQE is enabled https://github.com/apache/spark/pull/28952 > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >