Ideally this should be handled by the underlying data source to produce a
reasonably partitioned RDD as the input data. However if we already have a
poorly partitioned RDD at hand and want to repartition it properly, I think
an extra shuffle is required so that we can know the partition size first.

That said, I think calling `.repartition()` with no args is indeed a good
solution for this problem.

On Sat, May 22, 2021 at 1:12 AM mhawes <hawes.i...@gmail.com> wrote:

> Adding /another/ update to say that I'm currently planning on using a
> recently introduced feature whereby calling `.repartition()` with no args
> will cause the dataset to be optimised by AQE. This actually suits our
> use-case perfectly!
>
> Example:
>
>         sparkSession.conf().set("spark.sql.adaptive.enabled", "true");
>         Dataset<Long> dataset = sparkSession.range(1, 4, 1,
> 4).repartition();
>
>         assertThat(dataset.rdd().collectPartitions().length).isEqualTo(1);
> // true
>
>
> Relevant PRs/Issues:
> [SPARK-31220][SQL] repartition obeys initialPartitionNum when
> adaptiveExecutionEnabled https://github.com/apache/spark/pull/27986
> [SPARK-32056][SQL
> <https://github.com/apache/spark/pull/27986%5BSPARK-32056%5D%5BSQL>]
> Coalesce partitions for repartition by expressions when
> AQE is enabled https://github.com/apache/spark/pull/28900
> [SPARK-32056][SQL][Follow-up] Coalesce partitions for repartiotion hint and
> sql when AQE is enabled https://github.com/apache/spark/pull/28952
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Reply via email to