Without AQE, repartition() simply creates 200 (the value of
spark.sql.shuffle.partitions) partitions AFAIK. The AQE helps you to
coalesce the partitions into a reasonable number, by size. Note that you
need to tune spark.sql.shuffle.partitions to make sure it's big enough, as
AQE can not increase the number of partitions, only coalesce.

On Tue, May 25, 2021 at 2:35 AM Tom Graves <tgraves...@yahoo.com> wrote:

> so repartition() would look at some other config (
> spark.sql.adaptive.advisoryPartitionSizeInBytes) to decide the size to
> use to partition it on then?  Does it require AQE?  If so what does a
> repartition() call do if AQE is not enabled? this is essentially a new api
> so would repartitionBySize or something be less confusing to users who
> already use repartition(num_partitions).
>
> Tom
>
> On Monday, May 24, 2021, 12:30:20 PM CDT, Wenchen Fan <cloud0...@gmail.com>
> wrote:
>
>
> Ideally this should be handled by the underlying data source to produce a
> reasonably partitioned RDD as the input data. However if we already have a
> poorly partitioned RDD at hand and want to repartition it properly, I think
> an extra shuffle is required so that we can know the partition size first.
>
> That said, I think calling `.repartition()` with no args is indeed a good
> solution for this problem.
>
> On Sat, May 22, 2021 at 1:12 AM mhawes <hawes.i...@gmail.com> wrote:
>
> Adding /another/ update to say that I'm currently planning on using a
> recently introduced feature whereby calling `.repartition()` with no args
> will cause the dataset to be optimised by AQE. This actually suits our
> use-case perfectly!
>
> Example:
>
>         sparkSession.conf().set("spark.sql.adaptive.enabled", "true");
>         Dataset<Long> dataset = sparkSession.range(1, 4, 1,
> 4).repartition();
>
>         assertThat(dataset.rdd().collectPartitions().length).isEqualTo(1);
> // true
>
>
> Relevant PRs/Issues:
> [SPARK-31220][SQL] repartition obeys initialPartitionNum when
> adaptiveExecutionEnabled https://github.com/apache/spark/pull/27986
> [SPARK-32056][SQL
> <https://github.com/apache/spark/pull/27986%5BSPARK-32056%5D%5BSQL>]
> Coalesce partitions for repartition by expressions when
> AQE is enabled https://github.com/apache/spark/pull/28900
> [SPARK-32056][SQL][Follow-up] Coalesce partitions for repartiotion hint and
> sql when AQE is enabled https://github.com/apache/spark/pull/28952
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Reply via email to