koert kuipers created SPARK-31841:
-------------------------------------

             Summary: Dataset.repartition leverage adaptive execution
                 Key: SPARK-31841
                 URL: https://issues.apache.org/jira/browse/SPARK-31841
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.0.0
         Environment: spark branch-3.0 from may 1 this year
            Reporter: koert kuipers


hello,

we are very happy users of adaptive query execution. its a great feature to now 
have to think about and tune the number of partitions anymore in a shuffle.

i noticed that Dataset.groupBy consistently uses adaptive execution when its 
enabled (e.g. i don't see the default 200 partitions) but when i do 
Dataset.repartition it seems i am back to a hardcoded number of partitions.

Should adaptive execution also be used for repartition? It would be nice to be 
able to repartition without having to think about optimal number of partitions.

An example:
{code:java}
$ spark-shell --conf spark.sql.adaptive.enabled=true --conf 
spark.sql.adaptive.advisoryPartitionSizeInBytes=100000
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_252)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val x = (1 to 1000000).toDF
x: org.apache.spark.sql.DataFrame = [value: int]
scala> x.rdd.getNumPartitions
res0: Int = 2scala> x.repartition($"value").rdd.getNumPartitions
res1: Int = 200
scala> x.groupBy("value").count.rdd.getNumPartitions
res2: Int = 67
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to