koert kuipers created SPARK-31841: ------------------------------------- Summary: Dataset.repartition leverage adaptive execution Key: SPARK-31841 URL: https://issues.apache.org/jira/browse/SPARK-31841 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Environment: spark branch-3.0 from may 1 this year Reporter: koert kuipers
hello, we are very happy users of adaptive query execution. its a great feature to now have to think about and tune the number of partitions anymore in a shuffle. i noticed that Dataset.groupBy consistently uses adaptive execution when its enabled (e.g. i don't see the default 200 partitions) but when i do Dataset.repartition it seems i am back to a hardcoded number of partitions. Should adaptive execution also be used for repartition? It would be nice to be able to repartition without having to think about optimal number of partitions. An example: {code:java} $ spark-shell --conf spark.sql.adaptive.enabled=true --conf spark.sql.adaptive.advisoryPartitionSizeInBytes=100000 Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_252) Type in expressions to have them evaluated. Type :help for more information. scala> val x = (1 to 1000000).toDF x: org.apache.spark.sql.DataFrame = [value: int] scala> x.rdd.getNumPartitions res0: Int = 2scala> x.repartition($"value").rdd.getNumPartitions res1: Int = 200 scala> x.groupBy("value").count.rdd.getNumPartitions res2: Int = 67 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org