[jira] [Commented] (SPARK-39915) Dataset.repartition(N) may not create N partitions
[ https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597894#comment-17597894 ] Apache Spark commented on SPARK-39915: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/37730 > Dataset.repartition(N) may not create N partitions > -- > > Key: SPARK-39915 > URL: https://issues.apache.org/jira/browse/SPARK-39915 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Shixiong Zhu >Priority: Major > > Looks like there is a behavior change in Dataset.repartition in 3.3.0. For > example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 > in Spark 3.2.0, but 0 in Spark 3.3.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39915) Dataset.repartition(N) may not create N partitions
[ https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597893#comment-17597893 ] Apache Spark commented on SPARK-39915: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/37730 > Dataset.repartition(N) may not create N partitions > -- > > Key: SPARK-39915 > URL: https://issues.apache.org/jira/browse/SPARK-39915 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Shixiong Zhu >Priority: Major > > Looks like there is a behavior change in Dataset.repartition in 3.3.0. For > example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 > in Spark 3.2.0, but 0 in Spark 3.3.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39915) Dataset.repartition(N) may not create N partitions
[ https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17597138#comment-17597138 ] Apache Spark commented on SPARK-39915: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/37706 > Dataset.repartition(N) may not create N partitions > -- > > Key: SPARK-39915 > URL: https://issues.apache.org/jira/browse/SPARK-39915 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Shixiong Zhu >Priority: Major > > Looks like there is a behavior change in Dataset.repartition in 3.3.0. For > example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 > in Spark 3.2.0, but 0 in Spark 3.3.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39915) Dataset.repartition(N) may not create N partitions
[ https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582864#comment-17582864 ] Apache Spark commented on SPARK-39915: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/37612 > Dataset.repartition(N) may not create N partitions > -- > > Key: SPARK-39915 > URL: https://issues.apache.org/jira/browse/SPARK-39915 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Shixiong Zhu >Priority: Major > > Looks like there is a behavior change in Dataset.repartition in 3.3.0. For > example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 > in Spark 3.2.0, but 0 in Spark 3.3.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39915) Dataset.repartition(N) may not create N partitions
[ https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582707#comment-17582707 ] XiDuo You commented on SPARK-39915: --- We may need a more strict machine to ensure the output partition number of repartition > Dataset.repartition(N) may not create N partitions > -- > > Key: SPARK-39915 > URL: https://issues.apache.org/jira/browse/SPARK-39915 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Shixiong Zhu >Priority: Major > > Looks like there is a behavior change in Dataset.repartition in 3.3.0. For > example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 > in Spark 3.2.0, but 0 in Spark 3.3.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39915) Dataset.repartition(N) may not create N partitions
[ https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582703#comment-17582703 ] XiDuo You commented on SPARK-39915: --- Thank you [~yumwang] for ping me. I see this issue. This is not only for empty relation optimization but also for other unary node which is at top of repartition, e.g.: {code:java} val df1 = spark.range(1).selectExpr("id as c1") val df2 = spark.range(1).selectExpr("id as c2") df1.join(df2, col("c1") === col("c2")).repartition(200, col("c1")).rdd.getNumPartitions -- output 1{code} the `.rdd` of dataset will inject a unary node `DeserializeToObject`, so the protection of current AQE for repartition does not work. see `AQEUtils`. And the protection does not retain the `RoundRobinPartitioning`, which makes this issue more complex. > Dataset.repartition(N) may not create N partitions > -- > > Key: SPARK-39915 > URL: https://issues.apache.org/jira/browse/SPARK-39915 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Shixiong Zhu >Priority: Major > > Looks like there is a behavior change in Dataset.repartition in 3.3.0. For > example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 > in Spark 3.2.0, but 0 in Spark 3.3.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39915) Dataset.repartition(N) may not create N partitions
[ https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582457#comment-17582457 ] Shixiong Zhu commented on SPARK-39915: -- Yeah. I would consider this is a bug since the doc of `repartition` explicitly says {code:java} Returns a new Dataset that has exactly `numPartitions` partitions. {code} > Dataset.repartition(N) may not create N partitions > -- > > Key: SPARK-39915 > URL: https://issues.apache.org/jira/browse/SPARK-39915 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Shixiong Zhu >Priority: Major > > Looks like there is a behavior change in Dataset.repartition in 3.3.0. For > example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 > in Spark 3.2.0, but 0 in Spark 3.3.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39915) Dataset.repartition(N) may not create N partitions
[ https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582402#comment-17582402 ] Yuming Wang commented on SPARK-39915: - The reason is that it will return empty local relation since SPARK-35455: https://github.com/apache/spark/blob/a077701d4cc36a9a6ce898ddd3b4e5fd506f6162/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelation.scala#L129-L130 cc [~ulysses] > Dataset.repartition(N) may not create N partitions > -- > > Key: SPARK-39915 > URL: https://issues.apache.org/jira/browse/SPARK-39915 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Shixiong Zhu >Priority: Major > > Looks like there is a behavior change in Dataset.repartition in 3.3.0. For > example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 > in Spark 3.2.0, but 0 in Spark 3.3.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39915) Dataset.repartition(N) may not create N partitions
[ https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576319#comment-17576319 ] Shixiong Zhu commented on SPARK-39915: -- {code:java} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.3.0 /_/ Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171) Type in expressions to have them evaluated. Type :help for more information. scala> spark.range(10, 0).repartition(5).rdd.getNumPartitions res0: Int = 0 {code} > Dataset.repartition(N) may not create N partitions > -- > > Key: SPARK-39915 > URL: https://issues.apache.org/jira/browse/SPARK-39915 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Shixiong Zhu >Priority: Major > > Looks like there is a behavior change in Dataset.repartition in 3.3.0. For > example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 > in Spark 3.2.0, but 0 in Spark 3.3.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39915) Dataset.repartition(N) may not create N partitions
[ https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17576282#comment-17576282 ] Pablo Langa Blanco commented on SPARK-39915: Hi [~zsxwing] , I can't reproduce it, do you have a typo in range? {code:java} scala> spark.range(0, 10).repartition(5).rdd.getNumPartitions res53: Int = 5{code} > Dataset.repartition(N) may not create N partitions > -- > > Key: SPARK-39915 > URL: https://issues.apache.org/jira/browse/SPARK-39915 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Shixiong Zhu >Priority: Major > > Looks like there is a behavior change in Dataset.repartition in 3.3.0. For > example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 > in Spark 3.2.0, but 0 in Spark 3.3.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39915) Dataset.repartition(N) may not create N partitions
[ https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572685#comment-17572685 ] Shixiong Zhu commented on SPARK-39915: -- cc [~cloud_fan] > Dataset.repartition(N) may not create N partitions > -- > > Key: SPARK-39915 > URL: https://issues.apache.org/jira/browse/SPARK-39915 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Shixiong Zhu >Priority: Major > > Looks like there is a behavior change in Dataset.repartition in 3.3.0. For > example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 > in Spark 3.2.0, but 0 in Spark 3.3.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org