Julien Peloton created SPARK-26024: -------------------------------------- Summary: Dataset API: repartitionByRange(...) has inconsistent behaviour Key: SPARK-26024 URL: https://issues.apache.org/jira/browse/SPARK-26024 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.2 Environment: Spark version 2.3.2 Reporter: Julien Peloton
Hi, I recently played with the {{repartitionByRange}} method for DataFrame introduced in SPARK-22614. For DataFrames larger than the one tested in the code (which has only 10 elements), the code sends back random results. As a test for showing the inconsistent behaviour, I start as the unit code used to test {{repartitionByRange}} ([here|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala#L352]) but I increase the size of the initial array to 1000, repartition using 3 partitions, and count the number of element per-partitions: {code} // Shuffle numbers from 0 to 1000, and make a DataFrame val df = Random.shuffle(0.to(1000)).toDF("val") // Repartition it using 3 partitions // Sum up number of elements in each partition, and collect it. // And do it several times for (i <- 0 to 9) { var counts = df.repartitionByRange(3, col("val")) .mapPartitions{part => Iterator(part.size)} .collect() println(counts.toList) } // -> the number of elements in each partition varies... {code} I do not know whether it is expected (I will dig further in the code), but it sounds like a bug. Or I just misinterpret what {{repartitionByRange}} is for? Any ideas? Thanks! Julien -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org