[ https://issues.apache.org/jira/browse/SPARK-26024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Peloton updated SPARK-26024: ----------------------------------- Affects Version/s: 2.3.0 2.3.1 > Dataset API: repartitionByRange(...) has inconsistent behaviour > --------------------------------------------------------------- > > Key: SPARK-26024 > URL: https://issues.apache.org/jira/browse/SPARK-26024 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0, 2.3.1, 2.3.2 > Environment: Spark version 2.3.2 > Reporter: Julien Peloton > Priority: Major > Labels: dataFrame, partitioning, repartition, spark-sql > > Hi, > I recently played with the {{repartitionByRange}} method for DataFrame > introduced in SPARK-22614. For DataFrames larger than the one tested in the > code (which has only 10 elements), the code sends back random results. > As a test for showing the inconsistent behaviour, I start as the unit code > used to test {{repartitionByRange}} > ([here|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala#L352]) > but I increase the size of the initial array to 1000, repartition using 3 > partitions, and count the number of element per-partitions: > > {code} > // Shuffle numbers from 0 to 1000, and make a DataFrame > val df = Random.shuffle(0.to(1000)).toDF("val") > // Repartition it using 3 partitions > // Sum up number of elements in each partition, and collect it. > // And do it several times > for (i <- 0 to 9) { > var counts = df.repartitionByRange(3, col("val")) > .mapPartitions{part => Iterator(part.size)} > .collect() > println(counts.toList) > } > // -> the number of elements in each partition varies... > {code} > I do not know whether it is expected (I will dig further in the code), but it > sounds like a bug. > Or I just misinterpret what {{repartitionByRange}} is for? > Any ideas? > Thanks! > Julien -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org