[jira] [Created] (SPARK-26024) Dataset API: repartitionByRange(...) has inconsistent behaviour

Julien Peloton (JIRA) Mon, 12 Nov 2018 13:59:10 -0800

Julien Peloton created SPARK-26024:
--------------------------------------

             Summary: Dataset API: repartitionByRange(...) has inconsistent 
behaviour
                 Key: SPARK-26024
                 URL: https://issues.apache.org/jira/browse/SPARK-26024
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.2
         Environment: Spark version 2.3.2
            Reporter: Julien Peloton



Hi,

I recently played with the {{repartitionByRange}} method for DataFrame 
introduced in SPARK-22614. For DataFrames larger than the one tested in the 
code (which has only 10 elements), the code sends back random results.

As a test for showing the inconsistent behaviour, I start as the unit code used 
to test {{repartitionByRange}} 
([here|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala#L352])
 but I increase the size of the initial array to 1000, repartition using 3 
partitions, and count the number of element per-partitions:

 
{code}
// Shuffle numbers from 0 to 1000, and make a DataFrame
val df = Random.shuffle(0.to(1000)).toDF("val")

// Repartition it using 3 partitions
// Sum up number of elements in each partition, and collect it.
// And do it several times
for (i <- 0 to 9) {
  var counts = df.repartitionByRange(3, col("val"))
    .mapPartitions{part => Iterator(part.size)}
    .collect()
  println(counts.toList)
}
// -> the number of elements in each partition varies...
{code}
I do not know whether it is expected (I will dig further in the code), but it 
sounds like a bug.
 Or I just misinterpret what {{repartitionByRange}} is for?
 Any ideas?

Thanks!
 Julien



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26024) Dataset API: repartitionByRange(...) has inconsistent behaviour

Reply via email to