Herman van Hövell created SPARK-50525:
-----------------------------------------

             Summary: Do not allow repartition by map
                 Key: SPARK-50525
                 URL: https://issues.apache.org/jira/browse/SPARK-50525
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 4.0.0
            Reporter: Herman van Hövell


 

We allow users to repartition by a map column. This leads to incorrect results.
{code:java}
// Create a sequence of maps that all have the same element, but a different 
insertion order.
import scala.util.Random
val elements = Seq.tabulate(4)(i => i -> s"v$i") // 4 elements uses a 
scala.collection.immutable.Map$Map4, this retains the insertion order.
val maps = Seq.fill(10)(Random.shuffle(elements).toMap)

// Check if they are all the same in scala land.
assert(maps.distinct.size == 1)

// This fails, which is good.
maps.toDF.distinct.show()

// This should return a single partition. However it returns multiple 
partitions.
maps.toDF.repartition(4, $"value").groupBy(spark_partition_id()).show()

// +--------------------+-----+
// |SPARK_PARTITION_ID()|count|
// +--------------------+-----+
// |                   0|    2|
// |                   1|    4|
// |                   2|    2|
// |                   3|    2|
// +--------------------+-----+{code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to