[ 
https://issues.apache.org/jira/browse/SPARK-50525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-50525:
-----------------------------------
    Labels: correctness pull-request-available  (was: correctness)

> Do not allow repartition by map
> -------------------------------
>
>                 Key: SPARK-50525
>                 URL: https://issues.apache.org/jira/browse/SPARK-50525
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.0.0
>            Reporter: Herman van Hövell
>            Priority: Blocker
>              Labels: correctness, pull-request-available
>
>  
> We allow users to repartition by a map column. This leads to incorrect 
> results.
> {code:java}
> // Create a sequence of maps that all have the same element, but a different 
> insertion order.
> import scala.util.Random
> val elements = Seq.tabulate(4)(i => i -> s"v$i") // 4 elements uses a 
> scala.collection.immutable.Map$Map4, this retains the insertion order.
> val maps = Seq.fill(10)(Random.shuffle(elements).toMap)
> // Check if they are all the same in scala land.
> assert(maps.distinct.size == 1)
> // This fails, which is good.
> maps.toDF.distinct.show()
> // This should return a single partition. However it returns multiple 
> partitions.
> maps.toDF.repartition(4, $"value").groupBy(spark_partition_id()).show()
> // +--------------------+-----+
> // |SPARK_PARTITION_ID()|count|
> // +--------------------+-----+
> // |                   0|    2|
> // |                   1|    4|
> // |                   2|    2|
> // |                   3|    2|
> // +--------------------+-----+{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to