[ https://issues.apache.org/jira/browse/SPARK-50525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SPARK-50525: ----------------------------------- Labels: correctness pull-request-available (was: correctness) > Do not allow repartition by map > ------------------------------- > > Key: SPARK-50525 > URL: https://issues.apache.org/jira/browse/SPARK-50525 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 4.0.0 > Reporter: Herman van Hövell > Priority: Blocker > Labels: correctness, pull-request-available > > > We allow users to repartition by a map column. This leads to incorrect > results. > {code:java} > // Create a sequence of maps that all have the same element, but a different > insertion order. > import scala.util.Random > val elements = Seq.tabulate(4)(i => i -> s"v$i") // 4 elements uses a > scala.collection.immutable.Map$Map4, this retains the insertion order. > val maps = Seq.fill(10)(Random.shuffle(elements).toMap) > // Check if they are all the same in scala land. > assert(maps.distinct.size == 1) > // This fails, which is good. > maps.toDF.distinct.show() > // This should return a single partition. However it returns multiple > partitions. > maps.toDF.repartition(4, $"value").groupBy(spark_partition_id()).show() > // +--------------------+-----+ > // |SPARK_PARTITION_ID()|count| > // +--------------------+-----+ > // | 0| 2| > // | 1| 4| > // | 2| 2| > // | 3| 2| > // +--------------------+-----+{code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org