Herman van Hövell created SPARK-50525: -----------------------------------------
Summary: Do not allow repartition by map Key: SPARK-50525 URL: https://issues.apache.org/jira/browse/SPARK-50525 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Herman van Hövell We allow users to repartition by a map column. This leads to incorrect results. {code:java} // Create a sequence of maps that all have the same element, but a different insertion order. import scala.util.Random val elements = Seq.tabulate(4)(i => i -> s"v$i") // 4 elements uses a scala.collection.immutable.Map$Map4, this retains the insertion order. val maps = Seq.fill(10)(Random.shuffle(elements).toMap) // Check if they are all the same in scala land. assert(maps.distinct.size == 1) // This fails, which is good. maps.toDF.distinct.show() // This should return a single partition. However it returns multiple partitions. maps.toDF.repartition(4, $"value").groupBy(spark_partition_id()).show() // +--------------------+-----+ // |SPARK_PARTITION_ID()|count| // +--------------------+-----+ // | 0| 2| // | 1| 4| // | 2| 2| // | 3| 2| // +--------------------+-----+{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org