[ https://issues.apache.org/jira/browse/SPARK-38653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-38653: --------------------------------- Component/s: SQL (was: Spark Core) > Repartition by Column that is Int not working properly only on particular > numbers. (11, 33) > ------------------------------------------------------------------------------------------- > > Key: SPARK-38653 > URL: https://issues.apache.org/jira/browse/SPARK-38653 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.1.2 > Environment: This was running on EMR 6.4.0 using Spark 3.1.2 in an > EMR Notebook writing to S3 > Reporter: John Engelhart > Priority: Major > > My understanding is when you call .repartition(a column). For each unique key > in that field. The data will go to that partition. There should never be two > keys repartitioned to the same part. That behavior is true with a String > column. That behavior is also true with an Int column except on certain > numbers. In my use case. The magic numbers 11 and 33. > {code:java} > //Int based column repartition > spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex"). > repartition($"collectionIndex").write.mode("overwrite").parquet("path") > //Produces two part files > //String based column repartition > spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex"). > repartition($"collectionIndex").write.mode("overwrite").parquet("path1") > //Produces three part files {code} > > {code:java} > //Not working as expected > spark.read.parquet("path/part-00000...").distinct.show > spark.read.parquet("path/part-00001...").distinct.show > //Working as expected > spark.read.parquet("path1/part-00000...").distinct.show > spark.read.parquet("path1/part-00001...").distinct.show > spark.read.parquet("path1/part-00002...").distinct.show {code} > !image-2022-03-24-22-16-44-560.png! > This problem really manifested itself when doing something like > {code:java} > spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex"). > repartition($"collectionIndex").write.mode("overwrite").partitionBy("collectionIndex").parquet("path") > {code} > Because you end up with incorrect partitions where the data is commingled. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org