[ 
https://issues.apache.org/jira/browse/SPARK-38653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38653:
---------------------------------
    Component/s: SQL
                     (was: Spark Core)

> Repartition by Column that is Int not working properly only on particular 
> numbers. (11, 33)
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-38653
>                 URL: https://issues.apache.org/jira/browse/SPARK-38653
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.2
>         Environment: This was running on EMR 6.4.0 using Spark 3.1.2 in an 
> EMR Notebook writing to S3
>            Reporter: John Engelhart
>            Priority: Major
>
> My understanding is when you call .repartition(a column). For each unique key 
> in that field. The data will go to that partition. There should never be two 
> keys repartitioned to the same part. That behavior is true with a String 
> column. That behavior is also true with an Int column except on certain 
> numbers. In my use case. The magic numbers 11 and 33.
> {code:java}
> //Int based column repartition
> spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex").
> repartition($"collectionIndex").write.mode("overwrite").parquet("path")
> //Produces two part files
> //String based column repartition
> spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex").
> repartition($"collectionIndex").write.mode("overwrite").parquet("path1")
> //Produces three part files {code}
>  
> {code:java}
> //Not working as expected
> spark.read.parquet("path/part-00000...").distinct.show
> spark.read.parquet("path/part-00001...").distinct.show
> //Working as expected
> spark.read.parquet("path1/part-00000...").distinct.show
> spark.read.parquet("path1/part-00001...").distinct.show
> spark.read.parquet("path1/part-00002...").distinct.show {code}
> !image-2022-03-24-22-16-44-560.png!
> This problem really manifested itself when doing something like
> {code:java}
> spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex"). 
> repartition($"collectionIndex").write.mode("overwrite").partitionBy("collectionIndex").parquet("path")
>  {code}
> Because you end up with incorrect partitions where the data is commingled. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to