John Engelhart created SPARK-38653:
--------------------------------------

             Summary: Repartition by Column that is Int not working properly 
only particular numbers. (11, 33)
                 Key: SPARK-38653
                 URL: https://issues.apache.org/jira/browse/SPARK-38653
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.1.2
         Environment: This was running on EMR 6.4.0 using Spark 3.1.2 in an EMR 
Notebook writing to S3
            Reporter: John Engelhart


My understanding is when you call .repartition(a column). For each unique key 
in that field. The data will go to that partition. There should never be two 
keys repartitioned the same part. That behavior is true with a String column. 
That behavior is also true with an Int column except on certain numbers. In my 
use case. The magic numbers 11 and 33.
{code:java}
//Int based column repartition
spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex").
repartition($"collectionIndex").write.mode("overwrite").parquet("path")
//Produces two part files
//String based column repartition
spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex").
repartition($"collectionIndex").write.mode("overwrite").parquet("path1")
//Produces three part files {code}
 
{code:java}
//Not working as expected
spark.read.parquet("path/part-00000...").distinct.show
spark.read.parquet("path/part-00001...").distinct.show

//Working as expected
spark.read.parquet("path1/part-00000...").distinct.show
spark.read.parquet("path1/part-00001...").distinct.show
spark.read.parquet("path1/part-00002...").distinct.show {code}
!image-2022-03-24-22-09-26-917.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to