[jira] [Created] (SPARK-38653) Repartition by Column that is Int not working properly only particular numbers. (11, 33)

John Engelhart (Jira) Thu, 24 Mar 2022 20:13:09 -0700

John Engelhart created SPARK-38653:
--------------------------------------

             Summary: Repartition by Column that is Int not working properly 
only particular numbers. (11, 33)
                 Key: SPARK-38653
                 URL: https://issues.apache.org/jira/browse/SPARK-38653
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.1.2
         Environment: This was running on EMR 6.4.0 using Spark 3.1.2 in an EMR 
Notebook writing to S3
            Reporter: John Engelhart



My understanding is when you call .repartition(a column). For each unique key 
in that field. The data will go to that partition. There should never be two 
keys repartitioned the same part. That behavior is true with a String column. 
That behavior is also true with an Int column except on certain numbers. In my 
use case. The magic numbers 11 and 33.
{code:java}
//Int based column repartition
spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex").
repartition($"collectionIndex").write.mode("overwrite").parquet("path")
//Produces two part files
//String based column repartition
spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex").
repartition($"collectionIndex").write.mode("overwrite").parquet("path1")
//Produces three part files {code}
 
{code:java}
//Not working as expected
spark.read.parquet("path/part-00000...").distinct.show
spark.read.parquet("path/part-00001...").distinct.show

//Working as expected
spark.read.parquet("path1/part-00000...").distinct.show
spark.read.parquet("path1/part-00001...").distinct.show
spark.read.parquet("path1/part-00002...").distinct.show {code}
!image-2022-03-24-22-09-26-917.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38653) Repartition by Column that is Int not working properly only particular numbers. (11, 33)

Reply via email to