[ 
https://issues.apache.org/jira/browse/SPARK-38653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Engelhart updated SPARK-38653:
-----------------------------------
    Description: 
My understanding is when you call .repartition(a column). For each unique key 
in that field. The data will go to that partition. There should never be two 
keys repartitioned to the same part. That behavior is true with a String 
column. That behavior is also true with an Int column except on certain 
numbers. In my use case. The magic numbers 11 and 33.
{code:java}
//Int based column repartition
spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex").
repartition($"collectionIndex").write.mode("overwrite").parquet("path")
//Produces two part files
//String based column repartition
spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex").
repartition($"collectionIndex").write.mode("overwrite").parquet("path1")
//Produces three part files {code}
 
{code:java}
//Not working as expected
spark.read.parquet("path/part-00000...").distinct.show
spark.read.parquet("path/part-00001...").distinct.show

//Working as expected
spark.read.parquet("path1/part-00000...").distinct.show
spark.read.parquet("path1/part-00001...").distinct.show
spark.read.parquet("path1/part-00002...").distinct.show {code}
!image-2022-03-24-22-09-26-917.png!

  was:
My understanding is when you call .repartition(a column). For each unique key 
in that field. The data will go to that partition. There should never be two 
keys repartitioned the same part. That behavior is true with a String column. 
That behavior is also true with an Int column except on certain numbers. In my 
use case. The magic numbers 11 and 33.
{code:java}
//Int based column repartition
spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex").
repartition($"collectionIndex").write.mode("overwrite").parquet("path")
//Produces two part files
//String based column repartition
spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex").
repartition($"collectionIndex").write.mode("overwrite").parquet("path1")
//Produces three part files {code}
 
{code:java}
//Not working as expected
spark.read.parquet("path/part-00000...").distinct.show
spark.read.parquet("path/part-00001...").distinct.show

//Working as expected
spark.read.parquet("path1/part-00000...").distinct.show
spark.read.parquet("path1/part-00001...").distinct.show
spark.read.parquet("path1/part-00002...").distinct.show {code}
!image-2022-03-24-22-09-26-917.png!


> Repartition by Column that is Int not working properly only on particular 
> numbers. (11, 33)
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-38653
>                 URL: https://issues.apache.org/jira/browse/SPARK-38653
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.1.2
>         Environment: This was running on EMR 6.4.0 using Spark 3.1.2 in an 
> EMR Notebook writing to S3
>            Reporter: John Engelhart
>            Priority: Major
>
> My understanding is when you call .repartition(a column). For each unique key 
> in that field. The data will go to that partition. There should never be two 
> keys repartitioned to the same part. That behavior is true with a String 
> column. That behavior is also true with an Int column except on certain 
> numbers. In my use case. The magic numbers 11 and 33.
> {code:java}
> //Int based column repartition
> spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex").
> repartition($"collectionIndex").write.mode("overwrite").parquet("path")
> //Produces two part files
> //String based column repartition
> spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex").
> repartition($"collectionIndex").write.mode("overwrite").parquet("path1")
> //Produces three part files {code}
>  
> {code:java}
> //Not working as expected
> spark.read.parquet("path/part-00000...").distinct.show
> spark.read.parquet("path/part-00001...").distinct.show
> //Working as expected
> spark.read.parquet("path1/part-00000...").distinct.show
> spark.read.parquet("path1/part-00001...").distinct.show
> spark.read.parquet("path1/part-00002...").distinct.show {code}
> !image-2022-03-24-22-09-26-917.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to