[ 
https://issues.apache.org/jira/browse/SPARK-46105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790742#comment-17790742
 ] 

Josh Rosen commented on SPARK-46105:
------------------------------------

{quote}The reason for raising this as a bug is I have a scenario where my final 
dataframe returns 0 records in EKS(local spark) with single node(driver and 
executor on the sam node) but it returns 1 in EMR both uses a same spark 
version 3.3.3.
{quote}
To clarify: by "returns 0 records", are you referring to the record count of 
the data frame (i.e. whether isEmpty returns true or false) or to the partition 
count? In other words, are you saying that EMR returns an incorrect record 
count or do you mean that it returns an unexpected partition count?

> df.emptyDataFrame shows 1 if we repartition(1) in Spark 3.3.x and above
> -----------------------------------------------------------------------
>
>                 Key: SPARK-46105
>                 URL: https://issues.apache.org/jira/browse/SPARK-46105
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.3.3
>         Environment: EKS
> EMR
>            Reporter: dharani_sugumar
>            Priority: Major
>         Attachments: Screenshot 2023-11-26 at 11.54.58 AM.png
>
>
> {color:#FF0000}Version: 3.3.3{color}
>  
> {color:#FF0000}scala> val df = spark.emptyDataFrame{color}
> {color:#FF0000}df: org.apache.spark.sql.DataFrame = []{color}
> {color:#FF0000}scala> df.rdd.getNumPartitions{color}
> {color:#FF0000}res0: Int = 0{color}
> {color:#FF0000}scala> df.repartition(1).rdd.getNumPartitions{color}
> {color:#FF0000}res1: Int = 1{color}
> {color:#FF0000}scala> df.repartition(1).rdd.isEmpty(){color}
> {color:#FF0000}[Stage 1:>                                                     
>      (0 + 1) /                                                                
>              res2: Boolean = true{color}
> Version: 3.2.4
> scala> val df = spark.emptyDataFrame
> df: org.apache.spark.sql.DataFrame = []
> scala> df.rdd.getNumPartitions
> res0: Int = 0
> scala> df.repartition(1).rdd.getNumPartitions
> res1: Int = 0
> scala> df.repartition(1).rdd.isEmpty()
> res2: Boolean = true
>  
> {color:#FF0000}Version: 3.5.0{color}
> {color:#FF0000}scala> val df = spark.emptyDataFrame{color}
> {color:#FF0000}df: org.apache.spark.sql.DataFrame = []{color}
> {color:#FF0000}scala> df.rdd.getNumPartitions{color}
> {color:#FF0000}res0: Int = 0{color}
> {color:#FF0000}scala> df.repartition(1).rdd.getNumPartitions{color}
> {color:#FF0000}res1: Int = 1{color}
> {color:#FF0000}scala> df.repartition(1).rdd.isEmpty(){color}
> {color:#FF0000}[Stage 1:>                                                     
>      (0 + 1) /                                                                
>              res2: Boolean = true{color}
>  
> When we do repartition of 1 on an empty dataframe, the resultant partition is 
> 1 in version 3.3.x and 3.5.x whereas when I do the same in version 3.2.x, the 
> resultant partition is 0. May i know why this behaviour is changed from 3.2.x 
> to higher versions. 
>  
> The reason for raising this as a bug is I have a scenario where my final 
> dataframe returns 0 records in EKS(local spark) with single node(driver and 
> executor on the sam node) but it returns 1 in EMR both uses a same spark 
> version 3.3.3. I'm not sure why this behaves different in both the 
> environments. As a interim solution, I had to repartition a empty dataframe 
> if my final dataframe is empty which returns 1 for 3.3.3. Would like to know 
> if this really a bug or this behaviour exists in the future versions and 
> cannot be changed?
>  
> Because, If we go for a spark upgrade and this behaviour is changed, we will 
> face the issue again. 
> Please confirm on this.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to