[jira] [Comment Edited] (SPARK-49515) numInputRows is 0 when one of input relation is empty

Jungtaek Lim (Jira) Sun, 12 Jan 2025 18:47:04 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-49515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17912324#comment-17912324
 ]


Jungtaek Lim edited comment on SPARK-49515 at 1/13/25 2:46 AM:
---------------------------------------------------------------

> I guess problem still exists when falling back to "old way" 

My fix is actually performing "double" fallback. If the old way does not work 
after falling back, it just picks up the result of new way, because the reason 
fallback is triggered might be due to valid pruning, like the example in this 
Jira ticket.

Unfortunately, this requires changes to the constructor of catalyst classes, 
which is a sort of breaking change (catalyst is private API but people has been 
using it), hence we can't port the fix back.


was (Author: kabhwan):
> I guess problem still exists when falling back to "old way" 

My fix is actually performing "double" fallback. If the old way does not work 
after falling back, it just picks up the result of new way, because the reason 
fallback is triggered is due to valid pruning, like the example in this Jira 
ticket.

Unfortunately, this requires changes to the constructor of catalyst classes, 
which is a sort of breaking change (catalyst is private API but people has been 
using it), hence we can't port the fix back.

> numInputRows is 0 when one of input relation is empty
> -----------------------------------------------------
>
>                 Key: SPARK-49515
>                 URL: https://issues.apache.org/jira/browse/SPARK-49515
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 3.4.0
>         Environment: Reproducible in Spark 3.4 , probably all versions after 
> introducing org.apache.spark.sql.catalyst.optimizer.PropagateEmptyRelation
>  
>            Reporter: Eunjin Song
>            Priority: Trivial
>         Attachments: Query 0.png, Query 1.png, Query 2.png
>
>
> [https://github.com/apache/spark/blob/2ed6c3e511f322c5fd01953736c376a85ff2c687/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L481]
>  
> In case  # relation of logical plan != # relation of physical plan, 
> numInputRows captured as 0 for all sources. However it includes the case that 
> empty relations in union is removed from physical plan by 
> "org.apache.spark.sql.catalyst.optimizer.PropagateEmptyRelation".
>  
> Repro code:
> import org.apache.spark.sql.streaming.Trigger
> val df1 = spark.readStream.schema(sch).parquet("/streaming/parquet1")
> val df2 = spark.readStream.schema(sch).parquet("/streaming/emptyparquet")
> df1.union(df2).writeStream.option("checkpointLocation", 
> "/streaming/checkpoint8").format("parquet").trigger(Trigger.Once()).option("path",
>  "/streaming/result9").start
>  
> workaround:
> spark.conf.set("spark.sql.optimizer.excludedRules","org.apache.spark.sql.catalyst.optimizer.PropagateEmptyRelation")



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-49515) numInputRows is 0 when one of input relation is empty

Reply via email to