[ 
https://issues.apache.org/jira/browse/SPARK-39834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-39834:
------------------------------------

    Assignee: Jungtaek Lim

> Include the origin stats and constraints for LogicalRDD if it comes from 
> DataFrame
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-39834
>                 URL: https://issues.apache.org/jira/browse/SPARK-39834
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL, Structured Streaming
>    Affects Versions: 3.4.0
>            Reporter: Jungtaek Lim
>            Assignee: Jungtaek Lim
>            Priority: Major
>
> With SPARK-39748, Spark includes the origin logical plan for LogicalRDD if it 
> comes from DataFrame, to achieve carrying-over stats as well as providing 
> information to possibly connect two disconnected logical plans into one.
> After we introduced the change, we figured out several issues:
> 1. One of major use case for DataFrame.checkpoint is ML, especially 
> "iterative algorithm", which purpose is to "prune" the logical plan. That is 
> against the purpose of including origin logical plan and we have a risk to 
> have nested LogicalRDDs which grows the size of logical plan infinitely.
> 2. We leverage logical plan to carry over stats, but the correct stats 
> information is in optimized plan.
> 3. (Not an issue but missing spot) constraints is also something we can carry 
> over.
> To address above issues, it would be better if we include stats and 
> constraints in LogicalRDD rather than logical plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to