[ https://issues.apache.org/jira/browse/SPARK-39834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jungtaek Lim reassigned SPARK-39834: ------------------------------------ Assignee: Jungtaek Lim > Include the origin stats and constraints for LogicalRDD if it comes from > DataFrame > ---------------------------------------------------------------------------------- > > Key: SPARK-39834 > URL: https://issues.apache.org/jira/browse/SPARK-39834 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming > Affects Versions: 3.4.0 > Reporter: Jungtaek Lim > Assignee: Jungtaek Lim > Priority: Major > > With SPARK-39748, Spark includes the origin logical plan for LogicalRDD if it > comes from DataFrame, to achieve carrying-over stats as well as providing > information to possibly connect two disconnected logical plans into one. > After we introduced the change, we figured out several issues: > 1. One of major use case for DataFrame.checkpoint is ML, especially > "iterative algorithm", which purpose is to "prune" the logical plan. That is > against the purpose of including origin logical plan and we have a risk to > have nested LogicalRDDs which grows the size of logical plan infinitely. > 2. We leverage logical plan to carry over stats, but the correct stats > information is in optimized plan. > 3. (Not an issue but missing spot) constraints is also something we can carry > over. > To address above issues, it would be better if we include stats and > constraints in LogicalRDD rather than logical plan. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org