[jira] [Commented] (SPARK-49834) SPIP: Single-pass Analyzer for Catalyst

Vladimir Golubev (Jira) Sat, 05 Oct 2024 02:53:04 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-49834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887081#comment-17887081
 ]


Vladimir Golubev commented on SPARK-49834:
------------------------------------------

Vote passed with 15 +1s (8 bindings), 0 +0s, 0 -1s.

> SPIP: Single-pass Analyzer for Catalyst
> ---------------------------------------
>
>                 Key: SPARK-49834
>                 URL: https://issues.apache.org/jira/browse/SPARK-49834
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.0.0
>            Reporter: Vladimir Golubev
>            Priority: Major
>              Labels: SPIP
>
> This is a SPIP for a new single-pass Analyzer framework for Catalyst.
> *Q1. What are you trying to do? Articulate your objectives using absolutely 
> no jargon.*
> We propose to introduce a new Analyzer framework for Catalyst. This framework 
> will use a single post-order traversal to resolve the parsed logical plan. We 
> think this is necessary because the current algorithm frequently encounters 
> high latency or errors from excessive looping of the fixed-point rule batch, 
> and is generally hard to reason about.
> *Q2. What problem is this proposal NOT designed to solve?*
> This is a proposal to replace the current fixed-point framework. The 
> following are NON-goals:
>  * Improve average latency of average queries. In general the latency is 
> dominated by the RPC calls.
>  * Complete clean-slate rewrite. Instead, the new approach will reuse helper 
> logic from the existing rules as much as possible.
> *Q3. How is it done today, and what are the limits of current practice?*
> The current Analyzer is based on rules and a fixed-point model - rules run in 
> batches over the plan tree until it's fully resolved. There are almost no 
> invariants or notion if name scope. The rules traverse the plan many times 
> without changing it and sometimes the analysis does not reach a fixed point 
> at all. There are unobvious dependencies between the rules. 
> *Q4. What is new in your approach and why do you think it will be successful?*
> The analysis approach used in other known systems implements a different 
> bottom-up single-pass model:
> * Go through the parsed tree in a single post-order traversal
> * Keep the resolved references/relations in a state
> * The code resolving the current node knows that all the child nodes have 
> been resolved
> *Q5. Who cares? If you are successful, what difference will it make?*
> All the Spark developers will benefit from a framework with more obvious 
> invariants. Also, developers who are already familiar with other systems 
> would find it easier to apply their knowledge. Fewer regressions should occur 
> as a result of ongoing development in this area, making Spark deployments 
> more reliable. Large logical plans or work with wide tables will experience a 
> major compilation speedup. And we will be able to resolve the cases where the 
> current framework cannot resolve the plan tree in a fixed number of 
> iterations.
> *Q6. What are the risks?*
> The exact path to 100% rewrite is not clear, because the Analyzer consists of 
> many rules with unobvious dependencies between them. We think that this issue 
> does not block the development and can be resolved as we progress on the 
> implementation.
> *Q7. How long will it take?*
> The reasonable amount of time might be 18 months. We will need to enable the 
> new Analyzer once we are 100% sure it supports all the old Analyzer 
> functionality. We will most likely have a long tail of small tricky issues at 
> the end of this effort.
> *Q8. What are the mid-term and final "exams" to check for success?*
> We propose to rewrite the Analyzer by incrementally implementing subsets of 
> SQL and DataFrame functionality. We first start with a subset of SQL 
> operators and expressions and further progress by expanding the surface area, 
> eventually supporting DataFrames too. As we progress with the development, we 
> will enable more and more unit tests to work with both implementations.
> *Also refer to:*
>  - [Discussion 
> 1|https://lists.apache.org/thread/qqggswc7zl34zh2pdtn99rzp4o64yykf]
>  - [Discussion 
> 2|https://lists.apache.org/thread/40l9zsqb8dvxfk46d6dxb106ly98tmcl]
>  - [Discussion 
> 3|https://lists.apache.org/thread/dsd71423glzdbxkzl9ktb1tjl32nv1ro]
>  - [SPIP 
> document|https://docs.google.com/document/d/1dWxvrJV-0joGdLtWbvJ0uNyTocDMJ90rPRNWa4T56Og]
>  - [Vote|https://lists.apache.org/thread/tdlkpp7h2rs5k89pnmrvs64hd1t8f76b]
>  - [Vote 
> result|https://lists.apache.org/thread/9pc9p36mzvb78md9zzs4j5hz63nqgc49]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-49834) SPIP: Single-pass Analyzer for Catalyst

Reply via email to