[ https://issues.apache.org/jira/browse/SPARK-49834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887081#comment-17887081 ]
Vladimir Golubev commented on SPARK-49834: ------------------------------------------ Vote passed with 15 +1s (8 bindings), 0 +0s, 0 -1s. > SPIP: Single-pass Analyzer for Catalyst > --------------------------------------- > > Key: SPARK-49834 > URL: https://issues.apache.org/jira/browse/SPARK-49834 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 4.0.0 > Reporter: Vladimir Golubev > Priority: Major > Labels: SPIP > > This is a SPIP for a new single-pass Analyzer framework for Catalyst. > *Q1. What are you trying to do? Articulate your objectives using absolutely > no jargon.* > We propose to introduce a new Analyzer framework for Catalyst. This framework > will use a single post-order traversal to resolve the parsed logical plan. We > think this is necessary because the current algorithm frequently encounters > high latency or errors from excessive looping of the fixed-point rule batch, > and is generally hard to reason about. > *Q2. What problem is this proposal NOT designed to solve?* > This is a proposal to replace the current fixed-point framework. The > following are NON-goals: > * Improve average latency of average queries. In general the latency is > dominated by the RPC calls. > * Complete clean-slate rewrite. Instead, the new approach will reuse helper > logic from the existing rules as much as possible. > *Q3. How is it done today, and what are the limits of current practice?* > The current Analyzer is based on rules and a fixed-point model - rules run in > batches over the plan tree until it's fully resolved. There are almost no > invariants or notion if name scope. The rules traverse the plan many times > without changing it and sometimes the analysis does not reach a fixed point > at all. There are unobvious dependencies between the rules. > *Q4. What is new in your approach and why do you think it will be successful?* > The analysis approach used in other known systems implements a different > bottom-up single-pass model: > * Go through the parsed tree in a single post-order traversal > * Keep the resolved references/relations in a state > * The code resolving the current node knows that all the child nodes have > been resolved > *Q5. Who cares? If you are successful, what difference will it make?* > All the Spark developers will benefit from a framework with more obvious > invariants. Also, developers who are already familiar with other systems > would find it easier to apply their knowledge. Fewer regressions should occur > as a result of ongoing development in this area, making Spark deployments > more reliable. Large logical plans or work with wide tables will experience a > major compilation speedup. And we will be able to resolve the cases where the > current framework cannot resolve the plan tree in a fixed number of > iterations. > *Q6. What are the risks?* > The exact path to 100% rewrite is not clear, because the Analyzer consists of > many rules with unobvious dependencies between them. We think that this issue > does not block the development and can be resolved as we progress on the > implementation. > *Q7. How long will it take?* > The reasonable amount of time might be 18 months. We will need to enable the > new Analyzer once we are 100% sure it supports all the old Analyzer > functionality. We will most likely have a long tail of small tricky issues at > the end of this effort. > *Q8. What are the mid-term and final "exams" to check for success?* > We propose to rewrite the Analyzer by incrementally implementing subsets of > SQL and DataFrame functionality. We first start with a subset of SQL > operators and expressions and further progress by expanding the surface area, > eventually supporting DataFrames too. As we progress with the development, we > will enable more and more unit tests to work with both implementations. > *Also refer to:* > - [Discussion > 1|https://lists.apache.org/thread/qqggswc7zl34zh2pdtn99rzp4o64yykf] > - [Discussion > 2|https://lists.apache.org/thread/40l9zsqb8dvxfk46d6dxb106ly98tmcl] > - [Discussion > 3|https://lists.apache.org/thread/dsd71423glzdbxkzl9ktb1tjl32nv1ro] > - [SPIP > document|https://docs.google.com/document/d/1dWxvrJV-0joGdLtWbvJ0uNyTocDMJ90rPRNWa4T56Og] > - [Vote|https://lists.apache.org/thread/tdlkpp7h2rs5k89pnmrvs64hd1t8f76b] > - [Vote > result|https://lists.apache.org/thread/9pc9p36mzvb78md9zzs4j5hz63nqgc49] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org