[ https://issues.apache.org/jira/browse/SPARK-49834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vladimir Golubev updated SPARK-49834: ------------------------------------- Description: This is a SPIP for a new single-pass Analyzer framework for Catalyst. *Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.* We propose to introduce a new Analyzer framework for Catalyst. This framework will use a single post-order traversal to resolve the parsed logical plan. We think this is necessary because the current algorithm frequently encounters high latency or errors from excessive looping of the fixed-point rule batch, and is generally hard to reason about. *Q2. What problem is this proposal NOT designed to solve?* This is a proposal to replace the current fixed-point framework. The following are NON-goals: * Improve average latency of average queries. In general the latency is dominated by the RPC calls. * Complete clean-slate rewrite. Instead, the new approach will reuse helper logic from the existing rules as much as possible. *Q3. How is it done today, and what are the limits of current practice?* The current Analyzer is based on rules and a fixed-point model - rules run in batches over the plan tree until it's fully resolved. There are almost no invariants or notion of name scope. The rules traverse the plan many times without changing it and sometimes the analysis does not reach a fixed point at all. There are unobvious dependencies between the rules. *Q4. What is new in your approach and why do you think it will be successful?* The analysis approach used in other known systems implements a different bottom-up single-pass model: * Go through the parsed tree in a single post-order traversal * Keep the resolved references/relations in a state * The code resolving the current node knows that all the child nodes have been resolved *Q5. Who cares? If you are successful, what difference will it make?* All the Spark developers will benefit from a framework with more obvious invariants. Also, developers who are already familiar with other systems would find it easier to apply their knowledge. Fewer regressions should occur as a result of ongoing development in this area, making Spark deployments more reliable. Large logical plans or work with wide tables will experience a major compilation speedup. And we will be able to resolve the cases where the current framework cannot resolve the plan tree in a fixed number of iterations. *Q6. What are the risks?* The exact path to 100% rewrite is not clear, because the Analyzer consists of many rules with unobvious dependencies between them. We think that this issue does not block the development and can be resolved as we progress on the implementation. *Q7. How long will it take?* The reasonable amount of time might be 18 months. We will need to enable the new Analyzer once we are 100% sure it supports all the old Analyzer functionality. We will most likely have a long tail of small tricky issues at the end of this effort. *Q8. What are the mid-term and final "exams" to check for success?* We propose to rewrite the Analyzer by incrementally implementing subsets of SQL and DataFrame functionality. We first start with a subset of SQL operators and expressions and further progress by expanding the surface area, eventually supporting DataFrames too. As we progress with the development, we will enable more and more unit tests to work with both implementations. *Also refer to:* - [Discussion 1|https://lists.apache.org/thread/qqggswc7zl34zh2pdtn99rzp4o64yykf] - [Discussion 2|https://lists.apache.org/thread/40l9zsqb8dvxfk46d6dxb106ly98tmcl] - [Discussion 3|https://lists.apache.org/thread/dsd71423glzdbxkzl9ktb1tjl32nv1ro] - [SPIP document|https://docs.google.com/document/d/1dWxvrJV-0joGdLtWbvJ0uNyTocDMJ90rPRNWa4T56Og] - [Vote|https://lists.apache.org/thread/tdlkpp7h2rs5k89pnmrvs64hd1t8f76b] - [Vote result|https://lists.apache.org/thread/9pc9p36mzvb78md9zzs4j5hz63nqgc49] was: This is a SPIP for a new single-pass Analyzer framework for Catalyst. *Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.* We propose to introduce a new Analyzer framework for Catalyst. This framework will use a single post-order traversal to resolve the parsed logical plan. We think this is necessary because the current algorithm frequently encounters high latency or errors from excessive looping of the fixed-point rule batch, and is generally hard to reason about. *Q2. What problem is this proposal NOT designed to solve?* This is a proposal to replace the current fixed-point framework. The following are NON-goals: * Improve average latency of average queries. In general the latency is dominated by the RPC calls. * Complete clean-slate rewrite. Instead, the new approach will reuse helper logic from the existing rules as much as possible. *Q3. How is it done today, and what are the limits of current practice?* The current Analyzer is based on rules and a fixed-point model - rules run in batches over the plan tree until it's fully resolved. There are almost no invariants or notion if name scope. The rules traverse the plan many times without changing it and sometimes the analysis does not reach a fixed point at all. There are unobvious dependencies between the rules. *Q4. What is new in your approach and why do you think it will be successful?* The analysis approach used in other known systems implements a different bottom-up single-pass model: * Go through the parsed tree in a single post-order traversal * Keep the resolved references/relations in a state * The code resolving the current node knows that all the child nodes have been resolved *Q5. Who cares? If you are successful, what difference will it make?* All the Spark developers will benefit from a framework with more obvious invariants. Also, developers who are already familiar with other systems would find it easier to apply their knowledge. Fewer regressions should occur as a result of ongoing development in this area, making Spark deployments more reliable. Large logical plans or work with wide tables will experience a major compilation speedup. And we will be able to resolve the cases where the current framework cannot resolve the plan tree in a fixed number of iterations. *Q6. What are the risks?* The exact path to 100% rewrite is not clear, because the Analyzer consists of many rules with unobvious dependencies between them. We think that this issue does not block the development and can be resolved as we progress on the implementation. *Q7. How long will it take?* The reasonable amount of time might be 18 months. We will need to enable the new Analyzer once we are 100% sure it supports all the old Analyzer functionality. We will most likely have a long tail of small tricky issues at the end of this effort. *Q8. What are the mid-term and final "exams" to check for success?* We propose to rewrite the Analyzer by incrementally implementing subsets of SQL and DataFrame functionality. We first start with a subset of SQL operators and expressions and further progress by expanding the surface area, eventually supporting DataFrames too. As we progress with the development, we will enable more and more unit tests to work with both implementations. *Also refer to:* - [Discussion 1|https://lists.apache.org/thread/qqggswc7zl34zh2pdtn99rzp4o64yykf] - [Discussion 2|https://lists.apache.org/thread/40l9zsqb8dvxfk46d6dxb106ly98tmcl] - [Discussion 3|https://lists.apache.org/thread/dsd71423glzdbxkzl9ktb1tjl32nv1ro] - [SPIP document|https://docs.google.com/document/d/1dWxvrJV-0joGdLtWbvJ0uNyTocDMJ90rPRNWa4T56Og] - [Vote|https://lists.apache.org/thread/tdlkpp7h2rs5k89pnmrvs64hd1t8f76b] - [Vote result|https://lists.apache.org/thread/9pc9p36mzvb78md9zzs4j5hz63nqgc49] > SPIP: Single-pass Analyzer for Catalyst > --------------------------------------- > > Key: SPARK-49834 > URL: https://issues.apache.org/jira/browse/SPARK-49834 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 4.0.0 > Reporter: Vladimir Golubev > Priority: Major > Labels: SPIP > > This is a SPIP for a new single-pass Analyzer framework for Catalyst. > *Q1. What are you trying to do? Articulate your objectives using absolutely > no jargon.* > We propose to introduce a new Analyzer framework for Catalyst. This framework > will use a single post-order traversal to resolve the parsed logical plan. We > think this is necessary because the current algorithm frequently encounters > high latency or errors from excessive looping of the fixed-point rule batch, > and is generally hard to reason about. > *Q2. What problem is this proposal NOT designed to solve?* > This is a proposal to replace the current fixed-point framework. The > following are NON-goals: > * Improve average latency of average queries. In general the latency is > dominated by the RPC calls. > * Complete clean-slate rewrite. Instead, the new approach will reuse helper > logic from the existing rules as much as possible. > *Q3. How is it done today, and what are the limits of current practice?* > The current Analyzer is based on rules and a fixed-point model - rules run in > batches over the plan tree until it's fully resolved. There are almost no > invariants or notion of name scope. The rules traverse the plan many times > without changing it and sometimes the analysis does not reach a fixed point > at all. There are unobvious dependencies between the rules. > *Q4. What is new in your approach and why do you think it will be successful?* > The analysis approach used in other known systems implements a different > bottom-up single-pass model: > * Go through the parsed tree in a single post-order traversal > * Keep the resolved references/relations in a state > * The code resolving the current node knows that all the child nodes have > been resolved > *Q5. Who cares? If you are successful, what difference will it make?* > All the Spark developers will benefit from a framework with more obvious > invariants. Also, developers who are already familiar with other systems > would find it easier to apply their knowledge. Fewer regressions should occur > as a result of ongoing development in this area, making Spark deployments > more reliable. Large logical plans or work with wide tables will experience a > major compilation speedup. And we will be able to resolve the cases where the > current framework cannot resolve the plan tree in a fixed number of > iterations. > *Q6. What are the risks?* > The exact path to 100% rewrite is not clear, because the Analyzer consists of > many rules with unobvious dependencies between them. We think that this issue > does not block the development and can be resolved as we progress on the > implementation. > *Q7. How long will it take?* > The reasonable amount of time might be 18 months. We will need to enable the > new Analyzer once we are 100% sure it supports all the old Analyzer > functionality. We will most likely have a long tail of small tricky issues at > the end of this effort. > *Q8. What are the mid-term and final "exams" to check for success?* > We propose to rewrite the Analyzer by incrementally implementing subsets of > SQL and DataFrame functionality. We first start with a subset of SQL > operators and expressions and further progress by expanding the surface area, > eventually supporting DataFrames too. As we progress with the development, we > will enable more and more unit tests to work with both implementations. > *Also refer to:* > - [Discussion > 1|https://lists.apache.org/thread/qqggswc7zl34zh2pdtn99rzp4o64yykf] > - [Discussion > 2|https://lists.apache.org/thread/40l9zsqb8dvxfk46d6dxb106ly98tmcl] > - [Discussion > 3|https://lists.apache.org/thread/dsd71423glzdbxkzl9ktb1tjl32nv1ro] > - [SPIP > document|https://docs.google.com/document/d/1dWxvrJV-0joGdLtWbvJ0uNyTocDMJ90rPRNWa4T56Og] > - [Vote|https://lists.apache.org/thread/tdlkpp7h2rs5k89pnmrvs64hd1t8f76b] > - [Vote > result|https://lists.apache.org/thread/9pc9p36mzvb78md9zzs4j5hz63nqgc49] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org