[ https://issues.apache.org/jira/browse/SPARK-49834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vladimir Golubev updated SPARK-49834: ------------------------------------- Description: This is a SPIP for a new single-pass Analyzer framework for Catalyst. *Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.* We propose to introduce a new Analyzer framework for Catalyst. This framework will use a single post-order traversal to resolve the parsed logical plan. We think this is necessary because the current algorithm frequently encounters high latency or errors from excessive looping of the fixed-point rule batch, and generally is hard to reason about and does not scale for large logical plans and other edge-case scenarios. *Q2. What problem is this proposal NOT designed to solve?* This is a proposal to replace the current fixed-point framework. The following are NON-goals: * Improve average latency of average queries. In general the latency is dominated by the RPC calls. * Complete clean-slate rewrite. Instead, the new approach will reuse helper logic from the existing rules as much as possible. *Q3. How is it done today, and what are the limits of current practice?* The current Analyzer is based on rules and a fixed-point model - rules run in batches over the plan tree until it's fully resolved. There are almost no invariants or notion if name scope. The rules traverse the plan many times without changing it and sometimes the analysis does not reach a fixed point at all. There are unobvious dependencies between the rules. *Q4. What is new in your approach and why do you think it will be successful?* I believe this suggests a new way for both PySpark and pandas users to easily scale their workloads. I think we can be successful because more and more people tend to use Python and pandas. In fact, there are already similar tries such as Dask and Modin which are all growing fast and successfully. *Q5. Who cares? If you are successful, what difference will it make?* All the Spark developers will benefit from a framework with more obvious invariants. Also, developers who are already familiar with other systems would find it easier to apply their knowledge. Fewer regressions should occur as a result of ongoing development in this area, making Spark deployments more reliable. Large logical plans or work with wide tables will experience a major compilation speedup. And we will be able to resolve the cases where the current framework cannot resolve the plan tree in a fixed number of iterations. *Q6. What are the risks?* The exact path to 100% rewrite is not clear, because the Analyzer consists of many rules with unobvious dependencies between them. We think that this issue does not block the development and can be resolved as we progress on the implementation. *Q7. How long will it take?* The reasonable amount of time might be 18 months. We will need to enable the new Analyzer once we are 100% sure it supports all the old Analyzer functionality. We will most likely have a long tail of small tricky issues at the end of this effort. *Q8. What are the mid-term and final "exams" to check for success?* We propose to rewrite the Analyzer by incrementally implementing subsets of SQL and DataFrame functionality. We first start with a subset of SQL operators and expressions and further progress by expanding the surface area, eventually supporting DataFrames too. As we progress with the development, we will enable more and more unit tests to work with both implementations. *Also refer to:* - [SPIP document|https://docs.google.com/document/d/1dWxvrJV-0joGdLtWbvJ0uNyTocDMJ90rPRNWa4T56Og] was: This is a SPIP for a new single-pass Analyzer framework for Catalyst. *Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.* We propose to introduce a new Analyzer framework for Catalyst. This framework will use a single post-order traversal to resolve the parsed logical plan. We think this is necessary because the current algorithm frequently encounters high latency or errors from excessive looping of the fixed-point rule batch, and generally is hard to reason about and does not scale for large logical plans and other edge-case scenarios. *Q2. What problem is this proposal NOT designed to solve?* This is a proposal to replace the current fixed-point framework. The following are NON-goals: * Improve average latency of average queries. In general the latency is dominated by the RPC calls. * Complete clean-slate rewrite. Instead, the new approach will reuse helper logic from the existing rules as much as possible. *Q3. How is it done today, and what are the limits of current practice?* The current Analyzer is based on rules and a fixed-point model - rules run in batches over the plan tree until it's fully resolved. There are almost no invariants or notion if name scope. The rules traverse the plan many times without changing it and sometimes the analysis does not reach a fixed point at all. There are unobvious dependencies between the rules. *Q4. What is new in your approach and why do you think it will be successful?* I believe this suggests a new way for both PySpark and pandas users to easily scale their workloads. I think we can be successful because more and more people tend to use Python and pandas. In fact, there are already similar tries such as Dask and Modin which are all growing fast and successfully. *Q5. Who cares? If you are successful, what difference will it make?* All the Spark developers will benefit from a framework with more obvious invariants. Also, developers who are already familiar with other systems would find it easier to apply their knowledge. Fewer regressions should occur as a result of ongoing development in this area, making Spark deployments more reliable. Large logical plans or work with wide tables will experience a major compilation speedup. And we will be able to resolve the cases where the current framework cannot resolve the plan tree in a fixed number of iterations. *Q6. What are the risks?* The exact path to 100% rewrite is not clear, because the Analyzer consists of many rules with unobvious dependencies between them. We think that this issue does not block the development and can be resolved as we progress on the implementation. *Q7. How long will it take?* The reasonable amount of time might be 18 months. We will need to enable the new Analyzer once we are 100% sure it supports all the old Analyzer functionality. We will most likely have a long tail of small tricky issues at the end of this effort. *Q8. What are the mid-term and final "exams" to check for success?* We propose to rewrite the Analyzer by incrementally implementing subsets of SQL and DataFrame functionality. We first start with a subset of SQL operators and expressions and further progress by expanding the surface area, eventually supporting DataFrames too. As we progress with the development, we will enable more and more unit tests to work with both implementations. *Also refer to:* - [SPIP document|[https://docs.google.com/document/d/1dWxvrJV-0joGdLtWbvJ0uNyTocDMJ90rPRNWa4T56Og]] > SPIP: Single-pass Analyzer for Catalyst > --------------------------------------- > > Key: SPARK-49834 > URL: https://issues.apache.org/jira/browse/SPARK-49834 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 4.0.0 > Reporter: Vladimir Golubev > Priority: Major > Labels: SPIP > > This is a SPIP for a new single-pass Analyzer framework for Catalyst. > *Q1. What are you trying to do? Articulate your objectives using absolutely > no jargon.* > We propose to introduce a new Analyzer framework for Catalyst. This framework > will use a single post-order traversal to resolve the parsed logical plan. We > think this is necessary because the current algorithm frequently encounters > high latency or errors from excessive looping of the fixed-point rule batch, > and generally is hard to reason about and does not scale for large logical > plans and other edge-case scenarios. > *Q2. What problem is this proposal NOT designed to solve?* > This is a proposal to replace the current fixed-point framework. The > following are NON-goals: > * Improve average latency of average queries. In general the latency is > dominated by the RPC calls. > * Complete clean-slate rewrite. Instead, the new approach will reuse helper > logic from the existing rules as much as possible. > *Q3. How is it done today, and what are the limits of current practice?* > The current Analyzer is based on rules and a fixed-point model - rules run in > batches over the plan tree until it's fully resolved. There are almost no > invariants or notion if name scope. The rules traverse the plan many times > without changing it and sometimes the analysis does not reach a fixed point > at all. There are unobvious dependencies between the rules. > *Q4. What is new in your approach and why do you think it will be successful?* > I believe this suggests a new way for both PySpark and pandas users to easily > scale their workloads. I think we can be successful because more and more > people tend to use Python and pandas. In fact, there are already similar > tries such as Dask and Modin which are all growing fast and successfully. > *Q5. Who cares? If you are successful, what difference will it make?* > All the Spark developers will benefit from a framework with more obvious > invariants. Also, developers who are already familiar with other systems > would find it easier to apply their knowledge. Fewer regressions should occur > as a result of ongoing development in this area, making Spark deployments > more reliable. Large logical plans or work with wide tables will experience a > major compilation speedup. And we will be able to resolve the cases where the > current framework cannot resolve the plan tree in a fixed number of > iterations. > *Q6. What are the risks?* > The exact path to 100% rewrite is not clear, because the Analyzer consists of > many rules with unobvious dependencies between them. > We think that this issue does not block the development and can be resolved > as we progress on the implementation. > *Q7. How long will it take?* > The reasonable amount of time might be 18 months. We will need to enable the > new Analyzer once we are 100% sure it supports all the old Analyzer > functionality. We will most likely have a long tail of small tricky issues at > the end of this effort. > *Q8. What are the mid-term and final "exams" to check for success?* > We propose to rewrite the Analyzer by incrementally implementing subsets of > SQL and DataFrame functionality. We first start with a subset of SQL > operators and expressions and further progress by expanding the surface area, > eventually supporting DataFrames too. As we progress with the development, we > will enable more and more unit tests to work with both implementations. > *Also refer to:* > - [SPIP > document|https://docs.google.com/document/d/1dWxvrJV-0joGdLtWbvJ0uNyTocDMJ90rPRNWa4T56Og] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org