[jira] [Updated] (SPARK-49834) SPIP: Single-pass Analyzer for Catalyst

Vladimir Golubev (Jira) Mon, 30 Sep 2024 01:27:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-49834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vladimir Golubev updated SPARK-49834:
-------------------------------------
    Description: 
This is a SPIP for a new single-pass Analyzer framework for Catalyst.

*Q1. What are you trying to do? Articulate your objectives using absolutely no 
jargon.*

We propose to introduce a new Analyzer framework for Catalyst. This framework 
will use a single post-order traversal to resolve the parsed logical plan. We 
think this is necessary because the current algorithm frequently encounters 
high latency or errors from excessive looping of the fixed-point rule batch, 
and generally is hard to reason about and does not scale for large logical 
plans and other edge-case scenarios.

*Q2. What problem is this proposal NOT designed to solve?*

This is a proposal to replace the current fixed-point framework. The following 
are NON-goals:
 * Improve average latency of average queries. In general the latency is 
dominated by the RPC calls.
 * Complete clean-slate rewrite. Instead, the new approach will reuse helper 
logic from the existing rules as much as possible.

*Q3. How is it done today, and what are the limits of current practice?*

The current Analyzer is based on rules and a fixed-point model - rules run in 
batches over the plan tree until it's fully resolved. There are almost no 
invariants or notion if name scope. The rules traverse the plan many times 
without changing it and sometimes the analysis does not reach a fixed point at 
all. There are unobvious dependencies between the rules. 

*Q4. What is new in your approach and why do you think it will be successful?*

I believe this suggests a new way for both PySpark and pandas users to easily 
scale their workloads. I think we can be successful because more and more 
people tend to use Python and pandas. In fact, there are already similar tries 
such as Dask and Modin which are all growing fast and successfully.

*Q5. Who cares? If you are successful, what difference will it make?*

All the Spark developers will benefit from a framework with more obvious 
invariants. Also, developers who are already familiar with other systems would 
find it easier to apply their knowledge. Fewer regressions should occur as a 
result of ongoing development in this area, making Spark deployments more 
reliable. Large logical plans or work with wide tables will experience a major 
compilation speedup. And we will be able to resolve the cases where the current 
framework cannot resolve the plan tree in a fixed number of iterations.

*Q6. What are the risks?*

The exact path to 100% rewrite is not clear, because the Analyzer consists of 
many rules with unobvious dependencies between them.

We think that this issue does not block the development and can be resolved as 
we progress on the implementation.

*Q7. How long will it take?*

The reasonable amount of time might be 18 months. We will need to enable the 
new Analyzer once we are 100% sure it supports all the old Analyzer 
functionality. We will most likely have a long tail of small tricky issues at 
the end of this effort.

*Q8. What are the mid-term and final "exams" to check for success?*

We propose to rewrite the Analyzer by incrementally implementing subsets of SQL 
and DataFrame functionality. We first start with a subset of SQL operators and 
expressions and further progress by expanding the surface area, eventually 
supporting DataFrames too. As we progress with the development, we will enable 
more and more unit tests to work with both implementations.

*Also refer to:*
 - [SPIP 
document|https://docs.google.com/document/d/1dWxvrJV-0joGdLtWbvJ0uNyTocDMJ90rPRNWa4T56Og]

  was:
This is a SPIP for a new single-pass Analyzer framework for Catalyst.

*Q1. What are you trying to do? Articulate your objectives using absolutely no 
jargon.*

We propose to introduce a new Analyzer framework for Catalyst. This framework 
will use a single post-order traversal to resolve the parsed logical plan. We 
think this is necessary because the current algorithm frequently encounters 
high latency or errors from excessive looping of the fixed-point rule batch, 
and generally is hard to reason about and does not scale for large logical 
plans and other edge-case scenarios.

*Q2. What problem is this proposal NOT designed to solve?*

This is a proposal to replace the current fixed-point framework. The following 
are NON-goals:
 * Improve average latency of average queries. In general the latency is 
dominated by the RPC calls.
 * Complete clean-slate rewrite. Instead, the new approach will reuse helper 
logic from the existing rules as much as possible.

*Q3. How is it done today, and what are the limits of current practice?*

The current Analyzer is based on rules and a fixed-point model - rules run in 
batches over the plan tree until it's fully resolved. There are almost no 
invariants or notion if name scope. The rules traverse the plan many times 
without changing it and sometimes the analysis does not reach a fixed point at 
all. There are unobvious dependencies between the rules. 

*Q4. What is new in your approach and why do you think it will be successful?*

I believe this suggests a new way for both PySpark and pandas users to easily 
scale their workloads. I think we can be successful because more and more 
people tend to use Python and pandas. In fact, there are already similar tries 
such as Dask and Modin which are all growing fast and successfully.

*Q5. Who cares? If you are successful, what difference will it make?*

All the Spark developers will benefit from a framework with more obvious 
invariants. Also, developers who are already familiar with other systems would 
find it easier to apply their knowledge. Fewer regressions should occur as a 
result of ongoing development in this area, making Spark deployments more 
reliable. Large logical plans or work with wide tables will experience a major 
compilation speedup. And we will be able to resolve the cases where the current 
framework cannot resolve the plan tree in a fixed number of iterations.

*Q6. What are the risks?*

The exact path to 100% rewrite is not clear, because the Analyzer consists of 
many rules with unobvious dependencies between them.

We think that this issue does not block the development and can be resolved as 
we progress on the implementation.

*Q7. How long will it take?*

The reasonable amount of time might be 18 months. We will need to enable the 
new Analyzer once we are 100% sure it supports all the old Analyzer 
functionality. We will most likely have a long tail of small tricky issues at 
the end of this effort.

*Q8. What are the mid-term and final "exams" to check for success?*

We propose to rewrite the Analyzer by incrementally implementing subsets of SQL 
and DataFrame functionality. We first start with a subset of SQL operators and 
expressions and further progress by expanding the surface area, eventually 
supporting DataFrames too. As we progress with the development, we will enable 
more and more unit tests to work with both implementations.

*Also refer to:*
 - [SPIP 
document|[https://docs.google.com/document/d/1dWxvrJV-0joGdLtWbvJ0uNyTocDMJ90rPRNWa4T56Og]]


> SPIP: Single-pass Analyzer for Catalyst
> ---------------------------------------
>
>                 Key: SPARK-49834
>                 URL: https://issues.apache.org/jira/browse/SPARK-49834
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.0.0
>            Reporter: Vladimir Golubev
>            Priority: Major
>              Labels: SPIP
>
> This is a SPIP for a new single-pass Analyzer framework for Catalyst.
> *Q1. What are you trying to do? Articulate your objectives using absolutely 
> no jargon.*
> We propose to introduce a new Analyzer framework for Catalyst. This framework 
> will use a single post-order traversal to resolve the parsed logical plan. We 
> think this is necessary because the current algorithm frequently encounters 
> high latency or errors from excessive looping of the fixed-point rule batch, 
> and generally is hard to reason about and does not scale for large logical 
> plans and other edge-case scenarios.
> *Q2. What problem is this proposal NOT designed to solve?*
> This is a proposal to replace the current fixed-point framework. The 
> following are NON-goals:
>  * Improve average latency of average queries. In general the latency is 
> dominated by the RPC calls.
>  * Complete clean-slate rewrite. Instead, the new approach will reuse helper 
> logic from the existing rules as much as possible.
> *Q3. How is it done today, and what are the limits of current practice?*
> The current Analyzer is based on rules and a fixed-point model - rules run in 
> batches over the plan tree until it's fully resolved. There are almost no 
> invariants or notion if name scope. The rules traverse the plan many times 
> without changing it and sometimes the analysis does not reach a fixed point 
> at all. There are unobvious dependencies between the rules. 
> *Q4. What is new in your approach and why do you think it will be successful?*
> I believe this suggests a new way for both PySpark and pandas users to easily 
> scale their workloads. I think we can be successful because more and more 
> people tend to use Python and pandas. In fact, there are already similar 
> tries such as Dask and Modin which are all growing fast and successfully.
> *Q5. Who cares? If you are successful, what difference will it make?*
> All the Spark developers will benefit from a framework with more obvious 
> invariants. Also, developers who are already familiar with other systems 
> would find it easier to apply their knowledge. Fewer regressions should occur 
> as a result of ongoing development in this area, making Spark deployments 
> more reliable. Large logical plans or work with wide tables will experience a 
> major compilation speedup. And we will be able to resolve the cases where the 
> current framework cannot resolve the plan tree in a fixed number of 
> iterations.
> *Q6. What are the risks?*
> The exact path to 100% rewrite is not clear, because the Analyzer consists of 
> many rules with unobvious dependencies between them.
> We think that this issue does not block the development and can be resolved 
> as we progress on the implementation.
> *Q7. How long will it take?*
> The reasonable amount of time might be 18 months. We will need to enable the 
> new Analyzer once we are 100% sure it supports all the old Analyzer 
> functionality. We will most likely have a long tail of small tricky issues at 
> the end of this effort.
> *Q8. What are the mid-term and final "exams" to check for success?*
> We propose to rewrite the Analyzer by incrementally implementing subsets of 
> SQL and DataFrame functionality. We first start with a subset of SQL 
> operators and expressions and further progress by expanding the surface area, 
> eventually supporting DataFrames too. As we progress with the development, we 
> will enable more and more unit tests to work with both implementations.
> *Also refer to:*
>  - [SPIP 
> document|https://docs.google.com/document/d/1dWxvrJV-0joGdLtWbvJ0uNyTocDMJ90rPRNWa4T56Og]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-49834) SPIP: Single-pass Analyzer for Catalyst

Reply via email to