Vladimir Golubev created SPARK-49834:
----------------------------------------

             Summary: SPIP: Single-pass Analyzer for Catalyst
                 Key: SPARK-49834
                 URL: https://issues.apache.org/jira/browse/SPARK-49834
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.0.0
            Reporter: Vladimir Golubev


This is a SPIP for a new single-pass Analyzer framework for Catalyst.

*Q1. What are you trying to do? Articulate your objectives using absolutely no 
jargon.*

We propose to introduce a new Analyzer framework for Catalyst. This framework 
will use a single post-order traversal to resolve the parsed logical plan. We 
think this is necessary because the current algorithm frequently encounters 
high latency or errors from excessive looping of the fixed-point rule batch, 
and generally is hard to reason about and does not scale for large logical 
plans and other edge-case scenarios.

*Q2. What problem is this proposal NOT designed to solve?*

This is a proposal to replace the current fixed-point framework. The following 
are NON-goals:
 * Improve average latency of small queries. In general the latency is 
dominated by the RPC calls.
 * Complete clean-slate rewrite. Instead, the new approach will reuse helper 
logic from the existing rules as much as possible.

*Q3. How is it done today, and what are the limits of current practice?*

The current Analyzer is based on rules and a fixed-point model - rules run in 
batches over the plan tree until it's fully resolved. There are almost no 
invariants or notion if name scope. The rules traverse the plan many times 
without changing it and sometimes the analysis does not reach a fixed point at 
all. There are unobvious dependencies between the rules. 

*Q4. What is new in your approach and why do you think it will be successful?*

I believe this suggests a new way for both PySpark and pandas users to easily 
scale their workloads. I think we can be successful because more and more 
people tend to use Python and pandas. In fact, there are already similar tries 
such as Dask and Modin which are all growing fast and successfully.

*Q5. Who cares? If you are successful, what difference will it make?*

All the Spark developers will benefit from a framework with more obvious 
invariants. Also, developers who are already familiar with other systems would 
find it easier to apply their knowledge. Fewer regressions should occur as a 
result of ongoing development in this area, making Spark deployments more 
reliable. Large logical plans or work with wide tables will experience a major 
compilation speedup. And we will be able to resolve the cases where the current 
framework cannot resolve the plan tree in a fixed number of iterations.

*Q6. What are the risks?*

The exact path to 100% rewrite is not clear, because the Analyzer consists of 
many rules with unobvious dependencies between them.

We think that this issue does not block the development and can be resolved as 
we progress on the implementation.

*Q7. How long will it take?*

The reasonable amount of time might be 18 months. We will need to enable the 
new Analyzer once we are 100% sure it supports all the old Analyzer 
functionality. We will most likely have a long tail of small tricky issues at 
the end of this effort.

*Q8. What are the mid-term and final "exams" to check for success?*

We propose to rewrite the Analyzer by incrementally implementing subsets of SQL 
and DataFrame functionality. We first start with a subset of SQL operators and 
expressions and further progress by expanding the surface area, eventually 
supporting DataFrames too. As we progress with the development, we will enable 
more and more unit tests to work with both implementations.

*Also refer to:*
 - [SPIP 
document|[https://docs.google.com/document/d/1dWxvrJV-0joGdLtWbvJ0uNyTocDMJ90rPRNWa4T56Og]]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to