[jira] [Created] (SPARK-49834) SPIP: Single-pass Analyzer for Catalyst

Vladimir Golubev (Jira) Mon, 30 Sep 2024 00:59:03 -0700

Vladimir Golubev created SPARK-49834:
----------------------------------------

Summary: SPIP: Single-pass Analyzer for Catalyst
Key: SPARK-49834
URL: https://issues.apache.org/jira/browse/SPARK-49834
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.0.0
Reporter: Vladimir Golubev

This is a SPIP for a new single-pass Analyzer framework for Catalyst.

*Q1. What are you trying to do? Articulate your objectives using absolutely no
jargon.*

We propose to introduce a new Analyzer framework for Catalyst. This framework
will use a single post-order traversal to resolve the parsed logical plan. We
think this is necessary because the current algorithm frequently encounters
high latency or errors from excessive looping of the fixed-point rule batch,
and generally is hard to reason about and does not scale for large logical
plans and other edge-case scenarios.

*Q2. What problem is this proposal NOT designed to solve?*

This is a proposal to replace the current fixed-point framework. The following
are NON-goals:
* Improve average latency of small queries. In general the latency is
dominated by the RPC calls.
* Complete clean-slate rewrite. Instead, the new approach will reuse helper
logic from the existing rules as much as possible.

*Q3. How is it done today, and what are the limits of current practice?*

The current Analyzer is based on rules and a fixed-point model - rules run in
batches over the plan tree until it's fully resolved. There are almost no
invariants or notion if name scope. The rules traverse the plan many times
without changing it and sometimes the analysis does not reach a fixed point at
all. There are unobvious dependencies between the rules.

*Q4. What is new in your approach and why do you think it will be successful?*

I believe this suggests a new way for both PySpark and pandas users to easily
scale their workloads. I think we can be successful because more and more
people tend to use Python and pandas. In fact, there are already similar tries
such as Dask and Modin which are all growing fast and successfully.

*Q5. Who cares? If you are successful, what difference will it make?*

All the Spark developers will benefit from a framework with more obvious
invariants. Also, developers who are already familiar with other systems would
find it easier to apply their knowledge. Fewer regressions should occur as a
result of ongoing development in this area, making Spark deployments more
reliable. Large logical plans or work with wide tables will experience a major
compilation speedup. And we will be able to resolve the cases where the current
framework cannot resolve the plan tree in a fixed number of iterations.

*Q6. What are the risks?*

The exact path to 100% rewrite is not clear, because the Analyzer consists of
many rules with unobvious dependencies between them.

We think that this issue does not block the development and can be resolved as
we progress on the implementation.

*Q7. How long will it take?*

The reasonable amount of time might be 18 months. We will need to enable the
new Analyzer once we are 100% sure it supports all the old Analyzer
functionality. We will most likely have a long tail of small tricky issues at
the end of this effort.

*Q8. What are the mid-term and final "exams" to check for success?*

We propose to rewrite the Analyzer by incrementally implementing subsets of SQL
and DataFrame functionality. We first start with a subset of SQL operators and
expressions and further progress by expanding the surface area, eventually
supporting DataFrames too. As we progress with the development, we will enable
more and more unit tests to work with both implementations.

*Also refer to:*
- [SPIP
document|[https://docs.google.com/document/d/1dWxvrJV-0joGdLtWbvJ0uNyTocDMJ90rPRNWa4T56Og]]

--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-49834) SPIP: Single-pass Analyzer for Catalyst

Reply via email to