Breaking up SqlValidatorImpl

James Starr Fri, 23 Aug 2024 12:27:07 -0700

I am working on rebasing my for pull request (
https://github.com/apache/calcite/pull/3042), for CALCITE-5418(
https://issues.apache.org/jira/browse/CALCITE-5418) which stores the
correlated variable ids subquery instead of the RelNodes(join, project,
filter).  This change was briefly discussed here(
https://lists.apache.org/thread/fvf35njbddcck9hwcqnnxowl49n4nkth).


Storing the correlated IDs directly on the subquery provides several
benefits:

   - The RelNode interface would no longer need to expose correlated IDs.
   - There are several existing bugs with how correlated IDs are populated
   in SqlToRelConverter, particularly for Project and Join operations.
   - Rules and other rewrite logic for Project, Filter, and Join would no
   longer need to propagate correlated IDs.
   - Various edge cases could be removed from SqlToRelConverter.
   - The code handling correlated variables would be localized to
   subqueries and lateral joins, instead of being spread across the entire
   codebase whereever expressions can occur.


However, since I started this work, there have been several changes to
SqlValidatorImpl. Although my initial changes to SqlValidatorImpl were not
overly complex, they were time-consuming due to the unwieldy nature of
SqlValidatorImpl. Specifically, I’ve been working on populating a map of
lateral joins to linked lists of lateral scopes.

SqlValidatorImpl currently performs three main functions:
1.  Building a set of data structures for resolving scopes.
2.  Validates the query and resolving types.  This requires scope data
structure generated in 1.
3.  Exposes the scoped and type data structure to SqlToRelConverter.

Currently, SqlValidatorImpl is 7,921 lines long. The first step consists of
11 methods, each with up to 9 arguments, that recursively walk the tree.
These methods alone account for nearly 2,000 lines. The data structures for
resolving scopes, along with helper functions/methods, take up about 500
lines. The remaining validation logic comprises over 4,000 lines. The
combination of recursive methods with high argument counts, modifying the
state of SqlValidatorImpl, all within a single file, makes working with
SqlValidatorImpl cumbersome.

I propose splitting SqlValidatorImpl into several smaller objects. While
the recursive algorithms with high arguments would still exist, separating
them into their own file would allow for a more focused approach. Here’s
how I suggest breaking it down:

   1. *SqlValidatorImpl* - A facade to maintain the existing API and
   orchestrate the composition of the objects below.
   2. *SqlQueryScopes* - Contains the data structures for mapping SqlNodes
   to scopes and types, along with relevant helper methods.
   3. *SqlNodeValidator* - Contains most of the methods currently prefixed
   with validate.
   4. *SqlQueryScopePopulator* - Contains the methods prefixed with register
   .
   5. *NamespaceBuilder* - Formalizes an implicit API for overriding
   namespace creation.

There would be a circular reference between NamespaceBuilder and
SqlValidatorImpl, as there already exists a circular reference between
SqlValidatorImpl and all namespace objects. Introducing SqlQueryScopes
would help clarify the interactions between SqlNodeValidator,
SqlQueryScopePopulator, and SqlToRelConverter. Additionally, a set of
factory methods for components 2-5 would be added to SqlValidatorConfig to
allow them to be overridden.

Looking forward to your thoughts on this approach.
James

Breaking up SqlValidatorImpl

Reply via email to