Re: Breaking up SqlValidatorImpl

James Starr Tue, 10 Sep 2024 14:34:02 -0700

Here's a light touch cleanup of the text:
------------------------------

Julian,


I’m referring to the new approach of using RexSubQuery and
SubQueryRemovalRule. Nested correlated variables only work in a limited
subset of circumstances. Populating the Rel nodes with correlated IDs also
fails in several cases, such as subqueries in windows and subqueries in
SELECTs with aggregates. However, I agree this topic deserves its own
discussion, which I initiated some time ago. Perhaps a new thread needs to
be started. I bring up the issue of populating the correlated IDs as
context for my intention to refactor SqlValidatorImpl.

While simply moving code around doesn’t necessarily simplify things,
breaking it into smaller, logical chunks (though not too small) does help.
I’m not focusing on the high argument count, that was just an example of
the current complexity. I agree that immutability simplifies things, and I
don’t intend to change that structure or algorithm. I also agree that
refactoring shouldn’t change any logic, which I clarified in my follow-up
to Mihai. Each of the above classes would be placed in its own file.

My refactor would only involve reorganizing existing logic into separate
classes while avoiding cyclic interactions between them.  I take your
response as tacit approval of the proposed refactor, contingent on the
actual changes and other details.

James

On Tue, Sep 10, 2024 at 11:10 AM Julian Hyde <jh...@apache.org> wrote:

> If the validator is 11,000 lines of code in one file and you make the
> inner classes top-level classes, you still have 11,000 lines of code.
> Moving code around doesn't inherently make it simpler. Not saying it
> shouldn't be done...
>
> I agree with your observation that the tree of namespaces and scopes
> is complex and can be separated from the other validation logic. That
> tree is built as the first step of validation (before the real
> validation starts), and is immutable.
>
> If I recall correctly, some of those methods to create namespaces and
> scopes have 11 arguments because that allows us to make those classes
> immutable. An immutable tree is extremely valuable. We must keep that.
>
> Regarding the handling of correlated variables. It's legacy code
> (decorrelation better handled, in many or all cases, by the
> decorrelation rules), the map is a mess (see immutability, above). But
> you should not touch any of that logic while refactoring the
> validator. You will break stuff. I would much prefer that we fully
> obsolete the old decorrelation logic.
>
> Regarding how correlating variables are represented in RelNodes. It
> seems OK to me. I am prepared to be persuaded. But it has absolutely
> nothing to do with the validator, and should be a separate
> conversation thread.
>
> Julian
>
>
> On Fri, Aug 23, 2024 at 5:42 PM Mihai Budiu <mbu...@gmail.com> wrote:
> >
> > One thing I find unpleasant about the current validator implementation
> is that it leaves many implicit casts implicit. It would be much simpler
> for many subsequent passes if the validator would insert explicit casts
> where casts are needed (e.g., when adding a varchar and an int), and it
> would catch many bugs earlier. Subsequent rewrite passes also have to make
> assumptions about the implicit casts are that are missing, and they may
> make different assumptions from the validator. I have this problem all the
> time in the code generator.
> >
> > But this is somewhat independent on your current proposal. I hope that
> someday this could be fixed too.
> >
> > Mihai
> > ________________________________
> > From: James Starr <jamesst...@gmail.com>
> > Sent: Friday, August 23, 2024 2:40 PM
> > To: dev@calcite.apache.org <dev@calcite.apache.org>
> > Subject: Re: Breaking up SqlValidatorImpl
> >
> > I would break up the validator in 3 PRs with no functionality change:
> > extract SqlQueryScopes and NamespaceBuilder, extract the
> > SqlQueryScopePopulator, and final PR for extracting SqlNodeValidator.
> Then
> > I would apply functionalchanges on top.
> >
> > On Fri, Aug 23, 2024 at 2:02 PM Mihai Budiu <mbu...@gmail.com> wrote:
> >
> > > So do you want to refactor the code in two steps, where the first step
> > > would break the validator into multiple classes and keep the exact same
> > > behavior, and in a second step rework the handling of correlation
> variables?
> > >
> > > That sounds great. If you want to do both at the same time, it may be
> > > harder to review.
> > >
> > > Mihai
> > >
> > > ________________________________
> > > From: James Starr <jamesst...@gmail.com>
> > > Sent: Friday, August 23, 2024 12:26 PM
> > > To: dev@calcite.apache.org <dev@calcite.apache.org>
> > > Subject: Breaking up SqlValidatorImpl
> > >
> > > I am working on rebasing my for pull request (
> > > https://github.com/apache/calcite/pull/3042), for CALCITE-5418(
> > > https://issues.apache.org/jira/browse/CALCITE-5418) which stores the
> > > correlated variable ids subquery instead of the RelNodes(join, project,
> > > filter).  This change was briefly discussed here(
> > > https://lists.apache.org/thread/fvf35njbddcck9hwcqnnxowl49n4nkth).
> > >
> > > Storing the correlated IDs directly on the subquery provides several
> > > benefits:
> > >
> > >    - The RelNode interface would no longer need to expose correlated
> IDs.
> > >    - There are several existing bugs with how correlated IDs are
> populated
> > >    in SqlToRelConverter, particularly for Project and Join operations.
> > >    - Rules and other rewrite logic for Project, Filter, and Join would
> no
> > >    longer need to propagate correlated IDs.
> > >    - Various edge cases could be removed from SqlToRelConverter.
> > >    - The code handling correlated variables would be localized to
> > >    subqueries and lateral joins, instead of being spread across the
> entire
> > >    codebase whereever expressions can occur.
> > >
> > >
> > > However, since I started this work, there have been several changes to
> > > SqlValidatorImpl. Although my initial changes to SqlValidatorImpl were
> not
> > > overly complex, they were time-consuming due to the unwieldy nature of
> > > SqlValidatorImpl. Specifically, I’ve been working on populating a map
> of
> > > lateral joins to linked lists of lateral scopes.
> > >
> > > SqlValidatorImpl currently performs three main functions:
> > > 1.  Building a set of data structures for resolving scopes.
> > > 2.  Validates the query and resolving types.  This requires scope data
> > > structure generated in 1.
> > > 3.  Exposes the scoped and type data structure to SqlToRelConverter.
> > >
> > > Currently, SqlValidatorImpl is 7,921 lines long. The first step
> consists of
> > > 11 methods, each with up to 9 arguments, that recursively walk the
> tree.
> > > These methods alone account for nearly 2,000 lines. The data
> structures for
> > > resolving scopes, along with helper functions/methods, take up about
> 500
> > > lines. The remaining validation logic comprises over 4,000 lines. The
> > > combination of recursive methods with high argument counts, modifying
> the
> > > state of SqlValidatorImpl, all within a single file, makes working with
> > > SqlValidatorImpl cumbersome.
> > >
> > > I propose splitting SqlValidatorImpl into several smaller objects.
> While
> > > the recursive algorithms with high arguments would still exist,
> separating
> > > them into their own file would allow for a more focused approach.
> Here’s
> > > how I suggest breaking it down:
> > >
> > >    1. *SqlValidatorImpl* - A facade to maintain the existing API and
> > >    orchestrate the composition of the objects below.
> > >    2. *SqlQueryScopes* - Contains the data structures for mapping
> SqlNodes
> > >    to scopes and types, along with relevant helper methods.
> > >    3. *SqlNodeValidator* - Contains most of the methods currently
> prefixed
> > >    with validate.
> > >    4. *SqlQueryScopePopulator* - Contains the methods prefixed with
> > > register
> > >    .
> > >    5. *NamespaceBuilder* - Formalizes an implicit API for overriding
> > >    namespace creation.
> > >
> > > There would be a circular reference between NamespaceBuilder and
> > > SqlValidatorImpl, as there already exists a circular reference between
> > > SqlValidatorImpl and all namespace objects. Introducing SqlQueryScopes
> > > would help clarify the interactions between SqlNodeValidator,
> > > SqlQueryScopePopulator, and SqlToRelConverter. Additionally, a set of
> > > factory methods for components 2-5 would be added to
> SqlValidatorConfig to
> > > allow them to be overridden.
> > >
> > > Looking forward to your thoughts on this approach.
> > > James
> > >
>

Re: Breaking up SqlValidatorImpl

Reply via email to