I am working on rebasing my for pull request ( https://github.com/apache/calcite/pull/3042), for CALCITE-5418( https://issues.apache.org/jira/browse/CALCITE-5418) which stores the correlated variable ids subquery instead of the RelNodes(join, project, filter). This change was briefly discussed here( https://lists.apache.org/thread/fvf35njbddcck9hwcqnnxowl49n4nkth).
Storing the correlated IDs directly on the subquery provides several benefits: - The RelNode interface would no longer need to expose correlated IDs. - There are several existing bugs with how correlated IDs are populated in SqlToRelConverter, particularly for Project and Join operations. - Rules and other rewrite logic for Project, Filter, and Join would no longer need to propagate correlated IDs. - Various edge cases could be removed from SqlToRelConverter. - The code handling correlated variables would be localized to subqueries and lateral joins, instead of being spread across the entire codebase whereever expressions can occur. However, since I started this work, there have been several changes to SqlValidatorImpl. Although my initial changes to SqlValidatorImpl were not overly complex, they were time-consuming due to the unwieldy nature of SqlValidatorImpl. Specifically, I’ve been working on populating a map of lateral joins to linked lists of lateral scopes. SqlValidatorImpl currently performs three main functions: 1. Building a set of data structures for resolving scopes. 2. Validates the query and resolving types. This requires scope data structure generated in 1. 3. Exposes the scoped and type data structure to SqlToRelConverter. Currently, SqlValidatorImpl is 7,921 lines long. The first step consists of 11 methods, each with up to 9 arguments, that recursively walk the tree. These methods alone account for nearly 2,000 lines. The data structures for resolving scopes, along with helper functions/methods, take up about 500 lines. The remaining validation logic comprises over 4,000 lines. The combination of recursive methods with high argument counts, modifying the state of SqlValidatorImpl, all within a single file, makes working with SqlValidatorImpl cumbersome. I propose splitting SqlValidatorImpl into several smaller objects. While the recursive algorithms with high arguments would still exist, separating them into their own file would allow for a more focused approach. Here’s how I suggest breaking it down: 1. *SqlValidatorImpl* - A facade to maintain the existing API and orchestrate the composition of the objects below. 2. *SqlQueryScopes* - Contains the data structures for mapping SqlNodes to scopes and types, along with relevant helper methods. 3. *SqlNodeValidator* - Contains most of the methods currently prefixed with validate. 4. *SqlQueryScopePopulator* - Contains the methods prefixed with register . 5. *NamespaceBuilder* - Formalizes an implicit API for overriding namespace creation. There would be a circular reference between NamespaceBuilder and SqlValidatorImpl, as there already exists a circular reference between SqlValidatorImpl and all namespace objects. Introducing SqlQueryScopes would help clarify the interactions between SqlNodeValidator, SqlQueryScopePopulator, and SqlToRelConverter. Additionally, a set of factory methods for components 2-5 would be added to SqlValidatorConfig to allow them to be overridden. Looking forward to your thoughts on this approach. James