Thanks Mihai for reaching out. I've often run into these roadblocks where I want to push Calcite a little bit more, and the framework "almost" has what I want built in but often falls short at the last step. I've worked with this problem a bit and have tried using the SqlValidator APIs to accomplish my goal, but it feels more clunky to work with compared to the RelNode tree. I almost wish I had more control over SqlToRelConverter to inject my own Relational Tree logic, less I have to write my own converter (which seems challenging!). Maybe there is something I've missed, but I'll update this thread when I figure it out.
Thanks, Logan On Thu, Aug 22, 2024 at 9:39 PM Mihai Budiu <mbu...@gmail.com> wrote: > Here is a short primer on the 3 core Calcite program representations: > https://www.feldera.com/blog/calcite-irs/ > > If you look at the SqlNode tree you will have the original query > representation, as submitted to the compiler, but the identifiers won't be > resolved (you won't be able to tell from which of the input tables column X > comes from); name resolution is very tricky in SQL. Identifier resolution > is done by the validation phase, which is normally invoked by the > SqlToRelConverter. The validator has another important function, which is > to reject queries that are syntactically correct but semantically > incorrect. So you may want to validate the queries before you analyze them > further. > > When the validator does name resolution is also rewrites the SqlNode tree. > replacing identifiers with their resolved form (this is the function > "expand"). Ideally you should be able to get your hands on this expanded > SqlNode tree, where identifiers are annotated with the namespace they > belong to, but I don't know how to do that. Maybe there is a > SqlToRelConverter API to do that? > > The SqlToRelConverter does a few optimizations, but not too many. > Its behavior is driven by a Config, which you can set to disable most > optimizations. > > The RelNode tree is much easier to analyze and interpret, and this is what > I would try to use. > > Mihai > ________________________________ > From: JinxTheKid <logansmith...@gmail.com> > Sent: Thursday, August 22, 2024 9:02 PM > To: dev@calcite.apache.org <dev@calcite.apache.org> > Subject: [Question][Advice] Analyzing RelNodes and RexNodes Accurately > > Hi Calcite community, > > I'm exploring using the Rel/Rex Nodes to answer questions about a query. > One of the questions I want to ask is "what columns are used by functions > in a query's projections". Naively, I would like to just walk the Rel/Rex > Nodes and collect column origins when I'm at specific RexCalls, but I've > noticed that Calcite generates superfluous ProjectNodes when using > SqlToRelConverter, which throws a wrench in my original approach for > analyzing the query. > > For example, a query I might want to interrogate (in practice the queries > are much more complex) is: > > SELECT > 1 > FROM > employees e > GROUP BY > uuid > HAVING > AVG(age + 5) > 3 > > The query generates a logical plan in Calcite that looks like the > following. The issue with this plan is that if I blindly look at the plan, > I do not know if the LogicalProject corresponds to a SELECT in the original > query, or if it is injected by Calcite to create a better plan.I would > ideally want something that models the original query closer. I could use > the SqlNode tree, but I thought that the RelTree would be a better place to > analyze the query in this manner. > > LogicalProject(EXPR$0=[1]) > LogicalFilter(condition=[>($1, 3)]) > LogicalAggregate(group=[{0}], agg#0=[AVG($1)]) > LogicalProject(uuid=[$0], $f1=[+($2, 5)]) > LogicalTableScan(table=[[employees]]) > > In the previous plan, the second project (or at least my understanding > of it) is superfluous because this second projection was not written > in the original query,and an alternative plan could be generated that > does not use an extra project. > > LogicalProject(EXPR$0=[1]) > LogicalFilter(condition=[>($1, 3)]) > LogicalAggregate(group=[{0}], agg#0=[AVG(+($2, 5))]) > LogicalTableScan(table=[[employees]]) > > When I take a look at a plan created by Apache Spark for the same query, I > get a plan that is closer to how the original query was written. Some > values are repeated in the plan (at least in the explanation), but no > values are projected in the plan that weren't present in the original > query. To me, this would be a more desirable plan to analyze since I as the > developer have to do less work to understand the origin of the plan's > rules; i.e. I'm not guessing if the project that is generated is Spark's > project or mine. > > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- Project [1 AS 1#6] > +- Filter (isnotnull(avg((age#2 + 5))#8) AND (avg((age#2 + 5))#8 > 3.0)) > +- HashAggregate(keys=[uuid#0], functions=[avg((age#2 + 5))]) > +- Exchange hashpartitioning(uuid#0, 200), > ENSURE_REQUIREMENTS, [plan_id=22] > +- HashAggregate(keys=[uuid#0], > functions=[partial_avg((age#2 + 5))]) > +- LocalTableScan [uuid#0, age#2] > > > I have two questions from my experiments: > 1. Is my approach of walking the tree on the correct path, and is there a > way to model the strategy using Calcite primitives that I might have > missed? I've played around with CoreRules a bit but haven't made > significant progress in that area. > 2. Is there a way to generate a plan closer to the Spark version, or are > there other community efforts that might point me in the right direction > for achieving the desired behavior? > > Thanks, > Logan >