Here is a short primer on the 3 core Calcite program representations:
https://www.feldera.com/blog/calcite-irs/

If you look at the SqlNode tree you will have the original query 
representation, as submitted to the compiler, but the identifiers won't be 
resolved (you won't be able to tell from which of the input tables column X 
comes from); name resolution is very tricky in SQL. Identifier resolution is 
done by the validation phase, which is normally invoked by the 
SqlToRelConverter. The validator has another important function, which is to 
reject queries that are syntactically correct but semantically incorrect. So 
you may want to validate the queries before you analyze them further.

When the validator does name resolution is also rewrites the SqlNode tree. 
replacing identifiers with their resolved form (this is the function "expand"). 
Ideally you should be able to get your hands on this expanded SqlNode tree, 
where identifiers are annotated with the namespace they belong to, but I don't 
know how to do that. Maybe there is a SqlToRelConverter API to do that?

The SqlToRelConverter does a few optimizations, but not too many.
Its behavior is driven by a Config, which you can set to disable most 
optimizations.

The RelNode tree is much easier to analyze and interpret, and this is what I 
would try to use.

Mihai
________________________________
From: JinxTheKid <[email protected]>
Sent: Thursday, August 22, 2024 9:02 PM
To: [email protected] <[email protected]>
Subject: [Question][Advice] Analyzing RelNodes and RexNodes Accurately

Hi Calcite community,

I'm exploring using the Rel/Rex Nodes to answer questions about a query.
One of the questions I want to ask is "what columns are used by functions
in a query's projections". Naively, I would like to just walk the Rel/Rex
Nodes and collect column origins when I'm at specific RexCalls, but I've
noticed that Calcite generates superfluous ProjectNodes when using
SqlToRelConverter, which throws a wrench in my original approach for
analyzing the query.

For example, a query I might want to interrogate (in practice the queries
are much more complex) is:

SELECT
    1
FROM
    employees e
GROUP BY
    uuid
HAVING
    AVG(age + 5) > 3

The query generates a logical plan in Calcite that looks like the
following. The issue with this plan is that if I blindly look at the plan,
I do not know if the LogicalProject corresponds to a SELECT in the original
query, or if it is injected by Calcite to create a better plan.I would
ideally want something that models the original query closer. I could use
the SqlNode tree, but I thought that the RelTree would be a better place to
analyze the query in this manner.

LogicalProject(EXPR$0=[1])
  LogicalFilter(condition=[>($1, 3)])
    LogicalAggregate(group=[{0}], agg#0=[AVG($1)])
      LogicalProject(uuid=[$0], $f1=[+($2, 5)])
        LogicalTableScan(table=[[employees]])

In the previous plan, the second project (or at least my understanding
of it) is superfluous because this second projection was not written
in the original query,and an alternative plan could be generated that
does not use an extra project.

LogicalProject(EXPR$0=[1])
  LogicalFilter(condition=[>($1, 3)])
    LogicalAggregate(group=[{0}], agg#0=[AVG(+($2, 5))])
      LogicalTableScan(table=[[employees]])

When I take a look at a plan created by Apache Spark for the same query, I
get a plan that is closer to how the original query was written. Some
values are repeated in the plan (at least in the explanation), but no
values are projected in the plan that weren't present in the original
query. To me, this would be a more desirable plan to analyze since I as the
developer have to do less work to understand the origin of the plan's
rules; i.e. I'm not guessing if the project that is generated is Spark's
project or mine.

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [1 AS 1#6]
   +- Filter (isnotnull(avg((age#2 + 5))#8) AND (avg((age#2 + 5))#8 > 3.0))
      +- HashAggregate(keys=[uuid#0], functions=[avg((age#2 + 5))])
         +- Exchange hashpartitioning(uuid#0, 200),
ENSURE_REQUIREMENTS, [plan_id=22]
            +- HashAggregate(keys=[uuid#0],
functions=[partial_avg((age#2 + 5))])
               +- LocalTableScan [uuid#0, age#2]


I have two questions from my experiments:
1. Is my approach of walking the tree on the correct path, and is there a
way to model the strategy using Calcite primitives that I might have
missed? I've played around with CoreRules a bit but haven't made
significant progress in that area.
2. Is there a way to generate a plan closer to the Spark version, or are
there other community efforts that might point me in the right direction
for achieving the desired behavior?

Thanks,
Logan

Reply via email to