Thanks Mihai for reaching out.

I've often run into these roadblocks where I want to push Calcite a little
bit more, and the framework "almost" has what I want built in but often
falls short at the last step. I've worked with this problem a bit and have
tried using the SqlValidator APIs to accomplish my goal, but it feels more
clunky to work with compared to the RelNode tree. I almost wish I had more
control over SqlToRelConverter to inject my own Relational Tree logic, less
I have to write my own converter (which seems challenging!). Maybe there is
something I've missed, but I'll update this thread when I figure it out.

Thanks,
Logan

On Thu, Aug 22, 2024 at 9:39 PM Mihai Budiu <mbu...@gmail.com> wrote:

> Here is a short primer on the 3 core Calcite program representations:
> https://www.feldera.com/blog/calcite-irs/
>
> If you look at the SqlNode tree you will have the original query
> representation, as submitted to the compiler, but the identifiers won't be
> resolved (you won't be able to tell from which of the input tables column X
> comes from); name resolution is very tricky in SQL. Identifier resolution
> is done by the validation phase, which is normally invoked by the
> SqlToRelConverter. The validator has another important function, which is
> to reject queries that are syntactically correct but semantically
> incorrect. So you may want to validate the queries before you analyze them
> further.
>
> When the validator does name resolution is also rewrites the SqlNode tree.
> replacing identifiers with their resolved form (this is the function
> "expand"). Ideally you should be able to get your hands on this expanded
> SqlNode tree, where identifiers are annotated with the namespace they
> belong to, but I don't know how to do that. Maybe there is a
> SqlToRelConverter API to do that?
>
> The SqlToRelConverter does a few optimizations, but not too many.
> Its behavior is driven by a Config, which you can set to disable most
> optimizations.
>
> The RelNode tree is much easier to analyze and interpret, and this is what
> I would try to use.
>
> Mihai
> ________________________________
> From: JinxTheKid <logansmith...@gmail.com>
> Sent: Thursday, August 22, 2024 9:02 PM
> To: dev@calcite.apache.org <dev@calcite.apache.org>
> Subject: [Question][Advice] Analyzing RelNodes and RexNodes Accurately
>
> Hi Calcite community,
>
> I'm exploring using the Rel/Rex Nodes to answer questions about a query.
> One of the questions I want to ask is "what columns are used by functions
> in a query's projections". Naively, I would like to just walk the Rel/Rex
> Nodes and collect column origins when I'm at specific RexCalls, but I've
> noticed that Calcite generates superfluous ProjectNodes when using
> SqlToRelConverter, which throws a wrench in my original approach for
> analyzing the query.
>
> For example, a query I might want to interrogate (in practice the queries
> are much more complex) is:
>
> SELECT
>     1
> FROM
>     employees e
> GROUP BY
>     uuid
> HAVING
>     AVG(age + 5) > 3
>
> The query generates a logical plan in Calcite that looks like the
> following. The issue with this plan is that if I blindly look at the plan,
> I do not know if the LogicalProject corresponds to a SELECT in the original
> query, or if it is injected by Calcite to create a better plan.I would
> ideally want something that models the original query closer. I could use
> the SqlNode tree, but I thought that the RelTree would be a better place to
> analyze the query in this manner.
>
> LogicalProject(EXPR$0=[1])
>   LogicalFilter(condition=[>($1, 3)])
>     LogicalAggregate(group=[{0}], agg#0=[AVG($1)])
>       LogicalProject(uuid=[$0], $f1=[+($2, 5)])
>         LogicalTableScan(table=[[employees]])
>
> In the previous plan, the second project (or at least my understanding
> of it) is superfluous because this second projection was not written
> in the original query,and an alternative plan could be generated that
> does not use an extra project.
>
> LogicalProject(EXPR$0=[1])
>   LogicalFilter(condition=[>($1, 3)])
>     LogicalAggregate(group=[{0}], agg#0=[AVG(+($2, 5))])
>       LogicalTableScan(table=[[employees]])
>
> When I take a look at a plan created by Apache Spark for the same query, I
> get a plan that is closer to how the original query was written. Some
> values are repeated in the plan (at least in the explanation), but no
> values are projected in the plan that weren't present in the original
> query. To me, this would be a more desirable plan to analyze since I as the
> developer have to do less work to understand the origin of the plan's
> rules; i.e. I'm not guessing if the project that is generated is Spark's
> project or mine.
>
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Project [1 AS 1#6]
>    +- Filter (isnotnull(avg((age#2 + 5))#8) AND (avg((age#2 + 5))#8 > 3.0))
>       +- HashAggregate(keys=[uuid#0], functions=[avg((age#2 + 5))])
>          +- Exchange hashpartitioning(uuid#0, 200),
> ENSURE_REQUIREMENTS, [plan_id=22]
>             +- HashAggregate(keys=[uuid#0],
> functions=[partial_avg((age#2 + 5))])
>                +- LocalTableScan [uuid#0, age#2]
>
>
> I have two questions from my experiments:
> 1. Is my approach of walking the tree on the correct path, and is there a
> way to model the strategy using Calcite primitives that I might have
> missed? I've played around with CoreRules a bit but haven't made
> significant progress in that area.
> 2. Is there a way to generate a plan closer to the Spark version, or are
> there other community efforts that might point me in the right direction
> for achieving the desired behavior?
>
> Thanks,
> Logan
>

Reply via email to