In my efforts to continue learning about the internals of Calcite, I've
been experimenting with applying Calcite's Babel parser to the task of
extracting fully qualified field references from a query.  The use case
here would be to enumerate all the fields of a schema that are "touched" by
a given query, for the purpose of informing table owners about the usage by
consumers of particular fields.

It seems there has been some prior interest in this use case, but the few
of the stackoverflow posts I could find on the topic are likely outdated.
One post that has proved helpful was this one from Julian
<https://stackoverflow.com/a/37554628/215608>, which explains the roles of
namespaces and scopes in the validation process.

The comments in that thread mention SqlScopedShuttle only maintaining a
stack of SqlNode, but it seems it has changed to keep a stack of
SqlValidatorScopes?  For the most part, I seem to be able to use that Deque
to match identifiers to their tables, following on an example drawn from
the getFieldOrigins method of SqlValidatorImpl
<https://calcite.apache.org/javadocAggregate/org/apache/calcite/sql/validate/SqlValidator.html#getFieldOrigins(org.apache.calcite.sql.SqlNode)>.
My shuttle just uses the visit method for SqlIdentifier to try to qualify
the identifier against the namespace associated with the current scope,
discarding all that fail to be found.

Here's my really naaive approach with some test cases expressed as yaml
<https://gist.github.com/DeaconDesperado/416f11bf91d0ca60b3c30b22626f4178>.
Obviously there's some cases that fail badly with this approach, one of
which I've included wherein an UNNEST is used on an array field containing
subrecords.

A few questions I have:

   - Does this approach make sense for this use case, or would it make more
   sense to use an implementation of SqlValidator, which seems to maintain
   more state as to the scopes + namespaces?
   - Would it be easier to approach this particular use case from
   navigating the generated relational algebra rather than the SQL AST?
   - Does the static factory method for the shuttle make sense
   idiomatically with the intended usage for SqlScopedShuttle?  How typically
   should one obtain the intialScope?  I've tried to enumerate both the top
   level SqlSelect as well as the potential for a list of CTEs using WITH

Many thanks!

Reply via email to