Thanks for the informative reply Michael. The things I'm trying to accomplish with Catalyst are certain external domain model resolving and security-related constraint-handling transformations that depend more on the syntactic (nested TreeNode) structure of the query than on the actual semantics of the nodes in the query tree, but I understand this is kind of an atypical use case compared to the existing Analyzer and Optimizer phases in Catalyst. I also hadn't made very explicit in my post that a difference between the official Catalyst code base and the fork I created is that I don't evaluate expressions in Spark, but execute them embedded as-is in the finally produced SQL query 'physical plan' on a relational database (in this context a concept like expr nullability no longer has semantic meaning). I'll look into RuleExecutor then to see if I can make magic happen.
P.S. I've been informed by someone on the mailing list that some other Spark developers are working on a similar concept, called the push-down of everything: https://issues.apache.org/jira/browse/SPARK-12449 Roland ________________________________ From: Michael Armbrust <mich...@databricks.com> Sent: Tuesday, December 22, 2015 12:52 AM To: Roland Reumerman Cc: dev@spark.apache.org Subject: Re: Expression/LogicalPlan dichotomy in Spark SQL Catalyst Why was the choice made in Catalyst to make LogicalPlan/QueryPlan and Expression separate subclasses of TreeNode, instead of e.g. also make QueryPlan inherit from Expression? I think this is a pretty common way to model things (glancing at postgres it looks similar). Expression and plans are pretty different concepts. An expression can be evaluated on a single input row and returns a single value. In contrast a query plan operates on a relation and has a schema with many different atomic values. The code also contains duplicate functionality, like LeafNode/LeafExpression, UnaryNode/UnaryExpression and BinaryNode/BinaryExpression. These traits actually have different semantics for expressions vs. plans (i.e. a UnaryExpression nullability is based on its child's nullability, whereas this would not make sense for a UnaryNode which does not have a concept of nullability). this makes whole-tree transformations really cumbersome since we've got to deal with 'pivot points' for these 2 types of TreeNodes, where a recursive transformation can only be done on 1 specific type of children, and then has to be dealt with again within the same PartialFunction for the other type in which the matching case(s) can be nested. It is not clear to me that you actually want these transformations to happen seamlessly. For example, the resolution rules for subqueries are different than normal plans because you have to reason about correlation. That said, it seems like you should be able to do some magic in RuleExecutor to make sure that things like the optimizer descend seamlessly into nested query plans.