Re: Expression/LogicalPlan dichotomy in Spark SQL Catalyst

Roland Reumerman Wed, 23 Dec 2015 01:59:58 -0800

Thanks for the informative reply Michael. The things I'm trying to accomplish 
with Catalyst are certain external domain model resolving and security-related 
constraint-handling transformations that depend more on the syntactic (nested 
TreeNode) structure of the query than on the actual semantics of the nodes in 
the query tree, but I understand this is kind of an atypical use case compared 
to the existing Analyzer and Optimizer phases in Catalyst. I also hadn't made 
very explicit in my post that a difference between the official Catalyst code 
base and the fork I created is that I don't evaluate expressions in Spark, but 
execute them embedded as-is in the finally produced SQL query 'physical plan' 
on a relational database (in this context a concept like expr nullability no 
longer has semantic meaning). I'll look into RuleExecutor then to see if I can 
make magic happen.



P.S. I've been informed by someone on the mailing list that some other Spark 
developers are working on a similar concept, called the push-down of everything:

https://issues.apache.org/jira/browse/SPARK-12449


Roland


________________________________
From: Michael Armbrust <mich...@databricks.com>
Sent: Tuesday, December 22, 2015 12:52 AM
To: Roland Reumerman
Cc: dev@spark.apache.org
Subject: Re: Expression/LogicalPlan dichotomy in Spark SQL Catalyst


Why was the choice made in Catalyst to make LogicalPlan/QueryPlan and 
Expression separate subclasses of TreeNode, instead of e.g. also make QueryPlan 
inherit from Expression?

I think this is a pretty common way to model things (glancing at postgres it 
looks similar).  Expression and plans are pretty different concepts.  An 
expression can be evaluated on a single input row and returns a single value.  
In contrast a query plan operates on a relation and has a schema with many 
different atomic values.


The code also contains duplicate functionality, like LeafNode/LeafExpression, 
UnaryNode/UnaryExpression and BinaryNode/BinaryExpression.

These traits actually have different semantics for expressions vs. plans (i.e. 
a UnaryExpression nullability is based on its child's nullability, whereas this 
would not make sense for a UnaryNode which does not have a concept of 
nullability).

this makes whole-tree transformations really cumbersome since we've got to deal 
with 'pivot points' for these 2 types of TreeNodes, where a recursive 
transformation can only be done on 1 specific type of children, and then has to 
be dealt with again within the same PartialFunction for the other type in which 
the matching case(s) can be nested.

It is not clear to me that you actually want these transformations to happen 
seamlessly.  For example, the resolution rules for subqueries are different 
than normal plans because you have to reason about correlation.  That said, it 
seems like you should be able to do some magic in RuleExecutor to make sure 
that things like the optimizer descend seamlessly into nested query plans.

Re: Expression/LogicalPlan dichotomy in Spark SQL Catalyst

Reply via email to