GitHub user staple opened a pull request:

    https://github.com/apache/spark/pull/1706

    [SPARK-2781] Check resolution of LogicalPlans in Analyzer.

    LogicalPlan contains a ‘resolved’ attribute indicating that all of its 
execution requirements have been resolved. This attribute is not checked before 
query execution. The analyzer contains a step to check that all Expressions are 
resolved, but this is not equivalent to checking all LogicalPlans. In 
particular, the Union plan’s implementation of ‘resolved’ verifies that 
the types of its children’s columns are compatible. Because the analyzer does 
not check that a Union plan is resolved, it is possible to execute a Union plan 
that outputs different types in the same column.  See SPARK-XXX for an example.
    
    This patch adds two checks to the analyzer’s CheckResolution rule. First, 
each logical plan is checked to see if it is not resolved despite its children 
being resolved. This allows the ‘problem’ unresolved plan to be included in 
the TreeNodeException for reporting. Then as a backstop the root plan is 
checked to see if it is resolved, which recursively checks that the entire plan 
tree is resolved. Note that the resolved attribute is implemented recursively, 
and this patch also explicitly checks the resolved attribute on each logical 
plan in the tree. I assume the query plan trees will not be large enough this 
redundant checking to meaningfully impact performance.
    
    Because this patch starts validating that LogicalPlans are resolved before 
execution, I had to fix some cases where unresolved plans were passing through 
the analyzer as part of the implementation of the hive query system. In 
particular, HiveContext applies the CreateTables and PreInsertionCasts rules 
manually after the optimizer runs, meaning query plans are not resolved until 
after the optimizer. I moved these rules to the analyzer stage (for hive 
queries only), in the process completing a code TODO indicating the rules 
should be moved to the analyzer.
    
    It’s worth noting that moving the CreateTables rule means introducing an 
analyzer rule with a significant side effect - in this case the side effect is 
creating a hive table. The rule will only attempt to create a table once even 
if its batch is executed multiple times, because it converts the 
InsertIntoCreatedTable plan it matches against into an InsertIntoTable. 
Additionally, these hive rules must be added to the Resolution batch rather 
than as a separate batch because hive rules rules may be needed to resolve 
non-root nodes, leaving the root to be resolved on a subsequent batch 
iteration. For example, the hive compatibility test auto_smb_mapjoin_14, and 
others, make use of a query plan where the root is a Union and its children are 
each a hive InsertIntoTable.
    
    Mixing the custom hive rules with standard analyzer rules initially 
resulted in an additional failure because of different policies between spark 
sql and hive when casting a boolean to a string. Hive casts booleans to strings 
as “true” / “false” while spark sql casts booleans to strings as 
“1” / “0” (causing the cast1.q test to fail). This behavior is a result 
of the BooleanCasts rule in HiveTypeCoercion.scala, and from looking at the 
implementation of BooleanCasts I think converting to to “1”/“0” is 
potentially a programming mistake. (If the BooleanCasts rule is disabled, 
casting produces “true”/“false” instead.) I believe “true” / 
“false” should be the behavior for spark sql - I changed the behavior so 
bools are converted to “true”/“false” to be consistent with hive, and 
none of the existing spark tests failed.
    
    Finally, in some initial testing with hive it appears that an implicit type 
coercion of boolean to string results in a lowercase string, e.g. CONCAT( TRUE, 
“” ) -> “true” while an explicit cast produces an all caps string, e.g. 
CAST( TRUE AS STRING ) -> “TRUE”.  The change I’ve made just converts to 
lowercase strings in all cases.  I believe it is at least more correct than the 
existing spark sql implementation where all Cast expressions become “1” / 
“0”.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/staple/spark SPARK-2781

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1706.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1706
    
----
commit 80a1136d58d719f9f029858a42e068553be87eb0
Author: Aaron Staple <aaron.sta...@gmail.com>
Date:   2014-07-31T20:42:32Z

    [SPARK-2781] Check resolution of LogicalPlans in Analyzer.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to