GitHub user staple opened a pull request:
https://github.com/apache/spark/pull/1706
[SPARK-2781] Check resolution of LogicalPlans in Analyzer.
LogicalPlan contains a âresolvedâ attribute indicating that all of its
execution requirements have been resolved. This attribute is not checked before
query execution. The analyzer contains a step to check that all Expressions are
resolved, but this is not equivalent to checking all LogicalPlans. In
particular, the Union planâs implementation of âresolvedâ verifies that
the types of its childrenâs columns are compatible. Because the analyzer does
not check that a Union plan is resolved, it is possible to execute a Union plan
that outputs different types in the same column. See SPARK-XXX for an example.
This patch adds two checks to the analyzerâs CheckResolution rule. First,
each logical plan is checked to see if it is not resolved despite its children
being resolved. This allows the âproblemâ unresolved plan to be included in
the TreeNodeException for reporting. Then as a backstop the root plan is
checked to see if it is resolved, which recursively checks that the entire plan
tree is resolved. Note that the resolved attribute is implemented recursively,
and this patch also explicitly checks the resolved attribute on each logical
plan in the tree. I assume the query plan trees will not be large enough this
redundant checking to meaningfully impact performance.
Because this patch starts validating that LogicalPlans are resolved before
execution, I had to fix some cases where unresolved plans were passing through
the analyzer as part of the implementation of the hive query system. In
particular, HiveContext applies the CreateTables and PreInsertionCasts rules
manually after the optimizer runs, meaning query plans are not resolved until
after the optimizer. I moved these rules to the analyzer stage (for hive
queries only), in the process completing a code TODO indicating the rules
should be moved to the analyzer.
Itâs worth noting that moving the CreateTables rule means introducing an
analyzer rule with a significant side effect - in this case the side effect is
creating a hive table. The rule will only attempt to create a table once even
if its batch is executed multiple times, because it converts the
InsertIntoCreatedTable plan it matches against into an InsertIntoTable.
Additionally, these hive rules must be added to the Resolution batch rather
than as a separate batch because hive rules rules may be needed to resolve
non-root nodes, leaving the root to be resolved on a subsequent batch
iteration. For example, the hive compatibility test auto_smb_mapjoin_14, and
others, make use of a query plan where the root is a Union and its children are
each a hive InsertIntoTable.
Mixing the custom hive rules with standard analyzer rules initially
resulted in an additional failure because of different policies between spark
sql and hive when casting a boolean to a string. Hive casts booleans to strings
as âtrueâ / âfalseâ while spark sql casts booleans to strings as
â1â / â0â (causing the cast1.q test to fail). This behavior is a result
of the BooleanCasts rule in HiveTypeCoercion.scala, and from looking at the
implementation of BooleanCasts I think converting to to â1â/â0â is
potentially a programming mistake. (If the BooleanCasts rule is disabled,
casting produces âtrueâ/âfalseâ instead.) I believe âtrueâ /
âfalseâ should be the behavior for spark sql - I changed the behavior so
bools are converted to âtrueâ/âfalseâ to be consistent with hive, and
none of the existing spark tests failed.
Finally, in some initial testing with hive it appears that an implicit type
coercion of boolean to string results in a lowercase string, e.g. CONCAT( TRUE,
ââ ) -> âtrueâ while an explicit cast produces an all caps string, e.g.
CAST( TRUE AS STRING ) -> âTRUEâ. The change Iâve made just converts to
lowercase strings in all cases. I believe it is at least more correct than the
existing spark sql implementation where all Cast expressions become â1â /
â0â.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/staple/spark SPARK-2781
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1706.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1706
commit 80a1136d58d719f9f029858a42e068553be87eb0
Author: Aaron Staple
Date: 2014-07-31T20:42:32Z
[SPARK-2781] Check resolution of LogicalPlans in Analyzer.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so,