GitHub user staple opened a pull request: https://github.com/apache/spark/pull/1706
[SPARK-2781] Check resolution of LogicalPlans in Analyzer. LogicalPlan contains a âresolvedâ attribute indicating that all of its execution requirements have been resolved. This attribute is not checked before query execution. The analyzer contains a step to check that all Expressions are resolved, but this is not equivalent to checking all LogicalPlans. In particular, the Union planâs implementation of âresolvedâ verifies that the types of its childrenâs columns are compatible. Because the analyzer does not check that a Union plan is resolved, it is possible to execute a Union plan that outputs different types in the same column. See SPARK-XXX for an example. This patch adds two checks to the analyzerâs CheckResolution rule. First, each logical plan is checked to see if it is not resolved despite its children being resolved. This allows the âproblemâ unresolved plan to be included in the TreeNodeException for reporting. Then as a backstop the root plan is checked to see if it is resolved, which recursively checks that the entire plan tree is resolved. Note that the resolved attribute is implemented recursively, and this patch also explicitly checks the resolved attribute on each logical plan in the tree. I assume the query plan trees will not be large enough this redundant checking to meaningfully impact performance. Because this patch starts validating that LogicalPlans are resolved before execution, I had to fix some cases where unresolved plans were passing through the analyzer as part of the implementation of the hive query system. In particular, HiveContext applies the CreateTables and PreInsertionCasts rules manually after the optimizer runs, meaning query plans are not resolved until after the optimizer. I moved these rules to the analyzer stage (for hive queries only), in the process completing a code TODO indicating the rules should be moved to the analyzer. Itâs worth noting that moving the CreateTables rule means introducing an analyzer rule with a significant side effect - in this case the side effect is creating a hive table. The rule will only attempt to create a table once even if its batch is executed multiple times, because it converts the InsertIntoCreatedTable plan it matches against into an InsertIntoTable. Additionally, these hive rules must be added to the Resolution batch rather than as a separate batch because hive rules rules may be needed to resolve non-root nodes, leaving the root to be resolved on a subsequent batch iteration. For example, the hive compatibility test auto_smb_mapjoin_14, and others, make use of a query plan where the root is a Union and its children are each a hive InsertIntoTable. Mixing the custom hive rules with standard analyzer rules initially resulted in an additional failure because of different policies between spark sql and hive when casting a boolean to a string. Hive casts booleans to strings as âtrueâ / âfalseâ while spark sql casts booleans to strings as â1â / â0â (causing the cast1.q test to fail). This behavior is a result of the BooleanCasts rule in HiveTypeCoercion.scala, and from looking at the implementation of BooleanCasts I think converting to to â1â/â0â is potentially a programming mistake. (If the BooleanCasts rule is disabled, casting produces âtrueâ/âfalseâ instead.) I believe âtrueâ / âfalseâ should be the behavior for spark sql - I changed the behavior so bools are converted to âtrueâ/âfalseâ to be consistent with hive, and none of the existing spark tests failed. Finally, in some initial testing with hive it appears that an implicit type coercion of boolean to string results in a lowercase string, e.g. CONCAT( TRUE, ââ ) -> âtrueâ while an explicit cast produces an all caps string, e.g. CAST( TRUE AS STRING ) -> âTRUEâ. The change Iâve made just converts to lowercase strings in all cases. I believe it is at least more correct than the existing spark sql implementation where all Cast expressions become â1â / â0â. You can merge this pull request into a Git repository by running: $ git pull https://github.com/staple/spark SPARK-2781 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1706.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1706 ---- commit 80a1136d58d719f9f029858a42e068553be87eb0 Author: Aaron Staple <aaron.sta...@gmail.com> Date: 2014-07-31T20:42:32Z [SPARK-2781] Check resolution of LogicalPlans in Analyzer. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---