Hello Michael Brown, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/4419 to look at the new patch set (#6). Change subject: IMPALA-4101: qgen: Hive join predicates should only contains equality functions ...................................................................... IMPALA-4101: qgen: Hive join predicates should only contains equality functions Background: Hive only supports equi-joins in its JOIN clause, while Postgres and Impala support more complex functions such as <, <=, >, >=, etc. This change modifies the QueryGenerator._create_relational_join_condition and QueryGenerator._create_boolean_func_tree methods to only construct equality join conditions under certain conditions. The _create_boolean_func_tree method is invoked via QueryGenerator -> create_query -> _create_from_clause -> _create_join_clause -> _create_relational_join_condition -> _create_boolean_func_tree. This method is invoked when constructing the JOIN, WHERE, and HAVING clauses. It creates a tree of functions that would typically be found in any of these clauses. Changes: The parameter "signatures" is added to the method _create_boolean_func_tree, and it lists out all the allowed signatures the function is allowed to use. Previously, this list of signatures was populated by calling _funcs_to_allowed_signatures(FUNCS), and if "signatures" is not specified, then the code defaults back to the results of that method. A new method in the DefaultProfile called get_allowed_join_signatures is introduced and returns a list of function signatures that are allowed within a JOIN clause. The DefaultProfile allows all given signatures, while the HiveProfile only allows for the Equals and And functions, as well as any function that operates over only one column. The reason for these restrictions is that Hive only allows equality joins, does not allow OR operators in the join clause, and has some restrictions on functions that operate over multiple different tables. This last restriction is somewhat subtle; if one side of the equals operator contains a function that operates over two different tables, the other side of the operator cannot contain either of those tables. While it is possible to have functions that take in multiple input parameters, the inputs must be taken from specific tables to prevent Hive from throwing a compile time exception. Adding support for this in qgen code will require significant effort and modification to some core methods (_create_relational_join_condition and _populate_func_with_vals), so it's best to disable these for Hive altogether. Note that the _create_boolean_func_tree still allows for OR operators due to some logic around its "and_or_fill_ratio" variable. The plan is to fix this in a future patch that specifically focuses on removing OR operators from Hive JOIN clauses. Minor change to discrepancy_searcher so that the logs print out "Hive" instead of "Impala" when running against Hive. Testing: * Added a new unit test that ensures the HiveProfile only returns equality joins * Unit tests pass * Tested against Hive locally * Tested against Impala via Leopard * Tested against Impala via the Discrepancy Checker Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 --- M tests/comparison/discrepancy_searcher.py M tests/comparison/query_generator.py M tests/comparison/query_profile.py A tests/comparison/tests/hive/test_hive_create_relational_join_condition.py 4 files changed, 137 insertions(+), 19 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4419/6 -- To view, visit http://gerrit.cloudera.org:8080/4419 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: stak...@cloudera.com Gerrit-Reviewer: David Knupp <dkn...@cloudera.com> Gerrit-Reviewer: Michael Brown <mi...@cloudera.com> Gerrit-Reviewer: Taras Bobrovytsky <tbobrovyt...@cloudera.com> Gerrit-Reviewer: stak...@cloudera.com