[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions
Taras Bobrovytsky has submitted this change and it was merged. Change subject: IMPALA-4101: qgen: Hive join predicates should only contains equality functions .. IMPALA-4101: qgen: Hive join predicates should only contains equality functions Background: Hive only supports equi-joins in its JOIN clause, while Postgres and Impala support more complex functions such as <, <=, >, >=, etc. This change modifies the QueryGenerator._create_relational_join_condition and QueryGenerator._create_boolean_func_tree methods to only construct equality join conditions under certain conditions. The _create_boolean_func_tree method is invoked via QueryGenerator -> create_query -> _create_from_clause -> _create_join_clause -> _create_relational_join_condition -> _create_boolean_func_tree. This method is invoked when constructing the JOIN, WHERE, and HAVING clauses. It creates a tree of functions that would typically be found in any of these clauses. Changes: The parameter "signatures" is added to the method _create_boolean_func_tree, and it lists out all the allowed signatures the function is allowed to use. Previously, this list of signatures was populated by calling _funcs_to_allowed_signatures(FUNCS), and if "signatures" is not specified, then the code defaults back to the results of that method. A new method in the DefaultProfile called get_allowed_join_signatures is introduced and returns a list of function signatures that are allowed within a JOIN clause. The DefaultProfile allows all given signatures, while the HiveProfile only allows for the Equals and And functions, as well as any function that operates over only one column. The reason for these restrictions is that Hive only allows equality joins, does not allow OR operators in the join clause, and has some restrictions on functions that operate over multiple different tables. This last restriction is somewhat subtle; if one side of the equals operator contains a function that operates over two different tables, the other side of the operator cannot contain either of those tables. While it is possible to have functions that take in multiple input parameters, the inputs must be taken from specific tables to prevent Hive from throwing a compile time exception. Adding support for this in qgen code will require significant effort and modification to some core methods (_create_relational_join_condition and _populate_func_with_vals), so it's best to disable these for Hive altogether. Note that the _create_boolean_func_tree still allows for OR operators due to some logic around its "and_or_fill_ratio" variable. The plan is to fix this in a future patch that specifically focuses on removing OR operators from Hive JOIN clauses. Minor change to discrepancy_searcher so that the logs print out "Hive" instead of "Impala" when running against Hive. Testing: * Added a new unit test that ensures the HiveProfile only returns equality joins * Unit tests pass * Tested against Hive locally * Tested against Impala via Leopard * Tested against Impala via the Discrepancy Checker Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 Reviewed-on: http://gerrit.cloudera.org:8080/4419 Reviewed-by: Taras BobrovytskyTested-by: Taras Bobrovytsky --- M tests/comparison/discrepancy_searcher.py M tests/comparison/query_generator.py M tests/comparison/query_profile.py A tests/comparison/tests/hive/test_hive_create_relational_join_condition.py 4 files changed, 139 insertions(+), 20 deletions(-) Approvals: Taras Bobrovytsky: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/4419 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 Gerrit-PatchSet: 12 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Sahil Takiar Gerrit-Reviewer: David Knupp Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Taras Bobrovytsky
[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions
Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-4101: qgen: Hive join predicates should only contains equality functions .. Patch Set 11: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/4419 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Sahil TakiarGerrit-Reviewer: David Knupp Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions
Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-4101: qgen: Hive join predicates should only contains equality functions .. Patch Set 11: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/4419 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Sahil TakiarGerrit-Reviewer: David Knupp Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Sahil Takiar Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: No
Re: [Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions
Done, thanks Henry! --Sahil On Mon, Sep 26, 2016 at 3:57 PM, Henry Robinsonwrote: > stak...@cloudera.com - Could you please fill out your Gerrit profile? In > the top-right corner there's a drop-down menu with 'settings' as an option. > > That stops you from showing up on these reviews as 'Anonymous Coward'. > > On 25 September 2016 at 18:14, Anonymous Coward (Code Review) < > ger...@cloudera.org> wrote: > >> Hello Michael Brown, >> >> I'd like you to reexamine a change. Please visit >> >> http://gerrit.cloudera.org:8080/4419 >> >> to look at the new patch set (#9). >> >> Change subject: IMPALA-4101: qgen: Hive join predicates should only >> contains equality functions >> .. >> >> IMPALA-4101: qgen: Hive join predicates should only contains equality >> functions >> >> Background: >> >> Hive only supports equi-joins in its JOIN clause, while Postgres and >> Impala support more >> complex functions such as <, <=, >, >=, etc. This change modifies the >> QueryGenerator._create_relational_join_condition and >> QueryGenerator._create_boolean_func_tree methods to only construct >> equality join >> conditions under certain conditions. >> >> The _create_boolean_func_tree method is invoked via >> QueryGenerator -> create_query -> _create_from_clause -> >> _create_join_clause -> >> _create_relational_join_condition -> _create_boolean_func_tree. This >> method is invoked >> when constructing the JOIN, WHERE, and HAVING clauses. It creates a tree >> of functions >> that would typically be found in any of these clauses. >> >> Changes: >> >> The parameter "signatures" is added to the method >> _create_boolean_func_tree, and it lists >> out all the allowed signatures the function is allowed to use. >> Previously, this list of >> signatures was populated by calling _funcs_to_allowed_signatures(FUNCS), >> and if >> "signatures" is not specified, then the code defaults back to the results >> of that method. >> A new method in the DefaultProfile called get_allowed_join_signatures is >> introduced and >> returns a list of function signatures that are allowed within a JOIN >> clause. The >> DefaultProfile allows all given signatures, while the HiveProfile only >> allows for the >> Equals and And functions, as well as any function that operates over only >> one column. >> The reason for these restrictions is that Hive only allows equality >> joins, does not allow >> OR operators in the join clause, and has some restrictions on functions >> that operate over >> multiple different tables. This last restriction is somewhat subtle; if >> one side of the >> equals operator contains a function that operates over two different >> tables, the other >> side of the operator cannot contain either of those tables. While it is >> possible to have >> functions that take in multiple input parameters, the inputs must be >> taken from specific >> tables to prevent Hive from throwing a compile time exception. Adding >> support for this in >> qgen code will require significant effort and modification to some core >> methods >> (_create_relational_join_condition and _populate_func_with_vals), so >> it's best to disable >> these for Hive altogether. >> >> Note that the _create_boolean_func_tree still allows for OR operators due >> to some logic >> around its "and_or_fill_ratio" variable. The plan is to fix this in a >> future patch that >> specifically focuses on removing OR operators from Hive JOIN clauses. >> >> Minor change to discrepancy_searcher so that the logs print out "Hive" >> instead of >> "Impala" when running against Hive. >> >> Testing: >> >> * Added a new unit test that ensures the HiveProfile only returns >> equality joins >> * Unit tests pass >> * Tested against Hive locally >> * Tested against Impala via Leopard >> * Tested against Impala via the Discrepancy Checker >> >> Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 >> --- >> M tests/comparison/discrepancy_searcher.py >> M tests/comparison/query_generator.py >> M tests/comparison/query_profile.py >> A tests/comparison/tests/hive/test_hive_create_relational_join >> _condition.py >> 4 files changed, 139 insertions(+), 20 deletions(-) >> >> >> git pull ssh://gerrit.cloudera.org:29418/Impala-ASF >> refs/changes/19/4419/9 >> -- >> To view, visit http://gerrit.cloudera.org:8080/4419 >> To unsubscribe, visit http://gerrit.cloudera.org:8080/settings >> >> Gerrit-MessageType: newpatchset >> Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 >> Gerrit-PatchSet: 9 >> Gerrit-Project: Impala-ASF >> Gerrit-Branch: master >> Gerrit-Owner: stak...@cloudera.com >> Gerrit-Reviewer: David Knupp >> Gerrit-Reviewer: Michael Brown >> Gerrit-Reviewer: Taras Bobrovytsky >> Gerrit-Reviewer: stak...@cloudera.com >> >> -- >> You received this message because you are subscribed to the Google Groups >> "impala-cr" group. >>
Re: [Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions
stak...@cloudera.com - Could you please fill out your Gerrit profile? In the top-right corner there's a drop-down menu with 'settings' as an option. That stops you from showing up on these reviews as 'Anonymous Coward'. On 25 September 2016 at 18:14, Anonymous Coward (Code Review) < ger...@cloudera.org> wrote: > Hello Michael Brown, > > I'd like you to reexamine a change. Please visit > > http://gerrit.cloudera.org:8080/4419 > > to look at the new patch set (#9). > > Change subject: IMPALA-4101: qgen: Hive join predicates should only > contains equality functions > .. > > IMPALA-4101: qgen: Hive join predicates should only contains equality > functions > > Background: > > Hive only supports equi-joins in its JOIN clause, while Postgres and > Impala support more > complex functions such as <, <=, >, >=, etc. This change modifies the > QueryGenerator._create_relational_join_condition and > QueryGenerator._create_boolean_func_tree methods to only construct > equality join > conditions under certain conditions. > > The _create_boolean_func_tree method is invoked via > QueryGenerator -> create_query -> _create_from_clause -> > _create_join_clause -> > _create_relational_join_condition -> _create_boolean_func_tree. This > method is invoked > when constructing the JOIN, WHERE, and HAVING clauses. It creates a tree > of functions > that would typically be found in any of these clauses. > > Changes: > > The parameter "signatures" is added to the method > _create_boolean_func_tree, and it lists > out all the allowed signatures the function is allowed to use. Previously, > this list of > signatures was populated by calling _funcs_to_allowed_signatures(FUNCS), > and if > "signatures" is not specified, then the code defaults back to the results > of that method. > A new method in the DefaultProfile called get_allowed_join_signatures is > introduced and > returns a list of function signatures that are allowed within a JOIN > clause. The > DefaultProfile allows all given signatures, while the HiveProfile only > allows for the > Equals and And functions, as well as any function that operates over only > one column. > The reason for these restrictions is that Hive only allows equality joins, > does not allow > OR operators in the join clause, and has some restrictions on functions > that operate over > multiple different tables. This last restriction is somewhat subtle; if > one side of the > equals operator contains a function that operates over two different > tables, the other > side of the operator cannot contain either of those tables. While it is > possible to have > functions that take in multiple input parameters, the inputs must be taken > from specific > tables to prevent Hive from throwing a compile time exception. Adding > support for this in > qgen code will require significant effort and modification to some core > methods > (_create_relational_join_condition and _populate_func_with_vals), so it's > best to disable > these for Hive altogether. > > Note that the _create_boolean_func_tree still allows for OR operators due > to some logic > around its "and_or_fill_ratio" variable. The plan is to fix this in a > future patch that > specifically focuses on removing OR operators from Hive JOIN clauses. > > Minor change to discrepancy_searcher so that the logs print out "Hive" > instead of > "Impala" when running against Hive. > > Testing: > > * Added a new unit test that ensures the HiveProfile only returns equality > joins > * Unit tests pass > * Tested against Hive locally > * Tested against Impala via Leopard > * Tested against Impala via the Discrepancy Checker > > Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 > --- > M tests/comparison/discrepancy_searcher.py > M tests/comparison/query_generator.py > M tests/comparison/query_profile.py > A tests/comparison/tests/hive/test_hive_create_relational_ > join_condition.py > 4 files changed, 139 insertions(+), 20 deletions(-) > > > git pull ssh://gerrit.cloudera.org:29418/Impala-ASF > refs/changes/19/4419/9 > -- > To view, visit http://gerrit.cloudera.org:8080/4419 > To unsubscribe, visit http://gerrit.cloudera.org:8080/settings > > Gerrit-MessageType: newpatchset > Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 > Gerrit-PatchSet: 9 > Gerrit-Project: Impala-ASF > Gerrit-Branch: master > Gerrit-Owner: stak...@cloudera.com > Gerrit-Reviewer: David Knupp> Gerrit-Reviewer: Michael Brown > Gerrit-Reviewer: Taras Bobrovytsky > Gerrit-Reviewer: stak...@cloudera.com > > -- > You received this message because you are subscribed to the Google Groups > "impala-cr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to impala-cr+unsubscr...@cloudera.com. > For more options, visit https://groups.google.com/a/cloudera.com/d/optout. > -- Henry Robinson Software Engineer
[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions
Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-4101: qgen: Hive join predicates should only contains equality functions .. Patch Set 9: (2 comments) http://gerrit.cloudera.org:8080/#/c/4419/7/tests/comparison/query_generator.py File tests/comparison/query_generator.py: Line 1442: signature = self.profile.choose_relational_func_signature(relational_signatures) > Yeah, only_use_equality_join_predicates() is never called inside _create_bo Agreed. http://gerrit.cloudera.org:8080/#/c/4419/9/tests/comparison/query_generator.py File tests/comparison/query_generator.py: PS9, Line 1214: this should be 4 spaces, not 2. -- To view, visit http://gerrit.cloudera.org:8080/4419 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: stak...@cloudera.com Gerrit-Reviewer: David KnuppGerrit-Reviewer: Michael Brown Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: stak...@cloudera.com Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions
Hello Michael Brown, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4419 to look at the new patch set (#9). Change subject: IMPALA-4101: qgen: Hive join predicates should only contains equality functions .. IMPALA-4101: qgen: Hive join predicates should only contains equality functions Background: Hive only supports equi-joins in its JOIN clause, while Postgres and Impala support more complex functions such as <, <=, >, >=, etc. This change modifies the QueryGenerator._create_relational_join_condition and QueryGenerator._create_boolean_func_tree methods to only construct equality join conditions under certain conditions. The _create_boolean_func_tree method is invoked via QueryGenerator -> create_query -> _create_from_clause -> _create_join_clause -> _create_relational_join_condition -> _create_boolean_func_tree. This method is invoked when constructing the JOIN, WHERE, and HAVING clauses. It creates a tree of functions that would typically be found in any of these clauses. Changes: The parameter "signatures" is added to the method _create_boolean_func_tree, and it lists out all the allowed signatures the function is allowed to use. Previously, this list of signatures was populated by calling _funcs_to_allowed_signatures(FUNCS), and if "signatures" is not specified, then the code defaults back to the results of that method. A new method in the DefaultProfile called get_allowed_join_signatures is introduced and returns a list of function signatures that are allowed within a JOIN clause. The DefaultProfile allows all given signatures, while the HiveProfile only allows for the Equals and And functions, as well as any function that operates over only one column. The reason for these restrictions is that Hive only allows equality joins, does not allow OR operators in the join clause, and has some restrictions on functions that operate over multiple different tables. This last restriction is somewhat subtle; if one side of the equals operator contains a function that operates over two different tables, the other side of the operator cannot contain either of those tables. While it is possible to have functions that take in multiple input parameters, the inputs must be taken from specific tables to prevent Hive from throwing a compile time exception. Adding support for this in qgen code will require significant effort and modification to some core methods (_create_relational_join_condition and _populate_func_with_vals), so it's best to disable these for Hive altogether. Note that the _create_boolean_func_tree still allows for OR operators due to some logic around its "and_or_fill_ratio" variable. The plan is to fix this in a future patch that specifically focuses on removing OR operators from Hive JOIN clauses. Minor change to discrepancy_searcher so that the logs print out "Hive" instead of "Impala" when running against Hive. Testing: * Added a new unit test that ensures the HiveProfile only returns equality joins * Unit tests pass * Tested against Hive locally * Tested against Impala via Leopard * Tested against Impala via the Discrepancy Checker Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 --- M tests/comparison/discrepancy_searcher.py M tests/comparison/query_generator.py M tests/comparison/query_profile.py A tests/comparison/tests/hive/test_hive_create_relational_join_condition.py 4 files changed, 139 insertions(+), 20 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4419/9 -- To view, visit http://gerrit.cloudera.org:8080/4419 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: stak...@cloudera.com Gerrit-Reviewer: David KnuppGerrit-Reviewer: Michael Brown Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: stak...@cloudera.com
[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions
Hello Michael Brown, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4419 to look at the new patch set (#8). Change subject: IMPALA-4101: qgen: Hive join predicates should only contains equality functions .. IMPALA-4101: qgen: Hive join predicates should only contains equality functions Background: Hive only supports equi-joins in its JOIN clause, while Postgres and Impala support more complex functions such as <, <=, >, >=, etc. This change modifies the QueryGenerator._create_relational_join_condition and QueryGenerator._create_boolean_func_tree methods to only construct equality join conditions under certain conditions. The _create_boolean_func_tree method is invoked via QueryGenerator -> create_query -> _create_from_clause -> _create_join_clause -> _create_relational_join_condition -> _create_boolean_func_tree. This method is invoked when constructing the JOIN, WHERE, and HAVING clauses. It creates a tree of functions that would typically be found in any of these clauses. Changes: The parameter "signatures" is added to the method _create_boolean_func_tree, and it lists out all the allowed signatures the function is allowed to use. Previously, this list of signatures was populated by calling _funcs_to_allowed_signatures(FUNCS), and if "signatures" is not specified, then the code defaults back to the results of that method. A new method in the DefaultProfile called get_allowed_join_signatures is introduced and returns a list of function signatures that are allowed within a JOIN clause. The DefaultProfile allows all given signatures, while the HiveProfile only allows for the Equals and And functions, as well as any function that operates over only one column. The reason for these restrictions is that Hive only allows equality joins, does not allow OR operators in the join clause, and has some restrictions on functions that operate over multiple different tables. This last restriction is somewhat subtle; if one side of the equals operator contains a function that operates over two different tables, the other side of the operator cannot contain either of those tables. While it is possible to have functions that take in multiple input parameters, the inputs must be taken from specific tables to prevent Hive from throwing a compile time exception. Adding support for this in qgen code will require significant effort and modification to some core methods (_create_relational_join_condition and _populate_func_with_vals), so it's best to disable these for Hive altogether. Note that the _create_boolean_func_tree still allows for OR operators due to some logic around its "and_or_fill_ratio" variable. The plan is to fix this in a future patch that specifically focuses on removing OR operators from Hive JOIN clauses. Minor change to discrepancy_searcher so that the logs print out "Hive" instead of "Impala" when running against Hive. Testing: * Added a new unit test that ensures the HiveProfile only returns equality joins * Unit tests pass * Tested against Hive locally * Tested against Impala via Leopard * Tested against Impala via the Discrepancy Checker Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 --- M tests/comparison/discrepancy_searcher.py M tests/comparison/query_generator.py M tests/comparison/query_profile.py A tests/comparison/tests/hive/test_hive_create_relational_join_condition.py 4 files changed, 140 insertions(+), 20 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4419/8 -- To view, visit http://gerrit.cloudera.org:8080/4419 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: stak...@cloudera.com Gerrit-Reviewer: David KnuppGerrit-Reviewer: Michael Brown Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: stak...@cloudera.com
[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions
Hello Michael Brown, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4419 to look at the new patch set (#7). Change subject: IMPALA-4101: qgen: Hive join predicates should only contains equality functions .. IMPALA-4101: qgen: Hive join predicates should only contains equality functions Background: Hive only supports equi-joins in its JOIN clause, while Postgres and Impala support more complex functions such as <, <=, >, >=, etc. This change modifies the QueryGenerator._create_relational_join_condition and QueryGenerator._create_boolean_func_tree methods to only construct equality join conditions under certain conditions. The _create_boolean_func_tree method is invoked via QueryGenerator -> create_query -> _create_from_clause -> _create_join_clause -> _create_relational_join_condition -> _create_boolean_func_tree. This method is invoked when constructing the JOIN, WHERE, and HAVING clauses. It creates a tree of functions that would typically be found in any of these clauses. Changes: The parameter "signatures" is added to the method _create_boolean_func_tree, and it lists out all the allowed signatures the function is allowed to use. Previously, this list of signatures was populated by calling _funcs_to_allowed_signatures(FUNCS), and if "signatures" is not specified, then the code defaults back to the results of that method. A new method in the DefaultProfile called get_allowed_join_signatures is introduced and returns a list of function signatures that are allowed within a JOIN clause. The DefaultProfile allows all given signatures, while the HiveProfile only allows for the Equals and And functions, as well as any function that operates over only one column. The reason for these restrictions is that Hive only allows equality joins, does not allow OR operators in the join clause, and has some restrictions on functions that operate over multiple different tables. This last restriction is somewhat subtle; if one side of the equals operator contains a function that operates over two different tables, the other side of the operator cannot contain either of those tables. While it is possible to have functions that take in multiple input parameters, the inputs must be taken from specific tables to prevent Hive from throwing a compile time exception. Adding support for this in qgen code will require significant effort and modification to some core methods (_create_relational_join_condition and _populate_func_with_vals), so it's best to disable these for Hive altogether. Note that the _create_boolean_func_tree still allows for OR operators due to some logic around its "and_or_fill_ratio" variable. The plan is to fix this in a future patch that specifically focuses on removing OR operators from Hive JOIN clauses. Minor change to discrepancy_searcher so that the logs print out "Hive" instead of "Impala" when running against Hive. Testing: * Added a new unit test that ensures the HiveProfile only returns equality joins * Unit tests pass * Tested against Hive locally * Tested against Impala via Leopard * Tested against Impala via the Discrepancy Checker Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 --- M tests/comparison/discrepancy_searcher.py M tests/comparison/query_generator.py M tests/comparison/query_profile.py A tests/comparison/tests/hive/test_hive_create_relational_join_condition.py 4 files changed, 137 insertions(+), 19 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4419/7 -- To view, visit http://gerrit.cloudera.org:8080/4419 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: stak...@cloudera.com Gerrit-Reviewer: David KnuppGerrit-Reviewer: Michael Brown Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: stak...@cloudera.com
[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions
Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-4101: qgen: Hive join predicates should only contains equality functions .. Patch Set 5: Code-Review-1 (1 comment) http://gerrit.cloudera.org:8080/#/c/4419/5/tests/comparison/query_generator.py File tests/comparison/query_generator.py: Line 1436: if self.profile.only_use_equality_join_predicates(): I don't think this whole patch is necessary. We just need only_use_equality_join_predicates to always be true. This can be done by setting this probability http://github.mtv.cloudera.com/CDH/Impala/blob/cdh5-trunk/tests/comparison/query_profile.py#L220 to 1. -- To view, visit http://gerrit.cloudera.org:8080/4419 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: stak...@cloudera.com Gerrit-Reviewer: David KnuppGerrit-Reviewer: Michael Brown Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: stak...@cloudera.com Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions
Michael Brown has posted comments on this change. Change subject: IMPALA-4101: qgen: Hive join predicates should only contains equality functions .. Patch Set 5: Code-Review+1 A committer needs to look at this for +2 -- To view, visit http://gerrit.cloudera.org:8080/4419 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: stak...@cloudera.com Gerrit-Reviewer: David KnuppGerrit-Reviewer: Michael Brown Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: stak...@cloudera.com Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions
Michael Brown has posted comments on this change. Change subject: IMPALA-4101: qgen: Hive join predicates should only contains equality functions .. Patch Set 3: (3 comments) http://gerrit.cloudera.org:8080/#/c/4419/3//COMMIT_MSG Commit Message: PS3, Line 35: subtle, Nit: this is a comma splice. You can change the comma to something like a colon or semi-colon, or use a period and create a new sentence after "subtle." PS3, Line 41: its Nit: "it's" PS3, Line 45: it's Nit: "its" :) -- To view, visit http://gerrit.cloudera.org:8080/4419 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: stak...@cloudera.com Gerrit-Reviewer: David KnuppGerrit-Reviewer: Michael Brown Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: stak...@cloudera.com Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions
Michael Brown has posted comments on this change. Change subject: IMPALA-4101: qgen: Hive join predicates should only contains equality functions .. Patch Set 3: (3 comments) Thanks for adding a test! http://gerrit.cloudera.org:8080/#/c/4419/3/tests/comparison/query_generator.py File tests/comparison/query_generator.py: Line 1364: ''' Add documentation about the signatures parameter. http://gerrit.cloudera.org:8080/#/c/4419/3/tests/comparison/query_profile.py File tests/comparison/query_profile.py: PS3, Line 661: if signature.func in (Equals, And): : allowed_join_signatures.append(signature) : elif len(signature.args) == 1: Maybe use an or expression for these two conditions and remove the elif? http://gerrit.cloudera.org:8080/#/c/4419/3/tests/comparison/tests/hive/test_create_join_clause.py File tests/comparison/tests/hive/test_create_join_clause.py: PS3, Line 31: class MockQueryProfile(HiveProfile): 1. I prefer "Fake" instead of "Mock". I prefer the use of "mock" only when using actual mock libraries. Is this reasonable? 2. Use Hive in the name, to explain it's a fake HiveProfile. -- To view, visit http://gerrit.cloudera.org:8080/4419 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: stak...@cloudera.com Gerrit-Reviewer: David KnuppGerrit-Reviewer: Michael Brown Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: stak...@cloudera.com Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions
stak...@cloudera.com has uploaded a new patch set (#2). Change subject: IMPALA-4101: qgen: Hive join predicates should only contains equality functions .. IMPALA-4101: qgen: Hive join predicates should only contains equality functions Background: Hive only supports equi-joins in its JOIN clause, while Postgres and Impala support more complex functions such as <, <=, >, >=, etc. This change modifies the QueryGenerator._create_relational_join_condition and QueryGenerator._create_boolean_func_tree methods to only construct equality join conditions under certain conditions. The _create_boolean_func_tree method is invoked via QueryGenerator -> create_query -> _create_from_clause -> _create_join_clause -> _create_relational_join_condition -> _create_boolean_func_tree. This method is invoked when constructing the JOIN, WHERE, and HAVING clauses. It creates a tree of functions that would typically be found in any of these clauses. Changes: The parameter "signatures" is added to the method _create_boolean_func_tree, and it lists out all the allowed signatures the function is allowed to use. Previously, this list of signatures was populated by calling _funcs_to_allowed_signatures(FUNCS), and if "signatures" is not specified, then the code defaults back to the results of that method. A new method in the DefaultProfile called get_allowed_join_signatures is introduced and returns a list of function signatures that are allowed within a JOIN clause. The DefaultProfile allows all given signatures, while the HiveProfile only allows for the Equals and And functions, as well as any function that operates over only one column. The reason for these restrictions is that Hive only allows equality joins, does not allow OR operators in the join clause, and does not allow any operator that operates over multiple columns. Note that the _create_boolean_func_tree still allows for OR operators due to some logic around it's "and_or_fill_ratio" variable. The plan is to fix this in a future patch that specifically focuses on removing OR operators from Hive JOIN clauses. Minor change to discrepancy_searcher so that the logs print out "Hive" instead of "Impala" when running against Hive. Testing: * Added a new unit test that ensures the HiveProfile only returns equality joins * Tested against Hive locally * Tested against Impala via Leopard * Tested against Impala via the Discrepancy Checker inside an Impala Docker container Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 --- M tests/comparison/discrepancy_searcher.py M tests/comparison/query_generator.py M tests/comparison/query_profile.py A tests/comparison/tests/hive/test_create_join_clause.py 4 files changed, 110 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4419/2 -- To view, visit http://gerrit.cloudera.org:8080/4419 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: stak...@cloudera.com