[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions

2016-09-27 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has submitted this change and it was merged.

Change subject: IMPALA-4101: qgen: Hive join predicates should only contains 
equality functions
..


IMPALA-4101: qgen: Hive join predicates should only contains equality functions

Background:

Hive only supports equi-joins in its JOIN clause, while Postgres and Impala 
support more
complex functions such as <, <=, >, >=, etc. This change modifies the
QueryGenerator._create_relational_join_condition and
QueryGenerator._create_boolean_func_tree methods to only construct equality join
conditions under certain conditions.

The _create_boolean_func_tree method is invoked via
QueryGenerator -> create_query -> _create_from_clause -> _create_join_clause ->
_create_relational_join_condition -> _create_boolean_func_tree. This method is 
invoked
when constructing the JOIN, WHERE, and HAVING clauses. It creates a tree of 
functions
that would typically be found in any of these clauses.

Changes:

The parameter "signatures" is added to the method _create_boolean_func_tree, 
and it lists
out all the allowed signatures the function is allowed to use. Previously, this 
list of
signatures was populated by calling _funcs_to_allowed_signatures(FUNCS), and if
"signatures" is not specified, then the code defaults back to the results of 
that method.
A new method in the DefaultProfile called get_allowed_join_signatures is 
introduced and
returns a list of function signatures that are allowed within a JOIN clause. The
DefaultProfile allows all given signatures, while the HiveProfile only allows 
for the
Equals and And functions, as well as any function that operates over only one 
column.
The reason for these restrictions is that Hive only allows equality joins, does 
not allow
OR operators in the join clause, and has some restrictions on functions that 
operate over
multiple different tables. This last restriction is somewhat subtle; if one 
side of the
equals operator contains a function that operates over two different tables, 
the other
side of the operator cannot contain either of those tables. While it is 
possible to have
functions that take in multiple input parameters, the inputs must be taken from 
specific
tables to prevent Hive from throwing a compile time exception. Adding support 
for this in
qgen code will require significant effort and modification to some core methods
(_create_relational_join_condition and _populate_func_with_vals), so it's best 
to disable
these for Hive altogether.

Note that the _create_boolean_func_tree still allows for OR operators due to 
some logic
around its "and_or_fill_ratio" variable. The plan is to fix this in a future 
patch that
specifically focuses on removing OR operators from Hive JOIN clauses.

Minor change to discrepancy_searcher so that the logs print out "Hive" instead 
of
"Impala" when running against Hive.

Testing:

* Added a new unit test that ensures the HiveProfile only returns equality joins
* Unit tests pass
* Tested against Hive locally
* Tested against Impala via Leopard
* Tested against Impala via the Discrepancy Checker

Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
Reviewed-on: http://gerrit.cloudera.org:8080/4419
Reviewed-by: Taras Bobrovytsky 
Tested-by: Taras Bobrovytsky 
---
M tests/comparison/discrepancy_searcher.py
M tests/comparison/query_generator.py
M tests/comparison/query_profile.py
A tests/comparison/tests/hive/test_hive_create_relational_join_condition.py
4 files changed, 139 insertions(+), 20 deletions(-)

Approvals:
  Taras Bobrovytsky: Looks good to me, approved; Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/4419
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
Gerrit-PatchSet: 12
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Taras Bobrovytsky 


[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions

2016-09-27 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-4101: qgen: Hive join predicates should only contains 
equality functions
..


Patch Set 11: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/4419
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions

2016-09-27 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-4101: qgen: Hive join predicates should only contains 
equality functions
..


Patch Set 11: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/4419
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-HasComments: No


Re: [Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions

2016-09-26 Thread Sahil Takiar
Done, thanks Henry!

--Sahil

On Mon, Sep 26, 2016 at 3:57 PM, Henry Robinson  wrote:

> stak...@cloudera.com - Could you please fill out your Gerrit profile? In
> the top-right corner there's a drop-down menu with 'settings' as an option.
>
> That stops you from showing up on these reviews as 'Anonymous Coward'.
>
> On 25 September 2016 at 18:14, Anonymous Coward (Code Review) <
> ger...@cloudera.org> wrote:
>
>> Hello Michael Brown,
>>
>> I'd like you to reexamine a change.  Please visit
>>
>> http://gerrit.cloudera.org:8080/4419
>>
>> to look at the new patch set (#9).
>>
>> Change subject: IMPALA-4101: qgen: Hive join predicates should only
>> contains equality functions
>> ..
>>
>> IMPALA-4101: qgen: Hive join predicates should only contains equality
>> functions
>>
>> Background:
>>
>> Hive only supports equi-joins in its JOIN clause, while Postgres and
>> Impala support more
>> complex functions such as <, <=, >, >=, etc. This change modifies the
>> QueryGenerator._create_relational_join_condition and
>> QueryGenerator._create_boolean_func_tree methods to only construct
>> equality join
>> conditions under certain conditions.
>>
>> The _create_boolean_func_tree method is invoked via
>> QueryGenerator -> create_query -> _create_from_clause ->
>> _create_join_clause ->
>> _create_relational_join_condition -> _create_boolean_func_tree. This
>> method is invoked
>> when constructing the JOIN, WHERE, and HAVING clauses. It creates a tree
>> of functions
>> that would typically be found in any of these clauses.
>>
>> Changes:
>>
>> The parameter "signatures" is added to the method
>> _create_boolean_func_tree, and it lists
>> out all the allowed signatures the function is allowed to use.
>> Previously, this list of
>> signatures was populated by calling _funcs_to_allowed_signatures(FUNCS),
>> and if
>> "signatures" is not specified, then the code defaults back to the results
>> of that method.
>> A new method in the DefaultProfile called get_allowed_join_signatures is
>> introduced and
>> returns a list of function signatures that are allowed within a JOIN
>> clause. The
>> DefaultProfile allows all given signatures, while the HiveProfile only
>> allows for the
>> Equals and And functions, as well as any function that operates over only
>> one column.
>> The reason for these restrictions is that Hive only allows equality
>> joins, does not allow
>> OR operators in the join clause, and has some restrictions on functions
>> that operate over
>> multiple different tables. This last restriction is somewhat subtle; if
>> one side of the
>> equals operator contains a function that operates over two different
>> tables, the other
>> side of the operator cannot contain either of those tables. While it is
>> possible to have
>> functions that take in multiple input parameters, the inputs must be
>> taken from specific
>> tables to prevent Hive from throwing a compile time exception. Adding
>> support for this in
>> qgen code will require significant effort and modification to some core
>> methods
>> (_create_relational_join_condition and _populate_func_with_vals), so
>> it's best to disable
>> these for Hive altogether.
>>
>> Note that the _create_boolean_func_tree still allows for OR operators due
>> to some logic
>> around its "and_or_fill_ratio" variable. The plan is to fix this in a
>> future patch that
>> specifically focuses on removing OR operators from Hive JOIN clauses.
>>
>> Minor change to discrepancy_searcher so that the logs print out "Hive"
>> instead of
>> "Impala" when running against Hive.
>>
>> Testing:
>>
>> * Added a new unit test that ensures the HiveProfile only returns
>> equality joins
>> * Unit tests pass
>> * Tested against Hive locally
>> * Tested against Impala via Leopard
>> * Tested against Impala via the Discrepancy Checker
>>
>> Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
>> ---
>> M tests/comparison/discrepancy_searcher.py
>> M tests/comparison/query_generator.py
>> M tests/comparison/query_profile.py
>> A tests/comparison/tests/hive/test_hive_create_relational_join
>> _condition.py
>> 4 files changed, 139 insertions(+), 20 deletions(-)
>>
>>
>>   git pull ssh://gerrit.cloudera.org:29418/Impala-ASF
>> refs/changes/19/4419/9
>> --
>> To view, visit http://gerrit.cloudera.org:8080/4419
>> To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
>>
>> Gerrit-MessageType: newpatchset
>> Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
>> Gerrit-PatchSet: 9
>> Gerrit-Project: Impala-ASF
>> Gerrit-Branch: master
>> Gerrit-Owner: stak...@cloudera.com
>> Gerrit-Reviewer: David Knupp 
>> Gerrit-Reviewer: Michael Brown 
>> Gerrit-Reviewer: Taras Bobrovytsky 
>> Gerrit-Reviewer: stak...@cloudera.com
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "impala-cr" group.
>> 

Re: [Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions

2016-09-26 Thread Henry Robinson
stak...@cloudera.com - Could you please fill out your Gerrit profile? In
the top-right corner there's a drop-down menu with 'settings' as an option.

That stops you from showing up on these reviews as 'Anonymous Coward'.

On 25 September 2016 at 18:14, Anonymous Coward (Code Review) <
ger...@cloudera.org> wrote:

> Hello Michael Brown,
>
> I'd like you to reexamine a change.  Please visit
>
> http://gerrit.cloudera.org:8080/4419
>
> to look at the new patch set (#9).
>
> Change subject: IMPALA-4101: qgen: Hive join predicates should only
> contains equality functions
> ..
>
> IMPALA-4101: qgen: Hive join predicates should only contains equality
> functions
>
> Background:
>
> Hive only supports equi-joins in its JOIN clause, while Postgres and
> Impala support more
> complex functions such as <, <=, >, >=, etc. This change modifies the
> QueryGenerator._create_relational_join_condition and
> QueryGenerator._create_boolean_func_tree methods to only construct
> equality join
> conditions under certain conditions.
>
> The _create_boolean_func_tree method is invoked via
> QueryGenerator -> create_query -> _create_from_clause ->
> _create_join_clause ->
> _create_relational_join_condition -> _create_boolean_func_tree. This
> method is invoked
> when constructing the JOIN, WHERE, and HAVING clauses. It creates a tree
> of functions
> that would typically be found in any of these clauses.
>
> Changes:
>
> The parameter "signatures" is added to the method
> _create_boolean_func_tree, and it lists
> out all the allowed signatures the function is allowed to use. Previously,
> this list of
> signatures was populated by calling _funcs_to_allowed_signatures(FUNCS),
> and if
> "signatures" is not specified, then the code defaults back to the results
> of that method.
> A new method in the DefaultProfile called get_allowed_join_signatures is
> introduced and
> returns a list of function signatures that are allowed within a JOIN
> clause. The
> DefaultProfile allows all given signatures, while the HiveProfile only
> allows for the
> Equals and And functions, as well as any function that operates over only
> one column.
> The reason for these restrictions is that Hive only allows equality joins,
> does not allow
> OR operators in the join clause, and has some restrictions on functions
> that operate over
> multiple different tables. This last restriction is somewhat subtle; if
> one side of the
> equals operator contains a function that operates over two different
> tables, the other
> side of the operator cannot contain either of those tables. While it is
> possible to have
> functions that take in multiple input parameters, the inputs must be taken
> from specific
> tables to prevent Hive from throwing a compile time exception. Adding
> support for this in
> qgen code will require significant effort and modification to some core
> methods
> (_create_relational_join_condition and _populate_func_with_vals), so it's
> best to disable
> these for Hive altogether.
>
> Note that the _create_boolean_func_tree still allows for OR operators due
> to some logic
> around its "and_or_fill_ratio" variable. The plan is to fix this in a
> future patch that
> specifically focuses on removing OR operators from Hive JOIN clauses.
>
> Minor change to discrepancy_searcher so that the logs print out "Hive"
> instead of
> "Impala" when running against Hive.
>
> Testing:
>
> * Added a new unit test that ensures the HiveProfile only returns equality
> joins
> * Unit tests pass
> * Tested against Hive locally
> * Tested against Impala via Leopard
> * Tested against Impala via the Discrepancy Checker
>
> Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
> ---
> M tests/comparison/discrepancy_searcher.py
> M tests/comparison/query_generator.py
> M tests/comparison/query_profile.py
> A tests/comparison/tests/hive/test_hive_create_relational_
> join_condition.py
> 4 files changed, 139 insertions(+), 20 deletions(-)
>
>
>   git pull ssh://gerrit.cloudera.org:29418/Impala-ASF
> refs/changes/19/4419/9
> --
> To view, visit http://gerrit.cloudera.org:8080/4419
> To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
>
> Gerrit-MessageType: newpatchset
> Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
> Gerrit-PatchSet: 9
> Gerrit-Project: Impala-ASF
> Gerrit-Branch: master
> Gerrit-Owner: stak...@cloudera.com
> Gerrit-Reviewer: David Knupp 
> Gerrit-Reviewer: Michael Brown 
> Gerrit-Reviewer: Taras Bobrovytsky 
> Gerrit-Reviewer: stak...@cloudera.com
>
> --
> You received this message because you are subscribed to the Google Groups
> "impala-cr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to impala-cr+unsubscr...@cloudera.com.
> For more options, visit https://groups.google.com/a/cloudera.com/d/optout.
>



-- 
Henry Robinson
Software Engineer

[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions

2016-09-26 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-4101: qgen: Hive join predicates should only contains 
equality functions
..


Patch Set 9:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/4419/7/tests/comparison/query_generator.py
File tests/comparison/query_generator.py:

Line 1442: signature = 
self.profile.choose_relational_func_signature(relational_signatures)
> Yeah, only_use_equality_join_predicates() is never called inside _create_bo
Agreed.


http://gerrit.cloudera.org:8080/#/c/4419/9/tests/comparison/query_generator.py
File tests/comparison/query_generator.py:

PS9, Line 1214:   
this should be 4 spaces, not 2.


-- 
To view, visit http://gerrit.cloudera.org:8080/4419
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: stak...@cloudera.com
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: stak...@cloudera.com
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions

2016-09-25 Thread Anonymous Coward (Code Review)
Hello Michael Brown,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4419

to look at the new patch set (#9).

Change subject: IMPALA-4101: qgen: Hive join predicates should only contains 
equality functions
..

IMPALA-4101: qgen: Hive join predicates should only contains equality functions

Background:

Hive only supports equi-joins in its JOIN clause, while Postgres and Impala 
support more
complex functions such as <, <=, >, >=, etc. This change modifies the
QueryGenerator._create_relational_join_condition and
QueryGenerator._create_boolean_func_tree methods to only construct equality join
conditions under certain conditions.

The _create_boolean_func_tree method is invoked via
QueryGenerator -> create_query -> _create_from_clause -> _create_join_clause ->
_create_relational_join_condition -> _create_boolean_func_tree. This method is 
invoked
when constructing the JOIN, WHERE, and HAVING clauses. It creates a tree of 
functions
that would typically be found in any of these clauses.

Changes:

The parameter "signatures" is added to the method _create_boolean_func_tree, 
and it lists
out all the allowed signatures the function is allowed to use. Previously, this 
list of
signatures was populated by calling _funcs_to_allowed_signatures(FUNCS), and if
"signatures" is not specified, then the code defaults back to the results of 
that method.
A new method in the DefaultProfile called get_allowed_join_signatures is 
introduced and
returns a list of function signatures that are allowed within a JOIN clause. The
DefaultProfile allows all given signatures, while the HiveProfile only allows 
for the
Equals and And functions, as well as any function that operates over only one 
column.
The reason for these restrictions is that Hive only allows equality joins, does 
not allow
OR operators in the join clause, and has some restrictions on functions that 
operate over
multiple different tables. This last restriction is somewhat subtle; if one 
side of the
equals operator contains a function that operates over two different tables, 
the other
side of the operator cannot contain either of those tables. While it is 
possible to have
functions that take in multiple input parameters, the inputs must be taken from 
specific
tables to prevent Hive from throwing a compile time exception. Adding support 
for this in
qgen code will require significant effort and modification to some core methods
(_create_relational_join_condition and _populate_func_with_vals), so it's best 
to disable
these for Hive altogether.

Note that the _create_boolean_func_tree still allows for OR operators due to 
some logic
around its "and_or_fill_ratio" variable. The plan is to fix this in a future 
patch that
specifically focuses on removing OR operators from Hive JOIN clauses.

Minor change to discrepancy_searcher so that the logs print out "Hive" instead 
of
"Impala" when running against Hive.

Testing:

* Added a new unit test that ensures the HiveProfile only returns equality joins
* Unit tests pass
* Tested against Hive locally
* Tested against Impala via Leopard
* Tested against Impala via the Discrepancy Checker

Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
---
M tests/comparison/discrepancy_searcher.py
M tests/comparison/query_generator.py
M tests/comparison/query_profile.py
A tests/comparison/tests/hive/test_hive_create_relational_join_condition.py
4 files changed, 139 insertions(+), 20 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4419/9
-- 
To view, visit http://gerrit.cloudera.org:8080/4419
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: stak...@cloudera.com
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: stak...@cloudera.com


[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions

2016-09-24 Thread Anonymous Coward (Code Review)
Hello Michael Brown,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4419

to look at the new patch set (#8).

Change subject: IMPALA-4101: qgen: Hive join predicates should only contains 
equality functions
..

IMPALA-4101: qgen: Hive join predicates should only contains equality functions

Background:

Hive only supports equi-joins in its JOIN clause, while Postgres and Impala 
support more
complex functions such as <, <=, >, >=, etc. This change modifies the
QueryGenerator._create_relational_join_condition and
QueryGenerator._create_boolean_func_tree methods to only construct equality join
conditions under certain conditions.

The _create_boolean_func_tree method is invoked via
QueryGenerator -> create_query -> _create_from_clause -> _create_join_clause ->
_create_relational_join_condition -> _create_boolean_func_tree. This method is 
invoked
when constructing the JOIN, WHERE, and HAVING clauses. It creates a tree of 
functions
that would typically be found in any of these clauses.

Changes:

The parameter "signatures" is added to the method _create_boolean_func_tree, 
and it lists
out all the allowed signatures the function is allowed to use. Previously, this 
list of
signatures was populated by calling _funcs_to_allowed_signatures(FUNCS), and if
"signatures" is not specified, then the code defaults back to the results of 
that method.
A new method in the DefaultProfile called get_allowed_join_signatures is 
introduced and
returns a list of function signatures that are allowed within a JOIN clause. The
DefaultProfile allows all given signatures, while the HiveProfile only allows 
for the
Equals and And functions, as well as any function that operates over only one 
column.
The reason for these restrictions is that Hive only allows equality joins, does 
not allow
OR operators in the join clause, and has some restrictions on functions that 
operate over
multiple different tables. This last restriction is somewhat subtle; if one 
side of the
equals operator contains a function that operates over two different tables, 
the other
side of the operator cannot contain either of those tables. While it is 
possible to have
functions that take in multiple input parameters, the inputs must be taken from 
specific
tables to prevent Hive from throwing a compile time exception. Adding support 
for this in
qgen code will require significant effort and modification to some core methods
(_create_relational_join_condition and _populate_func_with_vals), so it's best 
to disable
these for Hive altogether.

Note that the _create_boolean_func_tree still allows for OR operators due to 
some logic
around its "and_or_fill_ratio" variable. The plan is to fix this in a future 
patch that
specifically focuses on removing OR operators from Hive JOIN clauses.

Minor change to discrepancy_searcher so that the logs print out "Hive" instead 
of
"Impala" when running against Hive.

Testing:

* Added a new unit test that ensures the HiveProfile only returns equality joins
* Unit tests pass
* Tested against Hive locally
* Tested against Impala via Leopard
* Tested against Impala via the Discrepancy Checker

Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
---
M tests/comparison/discrepancy_searcher.py
M tests/comparison/query_generator.py
M tests/comparison/query_profile.py
A tests/comparison/tests/hive/test_hive_create_relational_join_condition.py
4 files changed, 140 insertions(+), 20 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4419/8
-- 
To view, visit http://gerrit.cloudera.org:8080/4419
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: stak...@cloudera.com
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: stak...@cloudera.com


[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions

2016-09-23 Thread Anonymous Coward (Code Review)
Hello Michael Brown,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4419

to look at the new patch set (#7).

Change subject: IMPALA-4101: qgen: Hive join predicates should only contains 
equality functions
..

IMPALA-4101: qgen: Hive join predicates should only contains equality functions

Background:

Hive only supports equi-joins in its JOIN clause, while Postgres and Impala 
support more
complex functions such as <, <=, >, >=, etc. This change modifies the
QueryGenerator._create_relational_join_condition and
QueryGenerator._create_boolean_func_tree methods to only construct equality join
conditions under certain conditions.

The _create_boolean_func_tree method is invoked via
QueryGenerator -> create_query -> _create_from_clause -> _create_join_clause ->
_create_relational_join_condition -> _create_boolean_func_tree. This method is 
invoked
when constructing the JOIN, WHERE, and HAVING clauses. It creates a tree of 
functions
that would typically be found in any of these clauses.

Changes:

The parameter "signatures" is added to the method _create_boolean_func_tree, 
and it lists
out all the allowed signatures the function is allowed to use. Previously, this 
list of
signatures was populated by calling _funcs_to_allowed_signatures(FUNCS), and if
"signatures" is not specified, then the code defaults back to the results of 
that method.
A new method in the DefaultProfile called get_allowed_join_signatures is 
introduced and
returns a list of function signatures that are allowed within a JOIN clause. The
DefaultProfile allows all given signatures, while the HiveProfile only allows 
for the
Equals and And functions, as well as any function that operates over only one 
column.
The reason for these restrictions is that Hive only allows equality joins, does 
not allow
OR operators in the join clause, and has some restrictions on functions that 
operate over
multiple different tables. This last restriction is somewhat subtle; if one 
side of the
equals operator contains a function that operates over two different tables, 
the other
side of the operator cannot contain either of those tables. While it is 
possible to have
functions that take in multiple input parameters, the inputs must be taken from 
specific
tables to prevent Hive from throwing a compile time exception. Adding support 
for this in
qgen code will require significant effort and modification to some core methods
(_create_relational_join_condition and _populate_func_with_vals), so it's best 
to disable
these for Hive altogether.

Note that the _create_boolean_func_tree still allows for OR operators due to 
some logic
around its "and_or_fill_ratio" variable. The plan is to fix this in a future 
patch that
specifically focuses on removing OR operators from Hive JOIN clauses.

Minor change to discrepancy_searcher so that the logs print out "Hive" instead 
of
"Impala" when running against Hive.

Testing:

* Added a new unit test that ensures the HiveProfile only returns equality joins
* Unit tests pass
* Tested against Hive locally
* Tested against Impala via Leopard
* Tested against Impala via the Discrepancy Checker

Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
---
M tests/comparison/discrepancy_searcher.py
M tests/comparison/query_generator.py
M tests/comparison/query_profile.py
A tests/comparison/tests/hive/test_hive_create_relational_join_condition.py
4 files changed, 137 insertions(+), 19 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4419/7
-- 
To view, visit http://gerrit.cloudera.org:8080/4419
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: stak...@cloudera.com
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: stak...@cloudera.com


[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions

2016-09-22 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-4101: qgen: Hive join predicates should only contains 
equality functions
..


Patch Set 5: Code-Review-1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4419/5/tests/comparison/query_generator.py
File tests/comparison/query_generator.py:

Line 1436: if self.profile.only_use_equality_join_predicates():
I don't think this whole patch is necessary. We just need 
only_use_equality_join_predicates to always be true.

This can be done by setting this probability 
http://github.mtv.cloudera.com/CDH/Impala/blob/cdh5-trunk/tests/comparison/query_profile.py#L220
 to 1.


-- 
To view, visit http://gerrit.cloudera.org:8080/4419
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: stak...@cloudera.com
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: stak...@cloudera.com
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions

2016-09-21 Thread Michael Brown (Code Review)
Michael Brown has posted comments on this change.

Change subject: IMPALA-4101: qgen: Hive join predicates should only contains 
equality functions
..


Patch Set 5: Code-Review+1

A committer needs to look at this for +2

-- 
To view, visit http://gerrit.cloudera.org:8080/4419
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: stak...@cloudera.com
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: stak...@cloudera.com
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions

2016-09-21 Thread Michael Brown (Code Review)
Michael Brown has posted comments on this change.

Change subject: IMPALA-4101: qgen: Hive join predicates should only contains 
equality functions
..


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/4419/3//COMMIT_MSG
Commit Message:

PS3, Line 35: subtle,
Nit: this is a comma splice. You can change the comma to something like a colon 
or semi-colon, or use a period and create a new sentence after "subtle."


PS3, Line 41: its 
Nit: "it's"


PS3, Line 45: it's
Nit: "its" :)


-- 
To view, visit http://gerrit.cloudera.org:8080/4419
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: stak...@cloudera.com
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: stak...@cloudera.com
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions

2016-09-21 Thread Michael Brown (Code Review)
Michael Brown has posted comments on this change.

Change subject: IMPALA-4101: qgen: Hive join predicates should only contains 
equality functions
..


Patch Set 3:

(3 comments)

Thanks for adding a test!

http://gerrit.cloudera.org:8080/#/c/4419/3/tests/comparison/query_generator.py
File tests/comparison/query_generator.py:

Line 1364: '''
Add documentation about the signatures parameter.


http://gerrit.cloudera.org:8080/#/c/4419/3/tests/comparison/query_profile.py
File tests/comparison/query_profile.py:

PS3, Line 661:   if signature.func in (Equals, And):
 : allowed_join_signatures.append(signature)
 :   elif len(signature.args) == 1:
Maybe use an or expression for these two conditions and remove the elif?


http://gerrit.cloudera.org:8080/#/c/4419/3/tests/comparison/tests/hive/test_create_join_clause.py
File tests/comparison/tests/hive/test_create_join_clause.py:

PS3, Line 31:   class MockQueryProfile(HiveProfile):
1. I prefer "Fake" instead of "Mock". I prefer the use of "mock" only when 
using actual mock libraries. Is this reasonable?

2. Use Hive in the name, to explain it's a fake HiveProfile.


-- 
To view, visit http://gerrit.cloudera.org:8080/4419
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: stak...@cloudera.com
Gerrit-Reviewer: David Knupp 
Gerrit-Reviewer: Michael Brown 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: stak...@cloudera.com
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4101: qgen: Hive join predicates should only contains equality functions

2016-09-14 Thread Anonymous Coward (Code Review)
stak...@cloudera.com has uploaded a new patch set (#2).

Change subject: IMPALA-4101: qgen: Hive join predicates should only contains 
equality functions
..

IMPALA-4101: qgen: Hive join predicates should only contains equality functions

Background:

Hive only supports equi-joins in its JOIN clause, while Postgres and Impala 
support more
complex functions such as <, <=, >, >=, etc. This change modifies the
QueryGenerator._create_relational_join_condition and
QueryGenerator._create_boolean_func_tree methods to only construct equality join
conditions under certain conditions.

The _create_boolean_func_tree method is invoked via
QueryGenerator -> create_query -> _create_from_clause -> _create_join_clause ->
_create_relational_join_condition -> _create_boolean_func_tree. This method is 
invoked
when constructing the JOIN, WHERE, and HAVING clauses. It creates a tree of 
functions
that would typically be found in any of these clauses.

Changes:

The parameter "signatures" is added to the method _create_boolean_func_tree, 
and it lists
out all the allowed signatures the function is allowed to use. Previously, this 
list of
signatures was populated by calling _funcs_to_allowed_signatures(FUNCS), and if
"signatures" is not specified, then the code defaults back to the results of 
that method.
A new method in the DefaultProfile called get_allowed_join_signatures is 
introduced and
returns a list of function signatures that are allowed within a JOIN clause. The
DefaultProfile allows all given signatures, while the HiveProfile only allows 
for the
Equals and And functions, as well as any function that operates over only one 
column.
The reason for these restrictions is that Hive only allows equality joins, does 
not allow
OR operators in the join clause, and does not allow any operator that operates 
over
multiple columns.

Note that the _create_boolean_func_tree still allows for OR operators due to 
some logic
around it's "and_or_fill_ratio" variable. The plan is to fix this in a future 
patch that
specifically focuses on removing OR operators from Hive JOIN clauses.

Minor change to discrepancy_searcher so that the logs print out "Hive" instead 
of
"Impala" when running against Hive.

Testing:

* Added a new unit test that ensures the HiveProfile only returns equality joins
* Tested against Hive locally
* Tested against Impala via Leopard
* Tested against Impala via the Discrepancy Checker inside an Impala Docker 
container

Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
---
M tests/comparison/discrepancy_searcher.py
M tests/comparison/query_generator.py
M tests/comparison/query_profile.py
A tests/comparison/tests/hive/test_create_join_clause.py
4 files changed, 110 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4419/2
-- 
To view, visit http://gerrit.cloudera.org:8080/4419
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibe8832a03cfa0d7ecc293ec6db6db2bcb34ab459
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: stak...@cloudera.com