[Impala-ASF-CR] IMPALA-10434: Fix impala-shell's unicode regressions on Python2

2021-01-16 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16960 )

Change subject: IMPALA-10434: Fix impala-shell's unicode regressions on Python2
..


Patch Set 1:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8013/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icc4a8d31311a5c59e5fc0e65fe09f770df41bea4
Gerrit-Change-Number: 16960
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Laszlo Gaal 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sun, 17 Jan 2021 03:49:33 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10421: [DOCS] Documented the JOIN ROWS PRODUCED LIMIT query option

2021-01-16 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16922 )

Change subject: IMPALA-10421: [DOCS] Documented the JOIN_ROWS_PRODUCED_LIMIT 
query option
..


Patch Set 2: Verified+1

Build Successful

https://jenkins.impala.io/job/gerrit-docs-auto-test/616/ : Doc tests passed.


--
To view, visit http://gerrit.cloudera.org:8080/16922
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3d422889c433062456748a953b33e3d43799be14
Gerrit-Change-Number: 16922
Gerrit-PatchSet: 2
Gerrit-Owner: Fucun Chu 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Fucun Chu 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Sun, 17 Jan 2021 03:42:16 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10421: [DOCS] Documented the JOIN ROWS PRODUCED LIMIT query option

2021-01-16 Thread Fucun Chu (Code Review)
Fucun Chu has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16922 )

Change subject: IMPALA-10421: [DOCS] Documented the JOIN_ROWS_PRODUCED_LIMIT 
query option
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16922/1/docs/topics/impala_join_rows_produced_limit.xml
File docs/topics/impala_join_rows_produced_limit.xml:

http://gerrit.cloudera.org:8080/#/c/16922/1/docs/topics/impala_join_rows_produced_limit.xml@44
PS1, Line 44: any one of the
> To be more accurate:  ...when any one of the joins in the query produces mo
Done



--
To view, visit http://gerrit.cloudera.org:8080/16922
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3d422889c433062456748a953b33e3d43799be14
Gerrit-Change-Number: 16922
Gerrit-PatchSet: 2
Gerrit-Owner: Fucun Chu 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Fucun Chu 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Sun, 17 Jan 2021 03:35:42 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10421: [DOCS] Documented the JOIN ROWS PRODUCED LIMIT query option

2021-01-16 Thread Fucun Chu (Code Review)
Hello Aman Sinha, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16922

to look at the new patch set (#2).

Change subject: IMPALA-10421: [DOCS] Documented the JOIN_ROWS_PRODUCED_LIMIT 
query option
..

IMPALA-10421: [DOCS] Documented the JOIN_ROWS_PRODUCED_LIMIT query option

- Minor edit

Change-Id: I3d422889c433062456748a953b33e3d43799be14
---
M docs/impala.ditamap
A docs/topics/impala_join_rows_produced_limit.xml
2 files changed, 73 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/22/16922/2
--
To view, visit http://gerrit.cloudera.org:8080/16922
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I3d422889c433062456748a953b33e3d43799be14
Gerrit-Change-Number: 16922
Gerrit-PatchSet: 2
Gerrit-Owner: Fucun Chu 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-10421: [DOCS] Documented the JOIN ROWS PRODUCED LIMIT query option

2021-01-16 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16922 )

Change subject: IMPALA-10421: [DOCS] Documented the JOIN_ROWS_PRODUCED_LIMIT 
query option
..


Patch Set 2:

Build Started https://jenkins.impala.io/job/gerrit-docs-auto-test/616/

Testing docs change - this change appears to modify docs/ and no code. This is 
experimental - please report any issues to tarmstr...@cloudera.com or on this 
JIRA: IMPALA-7317


--
To view, visit http://gerrit.cloudera.org:8080/16922
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3d422889c433062456748a953b33e3d43799be14
Gerrit-Change-Number: 16922
Gerrit-PatchSet: 2
Gerrit-Owner: Fucun Chu 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Sun, 17 Jan 2021 03:35:26 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10434: Fix impala-shell's unicode regressions on Python2

2021-01-16 Thread Quanlong Huang (Code Review)
Quanlong Huang has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/16960


Change subject: IMPALA-10434: Fix impala-shell's unicode regressions on Python2
..

IMPALA-10434: Fix impala-shell's unicode regressions on Python2

To make impala-shell compatible for Python3, we explicitly distinguish
bytes and text in Python2 by decoding the bytes for all inputs.

Regression 1: multiple queries in one line with unicode chars will break

In precmd() of impala-shell, if there are multiple queries present in
one input line, we split it into individual queries (by
sqlparse.split()) and append them back to the 'cmdqueue'. They will be
passed to precmd() again. In our Python2 implementation, precmd()
expects them to be str type, and will decode them into unicode type.
However, the output type of sqlparse.split() is unicode which doesn't
have a decode() method. Calling decode() on a unicode var will let
Python2 implicitly encode it to str. This may cause UnicodeEncodeError
since implicitly encoding use 'ascii'.

Regression 2: multi-line query with unicode chars will break when
command history is enabled

In _check_for_command_completion(), when calling
readline.replace_history_item in Python2. We encode the completed_cmd
into bytes. However, we shouldn't replace it since the return type is
expected to be unicode.

Tests:
 - Add tests for these two regressions in Python2.

Change-Id: Icc4a8d31311a5c59e5fc0e65fe09f770df41bea4
---
M shell/impala_shell.py
M tests/shell/test_shell_interactive.py
2 files changed, 30 insertions(+), 7 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/16960/1
--
To view, visit http://gerrit.cloudera.org:8080/16960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Icc4a8d31311a5c59e5fc0e65fe09f770df41bea4
Gerrit-Change-Number: 16960
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang 


[Impala-ASF-CR] IMPALA-10296: Fix analytic limit pushdown when predicates are present

2021-01-16 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16942 )

Change subject: IMPALA-10296: Fix analytic limit pushdown when predicates are 
present
..


Patch Set 11:

(12 comments)

Sending comments based on 1st pass of the planner changes.

http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@9
PS11, Line 9: when
nit: remove


http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@22
PS11, Line 22: limit
did you mean partition limit >= order by limit ?


http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@32
PS11, Line 32: was
nit: 'is'


http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@40
PS11, Line 40: This patch implements tie handling in the backend (I took most
I had previously wondered about how the planner and backend  work for this 
functionality would be combined but it now starts falling into place with the 
handling of the duplicates :-)


http://gerrit.cloudera.org:8080/#/c/16942/11//COMMIT_MSG@68
PS11, Line 68: The for
nit: 'The elapsed time for..' ?


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java
File fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java:

http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@507
PS11, Line 507: upper top-n.
Since we do a distributed top-n, there are 3 top-n's in the plan and the 'upper 
top-n' may be confusing. Here it refers to the outermost top-n or final top-n 
or something  similar.


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@511
PS11, Line 511: doesn't matter
nit: worth clarifying that it doesn't matter for the purpose of the pushdown 
decision.


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@530
PS11, Line 530: include all of the rows in the final
This does not literally mean all rows in the final partition right ? Should it 
be all eligible rows or all relevant rows ? (based on the P+N value)


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@531
PS11, Line 531: was
nit: 'we'


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/AnalyticEvalNode.java@603
PS11, Line 603: if (analyticLimit < limit) return falseStatus;
One special case where this could work is if each partition had a maximum of 
analyticLimit rows.  Then we know we are not excluding rows as we iterate 
through the partitions until we reach the final limit.  Of course, this 
knowledge is not readily available or may not be relied upon due to ndv 
estimates but in practice I suspect this may not be uncommon.
For this patch, it makes sense to be conservative in applying the optimization.


http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

http://gerrit.cloudera.org:8080/#/c/16942/11/fe/src/main/java/org/apache/impala/planner/SortNode.java@83
PS11, Line 83: row.s
nit: 'rows'


http://gerrit.cloudera.org:8080/#/c/16942/11/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test:

http://gerrit.cloudera.org:8080/#/c/16942/11/testdata/workloads/functional-planner/queries/PlannerTest/limit-pushdown-analytic.test@1139
PS11, Line 1139: # rank() predicate is not pushed down because TOPN_BYTES_LIMIT 
prevents conversion
The plan shows the lower top-n which indicates the rank was pushed down. 
Perhaps the top_bytes_limit should be even smaller if this is supposed to be a 
negative test.



--
To view, visit http://gerrit.cloudera.org:8080/16942
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I801d7799b0d649c73d2dd1703729a9b58a662509
Gerrit-Change-Number: 16942
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sun, 17 Jan 2021 00:36:51 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2021-01-16 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: IMPALA-10325: Parquet scan should use min/max statistics to 
skip pages based on equi-join predicate
..


Patch Set 48:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/8012/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 48
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Sat, 16 Jan 2021 18:12:30 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2021-01-16 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#48). ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: IMPALA-10325: Parquet scan should use min/max statistics to 
skip pages based on equi-join predicate
..

IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on 
equi-join predicate

This patch adds a new class of predicates called overlap predicates
to aid in the determination of whether a Parquet row group or a page
overlap with a range computed from an equi hash join. If not, then
the entire row group or page are skipped.  When a row survives this way,
it can be subjected to the row-level overlapping test against the same
overlap predicate.

For the following query, the min and max in the overlap predicate are
computed with the values from the join column from table 'b'. To evaluate
the overlap predicate, these two values are compared against the min/max
of each row group or page at the scan node for 'a'.

  select straight_join count(*)
  from lineitem a join [SHUFFLE] lineitem b
  where a.l_shipdate = b.l_receiptdate
  and b.l_commitdate = "1992-01-31";

An overlap predicate associated with the column type J (in hash table)
and scan column type S will be formed when one of the following is true:
   Both J and S are booleans
   Both J and S are integers (tinyint, smallint, int, or bigint)
   Both J and S are approximate numeric (float or double)
   Both J and S are decimals with the same precision and scale
   Both J and S are strings (STRING, CHAR or VARCHAR)
   Both J and S are date
   Both J and S are timestamp

The overlap predicate is implemented as a min/max filter. Unlike existing
min/max filters, MAX_NUM_RUNTIME_FILTERS query option does not apply to
min/max filters created for overlap predicates. An overlap predicate
will be evaluated as long as the overlap ratio is less than a thresold
specified in a new query option 'minmax_filter_threshold'. Setting the
threshold to its minimal value 0.0 disables the feature, and setting it
to the maximal value 1.0 applies the filtering in all cases.

In addition, two new run-time profile counters are added to report the
number of row groups or pages filtered out via the overlap predicates
respectively:
  1. NumRuntimeFilteredRowGroups
  2. NumRuntimeFilteredPages

Testing:
1. Unit tested on various column types with TPCH and TPCDS tables.
   Benefits were significant when the join column on the outer table
   is sorted and there exist many row groups or pages no overlapping
   with the implementing min/max filters;
2. Added new tests in min_max_filters.test to demonstrate the number
   of filtered out pages and row groups with the two new profile counters;
3. Added new tests in runtime-filter-propagation.test to demonstrate
   that the overlap predicates work with different column types;
4. Added data type specific overlap method tests in min-max-filter-test.cc;
5. Core testing;
6. Performance measurement.

To do in follow-up JIRAs:
1. Improve filtering efficiency;
2. Apply the overlap predicate on partition columns;
3. IR code-gen for various MinMaxFilter::EvalOverlap methods.

Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
---
M be/src/exec/exec-node.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator.cc
M be/src/runtime/date-value.cc
M be/src/runtime/date-value.h
M be/src/runtime/raw-value.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/string-value.cc
M be/src/runtime/string-value.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-ir.cc
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java
M fe/src/main/java/org/apache/impala/analysis/Predicate.java
M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/broadcast-bytes-limit-large.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/broadcast-bytes-limit.test
A 
testdata/workloads/functional-planner/queries/PlannerTest/disable-overlap-filter.test
M 
testdata/workloads