[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-16 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 24: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 24
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Sat, 17 Nov 2018 01:48:03 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-16 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..

IMPALA-4123: Columnar decoding in Parquet

The idea is to optimise the common case where there are long runs of
NULL or non-NULL values (i.e. the def level is repeated). We can
detect this cheaply by keying the decoding loop in the column reader
off the state of the def level RLE decoder - if there's a long run
of repeated levels, we can skip checking the def level for every
value. We still fall back to decoding, caching and reading
value-by-value a batch of def levels whenever the next def level is not
in a repeated run. We still use the old approach for decoding rep
levels. There might be some benefit to using the same approach for rep
levels *if* repeated def and rep level runs line up.

These changes should unlock further optimizations because more time is
spent in simple kernel functions, e.g. UnpackAndDecode32Values() for
dictionary decompression, which is very optimisable using SIMD etc.

Snappy decompression now seems to be the main CPU bottleneck for
decoding snappy-compressed Parquet.

Perf:
Running TPC-H scale factor 60 on uncompressed and snappy parquet
both showed a ~4% speedup overall.

Microbenchmarks on uncompressed parquet show scans only doing
dictionary decoding on uncompressed Parquet is ~75% faster:

   set mt_dop=1;
   select min(l_returnflag) from lineitem;

Testing:
We have alltypes agg with a mix of null and non-null.

Many tables have long runs of non-null values.

Added new test data and coverage:
* a test table manynulls with long runs of null values.
* a large CHAR test table
* missing coverage for materialising pos slot in flattened nested types
  scan.
* Extended dict test to test longer runs.
* A larger version of complextypestbl with interesting collection
  shapes - NULL collections, empty collections, etc, particularly runs
  of collections with the same shape.
* Test interaction of timestamp validation with conversion
* Ran code coverage build to confirm all code paths are tested
* ASAN and exhaustive runs.

Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Reviewed-on: http://gerrit.cloudera.org:8080/8319
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/exec/parquet-common.h
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M be/src/util/bit-packing.h
M be/src/util/bit-packing.inline.h
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
A be/src/util/mem-util.h
M be/src/util/rle-encoding.h
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/out_of_range_timestamp2_hive_211.parquet
A testdata/data/out_of_range_timestamp_hive_211.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/chars.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-position.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
A 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-local-tz-conversion.test
A testdata/workloads/functional-query/queries/QueryTest/scanners-many-nulls.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/query_test/test_scanners.py
27 files changed, 883 insertions(+), 133 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 25
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-16 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 23:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1387/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 23
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 16 Nov 2018 22:22:57 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-16 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 24:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3470/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 24
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 16 Nov 2018 21:52:08 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-16 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 24: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 24
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 16 Nov 2018 21:52:07 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-16 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 23: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 23
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 16 Nov 2018 21:50:41 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-16 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 22: Code-Review+1

I filed IMPALA-7862 and updated the comments


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 22
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 16 Nov 2018 21:47:27 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-16 Thread Tim Armstrong (Code Review)
Hello Lars Volker, Zoltan Borok-Nagy, Csaba Ringhofer, Alex Behm, Mostafa 
Mokhtar, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8319

to look at the new patch set (#23).

Change subject: IMPALA-4123: Columnar decoding in Parquet
..

IMPALA-4123: Columnar decoding in Parquet

The idea is to optimise the common case where there are long runs of
NULL or non-NULL values (i.e. the def level is repeated). We can
detect this cheaply by keying the decoding loop in the column reader
off the state of the def level RLE decoder - if there's a long run
of repeated levels, we can skip checking the def level for every
value. We still fall back to decoding, caching and reading
value-by-value a batch of def levels whenever the next def level is not
in a repeated run. We still use the old approach for decoding rep
levels. There might be some benefit to using the same approach for rep
levels *if* repeated def and rep level runs line up.

These changes should unlock further optimizations because more time is
spent in simple kernel functions, e.g. UnpackAndDecode32Values() for
dictionary decompression, which is very optimisable using SIMD etc.

Snappy decompression now seems to be the main CPU bottleneck for
decoding snappy-compressed Parquet.

Perf:
Running TPC-H scale factor 60 on uncompressed and snappy parquet
both showed a ~4% speedup overall.

Microbenchmarks on uncompressed parquet show scans only doing
dictionary decoding on uncompressed Parquet is ~75% faster:

   set mt_dop=1;
   select min(l_returnflag) from lineitem;

Testing:
We have alltypes agg with a mix of null and non-null.

Many tables have long runs of non-null values.

Added new test data and coverage:
* a test table manynulls with long runs of null values.
* a large CHAR test table
* missing coverage for materialising pos slot in flattened nested types
  scan.
* Extended dict test to test longer runs.
* A larger version of complextypestbl with interesting collection
  shapes - NULL collections, empty collections, etc, particularly runs
  of collections with the same shape.
* Test interaction of timestamp validation with conversion
* Ran code coverage build to confirm all code paths are tested
* ASAN and exhaustive runs.

Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/exec/parquet-common.h
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M be/src/util/bit-packing.h
M be/src/util/bit-packing.inline.h
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
A be/src/util/mem-util.h
M be/src/util/rle-encoding.h
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/out_of_range_timestamp2_hive_211.parquet
A testdata/data/out_of_range_timestamp_hive_211.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/chars.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-position.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
A 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-local-tz-conversion.test
A testdata/workloads/functional-query/queries/QueryTest/scanners-many-nulls.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/query_test/test_scanners.py
27 files changed, 883 insertions(+), 133 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/8319/23
--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 23
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-15 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 22: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8319/20/be/src/util/mem-util.h
File be/src/util/mem-util.h:

http://gerrit.cloudera.org:8080/#/c/8319/20/be/src/util/mem-util.h@41
PS20, Line 41:
> Actually I feel like a negative stride is even valid and has reasonable sem
That makes sense to me, thanks.



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 22
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 15 Nov 2018 21:30:04 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-15 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 22:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1374/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 22
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 15 Nov 2018 21:11:41 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-15 Thread Tim Armstrong (Code Review)
Hello Lars Volker, Zoltan Borok-Nagy, Csaba Ringhofer, Alex Behm, Mostafa 
Mokhtar, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8319

to look at the new patch set (#22).

Change subject: IMPALA-4123: Columnar decoding in Parquet
..

IMPALA-4123: Columnar decoding in Parquet

The idea is to optimise the common case where there are long runs of
NULL or non-NULL values (i.e. the def level is repeated). We can
detect this cheaply by keying the decoding loop in the column reader
off the state of the def level RLE decoder - if there's a long run
of repeated levels, we can skip checking the def level for every
value. We still fall back to decoding, caching and reading
value-by-value a batch of def levels whenever the next def level is not
in a repeated run. We still use the old approach for decoding rep
levels. There might be some benefit to using the same approach for rep
levels *if* repeated def and rep level runs line up.

These changes should unlock further optimizations because more time is
spent in simple kernel functions, e.g. UnpackAndDecode32Values() for
dictionary decompression, which is very optimisable using SIMD etc.

Snappy decompression now seems to be the main CPU bottleneck for
decoding snappy-compressed Parquet.

Perf:
Running TPC-H scale factor 60 on uncompressed and snappy parquet
both showed a ~4% speedup overall.

Microbenchmarks on uncompressed parquet show scans only doing
dictionary decoding on uncompressed Parquet is ~75% faster:

   set mt_dop=1;
   select min(l_returnflag) from lineitem;

Testing:
We have alltypes agg with a mix of null and non-null.

Many tables have long runs of non-null values.

Added new test data and coverage:
* a test table manynulls with long runs of null values.
* a large CHAR test table
* missing coverage for materialising pos slot in flattened nested types
  scan.
* Extended dict test to test longer runs.
* A larger version of complextypestbl with interesting collection
  shapes - NULL collections, empty collections, etc, particularly runs
  of collections with the same shape.
* Test interaction of timestamp validation with conversion
* Ran code coverage build to confirm all code paths are tested
* ASAN and exhaustive runs.

TODO:
* Before merging, run fuzz test for longer

Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/exec/parquet-common.h
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M be/src/util/bit-packing.h
M be/src/util/bit-packing.inline.h
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
A be/src/util/mem-util.h
M be/src/util/rle-encoding.h
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/out_of_range_timestamp2_hive_211.parquet
A testdata/data/out_of_range_timestamp_hive_211.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/chars.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-position.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
A 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-local-tz-conversion.test
A testdata/workloads/functional-query/queries/QueryTest/scanners-many-nulls.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/query_test/test_scanners.py
27 files changed, 883 insertions(+), 133 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/8319/22
--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 22
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-15 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 20:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8319/20/be/src/util/mem-util.h
File be/src/util/mem-util.h:

http://gerrit.cloudera.org:8080/#/c/8319/20/be/src/util/mem-util.h@41
PS20, Line 41: DCHECK_GE
> Can the stride ever be zero? If not, DCHECK_GT?
Actually I feel like a negative stride is even valid and has reasonable 
semantics - if you wanted to write an array backwards for example. So I'll just 
remove the DCHECK.



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 20
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 15 Nov 2018 20:34:52 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-15 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 20:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8319/20/be/src/util/mem-util.h
File be/src/util/mem-util.h:

http://gerrit.cloudera.org:8080/#/c/8319/20/be/src/util/mem-util.h@41
PS20, Line 41: DCHECK_GE
Can the stride ever be zero? If not, DCHECK_GT?



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 20
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 15 Nov 2018 20:06:11 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-15 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 21:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1373/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 21
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 15 Nov 2018 19:36:45 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-15 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 20:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1372/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 20
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 15 Nov 2018 19:10:50 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-15 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 21:

(1 comment)

Rebased

http://gerrit.cloudera.org:8080/#/c/8319/21/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/8319/21/be/src/exec/parquet-column-readers.cc@948
PS21, Line 948:   // TODO: converting timestamps after validating them can move 
them out of range. We
Had to resolve a conflict here while rebasing. I solved it by duplicating the 
comment.



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 21
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 15 Nov 2018 19:01:44 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-15 Thread Tim Armstrong (Code Review)
Hello Lars Volker, Zoltan Borok-Nagy, Csaba Ringhofer, Alex Behm, Mostafa 
Mokhtar, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8319

to look at the new patch set (#21).

Change subject: IMPALA-4123: Columnar decoding in Parquet
..

IMPALA-4123: Columnar decoding in Parquet

The idea is to optimise the common case where there are long runs of
NULL or non-NULL values (i.e. the def level is repeated). We can
detect this cheaply by keying the decoding loop in the column reader
off the state of the def level RLE decoder - if there's a long run
of repeated levels, we can skip checking the def level for every
value. We still fall back to decoding, caching and reading
value-by-value a batch of def levels whenever the next def level is not
in a repeated run. We still use the old approach for decoding rep
levels. There might be some benefit to using the same approach for rep
levels *if* repeated def and rep level runs line up.

These changes should unlock further optimizations because more time is
spent in simple kernel functions, e.g. UnpackAndDecode32Values() for
dictionary decompression, which is very optimisable using SIMD etc.

Snappy decompression now seems to be the main CPU bottleneck for
decoding snappy-compressed Parquet.

Perf:
Running TPC-H scale factor 60 on uncompressed and snappy parquet
both showed a ~4% speedup overall.

Microbenchmarks on uncompressed parquet show scans only doing
dictionary decoding on uncompressed Parquet is ~75% faster:

   set mt_dop=1;
   select min(l_returnflag) from lineitem;

Testing:
We have alltypes agg with a mix of null and non-null.

Many tables have long runs of non-null values.

Added new test data and coverage:
* a test table manynulls with long runs of null values.
* a large CHAR test table
* missing coverage for materialising pos slot in flattened nested types
  scan.
* Extended dict test to test longer runs.
* A larger version of complextypestbl with interesting collection
  shapes - NULL collections, empty collections, etc, particularly runs
  of collections with the same shape.
* Test interaction of timestamp validation with conversion
* Ran code coverage build to confirm all code paths are tested
* ASAN and exhaustive runs.

TODO:
* Before merging, run fuzz test for longer

Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/exec/parquet-common.h
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M be/src/util/bit-packing.h
M be/src/util/bit-packing.inline.h
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
A be/src/util/mem-util.h
M be/src/util/rle-encoding.h
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/out_of_range_timestamp2_hive_211.parquet
A testdata/data/out_of_range_timestamp_hive_211.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/chars.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-position.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
A 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-local-tz-conversion.test
A testdata/workloads/functional-query/queries/QueryTest/scanners-many-nulls.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/query_test/test_scanners.py
27 files changed, 884 insertions(+), 133 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/8319/21
--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 21
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-15 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 19:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/8319/19/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/8319/19/be/src/exec/parquet-column-readers.cc@313
PS19, Line 313:   void ReadPositions(int64_t num_to_read, int tuple_size, 
uint8_t* tuple_mem) RESTRICT;
> Is there a reason that *tuple_mem is not restricted here?
Done


http://gerrit.cloudera.org:8080/#/c/8319/19/be/src/exec/parquet-column-readers.cc@745
PS19, Line 745:   return NeedsConversionInline() ?
> This might be more readable as an if-statement
Done


http://gerrit.cloudera.org:8080/#/c/8319/19/be/src/exec/parquet-column-readers.cc@897
PS19, Line 897:   // TODO: converting timestamps after validating them can move 
them out of range. We
> Do we have a Jira for this already?
I'm not sure - Csaba, do you know if there is one?


http://gerrit.cloudera.org:8080/#/c/8319/19/be/src/runtime/tuple.h
File be/src/runtime/tuple.h:

http://gerrit.cloudera.org:8080/#/c/8319/19/be/src/runtime/tuple.h@55
PS19, Line 55: to
> nit: extra word
Done


http://gerrit.cloudera.org:8080/#/c/8319/19/be/src/runtime/tuple.h@73
PS19, Line 73: to
> nit: double 'to'
Done


http://gerrit.cloudera.org:8080/#/c/8319/19/be/src/util/mem-util.h
File be/src/util/mem-util.h:

http://gerrit.cloudera.org:8080/#/c/8319/19/be/src/util/mem-util.h@34
PS19, Line 34:   T* current;
> can you initialize to nullptr (or add a brief comment why we don't)?
Instead switched to an explicit constructor that asserts that it's non-NULL.



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 19
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 15 Nov 2018 18:46:04 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-15 Thread Tim Armstrong (Code Review)
Hello Lars Volker, Zoltan Borok-Nagy, Csaba Ringhofer, Alex Behm, Mostafa 
Mokhtar, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8319

to look at the new patch set (#20).

Change subject: IMPALA-4123: Columnar decoding in Parquet
..

IMPALA-4123: Columnar decoding in Parquet

The idea is to optimise the common case where there are long runs of
NULL or non-NULL values (i.e. the def level is repeated). We can
detect this cheaply by keying the decoding loop in the column reader
off the state of the def level RLE decoder - if there's a long run
of repeated levels, we can skip checking the def level for every
value. We still fall back to decoding, caching and reading
value-by-value a batch of def levels whenever the next def level is not
in a repeated run. We still use the old approach for decoding rep
levels. There might be some benefit to using the same approach for rep
levels *if* repeated def and rep level runs line up.

These changes should unlock further optimizations because more time is
spent in simple kernel functions, e.g. UnpackAndDecode32Values() for
dictionary decompression, which is very optimisable using SIMD etc.

Snappy decompression now seems to be the main CPU bottleneck for
decoding snappy-compressed Parquet.

Perf:
Running TPC-H scale factor 60 on uncompressed and snappy parquet
both showed a ~4% speedup overall.

Microbenchmarks on uncompressed parquet show scans only doing
dictionary decoding on uncompressed Parquet is ~75% faster:

   set mt_dop=1;
   select min(l_returnflag) from lineitem;

Testing:
We have alltypes agg with a mix of null and non-null.

Many tables have long runs of non-null values.

Added new test data and coverage:
* a test table manynulls with long runs of null values.
* a large CHAR test table
* missing coverage for materialising pos slot in flattened nested types
  scan.
* Extended dict test to test longer runs.
* A larger version of complextypestbl with interesting collection
  shapes - NULL collections, empty collections, etc, particularly runs
  of collections with the same shape.
* Test interaction of timestamp validation with conversion
* Ran code coverage build to confirm all code paths are tested
* ASAN and exhaustive runs.

TODO:
* Before merging, run fuzz test for longer

Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/exec/parquet-common.h
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M be/src/util/bit-packing.h
M be/src/util/bit-packing.inline.h
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
A be/src/util/mem-util.h
M be/src/util/rle-encoding.h
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/out_of_range_timestamp2_hive_211.parquet
A testdata/data/out_of_range_timestamp_hive_211.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/chars.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-position.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
A 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-local-tz-conversion.test
A testdata/workloads/functional-query/queries/QueryTest/scanners-many-nulls.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/query_test/test_scanners.py
27 files changed, 881 insertions(+), 133 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/8319/20
--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 20
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-15 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 19:

(5 comments)

I did a pass out of curiosity and only had a few nits.

http://gerrit.cloudera.org:8080/#/c/8319/19/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/8319/19/be/src/exec/parquet-column-readers.cc@313
PS19, Line 313:   void ReadPositions(int64_t num_to_read, int tuple_size, 
uint8_t* tuple_mem) RESTRICT;
Is there a reason that *tuple_mem is not restricted here?


http://gerrit.cloudera.org:8080/#/c/8319/19/be/src/exec/parquet-column-readers.cc@745
PS19, Line 745:   return NeedsConversionInline() ?
This might be more readable as an if-statement


http://gerrit.cloudera.org:8080/#/c/8319/19/be/src/exec/parquet-column-readers.cc@897
PS19, Line 897:   // TODO: converting timestamps after validating them can move 
them out of range. We
Do we have a Jira for this already?


http://gerrit.cloudera.org:8080/#/c/8319/19/be/src/runtime/tuple.h
File be/src/runtime/tuple.h:

http://gerrit.cloudera.org:8080/#/c/8319/19/be/src/runtime/tuple.h@55
PS19, Line 55: to
nit: extra word


http://gerrit.cloudera.org:8080/#/c/8319/19/be/src/util/mem-util.h
File be/src/util/mem-util.h:

http://gerrit.cloudera.org:8080/#/c/8319/19/be/src/util/mem-util.h@34
PS19, Line 34:   T* current;
can you initialize to nullptr (or add a brief comment why we don't)?



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 19
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 15 Nov 2018 17:21:37 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-15 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 19: Code-Review+1

(1 comment)

I read it through again and haven't found anything.
I let Csaba give the final +2

http://gerrit.cloudera.org:8080/#/c/8319/19/be/src/runtime/tuple.h
File be/src/runtime/tuple.h:

http://gerrit.cloudera.org:8080/#/c/8319/19/be/src/runtime/tuple.h@78
PS19, Line 78: /// if the tuple is logically NULL.
Thanks for updating this!



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 19
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 15 Nov 2018 14:52:22 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-15 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 19: Code-Review+1

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8319/19/be/src/runtime/tuple.h
File be/src/runtime/tuple.h:

http://gerrit.cloudera.org:8080/#/c/8319/19/be/src/runtime/tuple.h@73
PS19, Line 73: to
nit: double 'to'


http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/util/mem-util.h
File be/src/util/mem-util.h:

http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/util/mem-util.h@41
PS17, Line 41: // memcpy() is necessary because 'current' may not be aligned.
> Ok, wow, that comment is out of date and has some wrong info. Fixed it in t
Thanks for improving the comment!



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 19
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 15 Nov 2018 13:49:29 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-14 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 19:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1366/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 19
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 14 Nov 2018 18:30:45 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-14 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 19:

I think Zoltan was also planning to do a pass over it, so I'll let you two 
figure out if that still makes sense (I would definitely appreciate more eyes 
on it).


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 19
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 14 Nov 2018 17:56:00 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-14 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 17:

(2 comments)

WRT to your earlier comment about breaking up parquet-column-readers, I agree 
that would be a good idea. Readability and compile times are a big issue.

http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@165
PS17, Line 165: numeric_limits::max();
> I am ok with the current solution - maybe the handling of max_level_ == 0 c
Done


http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/util/mem-util.h
File be/src/util/mem-util.h:

http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/util/mem-util.h@41
PS17, Line 41: // memcpy() is necessary because 'current' may not be aligned.
> Thanks for the explenation! I have misunderstood this comment: https://gith
Ok, wow, that comment is out of date and has some wrong info. Fixed it in the 
next PS.



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 17
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 14 Nov 2018 17:43:47 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-14 Thread Tim Armstrong (Code Review)
Hello Lars Volker, Zoltan Borok-Nagy, Csaba Ringhofer, Alex Behm, Mostafa 
Mokhtar, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8319

to look at the new patch set (#19).

Change subject: IMPALA-4123: Columnar decoding in Parquet
..

IMPALA-4123: Columnar decoding in Parquet

The idea is to optimise the common case where there are long runs of
NULL or non-NULL values (i.e. the def level is repeated). We can
detect this cheaply by keying the decoding loop in the column reader
off the state of the def level RLE decoder - if there's a long run
of repeated levels, we can skip checking the def level for every
value. We still fall back to decoding, caching and reading
value-by-value a batch of def levels whenever the next def level is not
in a repeated run. We still use the old approach for decoding rep
levels. There might be some benefit to using the same approach for rep
levels *if* repeated def and rep level runs line up.

These changes should unlock further optimizations because more time is
spent in simple kernel functions, e.g. UnpackAndDecode32Values() for
dictionary decompression, which is very optimisable using SIMD etc.

Snappy decompression now seems to be the main CPU bottleneck for
decoding snappy-compressed Parquet.

Perf:
Running TPC-H scale factor 60 on uncompressed and snappy parquet
both showed a ~4% speedup overall.

Microbenchmarks on uncompressed parquet show scans only doing
dictionary decoding on uncompressed Parquet is ~75% faster:

   set mt_dop=1;
   select min(l_returnflag) from lineitem;

Testing:
We have alltypes agg with a mix of null and non-null.

Many tables have long runs of non-null values.

Added new test data and coverage:
* a test table manynulls with long runs of null values.
* a large CHAR test table
* missing coverage for materialising pos slot in flattened nested types
  scan.
* Extended dict test to test longer runs.
* A larger version of complextypestbl with interesting collection
  shapes - NULL collections, empty collections, etc, particularly runs
  of collections with the same shape.
* Test interaction of timestamp validation with conversion
* Ran code coverage build to confirm all code paths are tested
* ASAN and exhaustive runs.

TODO:
* Before merging, run fuzz test for longer

Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/exec/parquet-common.h
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M be/src/util/bit-packing.h
M be/src/util/bit-packing.inline.h
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
A be/src/util/mem-util.h
M be/src/util/rle-encoding.h
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/out_of_range_timestamp2_hive_211.parquet
A testdata/data/out_of_range_timestamp_hive_211.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/chars.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-position.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
A 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-local-tz-conversion.test
A testdata/workloads/functional-query/queries/QueryTest/scanners-many-nulls.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/query_test/test_scanners.py
27 files changed, 873 insertions(+), 133 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/8319/19
--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 19
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-14 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 17:

(2 comments)

lgtm - I can also give +2 if you do not plan to wait for other people to look 
at the change.

Note that gerrit-verify-dryrun has already started for 
https://gerrit.cloudera.org/#/c/11057/, which is very likely to conflict with 
your changes.

http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@165
PS17, Line 165: numeric_limits::max();
> That does seem like something we'd like to prevent, but I also think adding
I am ok with the current solution - maybe the handling of max_level_ == 0 could 
be mentioned in the .h, as this my be different than what the reader assumes.


http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/util/mem-util.h
File be/src/util/mem-util.h:

http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/util/mem-util.h@41
PS17, Line 41: // memcpy() is necessary because 'current' may not be aligned.
> There's no guarantee generally that slots in tuples are aligned. The code i
Thanks for the explenation! I have misunderstood this comment: 
https://github.com/apache/impala/blob/cd26e807f18ef5fa729d7e15d0492a1284990122/be/src/runtime/tuple.h#L56

It says that there can be padding to ensure alignment, but not that every types 
has to be aligned. I see this happening in IMPALA-7367 - we can win a lot with 
packing StringValues, but doing the same is more problematic for 
TimestampValues due to boost:: members.



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 17
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 14 Nov 2018 16:39:54 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 18:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1352/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 18
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 12 Nov 2018 21:30:14 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-12 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 17:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@165
PS17, Line 165: numeric_limits::max();
> Using such large values worries me a bit, it is easy to create an overflow 
That does seem like something we'd like to prevent, but I also think adding 
more complexity by threading through another parameter is a bad idea. It makes 
it hard to tell what's an invariant of the code and what's unnecessary 
defensive code. In some ways it's better to just use the maximum valid value 
here and use assertions to flush out any bugs rather than leave them latent.

There's no guarantee in general that the number of levels returned by this 
decoded is < num_buffered_values_ or < numeric_limits::max() so it 
wouldn't be preserving an invariant, just adding a special case. I added a 
small comment here justifying that it's the maximum run length allowed by the 
Parquet standard.

If we're concerned about overflows, one option is to switch to using int64_t 
throughout the column reader for handling value counts. I think we can convince 
ourselves that this will never overflow since the inputs like batch_size and 
the Parquet run lengths are all guaranteed to fit in an int32_t.

I can go ahead with that if you agree.


http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@648
PS17, Line 648: materializing
> This seems a bit misleading to me: we do not have to materialize the batch
Done


http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@667
PS17, Line 667: num_buffered_values_
> num_buffered_values_ >= def_level_repeats should be always true at this poi
I'm not sure if it's technically invalid, but I agree it would be a weird way 
to encode the file. I looked at adding an UNLIKELY and I think it's net worse - 
min() is more readable to me and there shouldn't be  a measurable performance 
difference outside of the inner loop.


http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@769
PS17, Line 769: if (UNLIKELY(!ConvertSlot(curr_val, 
tuple->GetSlot(tuple_offset_ {
> Doing the conversion after the validation check means that timestamps that
I added a TODO to the timestamp ConvertSlot() implementation


http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@834
PS17, Line 834: StrideWriter out{out_vals, stride};
  : // Hoist data_ into a local variable to prevent GCC from 
storing it every loop
  : // iteration.
  : uint8_t* data = data_;
  : for (int64_t i = 0; i < count; ++i, out.SkipNext(1)) {
  :   if (UNLIKELY(!DecodeValue(&data, 
data_end_, out.current))) {
  : return false;
  :   }
  : }
  : data_ = data;
> I think that ParquetPlainEncoder would be a better place for this logic, e.
Done


http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@859
PS17, Line 859:   int pos_slot_offset = pos_slot_desc()->tuple_offset();
  :   for (int64_t i = 0; i < num_to_read; ++i, tuple_mem += 
tuple_size) {
  : Tuple* tuple = reinterpret_cast(tuple_mem);
  : ReadPositionBatched(rep_levels_.CacheGetNext(),
  : 
reinterpret_cast(tuple->GetSlot(pos_slot_offset)));
  :   }
> This could be also simplified a bit by using stride writer.
Done. Also added a new method to StrideWriter that returns the current element 
and advances it, since that pattern has occurred a few times now.


http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/util/mem-util.h
File be/src/util/mem-util.h:

http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/util/mem-util.h@41
PS17, Line 41: // memcpy() is necessary because 'current' may not be aligned.
> In which case can it point the unaligned memory? Normally it should point t
There's no guarantee generally that slots in tuples are aligned. The code is a 
bit muddled because some places attempt to maintain alignment. As long as I've 
been working on the code there have been cases where the tuples aren't aligned 
though. E.g. various operators like the Sorter, Join and Agg didn't add padding 
after variable-length data to ensure that the start of the next tuple was 
aligned.

Generally I think we should be moving towards packing tuples as densely as 
possible instead of trying to align things, simply because the performance 
penalty of unaligned memory access on modern x86 is minimal and cache capacity 
is much more of a limiting factor. E.g. the per

[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-12 Thread Tim Armstrong (Code Review)
Hello Lars Volker, Zoltan Borok-Nagy, Csaba Ringhofer, Alex Behm, Mostafa 
Mokhtar, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8319

to look at the new patch set (#18).

Change subject: IMPALA-4123: Columnar decoding in Parquet
..

IMPALA-4123: Columnar decoding in Parquet

The idea is to optimise the common case where there are long runs of
NULL or non-NULL values (i.e. the def level is repeated). We can
detect this cheaply by keying the decoding loop in the column reader
off the state of the def level RLE decoder - if there's a long run
of repeated levels, we can skip checking the def level for every
value. We still fall back to decoding, caching and reading
value-by-value a batch of def levels whenever the next def level is not
in a repeated run. We still use the old approach for decoding rep
levels. There might be some benefit to using the same approach for rep
levels *if* repeated def and rep level runs line up.

These changes should unlock further optimizations because more time is
spent in simple kernel functions, e.g. UnpackAndDecode32Values() for
dictionary decompression, which is very optimisable using SIMD etc.

Snappy decompression now seems to be the main CPU bottleneck for
decoding snappy-compressed Parquet.

Perf:
Running TPC-H scale factor 60 on uncompressed and snappy parquet
both showed a ~4% speedup overall.

Microbenchmarks on uncompressed parquet show scans only doing
dictionary decoding on uncompressed Parquet is ~75% faster:

   set mt_dop=1;
   select min(l_returnflag) from lineitem;

Testing:
We have alltypes agg with a mix of null and non-null.

Many tables have long runs of non-null values.

Added new test data and coverage:
* a test table manynulls with long runs of null values.
* a large CHAR test table
* missing coverage for materialising pos slot in flattened nested types
  scan.
* Extended dict test to test longer runs.
* A larger version of complextypestbl with interesting collection
  shapes - NULL collections, empty collections, etc, particularly runs
  of collections with the same shape.
* Test interaction of timestamp validation with conversion
* Ran code coverage build to confirm all code paths are tested
* ASAN and exhaustive runs.

TODO:
* Before merging, run fuzz test for longer

Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/exec/parquet-common.h
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M be/src/util/bit-packing.h
M be/src/util/bit-packing.inline.h
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
A be/src/util/mem-util.h
M be/src/util/rle-encoding.h
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/out_of_range_timestamp2_hive_211.parquet
A testdata/data/out_of_range_timestamp_hive_211.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/chars.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-position.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
A 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-local-tz-conversion.test
A testdata/workloads/functional-query/queries/QueryTest/scanners-many-nulls.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/query_test/test_scanners.py
27 files changed, 848 insertions(+), 117 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/8319/18
--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 18
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-10 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 17:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@165
PS17, Line 165: numeric_limits::max();
Using such large values worries me a bit, it is easy to create an overflow with 
innocent looking code ( e.g. if (r.NextRepeatedRunLength() + buffered_values < 
output_buffer_size) ).

I would prefer to give it a valid value, like have a member with the number of 
values in the page.


http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@648
PS17, Line 648: materializing
This seems a bit misleading to me: we do not have to materialize the batch to 
know num_def_levels_to_consume, just check some other conditions (mainly 
rep_levels_.CacheRemaining())/


http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@667
PS17, Line 667: num_buffered_values_
num_buffered_values_ >= def_level_repeats should be always true at this point 
in a valid Parquet file, as num_buffered_values_ means the remaining (non NULL) 
values in the page, and def levels of the page should not contain information 
about rows outside the page, so I think that the handling of the 
num_buffered_values_ < def_level_repeats case could go to an UNLIKELY branch.


http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@769
PS17, Line 769: if (UNLIKELY(!ConvertSlot(curr_val, 
tuple->GetSlot(tuple_offset_ {
Doing the conversion after the validation check means that timestamps that were 
moved out of range by the conversion do not set the NULL indicator.

This was not changed by your patch, but a TODO could be added.


http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@834
PS17, Line 834: StrideWriter out{out_vals, stride};
  : // Hoist data_ into a local variable to prevent GCC from 
storing it every loop
  : // iteration.
  : uint8_t* data = data_;
  : for (int64_t i = 0; i < count; ++i, out.SkipNext(1)) {
  :   if (UNLIKELY(!DecodeValue(&data, 
data_end_, out.current))) {
  : return false;
  :   }
  : }
  : data_ = data;
I think that ParquetPlainEncoder would be a better place for this logic, e.g. 
with a int DecodeBatch(const uint8_t* buffer, const uint8_t* buffer_end, int 
fixed_len_size, int64_t stride,
 InternalType* v) functions that calls ParquetPlainEncoder::Decode and returns 
the total encoded_length.

I generally prefer to move logic out from parquet-column-readers.cc when it 
make sense, as this file already contains a lot of complex stuff, which makes 
it hard to get a "big picture" about parquet reading.


http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/exec/parquet-column-readers.cc@859
PS17, Line 859:   int pos_slot_offset = pos_slot_desc()->tuple_offset();
  :   for (int64_t i = 0; i < num_to_read; ++i, tuple_mem += 
tuple_size) {
  : Tuple* tuple = reinterpret_cast(tuple_mem);
  : ReadPositionBatched(rep_levels_.CacheGetNext(),
  : 
reinterpret_cast(tuple->GetSlot(pos_slot_offset)));
  :   }
This could be also simplified a bit by using stride writer.


http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/util/mem-util.h
File be/src/util/mem-util.h:

http://gerrit.cloudera.org:8080/#/c/8319/17/be/src/util/mem-util.h@41
PS17, Line 41: // memcpy() is necessary because 'current' may not be aligned.
In which case can it point the unaligned memory? Normally it should point to 
slots in tuples, and those should have the proper alignment for the type, or at 
least that's what I thought until now.

Another reason for using memcpy instead of assignment could be that we are 
avoiding the copy constructor for types that have one.



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 17
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Sat, 10 Nov 2018 19:45:09 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-09 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 17:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1346/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 17
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 09 Nov 2018 19:06:53 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-09 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 16:

(5 comments)

Addressed comments and ran exhaustive tests overnight.

http://gerrit.cloudera.org:8080/#/c/8319/16/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8319/16/be/src/util/dict-encoding.h@464
PS16, Line 464: uint8_t* curr_value = reinterpret_cast(first_value);
> It could be useful to create a simple struct with a pointer + stride that h
Sorry about that, it wasn't a deliberate oversight, just sloppiness on my part. 
I took the idea and used it in a bunch of places - internally in many functions 
and also at function boundaries where the callee advances the pointer (it 
simplified things).

I didn't use it in the BitPacking interfaces, which are pretty low level and I 
don't think would benefit from it, and in other interfaces where the pointer 
wasn't advanced by the callee (since it didn't lead to simpler code).


http://gerrit.cloudera.org:8080/#/c/8319/15/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/8319/15/testdata/data/README@240
PS15, Line 240: t as
> I have no problem with the code, but I am a bit surprised that Hive still t
I don't think there's a limit in the Parquet spec - it stores the day as julian 
days. Afaik the limit just comes from Boost.


http://gerrit.cloudera.org:8080/#/c/8319/15/testdata/data/README@244
PS15, Line 244:
> nit: double "code"
Done


http://gerrit.cloudera.org:8080/#/c/8319/15/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
File tests/custom_cluster/test_hive_parquet_timestamp_conversion.py:

http://gerrit.cloudera.org:8080/#/c/8319/15/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py@105
PS15, Line 105: # Test that timestamp validation also works as expected 
when converting timestamps.
> This test seem to be an independent test that was appended here to avoid re
Done. That will be nice if we can avoid using a custom cluster test.


http://gerrit.cloudera.org:8080/#/c/8319/15/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/8319/15/tests/query_test/test_scanners.py@98
PS15, Line 98: new_vector = deepcopy(vector)
 : new_vector.get_value('exec_option')['batch_size'] = 
vector.get_value('batch_size')
> Can you add a comment about the reason for doing this?
Done. For whatever reason this test has the query options as top-level test 
dimensions rather than part of the exec_option dimension, so they need to be 
copied over to have any effect.



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 16
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 09 Nov 2018 18:33:45 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-09 Thread Tim Armstrong (Code Review)
Hello Lars Volker, Zoltan Borok-Nagy, Csaba Ringhofer, Alex Behm, Mostafa 
Mokhtar, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8319

to look at the new patch set (#17).

Change subject: IMPALA-4123: Columnar decoding in Parquet
..

IMPALA-4123: Columnar decoding in Parquet

The idea is to optimise the common case where there are long runs of
NULL or non-NULL values (i.e. the def level is repeated). We can
detect this cheaply by keying the decoding loop in the column reader
off the state of the def level RLE decoder - if there's a long run
of repeated levels, we can skip checking the def level for every
value. We still fall back to decoding, caching and reading
value-by-value a batch of def levels whenever the next def level is not
in a repeated run. We still use the old approach for decoding rep
levels. There might be some benefit to using the same approach for rep
levels *if* repeated def and rep level runs line up.

These changes should unlock further optimizations because more time is
spent in simple kernel functions, e.g. UnpackAndDecode32Values() for
dictionary decompression, which is very optimisable using SIMD etc.

Snappy decompression now seems to be the main CPU bottleneck for
decoding snappy-compressed Parquet.

Perf:
Running TPC-H scale factor 60 on uncompressed and snappy parquet
both showed a ~4% speedup overall.

Microbenchmarks on uncompressed parquet show scans only doing
dictionary decoding on uncompressed Parquet is ~75% faster:

   set mt_dop=1;
   select min(l_returnflag) from lineitem;

Testing:
We have alltypes agg with a mix of null and non-null.

Many tables have long runs of non-null values.

Added new test data and coverage:
* a test table manynulls with long runs of null values.
* a large CHAR test table
* missing coverage for materialising pos slot in flattened nested types
  scan.
* Extended dict test to test longer runs.
* A larger version of complextypestbl with interesting collection
  shapes - NULL collections, empty collections, etc, particularly runs
  of collections with the same shape.
* Test interaction of timestamp validation with conversion
* Ran code coverage build to confirm all code paths are tested
* ASAN and exhaustive runs.

TODO:
* Before merging, run fuzz test for longer

Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M be/src/util/bit-packing.h
M be/src/util/bit-packing.inline.h
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
A be/src/util/mem-util.h
M be/src/util/rle-encoding.h
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/out_of_range_timestamp2_hive_211.parquet
A testdata/data/out_of_range_timestamp_hive_211.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/chars.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-position.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
A 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-local-tz-conversion.test
A testdata/workloads/functional-query/queries/QueryTest/scanners-many-nulls.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/query_test/test_scanners.py
26 files changed, 813 insertions(+), 114 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/8319/17
--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 17
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-08 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 16:

(5 comments)

For some reason I did not realize until now that this is no longer WIP, sorry 
for the delay.

http://gerrit.cloudera.org:8080/#/c/8319/16/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8319/16/be/src/util/dict-encoding.h@464
PS16, Line 464: uint8_t* curr_value = reinterpret_cast(first_value);
It could be useful to create a simple struct with a pointer + stride that hides 
the stride related logic. I wrote something similar in an old comment in March, 
but now I would go for the simplest solution possible, like

template
struct StrideWriter {
  T* current;
  uint64_t stride;

  void SkipNext(int num) {
DCHECK(stride % alignof(T) == 0);
current = reinterpret_cast(reinterpret_cast(current) + 
stride*num));
  }

  void SetNext(T& val) {
memcpy(current, &val, sizeof(T));
SkipNext(1);
  }
}

I do not want to force this, but I think that it could make some parts of the 
code shorter and a bit easier to understand.


http://gerrit.cloudera.org:8080/#/c/8319/15/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/8319/15/testdata/data/README@240
PS15, Line 240: t as
I have no problem with the code, but I am a bit surprised that Hive still these 
values without any issue. I thought that Parquet also had the 1400..1 
limitation because of Impala, but probably I am wrong.


http://gerrit.cloudera.org:8080/#/c/8319/15/testdata/data/README@244
PS15, Line 244:
nit: double "code"


http://gerrit.cloudera.org:8080/#/c/8319/15/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
File tests/custom_cluster/test_hive_parquet_timestamp_conversion.py:

http://gerrit.cloudera.org:8080/#/c/8319/15/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py@105
PS15, Line 105: # Test that timestamp validation also works as expected 
when converting timestamps.
This test seem to be an independent test that was appended here to avoid 
restarting the Impala cluster. Can you add a comment about this, and maybe also 
move the test to a separate functions that is called here? (I guess that I 
should have done the same when I added test_stat_filtering).

Note that once https://gerrit.cloudera.org/#/c/11057/ will be merged, then the 
"converting + validating" paths in parquet-column-readers.cc should become 
reachable with the default flags using int64 timestamps.


http://gerrit.cloudera.org:8080/#/c/8319/15/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/8319/15/tests/query_test/test_scanners.py@98
PS15, Line 98: new_vector = deepcopy(vector)
 : new_vector.get_value('exec_option')['batch_size'] = 
vector.get_value('batch_size')
Can you add a comment about the reason for doing this?



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 16
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 08 Nov 2018 22:59:07 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 16:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1329/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 16
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 08 Nov 2018 18:33:15 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-08 Thread Tim Armstrong (Code Review)
Hello Lars Volker, Zoltan Borok-Nagy, Csaba Ringhofer, Alex Behm, Mostafa 
Mokhtar, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8319

to look at the new patch set (#16).

Change subject: IMPALA-4123: Columnar decoding in Parquet
..

IMPALA-4123: Columnar decoding in Parquet

The idea is to optimise the common case where there are long runs of
NULL or non-NULL values (i.e. the def level is repeated). We can
detect this cheaply by keying the decoding loop in the column reader
off the state of the def level RLE decoder - if there's a long run
of repeated levels, we can skip checking the def level for every
value. We still fall back to decoding, caching and reading
value-by-value a batch of def levels whenever the next def level is not
in a repeated run. We still use the old approach for decoding rep
levels. There might be some benefit to using the same approach for rep
levels *if* repeated def and rep level runs line up.

These changes should unlock further optimizations because more time is
spent in simple kernel functions, e.g. UnpackAndDecode32Values() for
dictionary decompression, which is very optimisable using SIMD etc.

Snappy decompression now seems to be the main CPU bottleneck for
decoding snappy-compressed Parquet.

Perf:
Running TPC-H scale factor 60 on uncompressed and snappy parquet
both showed a ~4% speedup overall.

Microbenchmarks on uncompressed parquet show scans only doing
dictionary decoding on uncompressed Parquet is ~75% faster:

   set mt_dop=1;
   select min(l_returnflag) from lineitem;

Testing:
We have alltypes agg with a mix of null and non-null.

Many tables have long runs of non-null values.

Added new test data and coverage:
* a test table manynulls with long runs of null values.
* a large CHAR test table
* missing coverage for materialising pos slot in flattened nested types
  scan.
* Extended dict test to test longer runs.
* A larger version of complextypestbl with interesting collection
  shapes - NULL collections, empty collections, etc, particularly runs
  of collections with the same shape.
* Test interaction of timestamp validation with conversion
* Ran code coverage build to confirm all code paths are tested
* ASAN and exhaustive runs.

TODO:
* Before merging, run fuzz test for longer

Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M be/src/util/bit-packing.h
M be/src/util/bit-packing.inline.h
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/rle-encoding.h
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/out_of_range_timestamp2_hive_211.parquet
A testdata/data/out_of_range_timestamp_hive_211.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/chars.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-position.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
A 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-local-tz-conversion.test
A testdata/workloads/functional-query/queries/QueryTest/scanners-many-nulls.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/query_test/test_scanners.py
24 files changed, 749 insertions(+), 106 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/8319/16
--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 16
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-08 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 16:

Rebased and added HS2_TYPES to test_chars


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 16
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 08 Nov 2018 17:58:27 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 15:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1325/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 15
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 08 Nov 2018 02:16:30 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-08 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 14:

(2 comments)

I'm planning to do another passes in the following days to better understand 
the details.

http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/exec/parquet-column-readers.cc@597
PS14, Line 597:   // If the file is corrupt, we may have more cached def levels 
than values in the page.
> I think I misdiagnosed a fuzz test failure and this is the wrong fix (and u
Thanks for the explanation!


http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/runtime/tuple.cc
File be/src/runtime/tuple.cc:

http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/runtime/tuple.cc@210
PS14, Line 210: local_offset
> I did this so that the compiler could hoist reads of 'offset' out of the lo
Oh I see.



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 14
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 08 Nov 2018 14:27:43 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-08 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 14:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/exec/parquet-column-readers.cc@162
PS14, Line 162: int32_t ParquetLevelDecoder::NextRepeatedRunLength() {
There was a subtle bug here discovered by the fuzz test. When max_level_ is 0, 
we don't actually initialise rle_decoder_, which means that the 
NextNumRepeats() check below is not valid and can hit a DCHECK.

CacheNextBatch() handles the max_level_ == 0 case specially and we need to do 
the same here.

I added some DCHECKs, variable initialisation and other defensiveness in the 
process of debugging how NextRepeatedRunLength() got called when rle_decoder_ 
wasn't initialised. I think they're worth keeping in.


http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/exec/parquet-column-readers.cc@597
PS14, Line 597:   // If the file is corrupt, we may have more cached def levels 
than values in the page.
> Shouldn't we raise a warning when the file is corrupt?
I think I misdiagnosed a fuzz test failure and this is the wrong fix (and 
unnecessary).

It looks like the CacheHasNext() check should ensure that the below loop 
doesn't execute more than num_buffered_values_ times since the cache is 
populated with at most num_buffered_values_.

I also don't think there's a way to easily cross-check the def levels and the 
number of values in the page metadata - the only way we know the exact number 
of def levels is the page metadata. We can end up with "extra" literal values 
at the end of the encoded def levels since the literal run length is always a 
multiple of 8.


http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/runtime/tuple.cc
File be/src/runtime/tuple.cc:

http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/runtime/tuple.cc@210
PS14, Line 210: local_offset
> nit: it seems that 'local_offset' is unnecessary and we could just use 'off
I did this so that the compiler could hoist reads of 'offset' out of the loop 
below. Otherwise it thinks 'tuple_mem' and 'offset' might alias and the loop 
below ends up with unnecessary loads of 'offset' in it.

The simpler solution is to pass the argument by value. I did this and left a 
comment explaining why it's significant.


http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/util/rle-encoding.h@132
PS14, Line 132:   int64_t dict_len, OutType* values, int64_t stride) 
WARN_UNUSED_RESULT;
> nit: you could use a default argument instead of having two functions, i.e.
Good point.


http://gerrit.cloudera.org:8080/#/c/8319/14/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/8319/14/testdata/datasets/functional/functional_schema_template.sql@717
PS14, Line 717: INSERT OVERWRITE TABLE {db_name}{db_suffix}.{table_name} SELECT 
c.* FROM functional_parquet.complextypestbl c join functional.alltypes sort by 
id;
> Do we need the same insert stmt after LOAD and after DEPENDENT_LOAD_HIVE?
Yeah, actually LOAD is never executed since we don't create a text version of 
the table.



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 14
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 08 Nov 2018 01:45:51 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-08 Thread Tim Armstrong (Code Review)
Hello Lars Volker, Zoltan Borok-Nagy, Csaba Ringhofer, Alex Behm, Mostafa 
Mokhtar, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8319

to look at the new patch set (#15).

Change subject: IMPALA-4123: Columnar decoding in Parquet
..

IMPALA-4123: Columnar decoding in Parquet

The idea is to optimise the common case where there are long runs of
NULL or non-NULL values (i.e. the def level is repeated). We can
detect this cheaply by keying the decoding loop in the column reader
off the state of the def level RLE decoder - if there's a long run
of repeated levels, we can skip checking the def level for every
value. We still fall back to decoding, caching and reading
value-by-value a batch of def levels whenever the next def level is not
in a repeated run. We still use the old approach for decoding rep
levels. There might be some benefit to using the same approach for rep
levels *if* repeated def and rep level runs line up.

These changes should unlock further optimizations because more time is
spent in simple kernel functions, e.g. UnpackAndDecode32Values() for
dictionary decompression, which is very optimisable using SIMD etc.

Snappy decompression now seems to be the main CPU bottleneck for
decoding snappy-compressed Parquet.

Perf:
Running TPC-H scale factor 60 on uncompressed and snappy parquet
both showed a ~4% speedup overall.

Microbenchmarks on uncompressed parquet show scans only doing
dictionary decoding on uncompressed Parquet is ~75% faster:

   set mt_dop=1;
   select min(l_returnflag) from lineitem;

Testing:
We have alltypes agg with a mix of null and non-null.

Many tables have long runs of non-null values.

Added new test data and coverage:
* a test table manynulls with long runs of null values.
* a large CHAR test table
* missing coverage for materialising pos slot in flattened nested types
  scan.
* Extended dict test to test longer runs.
* A larger version of complextypestbl with interesting collection
  shapes - NULL collections, empty collections, etc, particularly runs
  of collections with the same shape.
* Test interaction of timestamp validation with conversion
* Ran code coverage build to confirm all code paths are tested
* ASAN and exhaustive runs.

TODO:
* Before merging, run fuzz test for longer

Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M be/src/util/bit-packing.h
M be/src/util/bit-packing.inline.h
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/rle-encoding.h
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/out_of_range_timestamp2_hive_211.parquet
A testdata/data/out_of_range_timestamp_hive_211.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/chars.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-position.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
A 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-local-tz-conversion.test
A testdata/workloads/functional-query/queries/QueryTest/scanners-many-nulls.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/query_test/test_scanners.py
24 files changed, 745 insertions(+), 106 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/8319/15
--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 15
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-08 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 14:

Still rerunning exhaustive tests but I feel sufficiently confident to get 
another round of feedback.


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 14
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 08 Nov 2018 01:46:17 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-11-07 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 14:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/exec/parquet-column-readers.cc@597
PS14, Line 597:   // If the file is corrupt, we may have more cached def levels 
than values in the page.
Shouldn't we raise a warning when the file is corrupt?


http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/runtime/tuple.cc
File be/src/runtime/tuple.cc:

http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/runtime/tuple.cc@210
PS14, Line 210: local_offset
nit: it seems that 'local_offset' is unnecessary and we could just use 'offset' 
in L212


http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/util/rle-encoding.h@132
PS14, Line 132:   int64_t dict_len, OutType* values, int64_t stride) 
WARN_UNUSED_RESULT;
nit: you could use a default argument instead of having two functions, i.e. 
"..., int64_t stride = sizeof(OutType))".


http://gerrit.cloudera.org:8080/#/c/8319/14/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/8319/14/testdata/datasets/functional/functional_schema_template.sql@717
PS14, Line 717: INSERT OVERWRITE TABLE {db_name}{db_suffix}.{table_name} SELECT 
c.* FROM functional_parquet.complextypestbl c join functional.alltypes sort by 
id;
Do we need the same insert stmt after LOAD and after DEPENDENT_LOAD_HIVE?



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 14
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 07 Nov 2018 17:23:24 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-10-23 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 13:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8319/13/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
File tests/custom_cluster/test_hive_parquet_timestamp_conversion.py:

http://gerrit.cloudera.org:8080/#/c/8319/13/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py@105
PS13, Line 105: .
flake8: E501 line too long (91 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/8319/13/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py@106
PS13, Line 106: 1
flake8: E501 line too long (94 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/8319/13/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py@107
PS13, Line 107: 2
flake8: E501 line too long (95 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 13
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 24 Oct 2018 02:28:12 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-10-23 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 14:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1136/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 14
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 23 Oct 2018 21:02:01 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-10-23 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 13:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1135/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 13
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 23 Oct 2018 20:53:52 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-10-23 Thread Tim Armstrong (Code Review)
Hello Lars Volker, Zoltan Borok-Nagy, Csaba Ringhofer, Alex Behm, Mostafa 
Mokhtar, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8319

to look at the new patch set (#14).

Change subject: IMPALA-4123: Columnar decoding in Parquet
..

IMPALA-4123: Columnar decoding in Parquet

The idea is to optimise the common case where there are long runs of
NULL or non-NULL values (i.e. the def level is repeated). We can
detect this cheaply by keying the decoding loop in the column reader
off the state of the def level RLE decoder - if there's a long run
of repeated levels, we can skip checking the def level for every
value. We still fall back to decoding, caching and reading
value-by-value a batch of def levels whenever the next def level is not
in a repeated run. We still use the old approach for decoding rep
levels. There might be some benefit to using the same approach for rep
levels *if* repeated def and rep level runs line up.

These changes should unlock further optimizations because more time is
spent in simple kernel functions, e.g. UnpackAndDecode32Values() for
dictionary decompression, which is very optimisable using SIMD etc.

Snappy decompression now seems to be the main CPU bottleneck for
decoding snappy-compressed Parquet.

Perf:
Running TPC-H scale factor 60 on uncompressed and snappy parquet
both showed a ~4% speedup overall.

Microbenchmarks on uncompressed parquet show scans only doing
dictionary decoding on uncompressed Parquet is ~75% faster:

   set mt_dop=1;
   select min(l_returnflag) from lineitem;

Testing:
We have alltypes agg with a mix of null and non-null.

Many tables have long runs of non-null values.

Added new test data and coverage:
* a test table manynulls with long runs of null values.
* a large CHAR test table
* missing coverage for materialising pos slot in flattened nested types
  scan.
* Extended dict test to test longer runs.
* A larger version of complextypestbl with interesting collection
  shapes - NULL collections, empty collections, etc, particularly runs
  of collections with the same shape.
* Test interaction of timestamp validation with conversion
* Ran code coverage build to confirm all code paths are tested
* ASAN and exhaustive runs.

TODO:
* Before merging, run fuzz test for longer

Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M be/src/util/bit-packing.h
M be/src/util/bit-packing.inline.h
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/rle-encoding.h
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/out_of_range_timestamp2_hive_211.parquet
A testdata/data/out_of_range_timestamp_hive_211.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/chars.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-position.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
A 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-local-tz-conversion.test
A testdata/workloads/functional-query/queries/QueryTest/scanners-many-nulls.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/query_test/test_scanners.py
24 files changed, 731 insertions(+), 93 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/8319/14
--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 14
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-10-23 Thread Tim Armstrong (Code Review)
Hello Lars Volker, Zoltan Borok-Nagy, Csaba Ringhofer, Alex Behm, Mostafa 
Mokhtar, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8319

to look at the new patch set (#13).

Change subject: IMPALA-4123: Columnar decoding in Parquet
..

IMPALA-4123: Columnar decoding in Parquet

The idea is to optimise the common case where there are long runs of
NULL or non-NULL values (i.e. the def level is repeated). We can
detect this cheaply by keying the decoding loop in the column reader
off the state of the def level RLE decoder - if there's a long run
of repeated levels, we can skip checking the def level for every
value. We still fall back to decoding, caching and reading
value-by-value a batch of def levels whenever the next def level is not
in a repeated run. We still use the old approach for decoding rep
levels. There might be some benefit to using the same approach for rep
levels *if* repeated def and rep level runs line up.

These changes should unlock further optimizations because more time is
spent in simple kernel functions, e.g. UnpackAndDecode32Values() for
dictionary decompression, which is very optimisable using SIMD etc.

Snappy decompression now seems to be the main CPU bottleneck for
decoding snappy-compressed Parquet.

Perf:
Running TPC-H scale factor 60 on uncompressed and snappy parquet
both showed a ~4% speedup overall.

Microbenchmarks on uncompressed parquet show scans only doing
dictionary decoding on uncompressed Parquet is ~75% faster:

   set mt_dop=1;
   select min(l_returnflag) from lineitem;

Testing:
We have alltypes agg with a mix of null and non-null.

Many tables have long runs of non-null values.

Added new test data and coverage:
* a test table manynulls with long runs of null values.
* a large CHAR test table
* missing coverage for materialising pos slot in flattened nested types
  scan.
* Extended dict test to test longer runs.
* A larger version of complextypestbl with interesting collection
  shapes - NULL collections, empty collections, etc, particularly runs
  of collections with the same shape.
* Test interaction of timestamp validation with conversion
* Ran code coverage build to confirm all code paths are tested
* ASAN and exhaustive runs.

TODO:
* Before merging, run fuzz test for longer

Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M be/src/util/bit-packing.h
M be/src/util/bit-packing.inline.h
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/rle-encoding.h
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/out_of_range_timestamp2_hive_211.parquet
A testdata/data/out_of_range_timestamp_hive_211.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/chars.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-position.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
A 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-local-tz-conversion.test
A testdata/workloads/functional-query/queries/QueryTest/scanners-many-nulls.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/query_test/test_scanners.py
24 files changed, 729 insertions(+), 93 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/8319/13
--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 13
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-10-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 12:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/936/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 12
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 04 Oct 2018 16:01:43 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-10-04 Thread Tim Armstrong (Code Review)
Hello Lars Volker, Zoltan Borok-Nagy, Csaba Ringhofer, Alex Behm, Mostafa 
Mokhtar, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8319

to look at the new patch set (#12).

Change subject: IMPALA-4123: Columnar decoding in Parquet
..

IMPALA-4123: Columnar decoding in Parquet

The idea is to optimise the common case where there are long runs of
NULL or non-NULL values (i.e. the def level is repeated). We can
detect this cheaply by keying the decoding loop in the column reader
off the state of the def level RLE decoder - if there's a long run
of repeated levels, we can skip checking the def level for every
value. We still fall back to decoding, caching and reading
value-by-value a batch of def levels whenever the next def level is not
in a repeated run. We still use the old approach for decoding rep
levels. There might be some benefit to using the same approach for rep
levels *if* repeated def and rep level runs line up.

These changes should unlock further optimizations because more time is
spent in simple kernel functions, e.g. UnpackAndDecode32Values() for
dictionary decompression, which is very optimisable using SIMD etc.

Snappy decompression now seems to be the main CPU bottleneck for
decoding snappy-compressed Parquet.

Perf:
Running TPC-H scale factor 60 on uncompressed and snappy parquet
both showed a ~4% speedup overall.

Microbenchmarks on uncompressed parquet show scans only doing
dictionary decoding on uncompressed Parquet is ~75% faster:

   set mt_dop=1;
   select min(l_returnflag) from lineitem;

Testing:
We have alltypes agg with a mix of null and non-null.

Many tables have long runs of non-null values.

Added new test data and coverage:
* a test table manynulls with long runs of null values.
* a large CHAR test table
* missing coverage for materialising pos slot in flattened nested types
  scan.
* Extended dict test to test longer runs.
* A larger version of complextypestbl with interesting collection
  shapes - NULL collections, empty collections, etc, particularly runs
  of collections with the same shape.
* Test interaction of timestamp validation with conversion
* Ran code coverage build to confirm all code paths are tested

TODO:
* Before merging, run fuzz test for longer
* ASAN and exhaustive runs.

Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M be/src/util/bit-packing.h
M be/src/util/bit-packing.inline.h
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/rle-encoding.h
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/out_of_range_timestamp2_hive_211.parquet
A testdata/data/out_of_range_timestamp_hive_211.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/chars.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-position.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
A 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-local-tz-conversion.test
A testdata/workloads/functional-query/queries/QueryTest/scanners-many-nulls.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/query_test/test_scanners.py
24 files changed, 729 insertions(+), 93 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/8319/12
--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 12
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-10-04 Thread Tim Armstrong (Code Review)
Hello Lars Volker, Zoltan Borok-Nagy, Csaba Ringhofer, Alex Behm, Mostafa 
Mokhtar, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8319

to look at the new patch set (#11).

Change subject: IMPALA-4123: Columnar decoding in Parquet
..

IMPALA-4123: Columnar decoding in Parquet

The idea is to optimise the common case where there are long runs of
NULL or non-NULL values (i.e. the def level is repeated). We can
detect this cheaply by keying the decoding loop in the column reader
off the state of the def level RLE decoder - if there's a long run
of repeated levels, we can skip checking the def level for every
value. We still fall back to decoding, caching and reading
value-by-value a batch of def levels whenever the next def level is not
in a repeated run. We still use the old approach for decoding rep
levels. There might be some benefit to using the same approach for rep
levels *if* repeated def and rep level runs line up.

These changes should unlock further optimizations because more time is
spent in simple kernel functions, e.g. UnpackAndDecode32Values() for
dictionary decompression, which is very optimisable using SIMD etc.

Snappy decompression now seems to be the main CPU bottleneck for
decoding snappy-compressed Parquet.

Perf:
Running TPC-H scale factor 60 on uncompressed and snappy parquet
both showed a ~4% speedup overall.

Microbenchmarks on uncompressed parquet show scans only doing
dictionary decoding on uncompressed Parquet is ~75% faster:

   set mt_dop=1;
   select min(l_returnflag) from lineitem;

Testing:
We have alltypes agg with a mix of null and non-null.

Many tables have long runs of non-null values.

Added new test data and coverage:
* a test table manynulls with long runs of null values.
* a large CHAR test table
* missing coverage for materialising pos slot in flattened nested types
  scan.
* Extended dict test to test longer runs.
* A larger version of complextypestbl with interesting collection
  shapes - NULL collections, empty collections, etc, particularly runs
  of collections with the same shape.
* Test interaction of timestamp validation with conversion
* Ran code coverage build to confirm all code paths are tested

TODO:
* Before merging, run fuzz test for longer
* ASAN and exhaustive runs.

Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/rle-encoding.h
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/out_of_range_timestamp2_hive_211.parquet
A testdata/data/out_of_range_timestamp_hive_211.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/chars.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-position.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
A 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-local-tz-conversion.test
A testdata/workloads/functional-query/queries/QueryTest/scanners-many-nulls.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/query_test/test_scanners.py
22 files changed, 705 insertions(+), 75 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/8319/11
--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 11
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-10-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 12:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8319/12/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
File tests/custom_cluster/test_hive_parquet_timestamp_conversion.py:

http://gerrit.cloudera.org:8080/#/c/8319/12/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py@105
PS12, Line 105: .
flake8: E501 line too long (91 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/8319/12/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py@106
PS12, Line 106: 1
flake8: E501 line too long (94 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/8319/12/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py@107
PS12, Line 107: 2
flake8: E501 line too long (95 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 12
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 04 Oct 2018 15:28:50 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-10-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 10:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/935/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 10
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 04 Oct 2018 15:17:49 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-10-04 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8319/10/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/8319/10/be/src/exec/parquet-column-readers.cc@557
PS10, Line 557:   // Fast path to materialize a run of values with the same 
definition level. This
I could also consider enhancing test coverage by adding a debug action to 
randomly go down the slow path.



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 10
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 04 Oct 2018 15:15:25 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-10-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 9:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/934/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 9
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 04 Oct 2018 15:10:27 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-10-04 Thread Tim Armstrong (Code Review)
Hello Lars Volker, Zoltan Borok-Nagy, Csaba Ringhofer, Alex Behm, Mostafa 
Mokhtar, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8319

to look at the new patch set (#10).

Change subject: IMPALA-4123: Columnar decoding in Parquet
..

IMPALA-4123: Columnar decoding in Parquet

These changes should enable further optimizations because more time is
spent in simple kernel functions, e.g. UnpackAndDecode32Values() for
dictionary decompression.

Snappy decompression now seems to be the main CPU bottleneck for
decoding snappy-compressed Parquet.

Perf:
Running TPC-H scale factor 60 on uncompressed and snappy parquet
both showed a ~4% speedup overall.

Microbenchmarks on uncompressed parquet show scans only doing
dictionary decoding on uncompressed Parquet is ~75% faster:

   set mt_dop=1;
   select min(l_returnflag) from lineitem;

Testing:
We have alltypes agg with a mix of null and non-null.

Many tables have long runs of non-null values.

Added new test data and coverage:
* a test table manynulls with long runs of null values.
* a large CHAR test table
* missing coverage for materialising pos slot in flattened nested types
  scan.
* Extended dict test to test longer runs.
* A larger version of complextypestbl with interesting collection
  shapes - NULL collections, empty collections, etc, particularly runs
  of collections with the same shape.
* Test interaction of timestamp validation with conversion
* Ran code coverage build to confirm all code paths are tested

TODO:
* Before merging, run fuzz test for longer
* ASAN and exhaustive runs.

Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/rle-encoding.h
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/out_of_range_timestamp2_hive_211.parquet
A testdata/data/out_of_range_timestamp_hive_211.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/chars.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-position.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
A 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-local-tz-conversion.test
A testdata/workloads/functional-query/queries/QueryTest/scanners-many-nulls.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/query_test/test_scanners.py
22 files changed, 705 insertions(+), 75 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/8319/10
--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 10
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-10-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 10:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8319/10/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
File tests/custom_cluster/test_hive_parquet_timestamp_conversion.py:

http://gerrit.cloudera.org:8080/#/c/8319/10/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py@105
PS10, Line 105: .
flake8: E501 line too long (91 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/8319/10/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py@106
PS10, Line 106: 1
flake8: E501 line too long (94 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/8319/10/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py@107
PS10, Line 107: 2
flake8: E501 line too long (95 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 10
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 04 Oct 2018 14:52:51 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-10-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 9:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8319/9/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
File tests/custom_cluster/test_hive_parquet_timestamp_conversion.py:

http://gerrit.cloudera.org:8080/#/c/8319/9/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py@105
PS9, Line 105: .
flake8: E501 line too long (91 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/8319/9/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py@106
PS9, Line 106: 1
flake8: E501 line too long (94 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/8319/9/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py@107
PS9, Line 107: 2
flake8: E501 line too long (95 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 9
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 04 Oct 2018 14:51:26 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-10-04 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
..


Patch Set 9:

Good news! I finally added the missing test coverage.


--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 9
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 04 Oct 2018 14:51:13 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123: Columnar decoding in Parquet

2018-10-04 Thread Tim Armstrong (Code Review)
Hello Lars Volker, Zoltan Borok-Nagy, Csaba Ringhofer, Alex Behm, Mostafa 
Mokhtar,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8319

to look at the new patch set (#9).

Change subject: IMPALA-4123: Columnar decoding in Parquet
..

IMPALA-4123: Columnar decoding in Parquet

These changes should enable further optimizations because more time is
spent in simple kernel functions, e.g. UnpackAndDecode32Values() for
dictionary decompression.

Snappy decompression now seems to be the main CPU bottleneck for
decoding snappy-compressed Parquet.

Perf:
Running TPC-H scale factor 60 on uncompressed and snappy parquet
both showed a ~4% speedup overall.

Microbenchmarks on uncompressed parquet show scans only doing
dictionary decoding on uncompressed Parquet is ~75% faster:

   set mt_dop=1;
   select min(l_returnflag) from lineitem;

Testing:
We have alltypes agg with a mix of null and non-null.

Many tables have long runs of non-null values.

Added new test data and coverage:
* a test table manynulls with long runs of null values.
* a large CHAR test table
* missing coverage for materialising pos slot in flattened nested types
  scan.
* Extended dict test to test longer runs.
* A larger version of complextypestbl with interesting collection
  shapes - NULL collections, empty collections, etc, particularly runs
  of collections with the same shape.
* Test interaction of timestamp validation with conversion
* Ran code coverage build to confirm all code paths are tested

TODO:
* Before merging, run fuzz test for longer
* ASAN and exhaustive runs.

Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M be/src/util/bit-stream-utils.h
M be/src/util/bit-stream-utils.inline.h
M be/src/util/dict-encoding.h
M be/src/util/rle-encoding.h
M testdata/bin/generate-schema-statements.py
M testdata/data/README
A testdata/data/out_of_range_timestamp2_hive_211.parquet
A testdata/data/out_of_range_timestamp_hive_211.parquet
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/chars.test
M 
testdata/workloads/functional-query/queries/QueryTest/nested-types-scanner-position.test
M testdata/workloads/functional-query/queries/QueryTest/nested-types-tpch.test
A 
testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-local-tz-conversion.test
A testdata/workloads/functional-query/queries/QueryTest/scanners-many-nulls.test
M tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
M tests/query_test/test_scanners.py
22 files changed, 717 insertions(+), 75 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/8319/9
--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 9
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy