Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12065 )

Change subject: IMPALA-5843: Use page index in Parquet files to skip pages
......................................................................


Patch Set 8:

(37 comments)

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/hdfs-scan-node-base.h@272
PS8, Line 272:   io::ScanRange* AllocateScanRange(hdfsFS fs, const char* file,
nit: same line wrapping for both methods


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.h@360
PS8, Line 360: parquet::ColumnOrder*
nit: make this a const& or non-const ptr. If a const* is required to be null, 
mention in the comment.


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.h@428
PS8, Line 428: Page
nit: lower case


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.h@429
PS8, Line 429: eliminated
nit: missing comma


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.h@552
PS8, Line 552:   /// Reads statistics data from the page index.
This looks like a candidate to explain the various parameters in some details, 
in particular fn_name


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@803
PS7, Line 803: nReader* scalar_reader :
> I only did it this way to not make the signature awkward. If I change 'skip
That's a good point. One could pass skip_ranges by value and then let the 
compiler elide the copies but consistency with out style is important, too.

To improve the quality of the abstraction and readability, I'd also consider 
making a copy of skip_ranges inside the method and then sorting it there. If it 
is small, the extra copy won't matter, and if it is large, the sorting will 
dominate the runtime, anyways.

Passing by ptr would also be OK I think, the order could be (num_rows, 
&skip_ranges, &candidate_ranges), which I think is fine.


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@526
PS8, Line 526:
nit: double space


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common-test.cc
File be/src/exec/parquet/parquet-common-test.cc:

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common-test.cc@32
PS8, Line 32: TEST(ParquetCommon, CalculateCandidateRanges) {
Please add some comments to the TEST functions that briefly explain what the 
test does.


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common-test.cc@32
PS8, Line 32: CalculateCandidateRanges
rename


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common-test.cc@49
PS8, Line 49:           CalculateCandidateRanges({{1, 8}, {3, 8}, {7, 8}}, 10), 
{{0, 0}, {9, 9}});
Let's add some tests about error conditions, either here or elsewhere (this 
method might DCHECK). Currently the ranges are populated from 
page_locations.first_row_index which could have corrupted data. 
GetRowRangesForPage() could have some error handling but 
ComputeCandidateRanges() feels far away enough to warrant its own error 
checking.

* number of rows smaller than the beginning or end of one of the ranges
* Negative boundaries
* Last < first


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common-test.cc@52
PS8, Line 52: int
It would be good to have some tests that validate that values > INT_MAX don't 
break stuff for all places where they can occur.


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common-test.cc@61
PS8, Line 61: CalculateCandidatePages
rename


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common-test.cc@77
PS8, Line 77:   ValidatePages({0, 10, 20, 50, 70}, {{9, 9}, {10, 10}, {50, 
50}}, 100, {0, 1, 3});
Some tests for error conditions would be good here, too.


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.h
File be/src/exec/parquet/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.h@72
PS8, Line 72: don't want to skip.
"want to read" or "want to scan"?


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.h@74
PS8, Line 74: const
nit: I think we usually omit this when passing by value


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.cc
File be/src/exec/parquet/parquet-common.cc:

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.cc@92
PS8, Line 92: RowRange GetRowRangeForPage(const parquet::RowGroup& row_group,
nit: Return by pointer. I'm open to have a discussion whether we want to make 
use of copy-elision in such cases, and enforce it by putting some assert(false) 
into the copy c'tor.


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.cc@95
PS8, Line 95:   auto& page_locations = offset_index.page_locations;
nit: const auto&?


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.cc@106
PS8, Line 106: void ComputeCandidateRanges(const vector<RowRange>& skip_ranges,
Should we mark this one as static?


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-level-decoder.h
File be/src/exec/parquet/parquet-level-decoder.h:

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-level-decoder.h@109
PS8, Line 109:   // Prepare cache for reading if it's empty.
Mention what this does and what the return value means?


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.h
File be/src/exec/parquet/parquet-page-index.h:

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.h@32
PS8, Line 32: /// Helper class for reading the Parquet page index.
Mention briefly what this class does, i.e. the buffer management required to 
read and deserialize the page index


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.h@37
PS8, Line 37:   /// It needs to be called before the serialization methods.
nit: Maybe make this read a bit more fluently, and mention what it does before 
the requirements?


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.h@61
PS8, Line 61:   ScopedBuffer raw_page_index_;
rename to page_index_buffer_ or just buffer_?


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.h@66
PS8, Line 66:
missing word?


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.h@68
PS8, Line 68:   int64_t max_offset_;
We could replace this with the buffer size to make the relation between these 
and the butter size more obvious, or even remove it and use the buffer size 
directly instead.


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.cc
File be/src/exec/parquet/parquet-page-index.cc:

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.cc@43
PS8, Line 43:   base_offset_ = numeric_limits<int64_t>::max();
            :   max_offset_ = -1;
remove?


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.cc@47
PS8, Line 47:       int64_t ci_start = col_chunk.column_index_offset;
We should have a test that makes sure that this logic behaves well for broken 
files.


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.cc@54
PS8, Line 54:       if (idx_max > max_offset_) max_offset_ = idx_max;
This is just max3(ci_end, oi_end, max_offset_), right? I just found out that in 
C++11, std::max takes an initializer list, so you could write this as 
max({ci_end, oi_end, max_offset}).


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.cc@57
PS8, Line 57:   if (base_offset_ >= max_offset_) return Status::OK();
Should we have a comment to explain why we don't return an error here, given we 
return an error when trying to deserialize later?


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.cc@108
PS8, Line 108:   DCHECK_GE(file_offset, base_offset_);
This might be slightly more readable by changing it do DCHECK_GE(buffer_offset, 
0) below.


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/service/query-options.cc
File be/src/service/query-options.cc:

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/service/query-options.cc@730
PS8, Line 730:             iequals(value, "true") || iequals(value, "1"));
Use IsTrue(), might also fix elsewhere since the other occurrence doesn't look 
like it would likely result in a merge conflict.


http://gerrit.cloudera.org:8080/#/c/12065/8/common/thrift/ImpalaInternalService.thrift
File common/thrift/ImpalaInternalService.thrift:

http://gerrit.cloudera.org:8080/#/c/12065/8/common/thrift/ImpalaInternalService.thrift@331
PS8, Line 331: Page
Refer to the comment in ImpalaService.thrift like the other options?


http://gerrit.cloudera.org:8080/#/c/12065/8/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

http://gerrit.cloudera.org:8080/#/c/12065/8/common/thrift/ImpalaService.thrift@384
PS8, Line 384: Page
We should mention what this does, similar to PARQUET_READ_STATISTICS.

nit: lowercase "page".


http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/data/README@323
PS8, Line 323: hacked version of hive
The previous file doesn't mention the Hive hack


http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/data/README@353
PS8, Line 353: hacked version of hive 
same


http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/data/README@402
PS8, Line 402: hacked version of hive
same here


http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-page-index.test
File 
testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-page-index.test:

http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-page-index.test@1
PS8, Line 1: ====
I think these tests would be easier to follow if they had a brief comment 
explaining what they're supposed to test. Can you add comments here and in the 
other .test files?


http://gerrit.cloudera.org:8080/#/c/12065/8/tests/query_test/test_parquet_stats.py
File tests/query_test/test_parquet_stats.py:

http://gerrit.cloudera.org:8080/#/c/12065/8/tests/query_test/test_parquet_stats.py@77
PS8, Line 77:   def test_page_index(self, vector, unique_database):
Please add a comment to this function.



--
To view, visit http://gerrit.cloudera.org:8080/12065
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Gerrit-Change-Number: 12065
Gerrit-PatchSet: 8
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Wed, 03 Apr 2019 21:42:19 +0000
Gerrit-HasComments: Yes

Reply via email to