Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
......................................................................


Patch Set 15:

(25 comments)

Still need to do a pass over the tests.

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.h
File be/src/exec/hdfs-parquet-table-writer.h:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.h@108
PS15, Line 108: PAGE_INDEX_TRUNCATE_LENGTH
PAGE_INDEX_MAX_STRING_LENGTH?


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.h@130
PS15, Line 130: Write
nit: "Writes"


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@48
PS15, Line 48: using namespace parquet;
Can you remove this? I find omitting the parquet:: prefix confusing, and we 
often explicitly specify it below.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@312
PS15, Line 312:   OffsetIndex offset_index_;
I find it useful to prefix Parquet types with parquet:: and we do that in parts 
of this file already. Can you add it in places where it's missing and remove 
the "using" above? Let me know if you feel we shouldn't prefix them.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@651
PS15, Line 651:       location.compressed_page_size = -1;
Probably doesn't have much of an effect, but should we set the size to 0 for an 
empty page?


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@668
PS15, Line 668:     location.compressed_page_size = 
page.header.compressed_page_size + len;
Should we add a comment for this line to explain that one includes the length 
and the other doesn't, to make the intent explicit?


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@752
PS15, Line 752:   string min_val, max_val;
I think we usually do one declaration per line, but I don't feel strongly about 
it.

It also might be safer to enclose this block in a nested scope to prevent 
future editors from re-using min_val and max_val after they have been moved 
below.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@754
PS15, Line 754:     Status s_min = TruncateMinValue(page_stats.min_value, 
PAGE_INDEX_TRUNCATE_LENGTH,
Could this truncate non-string values if PAGE_INDEX_TRUNCATE_LENGTH was small 
enough? If so, we should add a check that it isn't, or at least add a comment 
to the definition of PAGE_INDEX_TRUNCATE_LENGTH.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@761
PS15, Line 761:     column_index_.null_pages.push_back(true);
Maybe DCHECK that both are false, to catch errors that result in only one being 
set?


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@1220
PS15, Line 1220:   // Currently we only support Parquet files with one row 
group.
Can you add a sentence why this particular code relies on this limited support?


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@1225
PS15, Line 1225: ++)
nit: ++i

Or use a range based loop like you do in L1254?


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@88
PS15, Line 88:       if (a != a && b != b) return 0;
Can we use std::isnan()?


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@113
PS15, Line 113: Used for merging page statistics into column chunk statistics.
I think it's better to omit usage from the comment here. Instead say that it 
maintains internal state that tracks whether the values are ordered.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@164
PS15, Line 164: Let's a
nit: "We assume..." so it reads less like a suggestions.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@165
PS15, Line 165: If not all values are equal
What happens for NaN, i.e. when all values are NaN?


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.inline.h@54
PS15, Line 54:         if (MinMaxTrait<T>::Compare(prev_page_max_value_, 
cs->max_value_) > 0 ||
I cannot convince myself that prev_page_max_value_ has to be initialized here. 
Is it?


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.inline.h@66
PS15, Line 66:     Update(cs->min_value_, cs->max_value_);
If you do the order checks after this line, you can assume that 
has_min_max_values_ is true.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.inline.h@224
PS15, Line 224:     Update(cs->min_value_, cs->max_value_);
same here


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@33
PS15, Line 33: void EvalTruncation(const string& original, const string& 
expected_result,
Why did you fold both test methods into one and use an enum for the test, but 
not for the helpers? Could they both be one method with an enum? It seems that 
TruncateMin is a special case of TruncateMax.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@41
PS15, Line 41:   EXPECT_EQ(expected_result, result);
This checks result without checking the return status of TruncateMin/Max


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@45
PS15, Line 45:   EvalTruncation("0123456789", "0123456789", 100, MIN);
Can you add tests with empty string, string that has \0 in it, and char min and 
max values? Also have at least one string that's not already sorted in 
ascending order.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@68
PS15, Line 68:
Can you add a test for [0, 0x7F, 0xFF] to check that we don't rely on signed 
char overflow behavior?


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util.h
File be/src/util/string-util.h:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util.h@29
PS15, Line 29: Status TruncateMinValue(const std::string& str, int32_t 
max_length, std::string* result);
These seem similar to numeric rounding. Maybe name them TruncateUp and 
TruncateDown?


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util.h@36
PS15, Line 36: Status TruncateMaxValue(const std::string& str, int32_t 
max_length, std::string* result);
This one should probably have a WARN_IF_UNUSED annotation since it modifies 
*result, even when hitting an error. If the caller forgets to handle the error, 
it might cause nasty bugs.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util.cc@46
PS15, Line 46:   (*result)[i] += 1;
(*result)[i] could be 0x7F here and this addition could overflow, resulting in 
undefined behavior.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 15
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Tue, 08 May 2018 23:26:38 +0000
Gerrit-HasComments: Yes

Reply via email to