Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 )
Change subject: IMPALA-5842: Write page index in Parquet files ...................................................................... Patch Set 15: (25 comments) Still need to do a pass over the tests. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.h File be/src/exec/hdfs-parquet-table-writer.h: http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.h@108 PS15, Line 108: PAGE_INDEX_TRUNCATE_LENGTH PAGE_INDEX_MAX_STRING_LENGTH? http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.h@130 PS15, Line 130: Write nit: "Writes" http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@48 PS15, Line 48: using namespace parquet; Can you remove this? I find omitting the parquet:: prefix confusing, and we often explicitly specify it below. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@312 PS15, Line 312: OffsetIndex offset_index_; I find it useful to prefix Parquet types with parquet:: and we do that in parts of this file already. Can you add it in places where it's missing and remove the "using" above? Let me know if you feel we shouldn't prefix them. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@651 PS15, Line 651: location.compressed_page_size = -1; Probably doesn't have much of an effect, but should we set the size to 0 for an empty page? http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@668 PS15, Line 668: location.compressed_page_size = page.header.compressed_page_size + len; Should we add a comment for this line to explain that one includes the length and the other doesn't, to make the intent explicit? http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@752 PS15, Line 752: string min_val, max_val; I think we usually do one declaration per line, but I don't feel strongly about it. It also might be safer to enclose this block in a nested scope to prevent future editors from re-using min_val and max_val after they have been moved below. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@754 PS15, Line 754: Status s_min = TruncateMinValue(page_stats.min_value, PAGE_INDEX_TRUNCATE_LENGTH, Could this truncate non-string values if PAGE_INDEX_TRUNCATE_LENGTH was small enough? If so, we should add a check that it isn't, or at least add a comment to the definition of PAGE_INDEX_TRUNCATE_LENGTH. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@761 PS15, Line 761: column_index_.null_pages.push_back(true); Maybe DCHECK that both are false, to catch errors that result in only one being set? http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@1220 PS15, Line 1220: // Currently we only support Parquet files with one row group. Can you add a sentence why this particular code relies on this limited support? http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@1225 PS15, Line 1225: ++) nit: ++i Or use a range based loop like you do in L1254? http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@88 PS15, Line 88: if (a != a && b != b) return 0; Can we use std::isnan()? http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@113 PS15, Line 113: Used for merging page statistics into column chunk statistics. I think it's better to omit usage from the comment here. Instead say that it maintains internal state that tracks whether the values are ordered. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@164 PS15, Line 164: Let's a nit: "We assume..." so it reads less like a suggestions. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.h@165 PS15, Line 165: If not all values are equal What happens for NaN, i.e. when all values are NaN? http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.inline.h File be/src/exec/parquet-column-stats.inline.h: http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.inline.h@54 PS15, Line 54: if (MinMaxTrait<T>::Compare(prev_page_max_value_, cs->max_value_) > 0 || I cannot convince myself that prev_page_max_value_ has to be initialized here. Is it? http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.inline.h@66 PS15, Line 66: Update(cs->min_value_, cs->max_value_); If you do the order checks after this line, you can assume that has_min_max_values_ is true. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/parquet-column-stats.inline.h@224 PS15, Line 224: Update(cs->min_value_, cs->max_value_); same here http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc File be/src/util/string-util-test.cc: http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@33 PS15, Line 33: void EvalTruncation(const string& original, const string& expected_result, Why did you fold both test methods into one and use an enum for the test, but not for the helpers? Could they both be one method with an enum? It seems that TruncateMin is a special case of TruncateMax. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@41 PS15, Line 41: EXPECT_EQ(expected_result, result); This checks result without checking the return status of TruncateMin/Max http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@45 PS15, Line 45: EvalTruncation("0123456789", "0123456789", 100, MIN); Can you add tests with empty string, string that has \0 in it, and char min and max values? Also have at least one string that's not already sorted in ascending order. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@68 PS15, Line 68: Can you add a test for [0, 0x7F, 0xFF] to check that we don't rely on signed char overflow behavior? http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util.h File be/src/util/string-util.h: http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util.h@29 PS15, Line 29: Status TruncateMinValue(const std::string& str, int32_t max_length, std::string* result); These seem similar to numeric rounding. Maybe name them TruncateUp and TruncateDown? http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util.h@36 PS15, Line 36: Status TruncateMaxValue(const std::string& str, int32_t max_length, std::string* result); This one should probably have a WARN_IF_UNUSED annotation since it modifies *result, even when hitting an error. If the caller forgets to handle the error, it might cause nasty bugs. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util.cc@46 PS15, Line 46: (*result)[i] += 1; (*result)[i] could be 0x7F here and this addition could overflow, resulting in undefined behavior. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 15 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Tue, 08 May 2018 23:26:38 +0000 Gerrit-HasComments: Yes