Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 )
Change subject: IMPALA-5842: Write page index in Parquet files ...................................................................... Patch Set 6: (10 comments) http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300 PS2, Line 300: std::vector<bool> null_pages_; > If all strings in a page have the same 1KB prefix, then performance and spa About leaving max_value unset: we cannot really do that, because 'max_values' is a required list of binary data. We can only set it to empty string, but an empty string can also be a valid value. And an empty string is less than all strings, so not really a good choice. About determining the prefix size: why don't we calculate it dynamically based on the actual data and with some heuristic? E.g for max_value.: min_element( length(max_value), length(common_prefix(min_value, max_value)) * 1.5 + num_values / 100, 4 kB ) http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/hdfs-parquet-table-writer.cc@727 PS6, Line 727: // max_values_ and mark the values as valid. In case min and max values are not > It is a bit misleading to by talking about marking the min/max values valid Yeah, I agree that the comment is misleading, so I updated it. Currently the write path logic is reversed however. We mark the page as null if it doesn't have min and max values, and we mark it as not-null if it has min and max values as well. http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/parquet-column-stats.inline.h File be/src/exec/parquet-column-stats.inline.h: http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/parquet-column-stats.inline.h@54 PS6, Line 54: if (MinMaxTrait<T>::Compare(prev_page_max_value_, cs->max_value_) == 1 || > I know it is not Java but it might be readable to use a java-like contract I changed the ifs to use < and > against 0 to allow some flexibility for the Compare() implementations. However, for now I leave the definition of Compare as it is, because I can't use a - b for string values, therefore I'd need to write one more template specialization for strings. http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@131 PS6, Line 131: for column_info in columns: > The next block is a good candidate to move to a separate function. You meant validate_page_locations(), right? http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@134 PS6, Line 134: previous_loc = column_info.offset_index.page_locations[0] : for current_loc in column_info.offset_index.page_locations[1:]: : assert previous_loc.offset < current_loc.offset : assert previous_loc.first_row_index < current_loc.first_row_index : previous_loc = current_loc > I would move this to a separate function like "validate_page_locations". Done http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@139 PS6, Line 139: null_pages = column_info.column_index.null_pages > The next empty line could be moved before this line to group it together wi I refactored a lot, this comment doesn't really apply anymore. http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@157 PS6, Line 157: if column_info.stats: > This is more subjective than the comment in line 131, but I would move the I created validate functions for the members of ColumnIndex and OffsetIndex. http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@163 PS6, Line 163: for null_page in null_pages: : assert null_page > Python has any() and all(). However, it might actually be better to use enu OK, used enumerate, and added a message to the assert. http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@166 PS6, Line 166: return > This "return" should be "continue", but it would be even better to move to Right, now it's in a different function, called _validate_min_max_values http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@169 PS6, Line 169: for null_page, value in zip(null_pages, min_values): : if not null_page: : current_value = decode_stats_value(column_info.schema, value) : assert current_value >= min_value : : max_value = decode_stats_value(column_info.schema, max_value_str) : for null_page, value in zip(null_pages, max_values): : if not null_page: : current_value = decode_stats_value(column_info.schema, value) : assert current_value <= max_value > These two for loops could be merged. I would also prefer more exact variabl I prefer two simple loops than a bigger, more complicated one, with two "current values". Improved the namings, though. Thanks for the ideas, I added a check for the sum of the page null counts. I'll add further tests once I'm implemented the truncation of min/max values. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 6 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Wed, 28 Mar 2018 00:04:22 +0000 Gerrit-HasComments: Yes