Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 )
Change subject: IMPALA-5842: Write page index in Parquet files ...................................................................... Patch Set 16: (20 comments) http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@1225 PS15, Line 1225: > Switched to ++i. Ah, I had overlooked row_group->columns[i] http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc File be/src/util/string-util-test.cc: http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@33 PS15, Line 33: void EvalTruncation(const string& original, const string& expected_result, > In the test it was easier to organize the code like this. Let's leave it as it is, I also don't feel strongly about it. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@68 PS15, Line 68: EvalTruncation(normal_plus_max, "abcdeg" + string(4, 0), 10, UP); > Added new tess. Already had tests with 0xFF at L57-L66 Thanks for adding more tests. I meant the exact 3-character string consisting of {0, (char)0x7F, (char)0xFF} to make sure that truncating up would hit the signed char overflow. Apologies for not being more clear. http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/util/string-util.cc@46 PS16, Line 46: unsigned char uch = (*result)[i]; I think this could use a comment and/or an explicit cast to make the intent clear. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@36 PS16, Line 36: PAGE_INDEX_TRUNCATE_LENGTH update http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@55 PS16, Line 55: """ nit: PEP8 says the closing """ should be on a new line. You can use flake8 to check the whole file. It was recently added to our codebase: https://github.com/apache/impala/commit/2fb73f94b425fde488166a19f78050ddbc3d7b50 http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@90 PS16, Line 90: column_info = ColumnInfo._make((schema, stats, offset_index, column_index, I think you should be able to just pass all the parameters to the ctor "ColumnInfo(schema, stats,...). Currently you wrap it in a tuple to get an iterable to pass to _make(), but that should not be necessary. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@96 PS16, Line 96: nametuples "ColumnInfo objects" or "column infos" http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@97 PS16, Line 97: a row group the first row group? http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@110 PS16, Line 110: current_loc in page_locations[1:]: You can zip a list with itself to create (prev, cur) pairs: In [6]: l = range(5) In [7]: zip(l[:-1], l[1:]) Out[7]: [(0, 1), (1, 2), (2, 3), (3, 4)] http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@124 PS16, Line 124: null_page nit: page_is_null might be more clear http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@182 PS16, Line 182: previous_max = None Would it make sense to validate the ordering instead of re-computing it? I.e., extract all the min_values and max_values into lists and then just all assert(sorted(min_values))? http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@214 PS16, Line 214: sorting_column unused? http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@216 PS16, Line 216: skip_col_idx Stale comment? http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@249 PS16, Line 249: vector.get_value('exec_option')['num_nodes'] = 1 move closer to the comment explaining the intent http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@321 PS16, Line 321: ends nit: end http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@348 PS16, Line 348: followed by : # zeroes. I had missed this looking at the truncation code. Why do we keep the zeroes at the end? http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py File tests/util/get_parquet_metadata.py: http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py@31 PS16, Line 31: The range must start with a serialized : thrift object. That does not actually seem to be a requirement of the function. Should we remove the sentence from the comment? Or enforce it? http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py@39 PS16, Line 39: """Creates a thrift protocol object from a memory buffer. The buffer should Mention "serialized_object" in the comment, or rename the parameter to buffer? serialized_object_buffer? From the comment it'd not entirely obvious that "serialized_object" actually is the buffer. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py@171 PS16, Line 171: read_serialized_object Can read_serialized_object also create the protocol and return it? You could even pass the class into the method and then get the object out. read_serialized_object(ColumnIndex, filename, file_pos, length) I don't feel strongly about it, but it might make the code simpler. Please ignore if you find it adds complexity. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 16 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Wed, 09 May 2018 17:52:00 +0000 Gerrit-HasComments: Yes