Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
......................................................................


Patch Set 16:

(20 comments)

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/exec/hdfs-parquet-table-writer.cc@1225
PS15, Line 1225:
> Switched to ++i.
Ah, I had overlooked row_group->columns[i]


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@33
PS15, Line 33: void EvalTruncation(const string& original, const string& 
expected_result,
> In the test it was easier to organize the code like this.
Let's leave it as it is, I also don't feel strongly about it.


http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@68
PS15, Line 68:   EvalTruncation(normal_plus_max, "abcdeg" + string(4, 0), 10, 
UP);
> Added new tess. Already had tests with 0xFF at L57-L66
Thanks for adding more tests. I meant the exact 3-character string consisting 
of {0, (char)0x7F, (char)0xFF} to make sure that truncating up would hit the 
signed char overflow. Apologies for not being more clear.


http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/util/string-util.cc@46
PS16, Line 46:   unsigned char uch = (*result)[i];
I think this could use a comment and/or an explicit cast to make the intent 
clear.


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@36
PS16, Line 36: PAGE_INDEX_TRUNCATE_LENGTH
update


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@55
PS16, Line 55: """
nit: PEP8 says the closing """ should be on a new line. You can use flake8 to 
check the whole file. It was recently added to our codebase: 
https://github.com/apache/impala/commit/2fb73f94b425fde488166a19f78050ddbc3d7b50


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@90
PS16, Line 90:       column_info = ColumnInfo._make((schema, stats, 
offset_index, column_index,
I think you should be able to just pass all the parameters to the ctor 
"ColumnInfo(schema, stats,...). Currently you wrap it in a tuple to get an 
iterable to pass to _make(), but that should not be necessary.


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@96
PS16, Line 96: nametuples
"ColumnInfo objects" or "column infos"


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@97
PS16, Line 97: a row group
the first row group?


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@110
PS16, Line 110: current_loc in page_locations[1:]:
You can zip a list with itself to create (prev, cur) pairs:

 In [6]: l = range(5)

 In [7]: zip(l[:-1], l[1:])
 Out[7]: [(0, 1), (1, 2), (2, 3), (3, 4)]


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@124
PS16, Line 124: null_page
nit: page_is_null might be more clear


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@182
PS16, Line 182:     previous_max = None
Would it make sense to validate the ordering instead of re-computing it? I.e., 
extract all the min_values and max_values into lists and then just all 
assert(sorted(min_values))?


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@214
PS16, Line 214: sorting_column
unused?


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@216
PS16, Line 216: skip_col_idx
Stale comment?


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@249
PS16, Line 249:     vector.get_value('exec_option')['num_nodes'] = 1
move closer to the comment explaining the intent


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@321
PS16, Line 321: ends
nit: end


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@348
PS16, Line 348: followed by
              :     # zeroes.
I had missed this looking at the truncation code. Why do we keep the zeroes at 
the end?


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py
File tests/util/get_parquet_metadata.py:

http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py@31
PS16, Line 31: The range must start with a serialized
             :   thrift object.
That does not actually seem to be a requirement of the function. Should we 
remove the sentence from the comment? Or enforce it?


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py@39
PS16, Line 39:   """Creates a thrift protocol object from a memory buffer. The 
buffer should
Mention "serialized_object" in the comment, or rename the parameter to buffer? 
serialized_object_buffer? From the comment it'd not entirely obvious that 
"serialized_object" actually is the buffer.


http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py@171
PS16, Line 171: read_serialized_object
Can read_serialized_object also create the protocol and return it? You could 
even pass the class into the method and then get the object out.

    read_serialized_object(ColumnIndex, filename, file_pos, length)

I don't feel strongly about it, but it might make the code simpler. Please 
ignore if you find it adds complexity.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 16
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Wed, 09 May 2018 17:52:00 +0000
Gerrit-HasComments: Yes

Reply via email to