Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
......................................................................


Patch Set 6:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300
PS2, Line 300:   std::vector<bool> null_pages_;
> If all strings in a page have the same 1KB prefix, then performance and spa
About leaving max_value unset: we cannot really do that, because 'max_values' 
is a required list of binary data. We can only set it to empty string, but an 
empty string can also be a valid value. And an empty string is less than all 
strings, so not really a good choice.


About determining the prefix size: why don't we calculate it dynamically based 
on the actual data and with some heuristic? E.g for max_value.:

 min_element(
   length(max_value),
   length(common_prefix(min_value, max_value)) * 1.5 + num_values / 100,
   4 kB
 )


http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/hdfs-parquet-table-writer.cc@727
PS6, Line 727:   // max_values_ and mark the values as valid. In case min and 
max values are not
> It is a bit misleading to by talking about marking the min/max values valid
Yeah, I agree that the comment is misleading, so I updated it.

Currently the write path logic is reversed however. We mark the page as null if 
it doesn't have min and max values, and we mark it as not-null if it has min 
and max values as well.


http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

http://gerrit.cloudera.org:8080/#/c/9693/6/be/src/exec/parquet-column-stats.inline.h@54
PS6, Line 54:         if (MinMaxTrait<T>::Compare(prev_page_max_value_, 
cs->max_value_) == 1 ||
> I know it is not Java but it might be readable to use a java-like contract
I changed the ifs to use < and > against 0 to allow some flexibility for the 
Compare() implementations.

However, for now I leave the definition of Compare as it is, because I can't 
use a - b for string values, therefore I'd need to write one more template 
specialization for strings.


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@131
PS6, Line 131:       for column_info in columns:
> The next block is a good candidate to move to a separate function.
You meant validate_page_locations(), right?


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@134
PS6, Line 134:         previous_loc = column_info.offset_index.page_locations[0]
             :         for current_loc in 
column_info.offset_index.page_locations[1:]:
             :           assert previous_loc.offset < current_loc.offset
             :           assert previous_loc.first_row_index < 
current_loc.first_row_index
             :           previous_loc = current_loc
> I would move this to a separate function like "validate_page_locations".
Done


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@139
PS6, Line 139:         null_pages = column_info.column_index.null_pages
> The next empty line could be moved before this line to group it together wi
I refactored a lot, this comment doesn't really apply anymore.


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@157
PS6, Line 157:         if column_info.stats:
> This is more subjective than the comment in line 131, but I would move the
I created validate functions for the members of ColumnIndex and OffsetIndex.


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@163
PS6, Line 163:             for null_page in null_pages:
             :               assert null_page
> Python has any() and all(). However, it might actually be better to use enu
OK, used enumerate, and added a message to the assert.


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@166
PS6, Line 166:             return
> This "return" should be "continue", but it would be even better to move to
Right, now it's in a different function, called _validate_min_max_values


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@169
PS6, Line 169:           for null_page, value in zip(null_pages, min_values):
             :             if not null_page:
             :               current_value = 
decode_stats_value(column_info.schema, value)
             :               assert current_value >= min_value
             :
             :           max_value = decode_stats_value(column_info.schema, 
max_value_str)
             :           for null_page, value in zip(null_pages, max_values):
             :             if not null_page:
             :               current_value = 
decode_stats_value(column_info.schema, value)
             :               assert current_value <= max_value
> These two for loops could be merged. I would also prefer more exact variabl
I prefer two simple loops than a bigger, more complicated one, with two 
"current values". Improved the namings, though.

Thanks for the ideas, I added a check for the sum of the page null counts.
I'll add further tests once I'm implemented the truncation of min/max values.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Wed, 28 Mar 2018 00:04:22 +0000
Gerrit-HasComments: Yes

Reply via email to