Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
......................................................................


Patch Set 16:

(5 comments)

For some reason I did not realize until now that this is no longer WIP, sorry 
for the delay.

http://gerrit.cloudera.org:8080/#/c/8319/16/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8319/16/be/src/util/dict-encoding.h@464
PS16, Line 464: uint8_t* curr_value = reinterpret_cast<uint8_t*>(first_value);
It could be useful to create a simple struct with a pointer + stride that hides 
the stride related logic. I wrote something similar in an old comment in March, 
but now I would go for the simplest solution possible, like

template<typename T>
struct StrideWriter {
  T* current;
  uint64_t stride;

  void SkipNext(int num) {
    DCHECK(stride % alignof(T) == 0);
    current = reinterpret_cast<T*>(reinterpret_cast<uint8_t*>(current) + 
stride*num));
  }

  void SetNext(T& val) {
    memcpy(current, &val, sizeof(T));
    SkipNext(1);
  }
}

I do not want to force this, but I think that it could make some parts of the 
code shorter and a bit easier to understand.


http://gerrit.cloudera.org:8080/#/c/8319/15/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/8319/15/testdata/data/README@240
PS15, Line 240: t as
I have no problem with the code, but I am a bit surprised that Hive still these 
values without any issue. I thought that Parquet also had the 1400..10000 
limitation because of Impala, but probably I am wrong.


http://gerrit.cloudera.org:8080/#/c/8319/15/testdata/data/README@244
PS15, Line 244:
nit: double "code"


http://gerrit.cloudera.org:8080/#/c/8319/15/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py
File tests/custom_cluster/test_hive_parquet_timestamp_conversion.py:

http://gerrit.cloudera.org:8080/#/c/8319/15/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py@105
PS15, Line 105:     # Test that timestamp validation also works as expected 
when converting timestamps.
This test seem to be an independent test that was appended here to avoid 
restarting the Impala cluster. Can you add a comment about this, and maybe also 
move the test to a separate functions that is called here? (I guess that I 
should have done the same when I added test_stat_filtering).

Note that once https://gerrit.cloudera.org/#/c/11057/ will be merged, then the 
"converting + validating" paths in parquet-column-readers.cc should become 
reachable with the default flags using int64 timestamps.


http://gerrit.cloudera.org:8080/#/c/8319/15/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/8319/15/tests/query_test/test_scanners.py@98
PS15, Line 98:     new_vector = deepcopy(vector)
             :     new_vector.get_value('exec_option')['batch_size'] = 
vector.get_value('batch_size')
Can you add a comment about the reason for doing this?



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 16
Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Thu, 08 Nov 2018 22:59:07 +0000
Gerrit-HasComments: Yes

Reply via email to