Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/8319 )
Change subject: IMPALA-4123: Columnar decoding in Parquet ...................................................................... Patch Set 16: (5 comments) For some reason I did not realize until now that this is no longer WIP, sorry for the delay. http://gerrit.cloudera.org:8080/#/c/8319/16/be/src/util/dict-encoding.h File be/src/util/dict-encoding.h: http://gerrit.cloudera.org:8080/#/c/8319/16/be/src/util/dict-encoding.h@464 PS16, Line 464: uint8_t* curr_value = reinterpret_cast<uint8_t*>(first_value); It could be useful to create a simple struct with a pointer + stride that hides the stride related logic. I wrote something similar in an old comment in March, but now I would go for the simplest solution possible, like template<typename T> struct StrideWriter { T* current; uint64_t stride; void SkipNext(int num) { DCHECK(stride % alignof(T) == 0); current = reinterpret_cast<T*>(reinterpret_cast<uint8_t*>(current) + stride*num)); } void SetNext(T& val) { memcpy(current, &val, sizeof(T)); SkipNext(1); } } I do not want to force this, but I think that it could make some parts of the code shorter and a bit easier to understand. http://gerrit.cloudera.org:8080/#/c/8319/15/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/8319/15/testdata/data/README@240 PS15, Line 240: t as I have no problem with the code, but I am a bit surprised that Hive still these values without any issue. I thought that Parquet also had the 1400..10000 limitation because of Impala, but probably I am wrong. http://gerrit.cloudera.org:8080/#/c/8319/15/testdata/data/README@244 PS15, Line 244: nit: double "code" http://gerrit.cloudera.org:8080/#/c/8319/15/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py File tests/custom_cluster/test_hive_parquet_timestamp_conversion.py: http://gerrit.cloudera.org:8080/#/c/8319/15/tests/custom_cluster/test_hive_parquet_timestamp_conversion.py@105 PS15, Line 105: # Test that timestamp validation also works as expected when converting timestamps. This test seem to be an independent test that was appended here to avoid restarting the Impala cluster. Can you add a comment about this, and maybe also move the test to a separate functions that is called here? (I guess that I should have done the same when I added test_stat_filtering). Note that once https://gerrit.cloudera.org/#/c/11057/ will be merged, then the "converting + validating" paths in parquet-column-readers.cc should become reachable with the default flags using int64 timestamps. http://gerrit.cloudera.org:8080/#/c/8319/15/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/8319/15/tests/query_test/test_scanners.py@98 PS15, Line 98: new_vector = deepcopy(vector) : new_vector.get_value('exec_option')['batch_size'] = vector.get_value('batch_size') Can you add a comment about the reason for doing this? -- To view, visit http://gerrit.cloudera.org:8080/8319 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef Gerrit-Change-Number: 8319 Gerrit-PatchSet: 16 Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Thu, 08 Nov 2018 22:59:07 +0000 Gerrit-HasComments: Yes