Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7822 )
Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet scanner ...................................................................... Patch Set 4: (12 comments) Did a first pass over it - thought I should flush out comments since you've been waiting quite a while for feedback here. http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-column-readers.cc@1277 PS4, Line 1277: switch (node.element->type) { This case is only going to get bigger with the follow on patches - it would be best to make it a separate function. http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h File be/src/exec/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h@213 PS4, Line 213: template <typename LOGICAL_TYPE, parquet::Type::type PHYSICAL_TYPE> Logical type and physical type I think aren't quite right, since e.g. "StringValue" isn't a logical type. It's really decoding/encoding between Parquet physical types and Impala's internal representation. Maybe INTERNAL_TYPE and PARQUET_PHYSICAL_TYPE? Or PARQUET_TYPE seems ok since that's what parquet calls it. http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h@376 PS4, Line 376: inline int ParquetPlainEncoder::Decode<Decimal4Value, parquet::Type::BYTE_ARRAY>( I think you can avoid stamping this out if you leave it still parameterized by the internal Impala type, since the logic is the same in all cases. There may be an opportunity to reduce the redundancy above too, since there are functions above with identical bodies. I tried this out on a dummy program and it looks like it's possible: #include <stdio.h> template <typename T, typename S> int foo(T* a, S* b) { return 0; } template <> int foo(int* a, int* b) { return 1; } template <typename T> int foo(T* a, double* b) { return 2; } int main(int argc, char**argv) { int a, b, c; double x, y, z; printf("%d\n", foo(&a, &b)); printf("%d\n", foo(&a, &x)); } http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc File be/src/exec/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@66 PS4, Line 66: bool IsSupportedPhysicalType(PrimitiveType impala_type, let's make it static - we don't want to export the symbol to be linked with code outside this file. http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@70 PS4, Line 70: ( unnecessary parens http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@173 PS4, Line 173: // TODO: Is this check too strict? I don't see why this shouldn't match the file metadata, this seems valid to me. http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@182 PS4, Line 182: parquet::SchemaElement* Why not pass by const ref like above? http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@209 PS4, Line 209: stringstream ss; Can you convert to using Substitute() while we're here? We've been very gradually converting to using that for cases like this. http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc File be/src/exec/parquet-plain-test.cc: http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@176 PS4, Line 176: TestType<Decimal4Value, parquet::Type::BYTE_ARRAY>(Decimal4Value(test_val), It would be nice to combine the FIXED_LEN_BYTE_ARRAY and BYTE_ARRAY tests. Mostly the tests are the same except for the expected size, right? Could we make them table-driven so that we test all the same cases but just have different expected output? http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@193 PS4, Line 193: int32_t Any reason to use sizeof(int..) instead of sizeof(Decimal*Value) like above? Or if this is the better way to express it, should we change it above? http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@212 PS4, Line 212: for (int i = 1; i <=16; ++i) { nit: missing space http://gerrit.cloudera.org:8080/#/c/7822/4/testdata/data/binary_decimal_dictionary.parquet File testdata/data/binary_decimal_dictionary.parquet: PS4: Can you update the README to describe how these files were generated. -- To view, visit http://gerrit.cloudera.org:8080/7822 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Gerrit-Change-Number: 7822 Gerrit-PatchSet: 4 Gerrit-Owner: Bikramjeet Vig <bikramjeet....@cloudera.com> Gerrit-Reviewer: Bikramjeet Vig <bikramjeet....@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Matthew Jacobs <mjac...@apache.org> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Comment-Date: Thu, 12 Oct 2017 00:52:03 +0000 Gerrit-HasComments: Yes