Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/11000 )
Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t ...................................................................... Patch Set 1: (5 comments) http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-column-readers.cc@1536 PS1, Line 1536: DCHECK_EQ(sizeof(Decimal4Value::StorageType), slot_desc->type().GetByteSize()); i dont remember where the file level column metadata validation occurs in code, but can you add this check there in case the metadata is corrupted and/or does not match the hms column metadata http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h File be/src/exec/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h@62 PS1, Line 62: static int ByteSize(const InternalType& v) { return sizeof(InternalType); } > I'm still confused about what this function does. Can we rename to somethin This methods represents the encoded byte size if the internalType is written to parquet. I guess the confusion lies where in there is special casing for when this is used for encoding/decoding of fixed/variable len types. >From what i understand this used to get the encoded size of an internalType >only for fixed len types, except for String type. ------------------------------- returning -1 makes no sense and it only does that for cases which are impossible, hence the DCHECK(false) associated with them. -------------------------------- The part in HdfsParquetTableWriter is correct since a plain_encoded_value_size_ < 0 means that its a variable len encoded type and since impala supports only string type to be written as a variable len encoded type, it calls on ParquetPlainEncoder::ByteSize to get it encode byte size which. http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h@182 PS1, Line 182: /// Decodes t from 'buffer', reading up to the byte before 'buffer_end'. 'buffer' : /// need not be aligned. If PARQUET_TYPE is FIXED_LEN_BYTE_ARRAY then 'fixed_len_size' : /// is the size of the object. Otherwise, it is unused. : /// Returns the number of bytes read or -1 if the value was not decoded successfully. I think it will be super helpful to list which types are decoded by this template method and which ones have specialization. What do you all think? http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h@208 PS1, Line 208: // Only used for testing to test decimals stored as INT32. : // Otherwise Impala stores decimals as FIXED_LEN_BYTE_ARRAY. a bit misleading, this implies that impala only supports this type for testing (that too not sure if for reading or writing). and the second line implies this method is only used for writing. http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-metadata-utils.cc File be/src/exec/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-metadata-utils.cc@195 PS1, Line 195: if (slot_desc->type().type == TYPE_DECIMAL) { here it is! you should add checks here to make sure int32 and int64 encoded decimal types have the right precision. -- To view, visit http://gerrit.cloudera.org:8080/11000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794 Gerrit-Change-Number: 11000 Gerrit-PatchSet: 1 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Bikramjeet Vig <bikramjeet....@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Comment-Date: Fri, 20 Jul 2018 18:49:33 +0000 Gerrit-HasComments: Yes