Bikramjeet Vig has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11000 )

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as 
int64_t/int32_t
......................................................................


Patch Set 1:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-column-readers.cc@1536
PS1, Line 1536: DCHECK_EQ(sizeof(Decimal4Value::StorageType), 
slot_desc->type().GetByteSize());
i dont remember where the file level column metadata validation occurs in code, 
but can you add this check there in case the metadata is corrupted and/or does 
not match the hms column metadata


http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h@62
PS1, Line 62:   static int ByteSize(const InternalType& v) { return 
sizeof(InternalType); }
> I'm still confused about what this function does. Can we rename to somethin
This methods represents the encoded byte size if the internalType is written to 
parquet. I guess the confusion lies where in there is special casing for when 
this is used for encoding/decoding of fixed/variable len types.
>From what i understand this used to get the encoded size of an internalType 
>only for fixed len types, except for String type.
-------------------------------
returning -1 makes no sense and it only does that for cases which are 
impossible, hence the DCHECK(false) associated with them.
--------------------------------
The part in HdfsParquetTableWriter is correct since a plain_encoded_value_size_ 
< 0 means that its a variable len encoded type and since impala supports only 
string type to be written as a variable len encoded type, it calls on 
ParquetPlainEncoder::ByteSize to get it encode byte size which.


http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h@182
PS1, Line 182:   /// Decodes t from 'buffer', reading up to the byte before 
'buffer_end'. 'buffer'
             :   /// need not be aligned. If PARQUET_TYPE is 
FIXED_LEN_BYTE_ARRAY then 'fixed_len_size'
             :   /// is the size of the object. Otherwise, it is unused.
             :   /// Returns the number of bytes read or -1 if the value was 
not decoded successfully.
I think it will be super helpful to list which types are decoded by this 
template method and which ones have specialization. What do you all think?


http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h@208
PS1, Line 208: // Only used for testing to test decimals stored as INT32.
             :   // Otherwise Impala stores decimals as FIXED_LEN_BYTE_ARRAY.
a bit misleading, this implies that impala only supports this type for testing 
(that too not sure if for reading or writing). and the second line implies this 
method is only used for writing.


http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-metadata-utils.cc@195
PS1, Line 195:  if (slot_desc->type().type == TYPE_DECIMAL) {
here it is! you should add checks here to make sure int32 and int64 encoded 
decimal types have the right precision.



--
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bikramjeet....@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Fri, 20 Jul 2018 18:49:33 +0000
Gerrit-HasComments: Yes

Reply via email to