Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11000 )

Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as 
int64_t/int32_t
......................................................................


Patch Set 1:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h@62
PS1, Line 62:   static int ByteSize(const InternalType& v) { return 
sizeof(InternalType); }
> I'm ok with leaving, but maybe just add an additional comment like "Used in
Added this sentence to the comment of this function.


http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc@196
PS2, Line 196:     // We require that the scale and byte length be set.
> Maybe leave a comment explaining that we don't support automatically conver
I think it is more like IMPALA-2515, since it is more about byte size than 
precision mismatch between the Parquet file metadata and the table metadata in 
the metastore.


http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc@199
PS2, Line 199: ile '$0' column
> this might be a bit confusing since the term "precision" when used in the c
Changed the comment.


http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc@203
PS2, Line 203: rquetPlainEncoder::DecimalSize(
> can you add check for schema_element.type_length, since this section of the
Based on 
https://github.com/apache/parquet-format/blob/518e206c3e6586b76e8315d5f62a8666ed62fa90/src/main/thrift/parquet.thrift#L347
 the type_length has different meaning for FIXED_LEN_BYTE_ARRAY and for the 
other types.

For other types it doesn't even need to be specified.

The check at L216 is a bit different. It takes the precision from the metastore 
(slot_desc()->type().precision), and calculates the minimum byte size 
(expected_len) to store that precision.

Impala is quite strict in the sense that it only allows properly sized 
decimals. IMPALA-2515 is the JIRA to loose this restriction.


http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc@234
PS2, Line 234:
             :     if (!is_converted_type_decimal) {
             :       // TODO: is this validation useful? It is not required at 
all to read the data and
             :       // might only serve to reject otherwise perfectly readable 
files.
             :       ErrorMsg msg(TErrorCode::PARQUET_BAD_CONVERTED_TYPE, 
filename,
             :           schema_element.name);
             :       RETURN_IF_ERROR(state->LogOrReturnError(msg));
             :     }
             :   } else if (schema_element.__isset.scale || 
schema_element.__isset.precision
             :       || is_converted_type_decimal) {
             :     ErrorMsg msg(TErrorCode::PARQUET_INCOMPATIBLE_DECIMAL, 
filename, schema_element.name,
             :
> can you also add similar checks for precision as well.
This of the code part checks if there is a precision mismatch between Parquet 
file metadata and table metadata regardless of the underlying physical type.

My checks up there (e.g. sizeof(int32_t) != slot_desc->type().GetByteSize()) 
also checks if the precision fits into Impala's internal type (int32_t or 
int64_t). In other words, if Parquet physical type is INT32, then the decimal 
with the given precision must fit into an int32_t.

If the precision doesn't fit it returns an error just like in L217.

Am I missing something? Exactly what checks do you think of?

RETURN_IF_ERROR(state->LogOrReturnError(msg)) only returns on cancellation, 
non-recoverable errors, and when abort_on_error() is true. Do we want that in 
all cases? I think having less bytes than what the precision needs should 
return an error status from ValidateColumn().


http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-plain-test.cc@38
PS2, Line 38: Handle special case of encoding decimal types stored as BYTE_ARRAY
> update comment
IN32 and INT64 are now also mentioned.



--
To view, visit http://gerrit.cloudera.org:8080/11000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Gerrit-Change-Number: 11000
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bikramjeet....@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Aug 2018 14:49:31 +0000
Gerrit-HasComments: Yes

Reply via email to