Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/11000 )
Change subject: IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t ...................................................................... Patch Set 1: (6 comments) http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h File be/src/exec/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/11000/1/be/src/exec/parquet-common.h@62 PS1, Line 62: static int ByteSize(const InternalType& v) { return sizeof(InternalType); } > I'm ok with leaving, but maybe just add an additional comment like "Used in Added this sentence to the comment of this function. http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc File be/src/exec/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc@196 PS2, Line 196: // We require that the scale and byte length be set. > Maybe leave a comment explaining that we don't support automatically conver I think it is more like IMPALA-2515, since it is more about byte size than precision mismatch between the Parquet file metadata and the table metadata in the metastore. http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc@199 PS2, Line 199: ile '$0' column > this might be a bit confusing since the term "precision" when used in the c Changed the comment. http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc@203 PS2, Line 203: rquetPlainEncoder::DecimalSize( > can you add check for schema_element.type_length, since this section of the Based on https://github.com/apache/parquet-format/blob/518e206c3e6586b76e8315d5f62a8666ed62fa90/src/main/thrift/parquet.thrift#L347 the type_length has different meaning for FIXED_LEN_BYTE_ARRAY and for the other types. For other types it doesn't even need to be specified. The check at L216 is a bit different. It takes the precision from the metastore (slot_desc()->type().precision), and calculates the minimum byte size (expected_len) to store that precision. Impala is quite strict in the sense that it only allows properly sized decimals. IMPALA-2515 is the JIRA to loose this restriction. http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-metadata-utils.cc@234 PS2, Line 234: : if (!is_converted_type_decimal) { : // TODO: is this validation useful? It is not required at all to read the data and : // might only serve to reject otherwise perfectly readable files. : ErrorMsg msg(TErrorCode::PARQUET_BAD_CONVERTED_TYPE, filename, : schema_element.name); : RETURN_IF_ERROR(state->LogOrReturnError(msg)); : } : } else if (schema_element.__isset.scale || schema_element.__isset.precision : || is_converted_type_decimal) { : ErrorMsg msg(TErrorCode::PARQUET_INCOMPATIBLE_DECIMAL, filename, schema_element.name, : > can you also add similar checks for precision as well. This of the code part checks if there is a precision mismatch between Parquet file metadata and table metadata regardless of the underlying physical type. My checks up there (e.g. sizeof(int32_t) != slot_desc->type().GetByteSize()) also checks if the precision fits into Impala's internal type (int32_t or int64_t). In other words, if Parquet physical type is INT32, then the decimal with the given precision must fit into an int32_t. If the precision doesn't fit it returns an error just like in L217. Am I missing something? Exactly what checks do you think of? RETURN_IF_ERROR(state->LogOrReturnError(msg)) only returns on cancellation, non-recoverable errors, and when abort_on_error() is true. Do we want that in all cases? I think having less bytes than what the precision needs should return an error status from ValidateColumn(). http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-plain-test.cc File be/src/exec/parquet-plain-test.cc: http://gerrit.cloudera.org:8080/#/c/11000/2/be/src/exec/parquet-plain-test.cc@38 PS2, Line 38: Handle special case of encoding decimal types stored as BYTE_ARRAY > update comment IN32 and INT64 are now also mentioned. -- To view, visit http://gerrit.cloudera.org:8080/11000 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794 Gerrit-Change-Number: 11000 Gerrit-PatchSet: 1 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Bikramjeet Vig <bikramjeet....@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Wed, 01 Aug 2018 14:49:31 +0000 Gerrit-HasComments: Yes