Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
......................................................................

IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

If, in a VALUES clause, for the same column all of the values are CHAR
types but not all are of the same length, the common type chosen is
CHAR(max(lengths)). This means that shorter values are padded with
spaces. If the destination column is not CHAR but VARCHAR or STRING,
this produces different results than if the values in the column are
inserted individually, in separate statements. This behaviour is
suboptimal because information is lost.

For example:
  CREATE TABLE impala_char_insert (s STRING);

  -- all values are CHAR(N) with different N, but all will use the
     biggest N
  INSERT OVERWRITE impala_char_insert VALUES
    (CAST("1" AS CHAR(1))),
    (CAST("12" AS CHAR(2))),
    (CAST("123" AS CHAR(3)));

  SELECT length(s) FROM impala_char_insert;
  3
  3
  3

  -- if inserted individually, the result is
  SELECT length(s) FROM impala_char_insert;
  1
  2
  3

This patch adds the query option VALUES_STMT_AVOID_LOSSY_CHAR_PADDING
which, when set to true, fixes the problem by implicitly casting the
values to the VARCHAR type of the longest value if all values in a
column are CHAR types AND not all have the same length. This VARCHAR
type will be the common type of the column in the VALUES statement.

The new behaviour is not turned on by default because it is a breaking
change.

Note that the behaviour in Hive is different from both behaviours in
Impala: Hive (and PostgreSQL) implicitly remove trailing spaces from
CHAR values when they are cast to other types, which is also lossy.

We choose VARCHAR instead of STRING as the common type because VARCHAR
can be converted to any VARCHAR type shorter or the same length and also
to STRING, while STRING cannot safely be converted to VARCHAR because
its length is not bounded - we would therefore run into problems if the
common type were STRING and the destination column were VARCHAR.

Note: although the VALUES statement is implemented as a special UNION
operation under the hood, this patch doesn't change the behaviour of
explicit UNION statements, it only applies to VALUES statements.

Note: the new VALUES_STMT_AVOID_LOSSY_CHAR_PADDING query option and
ALLOW_UNSAFE_CASTS are not allowed to be used at the same time: if both
are set to true and the query contains set operation(s), an error is
returned.

Testing:
 - Added tests verifying that unneeded padding doesn't occur and the
   queries succeed in various situations, e.g. different destination
   column types and multi-column inserts. See
   
testdata/workloads/functional-query/queries/QueryTest/chars-values-clause.test

Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Reviewed-on: http://gerrit.cloudera.org:8080/18999
Reviewed-by: Csaba Ringhofer <csringho...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/SetOperationStmt.java
M fe/src/main/java/org/apache/impala/analysis/StatementBase.java
M fe/src/main/java/org/apache/impala/analysis/ValuesStmt.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/planner/PlanFragment.java
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-lossy-char-padding.test
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-no-lossy-char-padding.test
M tests/query_test/test_chars.py
13 files changed, 331 insertions(+), 10 deletions(-)

Approvals:
  Csaba Ringhofer: Looks good to me, approved
  Impala Public Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 16
Gerrit-Owner: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Peter Rozsa <pro...@cloudera.com>

Reply via email to