[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-07-11 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..

IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

If, in a VALUES clause, for the same column all of the values are CHAR
types but not all are of the same length, the common type chosen is
CHAR(max(lengths)). This means that shorter values are padded with
spaces. If the destination column is not CHAR but VARCHAR or STRING,
this produces different results than if the values in the column are
inserted individually, in separate statements. This behaviour is
suboptimal because information is lost.

For example:
  CREATE TABLE impala_char_insert (s STRING);

  -- all values are CHAR(N) with different N, but all will use the
 biggest N
  INSERT OVERWRITE impala_char_insert VALUES
(CAST("1" AS CHAR(1))),
(CAST("12" AS CHAR(2))),
(CAST("123" AS CHAR(3)));

  SELECT length(s) FROM impala_char_insert;
  3
  3
  3

  -- if inserted individually, the result is
  SELECT length(s) FROM impala_char_insert;
  1
  2
  3

This patch adds the query option VALUES_STMT_AVOID_LOSSY_CHAR_PADDING
which, when set to true, fixes the problem by implicitly casting the
values to the VARCHAR type of the longest value if all values in a
column are CHAR types AND not all have the same length. This VARCHAR
type will be the common type of the column in the VALUES statement.

The new behaviour is not turned on by default because it is a breaking
change.

Note that the behaviour in Hive is different from both behaviours in
Impala: Hive (and PostgreSQL) implicitly remove trailing spaces from
CHAR values when they are cast to other types, which is also lossy.

We choose VARCHAR instead of STRING as the common type because VARCHAR
can be converted to any VARCHAR type shorter or the same length and also
to STRING, while STRING cannot safely be converted to VARCHAR because
its length is not bounded - we would therefore run into problems if the
common type were STRING and the destination column were VARCHAR.

Note: although the VALUES statement is implemented as a special UNION
operation under the hood, this patch doesn't change the behaviour of
explicit UNION statements, it only applies to VALUES statements.

Note: the new VALUES_STMT_AVOID_LOSSY_CHAR_PADDING query option and
ALLOW_UNSAFE_CASTS are not allowed to be used at the same time: if both
are set to true and the query contains set operation(s), an error is
returned.

Testing:
 - Added tests verifying that unneeded padding doesn't occur and the
   queries succeed in various situations, e.g. different destination
   column types and multi-column inserts. See
   
testdata/workloads/functional-query/queries/QueryTest/chars-values-clause.test

Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Reviewed-on: http://gerrit.cloudera.org:8080/18999
Reviewed-by: Csaba Ringhofer 
Tested-by: Impala Public Jenkins 
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/SetOperationStmt.java
M fe/src/main/java/org/apache/impala/analysis/StatementBase.java
M fe/src/main/java/org/apache/impala/analysis/ValuesStmt.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/planner/PlanFragment.java
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-lossy-char-padding.test
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-no-lossy-char-padding.test
M tests/query_test/test_chars.py
13 files changed, 331 insertions(+), 10 deletions(-)

Approvals:
  Csaba Ringhofer: Looks good to me, approved
  Impala Public Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 16
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-07-11 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..


Patch Set 15: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 15
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 
Gerrit-Comment-Date: Tue, 11 Jul 2023 15:48:39 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-07-11 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..


Patch Set 15:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/13517/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 15
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 
Gerrit-Comment-Date: Tue, 11 Jul 2023 11:51:58 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-07-11 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..


Patch Set 14:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/13515/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 14
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 
Gerrit-Comment-Date: Tue, 11 Jul 2023 11:49:01 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-07-11 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..


Patch Set 13:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/13516/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 13
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 
Gerrit-Comment-Date: Tue, 11 Jul 2023 11:48:45 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-07-11 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#13). ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..

IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

If, in a VALUES clause, for the same column all of the values are CHAR
types but not all are of the same length, the common type chosen is
CHAR(max(lengths)). This means that shorter values are padded with
spaces. If the destination column is not CHAR but VARCHAR or STRING,
this produces different results than if the values in the column are
inserted individually, in separate statements. This behaviour is
suboptimal because information is lost.

For example:
  CREATE TABLE impala_char_insert (s STRING);

  -- all values are CHAR(N) with different N, but all will use the
 biggest N
  INSERT OVERWRITE impala_char_insert VALUES
(CAST("1" AS CHAR(1))),
(CAST("12" AS CHAR(2))),
(CAST("123" AS CHAR(3)));

  SELECT length(s) FROM impala_char_insert;
  3
  3
  3

  -- if inserted individually, the result is
  SELECT length(s) FROM impala_char_insert;
  1
  2
  3

This patch adds the query option VALUES_STMT_AVOID_LOSSY_CHAR_PADDING
which, when set to true, fixes the problem by implicitly casting the
values to the VARCHAR type of the longest value if all values in a
column are CHAR types AND not all have the same length. This VARCHAR
type will be the common type of the column in the VALUES statement.

The new behaviour is not turned on by default because it is a breaking
change.

Note that the behaviour in Hive is different from both behaviours in
Impala: Hive (and PostgreSQL) implicitly remove trailing spaces from
CHAR values when they are cast to other types, which is also lossy.

We choose VARCHAR instead of STRING as the common type because VARCHAR
can be converted to any VARCHAR type shorter or the same length and also
to STRING, while STRING cannot safely be converted to VARCHAR because
its length is not bounded - we would therefore run into problems if the
common type were STRING and the destination column were VARCHAR.

Note: although the VALUES statement is implemented as a special UNION
operation under the hood, this patch doesn't change the behaviour of
explicit UNION statements, it only applies to VALUES statements.

Note: the new VALUES_STMT_AVOID_LOSSY_CHAR_PADDING query option and
ALLOW_UNSAFE_CASTS are not allowed to be used at the same time: if both
are set to true and the query contains set operation(s), an error is
returned.

Testing:
 - Added tests verifying that unneeded padding doesn't occur and the
   queries succeed in various situations, e.g. different destination
   column types and multi-column inserts. See
   
testdata/workloads/functional-query/queries/QueryTest/chars-values-clause.test

Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/SetOperationStmt.java
M fe/src/main/java/org/apache/impala/analysis/StatementBase.java
M fe/src/main/java/org/apache/impala/analysis/ValuesStmt.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/planner/PlanFragment.java
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-lossy-char-padding.test
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-no-lossy-char-padding.test
M tests/query_test/test_chars.py
13 files changed, 331 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/18999/13
--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 13
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-07-11 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#14). ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..

IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

If, in a VALUES clause, for the same column all of the values are CHAR
types but not all are of the same length, the common type chosen is
CHAR(max(lengths)). This means that shorter values are padded with
spaces. If the destination column is not CHAR but VARCHAR or STRING,
this produces different results than if the values in the column are
inserted individually, in separate statements. This behaviour is
suboptimal because information is lost.

For example:
  CREATE TABLE impala_char_insert (s STRING);

  -- all values are CHAR(N) with different N, but all will use the
 biggest N
  INSERT OVERWRITE impala_char_insert VALUES
(CAST("1" AS CHAR(1))),
(CAST("12" AS CHAR(2))),
(CAST("123" AS CHAR(3)));

  SELECT length(s) FROM impala_char_insert;
  3
  3
  3

  -- if inserted individually, the result is
  SELECT length(s) FROM impala_char_insert;
  1
  2
  3

This patch adds the query option VALUES_STMT_AVOID_LOSSY_CHAR_PADDING
which, when set to true, fixes the problem by implicitly casting the
values to the VARCHAR type of the longest value if all values in a
column are CHAR types AND not all have the same length. This VARCHAR
type will be the common type of the column in the VALUES statement.

The new behaviour is not turned on by default because it is a breaking
change.

Note that the behaviour in Hive is different from both behaviours in
Impala: Hive (and PostgreSQL) implicitly remove trailing spaces from
CHAR values when they are cast to other types, which is also lossy.

We choose VARCHAR instead of STRING as the common type because VARCHAR
can be converted to any VARCHAR type shorter or the same length and also
to STRING, while STRING cannot safely be converted to VARCHAR because
its length is not bounded - we would therefore run into problems if the
common type were STRING and the destination column were VARCHAR.

Note: although the VALUES statement is implemented as a special UNION
operation under the hood, this patch doesn't change the behaviour of
explicit UNION statements, it only applies to VALUES statements.

Note: the new VALUES_STMT_AVOID_LOSSY_CHAR_PADDING query option and
ALLOW_UNSAFE_CASTS are not allowed to be used at the same time: if both
are set to true and the query contains set operation(s), an error is
returned.

Testing:
 - Added tests verifying that unneeded padding doesn't occur and the
   queries succeed in various situations, e.g. different destination
   column types and multi-column inserts. See
   
testdata/workloads/functional-query/queries/QueryTest/chars-values-clause.test

Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/SetOperationStmt.java
M fe/src/main/java/org/apache/impala/analysis/StatementBase.java
M fe/src/main/java/org/apache/impala/analysis/ValuesStmt.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/planner/PlanFragment.java
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-lossy-char-padding.test
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-no-lossy-char-padding.test
M tests/query_test/test_chars.py
13 files changed, 332 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/18999/14
--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 14
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-07-11 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..


Patch Set 15:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9491/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 15
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 
Gerrit-Comment-Date: Tue, 11 Jul 2023 11:29:28 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-07-11 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..


Patch Set 15: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 15
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 
Gerrit-Comment-Date: Tue, 11 Jul 2023 11:28:50 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-07-11 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#15). ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..

IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

If, in a VALUES clause, for the same column all of the values are CHAR
types but not all are of the same length, the common type chosen is
CHAR(max(lengths)). This means that shorter values are padded with
spaces. If the destination column is not CHAR but VARCHAR or STRING,
this produces different results than if the values in the column are
inserted individually, in separate statements. This behaviour is
suboptimal because information is lost.

For example:
  CREATE TABLE impala_char_insert (s STRING);

  -- all values are CHAR(N) with different N, but all will use the
 biggest N
  INSERT OVERWRITE impala_char_insert VALUES
(CAST("1" AS CHAR(1))),
(CAST("12" AS CHAR(2))),
(CAST("123" AS CHAR(3)));

  SELECT length(s) FROM impala_char_insert;
  3
  3
  3

  -- if inserted individually, the result is
  SELECT length(s) FROM impala_char_insert;
  1
  2
  3

This patch adds the query option VALUES_STMT_AVOID_LOSSY_CHAR_PADDING
which, when set to true, fixes the problem by implicitly casting the
values to the VARCHAR type of the longest value if all values in a
column are CHAR types AND not all have the same length. This VARCHAR
type will be the common type of the column in the VALUES statement.

The new behaviour is not turned on by default because it is a breaking
change.

Note that the behaviour in Hive is different from both behaviours in
Impala: Hive (and PostgreSQL) implicitly remove trailing spaces from
CHAR values when they are cast to other types, which is also lossy.

We choose VARCHAR instead of STRING as the common type because VARCHAR
can be converted to any VARCHAR type shorter or the same length and also
to STRING, while STRING cannot safely be converted to VARCHAR because
its length is not bounded - we would therefore run into problems if the
common type were STRING and the destination column were VARCHAR.

Note: although the VALUES statement is implemented as a special UNION
operation under the hood, this patch doesn't change the behaviour of
explicit UNION statements, it only applies to VALUES statements.

Note: the new VALUES_STMT_AVOID_LOSSY_CHAR_PADDING query option and
ALLOW_UNSAFE_CASTS are not allowed to be used at the same time: if both
are set to true and the query contains set operation(s), an error is
returned.

Testing:
 - Added tests verifying that unneeded padding doesn't occur and the
   queries succeed in various situations, e.g. different destination
   column types and multi-column inserts. See
   
testdata/workloads/functional-query/queries/QueryTest/chars-values-clause.test

Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/SetOperationStmt.java
M fe/src/main/java/org/apache/impala/analysis/StatementBase.java
M fe/src/main/java/org/apache/impala/analysis/ValuesStmt.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/planner/PlanFragment.java
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-lossy-char-padding.test
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-no-lossy-char-padding.test
M tests/query_test/test_chars.py
13 files changed, 331 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/18999/15
--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 15
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-07-06 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..


Patch Set 11:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/13483/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 11
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 
Gerrit-Comment-Date: Thu, 06 Jul 2023 14:58:24 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-07-06 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#11). ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..

IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

If, in a VALUES clause, for the same column all of the values are CHAR
types but not all are of the same length, the common type chosen is
CHAR(max(lengths)). This means that shorter values are padded with
spaces. If the destination column is not CHAR but VARCHAR or STRING,
this produces different results than if the values in the column are
inserted individually, in separate statements. This behaviour is
suboptimal because information is lost.

For example:
  CREATE TABLE impala_char_insert (s STRING);

  -- all values are CHAR(N) with different N, but all will use the
 biggest N
  INSERT OVERWRITE impala_char_insert VALUES
(CAST("1" AS CHAR(1))),
(CAST("12" AS CHAR(2))),
(CAST("123" AS CHAR(3)));

  SELECT length(s) FROM impala_char_insert;
  3
  3
  3

  -- if inserted individually, the result is
  SELECT length(s) FROM impala_char_insert;
  1
  2
  3

This patch adds the query option VALUES_STMT_AVOID_LOSSY_CHAR_PADDING
which, when set to true, fixes the problem by implicitly casting the
values to the VARCHAR type of the longest value if all values in a
column are CHAR types AND not all have the same length. This VARCHAR
type will be the common type of the column in the VALUES statement.

The new behaviour is not turned on by default because it is a breaking
change.

Note that the behaviour in Hive is different from both behaviours in
Impala: Hive (and PostgreSQL) implicitly remove trailing spaces from
CHAR values when they are cast to other types, which is also lossy.

We choose VARCHAR instead of STRING as the common type because VARCHAR
can be converted to any VARCHAR type shorter or the same length and also
to STRING, while STRING cannot safely be converted to VARCHAR because
its length is not bounded - we would therefore run into problems if the
common type were STRING and the destination column were VARCHAR.

Note: although the VALUES statement is implemented as a special UNION
operation under the hood, this patch doesn't change the behaviour of
explicit UNION statements, it only applies to VALUES statements.

Testing:
 - Added tests verifying that unneeded padding doesn't occur and the
   queries succeed in various situations, e.g. different destination
   column types and multi-column inserts. See
   
testdata/workloads/functional-query/queries/QueryTest/chars-values-clause.test

Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/SetOperationStmt.java
M fe/src/main/java/org/apache/impala/analysis/StatementBase.java
M fe/src/main/java/org/apache/impala/analysis/ValuesStmt.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/planner/PlanFragment.java
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-lossy-char-padding.test
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-no-lossy-char-padding.test
M tests/query_test/test_chars.py
13 files changed, 315 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/18999/11
--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 11
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-07-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..


Patch Set 10:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/13477/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 10
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 
Gerrit-Comment-Date: Tue, 04 Jul 2023 13:02:45 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-07-04 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#10). ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..

IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

If, in a VALUES clause, for the same column all of the values are CHAR
types but not all are of the same length, the common type chosen is
CHAR(max(lengths)). This means that shorter values are padded with
spaces. If the destination column is not CHAR but VARCHAR or STRING,
this produces different results than if the values in the column are
inserted individually, in separate statements. This behaviour is
suboptimal because information is lost.

For example:
  CREATE TABLE impala_char_insert (s STRING);

  -- all values are CHAR(N) with different N, but all will use the
 biggest N
  INSERT OVERWRITE impala_char_insert VALUES
(CAST("1" AS CHAR(1))),
(CAST("12" AS CHAR(2))),
(CAST("123" AS CHAR(3)));

  SELECT length(s) FROM impala_char_insert;
  3
  3
  3

  -- if inserted individually, the result is
  SELECT length(s) FROM impala_char_insert;
  1
  2
  3

This patch adds the query option VALUES_STMT_AVOID_LOSSY_CHAR_PADDING
which, when set to true, fixes the problem by implicitly casting the
values to the VARCHAR type of the longest value if all values in a
column are CHAR types AND not all have the same length. This VARCHAR
type will be the common type of the column in the VALUES statement.

The new behaviour is not turned on by default because it is a breaking
change.

Note that the behaviour in Hive is different from both behaviours in
Impala: Hive implicitly removes trailing spaces from CHAR values when
they are cast to other types, which is also lossy.

We choose VARCHAR instead of STRING as the common type because VARCHAR
can be converted to any VARCHAR type shorter or the same length and also
to STRING, while STRING cannot safely be converted to VARCHAR because
its length is not bounded - we would therefore run into problems if the
common type were STRING and the destination column were VARCHAR.

Note: although the VALUES statement is implemented as a special UNION
operation under the hood, this patch doesn't change the behaviour of
explicit UNION statements, it only applies to VALUES statements.

Testing:
 - Added tests verifying that unneeded padding doesn't occur and the
   queries succeed in various situations, e.g. different destination
   column types and multi-column inserts. See
   
testdata/workloads/functional-query/queries/QueryTest/chars-values-clause.test

Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/SetOperationStmt.java
M fe/src/main/java/org/apache/impala/analysis/StatementBase.java
M fe/src/main/java/org/apache/impala/analysis/ValuesStmt.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/planner/PlanFragment.java
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-lossy-char-padding.test
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-no-lossy-char-padding.test
M tests/query_test/test_chars.py
13 files changed, 340 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/18999/10
--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 10
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-07-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18999/10/be/src/service/query-options.h
File be/src/service/query-options.h:

http://gerrit.cloudera.org:8080/#/c/18999/10/be/src/service/query-options.h@53
PS10, Line 53:   TImpalaQueryOptions::VALUES_STMT_AVOID_LOSSY_CHAR_PADDING 
+ 1);   \
line too long (93 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 10
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 
Gerrit-Comment-Date: Tue, 04 Jul 2023 12:39:53 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-07-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..


Patch Set 9:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/13475/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 9
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 
Gerrit-Comment-Date: Tue, 04 Jul 2023 11:36:47 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-07-04 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#9). ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..

IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

If, in a VALUES clause, for the same column all of the values are CHAR
types but not all are of the same length, the common type chosen is
CHAR(max(lengths)). This means that shorter values are padded with
spaces. If the destination column is not CHAR but VARCHAR or STRING,
this produces different results than if the values in the column are
inserted individually, in separate statements. This behaviour is
suboptimal because information is lost.

For example:
  CREATE TABLE impala_char_insert (s STRING);

  -- all values are CHAR(N) with different N, but all will use the
 biggest N
  INSERT OVERWRITE impala_char_insert VALUES
(CAST("1" AS CHAR(1))),
(CAST("12" AS CHAR(2))),
(CAST("123" AS CHAR(3)));

  SELECT length(s) FROM impala_char_insert;
  3
  3
  3

  -- if inserted individually, the result is
  SELECT length(s) FROM impala_char_insert;
  1
  2
  3

This patch adds the query option VALUES_STMT_AVOID_LOSSY_CHAR_PADDING
which, when set to true, fixes the problem by implicitly casting the
values to the VARCHAR type of the longest value if all values in a
column are CHAR types AND not all have the same length. This VARCHAR
type will be the common type of the column in the VALUES statement.

The new behaviour is not turned on by default because it is a breaking
change.

Note that the behaviour in Hive is different from both behaviours in
Impala: Hive implicitly removes trailing spaces from CHAR values when
they are cast to other types, which is also lossy.

We choose VARCHAR instead of STRING as the common type because VARCHAR
can be converted to any VARCHAR type shorter or the same length and also
to STRING, while STRING cannot safely be converted to VARCHAR because
its length is not bounded - we would therefore run into problems if the
common type were STRING and the destination column were VARCHAR.

Note: although the VALUES statement is implemented as a special UNION
operation under the hood, this patch doesn't change the behaviour of
explicit UNION statements, it only applies to VALUES statements.

Testing:
 - Added tests verifying that unneeded padding doesn't occur and the
   queries succeed in various situations, e.g. different destination
   column types and multi-column inserts. See
   
testdata/workloads/functional-query/queries/QueryTest/chars-values-clause.test

Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/SetOperationStmt.java
M fe/src/main/java/org/apache/impala/analysis/StatementBase.java
M fe/src/main/java/org/apache/impala/analysis/ValuesStmt.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/planner/PlanFragment.java
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-lossy-char-padding.test
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-no-lossy-char-padding.test
M tests/query_test/test_chars.py
13 files changed, 339 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/18999/9
--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 9
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-07-04 Thread Daniel Becker (Code Review)
Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..


Patch Set 9:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/18999/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18999/8//COMMIT_MSG@8
PS8, Line 8:
   : If, in a VALUES clause, for the same column all of the values are 
CHAR
   : types but not all are of the same length, the common type chosen is
   : CHAR(max(lengths)). This means that shorter values are padded with
   : spaces. If the destination column is not CHAR but VARCHAR or 
STRING,
   : this produces different results than if the values in the column 
are
   : inserted individually, in separate statements. This behaviour is
   : suboptimal because information is lost.
> An example could make this clearer.
Done


http://gerrit.cloudera.org:8080/#/c/18999/8//COMMIT_MSG@17
PS8, Line 17:
> I would prefer to add CHAR to the name somehow as it is only applied for th
Done


http://gerrit.cloudera.org:8080/#/c/18999/8//COMMIT_MSG@26
PS8, Line 26:
> Can you add a note about already different behavior in CHAR(N) padding comp
I mentioned Hive. If it is important I can find out what Postgres does.


http://gerrit.cloudera.org:8080/#/c/18999/8/common/thrift/Query.thrift
File common/thrift/Query.thrift:

http://gerrit.cloudera.org:8080/#/c/18999/8/common/thrift/Query.thrift@646
PS8, Line 646:
> nit: false
Done


http://gerrit.cloudera.org:8080/#/c/18999/8/fe/src/main/java/org/apache/impala/analysis/StatementBase.java
File fe/src/main/java/org/apache/impala/analysis/StatementBase.java:

http://gerrit.cloudera.org:8080/#/c/18999/8/fe/src/main/java/org/apache/impala/analysis/StatementBase.java@200
PS8, Line 200:
> This is no longer true, right?
Right, removed this from the comment.


http://gerrit.cloudera.org:8080/#/c/18999/8/fe/src/main/java/org/apache/impala/analysis/ValuesStmt.java
File fe/src/main/java/org/apache/impala/analysis/ValuesStmt.java:

http://gerrit.cloudera.org:8080/#/c/18999/8/fe/src/main/java/org/apache/impala/analysis/ValuesStmt.java@146
PS8, Line 146:
 :
 :
 :
 :
 :
 :
 :
 :
 :
 :
 :
 :
 :
 :
 :
 :
> Calling this before analyze looks a bit convoluted.
Done


http://gerrit.cloudera.org:8080/#/c/18999/8/testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-non-lossy-common-type.test
File 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-non-lossy-common-type.test:

http://gerrit.cloudera.org:8080/#/c/18999/8/testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-non-lossy-common-type.test@32
PS8, Line 32:
> Can you also add longer string that will be truncated?
If the value is longer than the destination field, it results in an error. I 
added a test case for that.



--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 9
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 
Gerrit-Comment-Date: Tue, 04 Jul 2023 11:13:41 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-07-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..


Patch Set 9:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/18999/9/be/src/service/query-options.h
File be/src/service/query-options.h:

http://gerrit.cloudera.org:8080/#/c/18999/9/be/src/service/query-options.h@53
PS9, Line 53:   TImpalaQueryOptions::VALUES_STMT_AVOID_LOSSY_CHAR_PADDING + 
1);   \
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/18999/9/be/src/service/query-options.h@298
PS9, Line 298:   VALUES_STMT_AVOID_LOSSY_CHAR_PADDING,  
TQueryOptionLevel::REGULAR) 
   \
line too long (129 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 9
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 
Gerrit-Comment-Date: Tue, 04 Jul 2023 11:14:37 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-07-03 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..


Patch Set 8:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/18999/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18999/8//COMMIT_MSG@8
PS8, Line 8:
   : If, in a VALUES clause, for the same column all of the values are 
CHAR
   : types but not all are of the same length, the common type chosen is
   : CHAR(max(lengths)). This means that shorter values are padded with
   : spaces. If the destination column is not CHAR but VARCHAR or 
STRING,
   : this produces different results than if the values in the column 
are
   : inserted individually, in separate statements. This behaviour is
   : suboptimal because information is lost.
An example could make this clearer.


http://gerrit.cloudera.org:8080/#/c/18999/8//COMMIT_MSG@17
PS8, Line 17: VALUES_STMT_NON_LOSSY_COMMON_TYPE
I would prefer to add CHAR to the name somehow as it is only applied for that 
type.
e.g. VALUES_STMT_AVOID_LOSSY_CHAR_PADDING


http://gerrit.cloudera.org:8080/#/c/18999/8//COMMIT_MSG@26
PS8, Line 26: We choose VARCHAR instead of STRING as the common type because 
VARCHAR
Can you add a note about already different behavior in CHAR(N) padding compared 
to Hive? Postgres could be also mentioned.


http://gerrit.cloudera.org:8080/#/c/18999/8/common/thrift/Query.thrift
File common/thrift/Query.thrift:

http://gerrit.cloudera.org:8080/#/c/18999/8/common/thrift/Query.thrift@646
PS8, Line 646: 0
nit: false


http://gerrit.cloudera.org:8080/#/c/18999/8/fe/src/main/java/org/apache/impala/analysis/StatementBase.java
File fe/src/main/java/org/apache/impala/analysis/StatementBase.java:

http://gerrit.cloudera.org:8080/#/c/18999/8/fe/src/main/java/org/apache/impala/analysis/StatementBase.java@200
PS8, Line 200: his is only used
This is no longer true, right?


http://gerrit.cloudera.org:8080/#/c/18999/8/fe/src/main/java/org/apache/impala/analysis/ValuesStmt.java
File fe/src/main/java/org/apache/impala/analysis/ValuesStmt.java:

http://gerrit.cloudera.org:8080/#/c/18999/8/fe/src/main/java/org/apache/impala/analysis/ValuesStmt.java@146
PS8, Line 146: if (analyzer.getQueryCtx().client_request
 : .query_options.values_stmt_non_lossy_common_type) {
 :   // We suppress any AnalysisException thrown by 
castIfAllChars() because it is the
 :   // responsibility of super.analyze() to validate the query 
and report errors. Any
 :   // error that results in castIfAllChars() throwing an 
exception should also trigger
 :   // an exception in super.analyze(), but if there are 
multiple errors, the exception
 :   // thrown by castIfAllChars() may not correspond to the 
error first encountered by
 :   // super.analyze(). To keep error reporting consistent, we 
always choose the
 :   // exception thrown by super.analyze() and suppress 
exceptions thrown by
 :   // castIfAllChars().
 :   try {
 : castIfAllChars(analyzer);
 :   } catch (AnalysisException e) {
 : ex = e;
 :   }
 : }
 : super.analyze(analyzer);
Calling this before analyze looks a bit convoluted.

It would be nicer to do this during the usual analyses, e.g. create function in 
SetOperationStmt like use EnforceNonLossyCommonType() which could return true 
for ValuesStmt with values_stmt_non_lossy_common_type. This bool could be 
passed to castToSetOpCompatibleTypes() which looks to best place for this logic 
for me.


http://gerrit.cloudera.org:8080/#/c/18999/8/testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-non-lossy-common-type.test
File 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-non-lossy-common-type.test:

http://gerrit.cloudera.org:8080/#/c/18999/8/testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-non-lossy-common-type.test@32
PS8, Line 32: (cast("12" as char(2)));
Can you also add longer string that will be truncated?
+ the same for varchars



--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 8
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 
Gerrit-Comment-Date: Mon, 03 Jul 2023 13:10:59 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-06-30 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..


Patch Set 8:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/13449/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 8
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 
Gerrit-Comment-Date: Fri, 30 Jun 2023 14:02:28 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-06-30 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#8). ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..

IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

If, in a VALUES clause, for the same column all of the values are CHAR
types but not all are of the same length, the common type chosen is
CHAR(max(lengths)). This means that shorter values are padded with
spaces. If the destination column is not CHAR but VARCHAR or STRING,
this produces different results than if the values in the column are
inserted individually, in separate statements. This behaviour is
suboptimal because information is lost.

This patch adds the query option VALUES_STMT_NON_LOSSY_COMMON_TYPE
which, when set to true, fixes the problem by implicitly casting the
values to the VARCHAR type of the longest value if all values in a
column are CHAR types AND not all have the same length. This VARCHAR
type will be the common type of the column in the VALUES statement.

The new behaviour is not turned on by default because it is a breaking
change.

We choose VARCHAR instead of STRING as the common type because VARCHAR
can be converted to any VARCHAR type shorter or the same length and also
to STRING, while STRING cannot safely be converted to VARCHAR because
its length is not bounded - we therefore would run into problems if the
common type were STRING and the destination column were VARCHAR.

Note: although the VALUES statement is implemented as a special UNION
operation under the hood, this patch doesn't change the behaviour of
explicit UNION statements, it only applies to VALUES statements.

Testing:
 - Added tests verifying that unneeded padding doesn't occur and the
   queries succeed in various situations, e.g. different destination
   column types and multi-column inserts. See
   
testdata/workloads/functional-query/queries/QueryTest/chars-values-clause.test

Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/StatementBase.java
M fe/src/main/java/org/apache/impala/analysis/ValuesStmt.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-lossy-common-type.test
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-non-lossy-common-type.test
M tests/query_test/test_chars.py
10 files changed, 382 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/18999/8
--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 8
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-06-29 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..


Patch Set 7:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/13438/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 7
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 
Gerrit-Comment-Date: Thu, 29 Jun 2023 17:12:22 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-06-29 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..


Patch Set 7:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/13431/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 7
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 
Gerrit-Comment-Date: Thu, 29 Jun 2023 12:00:57 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-06-29 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#7). ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..

IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

If, in a VALUES clause, for the same column all of the values are CHAR
types but not all are of the same length, the common type chosen is
CHAR(max(lengths)). This means that shorter values are padded with
spaces. If the destination column is not CHAR but VARCHAR or STRING,
this produces different results than if the values in the column are
inserted individually, in separate statements. This behaviour is
suboptimal because information is lost.

This patch adds the query option VALUES_STMT_NON_LOSSY_COMMON_TYPE
which, when set to true, fixes the problem by implicitly casting the
values to the VARCHAR type of the longest value if all values in a
column are CHAR types AND not all have the same length. This VARCHAR
type will be the common type of the column in the VALUES statement.

The new behaviour is not turned on by default because it is a breaking
change.

We choose VARCHAR instead of STRING as the common type because VARCHAR
can be converted to any VARCHAR type shorter or the same length and also
to STRING, while STRING cannot safely be converted to VARCHAR because
its length is not bounded - we therefore would run into problems if the
common type were STRING and the destination column were VARCHAR.

Note: although the VALUES statement is implemented as a special UNION
operation under the hood, this patch doesn't change the behaviour of
explicit UNION statements, it only applies to VALUES statements.

Testing:
 - Added tests verifying that unneeded padding doesn't occur and the
   queries succeed in various situations, e.g. different destination
   column types and multi-column inserts. See
   
testdata/workloads/functional-query/queries/QueryTest/chars-values-clause.test

Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/StatementBase.java
M fe/src/main/java/org/apache/impala/analysis/ValuesStmt.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-lossy-common-type.test
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-non-lossy-common-type.test
M tests/query_test/test_chars.py
10 files changed, 382 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/18999/7
--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 7
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-03-29 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..


Patch Set 6:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/12705/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 6
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 
Gerrit-Comment-Date: Wed, 29 Mar 2023 08:39:52 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-03-29 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#6). ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..

IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

If, in a VALUES clause, for the same column all of the values are CHAR
types but not all are of the same length, the common type chosen is
CHAR(max(lengths)). This means that shorter values are padded with
spaces. If the destination column is not CHAR but VARCHAR or STRING,
this produces different results than if the values in the column are
inserted individually, in separate statements. This behaviour is
suboptimal because information is lost.

This patch adds the query option VALUES_STMT_NON_LOSSY_COMMON_TYPE
which, when set to true, fixes the problem by implicitly casting the
values to the VARCHAR type of the longest value if all values in a
column are CHAR types AND not all have the same length. This VARCHAR
type will be the common type of the column in the VALUES statement.

The new behaviour is not turned on by default because it is a breaking
change.

We choose VARCHAR instead of STRING as the common type because VARCHAR
can be converted to any VARCHAR type shorter or the same length and also
to STRING, while STRING cannot safely be converted to VARCHAR because
its length is not bounded - we therefore would run into problems if the
common type were STRING and the destination column were VARCHAR.

Note: although the VALUES statement is implemented as a special UNION
operation under the hood, this patch doesn't change the behaviour of
explicit UNION statements, it only applies to VALUES statements.

Testing:
 - Added tests verifying that unneeded padding doesn't occur and the
   queries succeed in various situations, e.g. different destination
   column types and multi-column inserts. See
   
testdata/workloads/functional-query/queries/QueryTest/chars-values-clause.test

Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/StatementBase.java
M fe/src/main/java/org/apache/impala/analysis/ValuesStmt.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-lossy-common-type.test
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-non-lossy-common-type.test
M tests/query_test/test_chars.py
10 files changed, 383 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/18999/6
--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 6
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-01-09 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..


Patch Set 5:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/12127/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 5
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 
Gerrit-Comment-Date: Mon, 09 Jan 2023 17:15:08 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2023-01-09 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#5). ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..

IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

If, in a VALUES clause, for the same column all of the values are CHAR
types but not all are of the same length, the common type chosen is
CHAR(max(lengths)). This means that shorter values are padded with
spaces. If the destination column is not CHAR but VARCHAR or STRING,
this produces different results than if the values in the column are
inserted individually, in separate statements. This behaviour is
suboptimal because information is lost.

This patch adds the query option VALUES_STMT_NON_LOSSY_COMMON_TYPE
which, when set to true, fixes the problem by implicitly casting the
values to the VARCHAR type of the longest value if all values in a
column are CHAR types AND not all have the same length. This VARCHAR
type will be the common type of the column in the VALUES statement.

The new behaviour is not turned on by default because it is a breaking
change.

We choose VARCHAR instead of STRING as the common type because VARCHAR
can be converted to any VARCHAR type shorter or the same length and also
to STRING, while STRING cannot safely be converted to VARCHAR because
its length is not bounded - we therefore would run into problems if the
common type were STRING and the destination column were VARCHAR.

Note: although the VALUES statement is implemented as a special UNION
operation under the hood, this patch doesn't change the behaviour of
explicit UNION statements, it only applies to VALUES statements.

Testing:
 - Added tests verifying that unneeded padding doesn't occur and the
   queries succeed in various situations, e.g. different destination
   column types and multi-column inserts. See
   
testdata/workloads/functional-query/queries/QueryTest/chars-values-clause.test

Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/StatementBase.java
M fe/src/main/java/org/apache/impala/analysis/ValuesStmt.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-lossy-common-type.test
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-non-lossy-common-type.test
M tests/query_test/test_chars.py
10 files changed, 383 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/18999/5
--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 5
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2022-09-23 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..


Patch Set 4:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/11417/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 4
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 
Gerrit-Comment-Date: Fri, 23 Sep 2022 12:38:04 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2022-09-23 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#4). ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..

IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

If, in a VALUES clause, for the same column all of the values are CHAR
types but not all are of the same length, the common type chosen is
CHAR(max(lengths)). This means that shorter values are padded with
spaces. If the destination column is not CHAR but VARCHAR or STRING,
this produces different results than if the values in the column are
inserted individually, in separate statements. This behaviour is
suboptimal because information is lost.

This patch adds the query option VALUES_STMT_NON_LOSSY_COMMON_TYPE
which, when set to true, fixes the problem by implicitly casting the
values to the VARCHAR type of the longest value if all values in a
column are CHAR types AND not all have the same length. This VARCHAR
type will be the common type of the column in the VALUES statement.

The new behaviour is not turned on by default because it is a breaking
change.

We choose VARCHAR instead of STRING as the common type because VARCHAR
can be converted to any VARCHAR type shorter or the same length and also
to STRING, while STRING cannot safely be converted to VARCHAR because
its length is not bounded - we therefore would run into problems if the
common type were STRING and the destination column were VARCHAR.

Note: although the VALUES statement is implemented as a special UNION
operation under the hood, this patch doesn't change the behaviour of
explicit UNION statements, it only applies to VALUES statements.

Testing:
 - Added tests verifying that unneeded padding doesn't occur and the
   queries succeed in various situations, e.g. different destination
   column types and multi-column inserts. See
   
testdata/workloads/functional-query/queries/QueryTest/chars-values-clause.test

Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/StatementBase.java
M fe/src/main/java/org/apache/impala/analysis/ValuesStmt.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-lossy-common-type.test
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-non-lossy-common-type.test
M tests/query_test/test_chars.py
10 files changed, 382 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/18999/4
--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 4
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2022-09-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..


Patch Set 2:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/11411/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 2
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 
Gerrit-Comment-Date: Thu, 22 Sep 2022 14:43:29 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2022-09-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..


Patch Set 3:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/11410/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 3
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 
Gerrit-Comment-Date: Thu, 22 Sep 2022 14:43:18 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2022-09-22 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#2). ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..

IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

If, in a VALUES clause, for the same column all of the values are CHAR
types but not all are of the same length, the common type chosen is
CHAR(max(lengths)). This means that shorter values are padded with
spaces. If the destination column is not CHAR but VARCHAR or STRING,
this produces different results than if the values in the column are
inserted individually, in separate statements. This behaviour is
suboptimal because information is lost.

This patch adds the query option VALUES_STMT_NON_LOSSY_COMMON_TYPE
which, when set to true, fixes the problem by implicitly casting the
values to the VARCHAR type of the longest value if all values in a
column are CHAR types AND not all have the same length. This VARCHAR
type will be the common type of the column in the VALUES statement.

The new behaviour is not turned on by default because it is a breaking
change.

We choose VARCHAR instead of STRING as the common type because VARCHAR
can be converted to any VARCHAR type shorter or the same length and also
to STRING, while STRING cannot safely be converted to VARCHAR because
its length is not bounded - we therefore would run into problems if the
common type were STRING and the destination column were VARCHAR.

Note: although the VALUES statement is implemented as a special UNION
operation under the hood, this patch doesn't change the behaviour of
explicit UNION statements, it only applies to VALUES statements.

Testing:
 - Added tests verifying that unneeded padding doesn't occur and the
   queries succeed in various situations, e.g. different destination
   column types and multi-column inserts. See
   
testdata/workloads/functional-query/queries/QueryTest/chars-values-clause.test

Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/StatementBase.java
M fe/src/main/java/org/apache/impala/analysis/ValuesStmt.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-lossy-common-type.test
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-non-lossy-common-type.test
M tests/query_test/test_chars.py
10 files changed, 375 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/18999/2
--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 2
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa 


[Impala-ASF-CR] IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

2022-09-22 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#3). ( 
http://gerrit.cloudera.org:8080/18999 )

Change subject: IMPALA-10753: Incorrect length when multiple CHAR(N) values are 
inserted
..

IMPALA-10753: Incorrect length when multiple CHAR(N) values are inserted

If, in a VALUES clause, for the same column all of the values are CHAR
types but not all are of the same length, the common type chosen is
CHAR(max(lengths)). This means that shorter values are padded with
spaces. If the destination column is not CHAR but VARCHAR or STRING,
this produces different results than if the values in the column are
inserted individually, in separate statements. This behaviour is
suboptimal because information is lost.

This patch adds the query option VALUES_STMT_NON_LOSSY_COMMON_TYPE
which, when set to true, fixes the problem by implicitly casting the
values to the VARCHAR type of the longest value if all values in a
column are CHAR types AND not all have the same length. This VARCHAR
type will be the common type of the column in the VALUES statement.

The new behaviour is not turned on by default because it is a breaking
change.

We choose VARCHAR instead of STRING as the common type because VARCHAR
can be converted to any VARCHAR type shorter or the same length and also
to STRING, while STRING cannot safely be converted to VARCHAR because
its length is not bounded - we therefore would run into problems if the
common type were STRING and the destination column were VARCHAR.

Note: although the VALUES statement is implemented as a special UNION
operation under the hood, this patch doesn't change the behaviour of
explicit UNION statements, it only applies to VALUES statements.

Testing:
 - Added tests verifying that unneeded padding doesn't occur and the
   queries succeed in various situations, e.g. different destination
   column types and multi-column inserts. See
   
testdata/workloads/functional-query/queries/QueryTest/chars-values-clause.test

Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
---
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/StatementBase.java
M fe/src/main/java/org/apache/impala/analysis/ValuesStmt.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-lossy-common-type.test
A 
testdata/workloads/functional-query/queries/QueryTest/chars-values-stmt-non-lossy-common-type.test
M tests/query_test/test_chars.py
10 files changed, 374 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/18999/3
--
To view, visit http://gerrit.cloudera.org:8080/18999
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9e9e189cb3c2be0e741ca3d15a7f97ec3a1b1a86
Gerrit-Change-Number: 18999
Gerrit-PatchSet: 3
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Peter Rozsa