[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Reviewed-on: http://gerrit.cloudera.org:8080/16066
Reviewed-by: Quanlong Huang 
Tested-by: Impala Public Jenkins 
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase/hbase-scan-node.cc
M be/src/exec/hbase/hbase-scan-node.h
M be/src/exec/hbase/hbase-table-writer.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/rcfile/hdfs-rcfile-scanner.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exec/text/hdfs-text-scanner.cc
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M 

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 26: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 26
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 19 Aug 2022 13:55:41 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-20 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 22:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h@256
PS14, Line 256: return col_descs_[slot_desc->col_path().back()];
> Change the analyzer to disallow complex types in select list if they have b
Thanks for digging into this!

Sorry that my initial confusion is that "col_descs_" are the top-level columns 
of the table, but the last item in SchemaPath is not always the top-level 
column index. Usually the first item of SchemaPath is the top-level column 
index, and the next items are the index inside the nested type. E.g. the 6th 
column in table complextypestbl is

 nested_struct  struct<
   a: int,
   b: array,
   ...
 >

If the query selects "nested_struct.a" in the SelectList, the corresponding 
SchemaPath is [5, 0]. Here [5] is the SchemaPath of "nested_struct". But we are 
using 0 (the last item) here as the index of col_descs_. So I hope we can add a 
test of selecting the binary column directly inside a struct top level 
column.

Maybe I've missed something. Just explaning my confusion.


http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
File fe/src/main/java/org/apache/impala/analysis/LikePredicate.java:

http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java@116
PS14, Line 116:   private static boolean isLikeableType(Type type) {
> :D poor other types - tbh I don't remember whether the naming was intention
haha


http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/bin/generate-schema-statements.py
File testdata/bin/generate-schema-statements.py:

http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/bin/generate-schema-statements.py@222
PS22, Line 222: 'BINARY': 'bytes'
nit: it'd be nice to add a trailing comma so future changes don't need to touch 
this line.


http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/datasets/functional/functional_schema_template.sql@3532
PS22, Line 3532: binary_in_complex_types
Can we add some data to this table and add some e2e tests? e.g.

 select binary_member_col.b from binary_in_complex_types;
 select a.item from binary_in_complex_types t, t.binary_item_col;
 select m.key from binary_in_complex_types t, t.binary_key_col;
 select m.value from binary_in_complex_types t, t.binary_value_col;



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 22
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 18 Aug 2022 09:43:26 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-20 Thread Csaba Ringhofer (Code Review)
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim 
Armstrong, Steve Carlin, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#23).

Change subject: IMPALA-9482: Support for BINARY columns
..

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M 

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-20 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 23:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/bin/generate-schema-statements.py
File testdata/bin/generate-schema-statements.py:

http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/bin/generate-schema-statements.py@222
PS22, Line 222: 'BINARY': 'bytes'
> nit: it'd be nice to add a trailing comma so future changes don't need to t
Done


http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/datasets/functional/functional_schema_template.sql@3532
PS22, Line 3532: binary_in_complex_types
> Can we add some data to this table and add some e2e tests? e.g.
Added minimal data and these simple tests.



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 23
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 18 Aug 2022 12:20:43 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-20 Thread Csaba Ringhofer (Code Review)
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim 
Armstrong, Steve Carlin, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#24).

Change subject: IMPALA-9482: Support for BINARY columns
..

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M 

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-20 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 24:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h@256
PS14, Line 256: return col_descs_[slot_desc->col_path().back()];
> Thanks for digging into this!
It turned out that while function would not work for nested columns, we do not 
use it at any place where this could be a problem. Changed the dcheck to ensure 
that the type is not nested (the path's size is 1).



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 24
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 18 Aug 2022 14:36:05 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 23:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16066/23/tests/hs2/test_fetch.py
File tests/hs2/test_fetch.py:

http://gerrit.cloudera.org:8080/#/c/16066/23/tests/hs2/test_fetch.py@69
PS23, Line 69: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/23/tests/hs2/test_fetch.py@89
PS23, Line 89: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/23/tests/hs2/test_fetch.py@100
PS23, Line 100: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/23/tests/hs2/test_fetch.py@113
PS23, Line 113: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/23/tests/hs2/test_fetch.py@123
PS23, Line 123: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/23/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/23/tests/query_test/test_udfs.py@116
PS23, Line 116: y
flake8: E501 line too long (92 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 23
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 18 Aug 2022 12:15:08 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 24:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16066/24/tests/hs2/test_fetch.py
File tests/hs2/test_fetch.py:

http://gerrit.cloudera.org:8080/#/c/16066/24/tests/hs2/test_fetch.py@69
PS24, Line 69: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/24/tests/hs2/test_fetch.py@89
PS24, Line 89: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/24/tests/hs2/test_fetch.py@100
PS24, Line 100: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/24/tests/hs2/test_fetch.py@113
PS24, Line 113: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/24/tests/hs2/test_fetch.py@123
PS24, Line 123: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/24/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/24/tests/query_test/test_udfs.py@116
PS24, Line 116: y
flake8: E501 line too long (92 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 24
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 18 Aug 2022 14:33:23 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 24:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/11184/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 24
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 18 Aug 2022 14:53:48 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 23:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/11181/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 23
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 18 Aug 2022 12:35:30 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-20 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 24: Code-Review+1

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/exec/orc-metadata-utils.cc
File be/src/exec/orc-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/exec/orc-metadata-utils.cc@477
PS24, Line 477:   || type.type == TYPE_CHAR) {
Could you add a comment about why we don't need TYPE_BINARY here?


http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/runtime/descriptors.h@256
PS24, Line 256: back()
nit: front() or [0] seems better (as line 248 uses). Same comment to line 255.


http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java
File java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java:

http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@40
PS24, Line 40: import org.apache.hadoop.io.Text;
duplicated line


http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@319
PS24, Line 319: String
BytesWritable


http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@321
PS24, Line 321: get
nit: get() is deprecated by getBytes().


http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@326
PS24, Line 326: result
nit: handle null 'result' when 'inputs' is empty.



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 24
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 19 Aug 2022 02:30:29 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-20 Thread Csaba Ringhofer (Code Review)
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim 
Armstrong, Steve Carlin, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#25).

Change subject: IMPALA-9482: Support for BINARY columns
..

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase/hbase-scan-node.cc
M be/src/exec/hbase/hbase-scan-node.h
M be/src/exec/hbase/hbase-table-writer.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/rcfile/hdfs-rcfile-scanner.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exec/text/hdfs-text-scanner.cc
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M 

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-20 Thread Csaba Ringhofer (Code Review)
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim 
Armstrong, Steve Carlin, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#26).

Change subject: IMPALA-9482: Support for BINARY columns
..

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase/hbase-scan-node.cc
M be/src/exec/hbase/hbase-scan-node.h
M be/src/exec/hbase/hbase-table-writer.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/rcfile/hdfs-rcfile-scanner.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exec/text/hdfs-text-scanner.cc
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M 

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-20 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 26: Code-Review+2

Thanks for the great work!


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 26
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 19 Aug 2022 08:47:57 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 26:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/8473/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 26
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 19 Aug 2022 09:07:18 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-20 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 26:

(6 comments)

Thanks for the comments!
PS 25 was a rebase that included https://gerrit.cloudera.org/#/c/18815/ , which 
is a major refactoring that moved several h/cc files.

PS 26 tries to fix the comments.

http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/exec/orc-metadata-utils.cc
File be/src/exec/orc-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/exec/orc-metadata-utils.cc@477
PS24, Line 477:
> Could you add a comment about why we don't need TYPE_BINARY here?
added a comment - note that file was moved during the rebase


http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/runtime/descriptors.h@256
PS24, Line 256: 0]];
> nit: front() or [0] seems better (as line 248 uses). Same comment to line 2
Done


http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java
File java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java:

http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@40
PS24, Line 40: import org.apache.hadoop.io.Writable;
> duplicated line
Done


http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@319
PS24, Line 319: t " +
> BytesWritable
Done


http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@321
PS24, Line 321: get
> nit: get() is deprecated by getBytes().
Done


http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@326
PS24, Line 326: nary.s
> nit: handle null 'result' when 'inputs' is empty.
Done



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 26
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 19 Aug 2022 08:16:26 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 25:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/25/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/25/tests/query_test/test_udfs.py@116
PS25, Line 116: y
flake8: E501 line too long (92 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 25
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 19 Aug 2022 06:52:39 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 25:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/11189/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 25
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 19 Aug 2022 07:12:28 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 26:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/26/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/26/tests/query_test/test_udfs.py@116
PS26, Line 116: y
flake8: E501 line too long (92 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 26
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 19 Aug 2022 08:13:01 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 26:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/11190/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 26
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 19 Aug 2022 08:35:16 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-11 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 22:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/11136/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 22
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 11 Aug 2022 10:05:17 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-11 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 21:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/11135/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 21
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 11 Aug 2022 09:57:15 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-11 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 22:

(1 comment)

Some tests were broken by recent changes related fixing comments, ps 21 fixes 
those (and ps 22 fixes some whitespace issues)

http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java
File fe/src/main/java/org/apache/impala/analysis/CastExpr.java:

http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@357
PS14, Line 357: Preconditions.checkState(!childType.isNull());
> Added a precondition to ensure that if one is binary, the other must be str
This change actually broke a number of tests, for example when casting NULL to 
binary or tests the rely on an exact error message about the invalid cast.

Changed the logic to do noop cast only in case of binary<->string cast, and let 
the subsequent logic catch the error that there is no casting functions to/from 
other types.



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 22
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 11 Aug 2022 09:47:05 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-11 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 22:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16066/22/tests/hs2/test_fetch.py
File tests/hs2/test_fetch.py:

http://gerrit.cloudera.org:8080/#/c/16066/22/tests/hs2/test_fetch.py@69
PS22, Line 69: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/22/tests/hs2/test_fetch.py@89
PS22, Line 89: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/22/tests/hs2/test_fetch.py@100
PS22, Line 100: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/22/tests/hs2/test_fetch.py@113
PS22, Line 113: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/22/tests/hs2/test_fetch.py@123
PS22, Line 123: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/22/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/22/tests/query_test/test_udfs.py@116
PS22, Line 116: y
flake8: E501 line too long (92 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 22
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 11 Aug 2022 09:46:34 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-11 Thread Csaba Ringhofer (Code Review)
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim 
Armstrong, Steve Carlin, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#22).

Change subject: IMPALA-9482: Support for BINARY columns
..

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M 

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-11 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 21:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/main/java/org/apache/impala/analysis/CastExpr.java
File fe/src/main/java/org/apache/impala/analysis/CastExpr.java:

http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@368
PS21, Line 368:   noOp_ = true;
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@369
PS21, Line 369:   return;
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@370
PS21, Line 370: }
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
File fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java@4278
PS21, Line 4278:"cast(string_col as binary), bool_col, " +
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
File fe/src/test/java/org/apache/impala/catalog/CatalogTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java@632
PS21, Line 632: // NDV is not calculated for BINARY columns
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java@633
PS21, Line 633: 
assertFalse(binaryCol.getStats().hasNumDistinctValues());
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
File fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java@243
PS21, Line 243: stats.get(0).toString());
tab used for whitespace


http://gerrit.cloudera.org:8080/#/c/16066/21/tests/hs2/test_fetch.py
File tests/hs2/test_fetch.py:

http://gerrit.cloudera.org:8080/#/c/16066/21/tests/hs2/test_fetch.py@69
PS21, Line 69: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/21/tests/hs2/test_fetch.py@89
PS21, Line 89: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/21/tests/hs2/test_fetch.py@100
PS21, Line 100: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/21/tests/hs2/test_fetch.py@113
PS21, Line 113: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/21/tests/hs2/test_fetch.py@123
PS21, Line 123: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/21/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/21/tests/query_test/test_udfs.py@116
PS21, Line 116: y
flake8: E501 line too long (92 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 21
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 11 Aug 2022 09:36:04 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-11 Thread Csaba Ringhofer (Code Review)
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim 
Armstrong, Steve Carlin, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#21).

Change subject: IMPALA-9482: Support for BINARY columns
..

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M 

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 20:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/11132/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 20
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 10 Aug 2022 18:37:45 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 19:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/11131/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 19
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 10 Aug 2022 18:36:28 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-10 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 20:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h@256
PS14, Line 256: return col_descs_[slot_desc->col_path().back()];
> Checked how Hive handles this, it seems to utf8 encode the nested binary wi
Change the analyzer to disallow complex types in select list if they have 
binary members. Created follow up Jira IMPALA-11491

Also added a test tables + FE tests to check this behavior.



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 20
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 10 Aug 2022 18:12:17 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-10 Thread Csaba Ringhofer (Code Review)
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim 
Armstrong, Steve Carlin, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#20).

Change subject: IMPALA-9482: Support for BINARY columns
..

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M 

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-10 Thread Csaba Ringhofer (Code Review)
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim 
Armstrong, Steve Carlin, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#19).

Change subject: IMPALA-9482: Support for BINARY columns
..

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M 

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 20:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16066/20/tests/hs2/test_fetch.py
File tests/hs2/test_fetch.py:

http://gerrit.cloudera.org:8080/#/c/16066/20/tests/hs2/test_fetch.py@69
PS20, Line 69: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/20/tests/hs2/test_fetch.py@89
PS20, Line 89: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/20/tests/hs2/test_fetch.py@100
PS20, Line 100: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/20/tests/hs2/test_fetch.py@113
PS20, Line 113: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/20/tests/hs2/test_fetch.py@123
PS20, Line 123: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/20/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/20/tests/query_test/test_udfs.py@116
PS20, Line 116: y
flake8: E501 line too long (92 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 20
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 10 Aug 2022 18:10:21 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 19:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/16066/19/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
File fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/19/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java@297
PS19, Line 297: // Returning complex types with BINARY in select list is 
not yet implemented (IMPALA-11491).
line too long (96 > 90)


http://gerrit.cloudera.org:8080/#/c/16066/19/tests/hs2/test_fetch.py
File tests/hs2/test_fetch.py:

http://gerrit.cloudera.org:8080/#/c/16066/19/tests/hs2/test_fetch.py@69
PS19, Line 69: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/19/tests/hs2/test_fetch.py@89
PS19, Line 89: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/19/tests/hs2/test_fetch.py@100
PS19, Line 100: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/19/tests/hs2/test_fetch.py@113
PS19, Line 113: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/19/tests/hs2/test_fetch.py@123
PS19, Line 123: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/19/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/19/tests/query_test/test_udfs.py@116
PS19, Line 116: y
flake8: E501 line too long (92 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 19
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 10 Aug 2022 18:07:31 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-05 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 18:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h@256
PS14, Line 256: return col_descs_[slot_desc->col_path().back()];
> Hmm, I realized that I have to think nested binaries through more carefully
Checked how Hive handles this, it seems to utf8 encode the nested binary with 
replace:
select named_struct("b", unhex("00112233445566778899AABBCCDDEEFF"));
result: {"b":"3DUfw}

select array(unhex("00112233445566778899AABBCCDDEEFF"));
result: ["3DUfw]

A weird behavior around nested binary is that it is not quoted (unlike string):
select named_struct("s", "a", "b", cast("a" as binary));
{"s":"a","b":a}

I have also checked what happens when we read these back from Impala:
- if we return the whole struct, the non-valid utf8 binary is replaced with an 
empty string
- if we return the whole array, it is returned escaped:
"\000\021\"3DUfw"]
- if the member/item is returned directly, we print it the same way as Hive
- if we hex() the result back, it returns it correctly, so Hive does not write 
it in a lossy way

I am thinking about not allowing nested binary for now, and enable them once 
the struct/array printing functions are unified + I have verified whether not 
quoting them is intentional in Hive.



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 18
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 05 Aug 2022 11:22:57 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-05 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 18:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/11105/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 18
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 05 Aug 2022 11:04:27 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-05 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 18:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc
File be/src/service/hs2-util.cc:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc@872
PS14, Line 872:   AuxColumnType aux_type(columnType);
> Done
Reverted this, as while gcc compiled it, clang had issue I could not solve. 
This is strange to me, as case TYPE_DECIMAL: also has some local variables, but 
I did not want to dive too deeply.



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 18
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 05 Aug 2022 10:46:34 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-05 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 18:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16066/18/tests/hs2/test_fetch.py
File tests/hs2/test_fetch.py:

http://gerrit.cloudera.org:8080/#/c/16066/18/tests/hs2/test_fetch.py@69
PS18, Line 69: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/18/tests/hs2/test_fetch.py@89
PS18, Line 89: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/18/tests/hs2/test_fetch.py@100
PS18, Line 100: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/18/tests/hs2/test_fetch.py@113
PS18, Line 113: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/18/tests/hs2/test_fetch.py@123
PS18, Line 123: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/18/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/18/tests/query_test/test_udfs.py@116
PS18, Line 116: y
flake8: E501 line too long (92 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 18
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 05 Aug 2022 10:44:48 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-05 Thread Csaba Ringhofer (Code Review)
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim 
Armstrong, Steve Carlin, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#18).

Change subject: IMPALA-9482: Support for BINARY columns
..

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M 

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 16:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/11100/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 16
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 04 Aug 2022 19:28:02 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-04 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 17:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h@256
PS14, Line 256: return col_descs_[slot_desc->col_path().back()];
> Added tests with binary in struct, binary in array was only tested manually
Hmm, I realized that I have to think nested binaries through more carefully and 
check how Hive deals with them - the issue is that complex types are expected 
to be returned as utf8 strings, and I don't know what to do with a nested 
BINARY with special characters - probably they are handled by escaping them.



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 17
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 04 Aug 2022 15:18:12 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-03 Thread Csaba Ringhofer (Code Review)
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim 
Armstrong, Steve Carlin, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#17).

Change subject: IMPALA-9482: Support for BINARY columns
..

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M 

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 17:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16066/17/tests/hs2/test_fetch.py
File tests/hs2/test_fetch.py:

http://gerrit.cloudera.org:8080/#/c/16066/17/tests/hs2/test_fetch.py@69
PS17, Line 69: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/17/tests/hs2/test_fetch.py@89
PS17, Line 89: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/17/tests/hs2/test_fetch.py@100
PS17, Line 100: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/17/tests/hs2/test_fetch.py@113
PS17, Line 113: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/17/tests/hs2/test_fetch.py@123
PS17, Line 123: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/17/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/17/tests/query_test/test_udfs.py@116
PS17, Line 116: y
flake8: E501 line too long (92 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 17
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 03 Aug 2022 18:16:49 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-03 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 14:

(11 comments)

Ps 15 is rebase + conflict resolution, ps 16 contains fixes for the comments.

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/hbase-table-writer.cc
File be/src/exec/hbase-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/hbase-table-writer.cc@161
PS14, Line 161:   data_len = encoded_value.length();
> I just realized we support writing to HBase. It worths adding tests for thi
Thanks for the comment, it turned out that the writer was buggy as 
'base64_encoded_value' went out of scope while 'data' kept a pointer to its 
buffer.

Added a binary column to insertalltypesagg/insertalltypesaggbinary (only used 
by HBase tests) and added few tests to hbase-inserts.test

Also checked manually that Hive can read our data.


http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/text-converter.cc
File be/src/exec/text-converter.cc:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/text-converter.cc@359
PS14, Line 359:   return auxType.string_subtype != 
AuxColumnType::StringSubtype::BINARY;
> Add JIRA for codegen on reading BINARY from text files? I guess we just nee
Created a jira.


http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h@256
PS14, Line 256: return col_descs_[slot_desc->col_path().back()];
> Not sure if this is correct for nested types. Can we add tests on BINARY in
Added tests with binary in struct, binary in array was only tested manually so 
far.


http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/types.cc
File be/src/runtime/types.cc:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/types.cc@141
PS14, Line 141:   return TPrimitiveType::BINARY;
> Should we return INVALID_TYPE here because of DCHECK(false) so that we catc
Agree with returning INVALID_TYPE to improve error handling in release builds.


http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc
File be/src/service/hs2-util.cc:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc@872
PS14, Line 872:   AuxColumnType aux_type(columnType);
> If this is only used in the TYPE_STRING, perhaps it should be initialized t
Done


http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc@950
PS14, Line 950:   type_entry.__set_type(thrift::TTypeId::STRING_TYPE);
> If DCHECK is false, do we want to set this to STRING_TYPE?  Or maybe we sho
There is no "invalid type" concept in the HS2 thrift api + this field is 
required, so I don't see a better idea here (maybe we could return some status 
in case of error, but that seems a bit too much effort for the unlikely case of 
not catching this scenario in debug builds.


http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/impala-beeswax-server.cc
File be/src/service/impala-beeswax-server.cc:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/impala-beeswax-server.cc@212
PS14, Line 212: return "string";
> How come we don't return TypeToOdbcString here?
What do you mean for inlineing? It is also used in some tests, so we can't keep 
it in this file.

About returning TypeToOdbcString - while odbc does support binary type, my 
guess is that existing beeswax clients don't, so it seemed safer to return 
"string"


http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java
File fe/src/main/java/org/apache/impala/analysis/CastExpr.java:

http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@187
PS14, Line 187: // No built-in function needed for BINARY <-> STRING 
conversion.
> I think the comment should also reflect other datatypes versus binary are n
Done


http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@357
PS14, Line 357: boolean isStringBinaryCast =
> I may be understanding this wrong...
Added a precondition to ensure that if one is binary, the other must be string.


http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
File fe/src/main/java/org/apache/impala/analysis/LikePredicate.java:

http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java@116
PS14, Line 116:   private static boolean isLikeableType(Type type) {
> No real comment here. But I feel the other types might get upset for not be
:D poor other types - tbh I don't remember whether the naming was intentional


http://gerrit.cloudera.org:8080/#/c/16066/14/testdata/workloads/functional-query/queries/QueryTest/binary-type.test
File 

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 16:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/11087/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 16
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 03 Aug 2022 17:55:45 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 16:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16066/16/tests/hs2/test_fetch.py
File tests/hs2/test_fetch.py:

http://gerrit.cloudera.org:8080/#/c/16066/16/tests/hs2/test_fetch.py@69
PS16, Line 69: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/16/tests/hs2/test_fetch.py@89
PS16, Line 89: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/16/tests/hs2/test_fetch.py@100
PS16, Line 100: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/16/tests/hs2/test_fetch.py@113
PS16, Line 113: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/16/tests/hs2/test_fetch.py@123
PS16, Line 123: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/16/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/16/tests/query_test/test_udfs.py@116
PS16, Line 116: y
flake8: E501 line too long (92 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 16
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 03 Aug 2022 17:46:05 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-03 Thread Csaba Ringhofer (Code Review)
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim 
Armstrong, Steve Carlin, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#16).

Change subject: IMPALA-9482: Support for BINARY columns
..

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M 

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 15:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/11085/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 15
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 03 Aug 2022 13:51:05 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 15:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16066/15/tests/hs2/test_fetch.py
File tests/hs2/test_fetch.py:

http://gerrit.cloudera.org:8080/#/c/16066/15/tests/hs2/test_fetch.py@69
PS15, Line 69: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/15/tests/hs2/test_fetch.py@89
PS15, Line 89: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/15/tests/hs2/test_fetch.py@100
PS15, Line 100: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/15/tests/hs2/test_fetch.py@113
PS15, Line 113: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/15/tests/hs2/test_fetch.py@123
PS15, Line 123: "
flake8: E126 continuation line over-indented for hanging indent


http://gerrit.cloudera.org:8080/#/c/16066/15/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/15/tests/query_test/test_udfs.py@116
PS15, Line 116: y
flake8: E501 line too long (92 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 15
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 03 Aug 2022 13:30:46 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-08-03 Thread Csaba Ringhofer (Code Review)
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim 
Armstrong, Steve Carlin, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#15).

Change subject: IMPALA-9482: Support for BINARY columns
..

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exec/parquet/parquet-metadata-utils.h
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M 

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-07-13 Thread Steve Carlin (Code Review)
Steve Carlin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 14:

(7 comments)

Hopefully these comments are helpful.  Logic seems good, these are all pretty 
much nits.

Thanks!

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/types.cc
File be/src/runtime/types.cc:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/types.cc@141
PS14, Line 141:   return TPrimitiveType::BINARY;
Should we return INVALID_TYPE here because of DCHECK(false) so that we catch 
the problem earlier on?


http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc
File be/src/service/hs2-util.cc:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc@872
PS14, Line 872:   AuxColumnType aux_type(columnType);
If this is only used in the TYPE_STRING, perhaps it should be initialized there.


http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc@950
PS14, Line 950:   type_entry.__set_type(thrift::TTypeId::STRING_TYPE);
If DCHECK is false, do we want to set this to STRING_TYPE?  Or maybe we should 
set it to INVALID_TYPE?  Not sure what is best.


http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/impala-beeswax-server.cc
File be/src/service/impala-beeswax-server.cc:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/impala-beeswax-server.cc@212
PS14, Line 212: return "string";
How come we don't return TypeToOdbcString here?

(sidenote, I'm rusty on my C++, but why is TypeToOdbcString not inlined?)


http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java
File fe/src/main/java/org/apache/impala/analysis/CastExpr.java:

http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@187
PS14, Line 187: // No built-in function needed for BINARY <-> STRING 
conversion.
I think the comment should also reflect other datatypes versus binary are not 
in the support matrix.


http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@357
PS14, Line 357: boolean isStringBinaryCast =
I may be understanding this wrong...

This is slightly inconsistent with the above code.  In the above code, we 
assume that if one is binary, the other must be either binary or string.  Do we 
need to check explicitly here that the other type is string?  I understand why, 
but then maybe we should have a Preconditions check sayin git is either of type 
string or binary?

Not sure what is best here.


http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
File fe/src/main/java/org/apache/impala/analysis/LikePredicate.java:

http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java@116
PS14, Line 116:   private static boolean isLikeableType(Type type) {
No real comment here. But I feel the other types might get upset for not being 
liked.



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Steve Carlin 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 14 Jul 2022 00:01:00 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-06-14 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 14:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/hbase-table-writer.cc
File be/src/exec/hbase-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/hbase-table-writer.cc@161
PS14, Line 161:   data_len = encoded_value.length();
I just realized we support writing to HBase. It worths adding tests for this.


http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/text-converter.cc
File be/src/exec/text-converter.cc:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/text-converter.cc@359
PS14, Line 359:   return auxType.string_subtype != 
AuxColumnType::StringSubtype::BINARY;
Add JIRA for codegen on reading BINARY from text files? I guess we just need 
the codegen of base64 decoding.


http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h@256
PS14, Line 256: return col_descs_[slot_desc->col_path().back()];
Not sure if this is correct for nested types. Can we add tests on BINARY inside 
nested types?


http://gerrit.cloudera.org:8080/#/c/16066/14/testdata/workloads/functional-query/queries/QueryTest/binary-type.test
File testdata/workloads/functional-query/queries/QueryTest/binary-type.test:

http://gerrit.cloudera.org:8080/#/c/16066/14/testdata/workloads/functional-query/queries/QueryTest/binary-type.test@68
PS14, Line 68:   where binary_col_with_nulls = cast("01/02/09" as binary)
This is identical to the above query.



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 15 Jun 2022 03:17:24 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-06-13 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 14:

> (4 comments)
 >
 > Thanks for reviving this patch! I just went throught the FE changes
 > and some tests and they LGTM. I'm going to dig into more details.
 >
 > Not sure if I missed it, but do we have tests on writing BINARY
 > columns? Assuming the data loading of functional_parquet.binary_tbl
 > is the test on writing BINARY to parquet, we still need test on
 > writing BINARY to text files.

At first I thought that with the new table writing to text is covered now, but 
then realized that LOAD always uses Hive. Will add some table writing tests in 
the next patch.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 13 Jun 2022 19:12:29 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-06-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 14:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/10761/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 13 Jun 2022 18:59:08 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-06-13 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 14:

(4 comments)

PS 14 extended the testdata and EE tests and also added some new functions: 
concat, murmur_hash (this was needed for multiple count distinct).

In the next patch I plan to fix the older comments but Gabor and Tim

http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java:

http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@44
PS13, Line 44: import org.apache.impala.catalog.Type;
> nit: unused import?
Done


http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/catalog/Type.java
File fe/src/main/java/org/apache/impala/catalog/Type.java:

http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/catalog/Type.java@681
PS13, Line 681: if
> nit: need one space after "if"
Done


http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
File fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java@409
PS13, Line 409: RewritesOk("cast(concat('a', 'b') as binary)", rule, 
"'ab'");
> nit: Can we also test concat() on binary literals? i.e.
The concat for binary was not added yet. Added it in PS 14 + some tests in 
expr-test.cc


http://gerrit.cloudera.org:8080/#/c/16066/13/testdata/workloads/functional-query/queries/QueryTest/udf.test
File testdata/workloads/functional-query/queries/QueryTest/udf.test:

http://gerrit.cloudera.org:8080/#/c/16066/13/testdata/workloads/functional-query/queries/QueryTest/udf.test@236
PS13, Line 236: bytes(
> bytes()? in case of utf8_mode=true.
Done



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 13 Jun 2022 18:42:52 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-06-13 Thread Csaba Ringhofer (Code Review)
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim 
Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#14).

Change subject: IMPALA-9482: Support for BINARY columns
..

IMPALA-9482: Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveLegacyJavaFunction.java
M 

[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns

2022-06-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482: Support for BINARY columns
..


Patch Set 14:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/14/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/14/tests/query_test/test_udfs.py@116
PS14, Line 116: y
flake8: E501 line too long (92 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 14
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 13 Jun 2022 18:38:29 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-06-13 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 13:

(4 comments)

Thanks for reviving this patch! I just went throught the FE changes and some 
tests and they LGTM. I'm going to dig into more details.

Not sure if I missed it, but do we have tests on writing BINARY columns? 
Assuming the data loading of functional_parquet.binary_tbl is the test on 
writing BINARY to parquet, we still need test on writing BINARY to text files.

http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java:

http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@44
PS13, Line 44: import org.apache.impala.catalog.PrimitiveType;
nit: unused import?


http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/catalog/Type.java
File fe/src/main/java/org/apache/impala/catalog/Type.java:

http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/catalog/Type.java@681
PS13, Line 681: if
nit: need one space after "if"


http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
File fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java@409
PS13, Line 409: RewritesOk("cast(concat('a', 'b') as binary)", rule, 
"'ab'");
nit: Can we also test concat() on binary literals? i.e.

 concat(cast('a' as binary), cast('b' as binary))


http://gerrit.cloudera.org:8080/#/c/16066/13/testdata/workloads/functional-query/queries/QueryTest/udf.test
File testdata/workloads/functional-query/queries/QueryTest/udf.test:

http://gerrit.cloudera.org:8080/#/c/16066/13/testdata/workloads/functional-query/queries/QueryTest/udf.test@236
PS13, Line 236: length
bytes()? in case of utf8_mode=true.



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 13 Jun 2022 08:29:15 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-06-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 13:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/10753/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 10 Jun 2022 19:00:01 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-06-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/13/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/13/tests/query_test/test_udfs.py@116
PS13, Line 116: y
flake8: E501 line too long (92 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 10 Jun 2022 18:42:00 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-06-10 Thread Csaba Ringhofer (Code Review)
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim 
Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#13).

Change subject: IMPALA-9482 Support for BINARY columns
..

IMPALA-9482 Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type
- Ran exhaustive tests.

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java
M 

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-06-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 12: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 07 Jun 2022 19:36:52 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-06-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 12:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/10721/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 07 Jun 2022 15:31:04 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-06-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 12:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/12/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/12/tests/query_test/test_udfs.py@116
PS12, Line 116: y
flake8: E501 line too long (92 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 07 Jun 2022 15:13:20 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-06-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 12:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/8194/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 07 Jun 2022 15:13:22 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-06-07 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 12:

PS 12 adds support for Hive UDFs.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 07 Jun 2022 15:13:02 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-06-07 Thread Csaba Ringhofer (Code Review)
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim 
Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#12).

Change subject: IMPALA-9482 Support for BINARY columns
..

IMPALA-9482 Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type
- Ran exhaustive tests.

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java
M 

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-06-01 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 11: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/8171/


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 11
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 01 Jun 2022 20:48:40 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-06-01 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 11:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/10675/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 11
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 01 Jun 2022 16:34:21 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-06-01 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 11:

Ps 11 fixed length() behavior when utf8_mode=1 and added BE tests for functions 
and predicates on BINARY.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 11
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 01 Jun 2022 16:23:38 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-06-01 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 11:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/8171/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 11
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 01 Jun 2022 16:23:59 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-06-01 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/11/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/11/tests/query_test/test_udfs.py@116
PS11, Line 116: y
flake8: E501 line too long (92 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 11
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 01 Jun 2022 16:15:22 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-06-01 Thread Csaba Ringhofer (Code Review)
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim 
Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#11).

Change subject: IMPALA-9482 Support for BINARY columns
..

IMPALA-9482 Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type
- Ran exhaustive tests.

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M 

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-05-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 9:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/10521/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 9
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 02 May 2022 15:51:23 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-05-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/9/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/9/tests/query_test/test_udfs.py@116
PS9, Line 116: y
flake8: E501 line too long (92 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 9
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 02 May 2022 15:33:02 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-05-02 Thread Csaba Ringhofer (Code Review)
Hello Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Impala 
Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#9).

Change subject: IMPALA-9482 Support for BINARY columns
..

IMPALA-9482 Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type
- Ran exhaustive tests.

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M 

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-05-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 8: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/8073/


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 02 May 2022 13:51:50 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-05-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 8:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/10519/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 02 May 2022 09:45:27 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-05-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16066/8/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/8/tests/query_test/test_udfs.py@116
PS8, Line 116: y
flake8: E501 line too long (92 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 02 May 2022 09:26:31 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-05-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 8:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/8073/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 02 May 2022 09:26:22 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2022-05-02 Thread Csaba Ringhofer (Code Review)
Hello Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Impala 
Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#8).

Change subject: IMPALA-9482 Support for BINARY columns
..

IMPALA-9482 Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type
- Ran exhaustive tests.

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/file-metadata-utils.cc
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M 

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2020-09-03 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 7:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/16066/7//COMMIT_MSG
Commit Message:

PS7:
I think you're missing some handling for the parquet_annotate_strings_utf8 
option - it looks like we might set this for BINARY types, which is wrong.


http://gerrit.cloudera.org:8080/#/c/16066/7//COMMIT_MSG@29
PS7, Line 29: - HS2/Beeswax service
Do we need to update clients? Looks like impala-shell might have some logic for 
HS2 but not sure that it's tested.


http://gerrit.cloudera.org:8080/#/c/16066/7//COMMIT_MSG@55
PS7, Line 55: - Added FE/EE tests mainly based on the ones added to the DATE 
type
Can we add some more sanity tests for different parts of query processing. I 
think the fact that it's handled as string internally reduces the amount of 
testing required, but some sanity checks would be good. E.g. a join on BINARY, 
a group by on aggregation and a sort by BINARY?


http://gerrit.cloudera.org:8080/#/c/16066/7//COMMIT_MSG@56
PS7, Line 56: - Ran exhaustive tests.
Can we also add a shell test?


http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/hbase-scan-node.h
File be/src/exec/hbase-scan-node.h:

http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/hbase-scan-node.h@84
PS7, Line 84:   /// Descriptor of the HBase table.
Can you mention it's set in Prepare()


http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/hdfs-text-table-writer.cc
File be/src/exec/hdfs-text-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/hdfs-text-table-writer.cc@107
PS7, Line 107:   // TODO: try to find a more efficient imlementation
Maybe remove the TODO? Or file a JIRA? Not sure that it adds much value here.


http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/text-converter.h
File be/src/exec/text-converter.h:

http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/text-converter.h@65
PS7, Line 65: auxType
aux_type here and elsewhere


http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/text-converter.cc
File be/src/exec/text-converter.cc:

http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/text-converter.cc@353
PS7, Line 353: bool TextConverter::SupportsCodegenWriteSlot(const ColumnType& 
col_type,
I don't love adding more logic without corresponding codegen, we spent so much 
time filling in gaps in the past and it feels bad to add more back in.


http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/runtime/descriptors.h
File be/src/runtime/descriptors.h:

http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/runtime/descriptors.h@209
PS7, Line 209: (
aux_type


http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java:

http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@259
PS7, Line 259: // NDV is not calculated for BINARY columns (similarly 
to Hive).
This seems kinda arbitrary - is there a real reason not to calculate it? I 
guess not a big deal.


http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
File fe/src/main/java/org/apache/impala/analysis/LikePredicate.java:

http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java@116
PS7, Line 116:   private static boolean isLikeableType(Type type) {
lol at the function name (no need to change).


http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java
File fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java@192
PS7, Line 192: C
nit: Column


http://gerrit.cloudera.org:8080/#/c/16066/7/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/16066/7/tests/query_test/test_scanners.py@1611
PS7, Line 1611:   def add_test_dimensions(cls):
It'd be good to run this for HS2 as well.



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 04 Sep 2020 04:20:59 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2020-09-01 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 7:

Let me know when I should take another look


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 01 Sep 2020 23:20:57 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2020-08-26 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/text-converter.cc
File be/src/exec/text-converter.cc:

http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/text-converter.cc@355
PS7, Line 355: switch (col_type.type) {
 : case TYPE_CHAR:
 :   return false;
 : case TYPE_STRING:
 :   return auxType.string_subtype != 
AuxColumnType::StringSubtype::BINARY;
 : default:
 :   return true;
This could be simplified a bit in my opinion:
if (col_type.type == TYPE_STRING) return auxType... != BINARY;
return col_type.type != TYPE_CHAR;


http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exprs/utility-functions-ir.cc
File be/src/exprs/utility-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exprs/utility-functions-ir.cc@209
PS7, Line 209: const StringVal& /*input_val*/
Is this part needed?



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 26 Aug 2020 09:20:16 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2020-08-25 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 7:

(1 comment)

I only checked the FE part of the patch. Will continue with the rest tomorrow.

http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java:

http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@989
PS7, Line 989: // BINARY is only supported for count + min + max.
Lucky that those were the first 3 operators in the for-loop's body.
What if later someone adds another function above this line? Let's consider 
adding these functions for binary after the for loop similarly to ds_hll_union 
(or avg).



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 25 Aug 2020 15:10:02 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2020-08-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 7:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/6920/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 13 Aug 2020 19:25:18 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2020-08-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/service/hs2-util.cc
File be/src/service/hs2-util.cc:

http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/service/hs2-util.cc@273
PS7, Line 273: static void 
StringExprValuesToHS2TColumnHelper(ScalarExprEvaluator* expr_eval,
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/16066/7/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/7/tests/query_test/test_udfs.py@116
PS7, Line 116: y
flake8: E501 line too long (92 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 13 Aug 2020 19:05:08 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2020-08-13 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 7:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/text-converter.h
File be/src/exec/text-converter.h:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/text-converter.h@63
PS2, Line 63:
> Could you mention the new param in the comment above?
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/text-converter.h@88
PS2, Line 88: hat cont
> Could you mention the new param in the comment above?
Done


http://gerrit.cloudera.org:8080/#/c/16066/6/be/src/service/hs2-util.cc
File be/src/service/hs2-util.cc:

http://gerrit.cloudera.org:8080/#/c/16066/6/be/src/service/hs2-util.cc@273
PS6, Line 273: static void 
StringExprValuesToHS2TColumnHelper(ScalarExprEvaluator* expr_eval,
> line too long (95 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16066/6/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
File fe/src/test/java/org/apache/impala/catalog/CatalogTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/6/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java@624
PS6, Line 624: HdfsTable binaryTable = (HdfsTable) 
catalog_.getOrLoadTable("functional",
> line too long (91 > 90)
Done



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 13 Aug 2020 19:05:27 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2020-08-13 Thread Csaba Ringhofer (Code Review)
Hello Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Impala 
Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16066

to look at the new patch set (#7).

Change subject: IMPALA-9482 Support for BINARY columns
..

IMPALA-9482 Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type
- Ran exhaustive tests.

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M bin/rat_exclude_files.txt
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M 

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2020-08-13 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 6:

(23 comments)

Thanks for the comments and sorry for the huge time before resolving them!

Patch 3 is just a rebase with some conflict resolution.

http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@18
PS2, Line 18: - No NDV is calculated during COMPUTE STATISTICS.
> Are we planning to change this? I think implicit casts from BINARY to STRIN
I have changed the casting rules to always need explicit casting between STRING 
and BINARY. Added more info in the commit message (line 42).


http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@47
PS2, Line 47: INSERT
> Do you plan to do interop tests between Hive and Impala?
We generally don't seem to have such tests, so I didn't add it here. I think it 
would be great to add some simple tests that test all types and fileformats 
with writer as Hive and reader as Impala + vica versa, but I would do this in 
jira.

Note that for some file formats the Hive writer -> Impala reader path is tested 
as Hive writes the tables during dataload.


http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@52
PS2, Line 52:   to test scanning.
> I think we want a lot more coverage in analysis tests, to make sure that we
I have added much more FE tests (similar ones as Attila added for the Date 
type) + a few to specifically check casts.


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-scan-node.cc
File be/src/exec/hbase-scan-node.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-scan-node.cc@49
PS2, Line 49: null
> nit: nullptr? I see there's a lot of NULLs in the file, so maybe just updat
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-table-writer.cc
File be/src/exec/hbase-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-table-writer.cc@157
PS2, Line 157: if (col_desc.auxType().IsBinaryStringSubtype()) {
> line too long (92 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-table-writer.cc@159
PS2, Line 159: Base64Encode
> Why do we encode it for HBase?
This how Hive does it, BINARY columns are base64 encoded as in text files.


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-rcfile-scanner.cc
File be/src/exec/hdfs-rcfile-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-rcfile-scanner.cc@593
PS2, Line 593: const AuxColumnType& aux_type =
 : scan_node_->hdfs_table()->GetColumnDesc(slot_desc).
> Yeah I wondered about this too. I think we should consider this in the cont
Yes, I was thinking similarly as what Tim wrote - it would be much nicer if 
string encoding could be handled mainly in the frontend and the backend would 
just need to think about byte arrays. Probably VARCHAR could be also handled 
this way, as AFAIK most code do not enforce the length limit, only 
scanners/writers/client handlers need to deal with it.

Another type that could probably profit from AuxType is timestamp, if we'll 
implement TIMESTAMP WITH (LOCAL) TIMEZONE. It should be probably the same data 
type underneath, but the frontend could pick different functions where the 
conversion matters + scanners would also need the info as different files may 
be written in different timezones.


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc
File be/src/exec/hdfs-text-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc@103
PS2, Line 103: const ColumnDescriptor& col_desc =
> line too long (96 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc@105
PS2, Line 105: const StringValue* string_value = 
reinterpret_cast(value);
> line too long (93 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc@107
PS2, Line 107:   // TODO: try to find a more efficient imlementation
> line too long (91 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.h
File be/src/runtime/types.h:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.h@67
PS2, Line 67: Aux
> nit: 'Aux' is not too descriptive. Probably EncodedColumnType/StoredColumnT
I didn't change it in the last patch to avoid the additional noise, but I am 
not attached to the current name.


http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc
File be/src/runtime/types.cc:

http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc@122
PS2, Line 122: ToThrift(PrimitiveType ptype)
> Should this function work as the inverse of ThriftToType() ?
It makes sense to create a version which also uses AuxType, but this was 

[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2020-08-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 6:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/6919/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 6
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 13 Aug 2020 18:21:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2020-08-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..


Patch Set 6:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16066/6/be/src/service/hs2-util.cc
File be/src/service/hs2-util.cc:

http://gerrit.cloudera.org:8080/#/c/16066/6/be/src/service/hs2-util.cc@273
PS6, Line 273: static void 
StringExprValuesToHS2TColumnHelper(ScalarExprEvaluator* expr_eval, RowBatch* 
batch,
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/16066/6/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java
File fe/src/test/java/org/apache/impala/catalog/CatalogTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/6/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java@624
PS6, Line 624: HdfsTable binaryTable = (HdfsTable) 
catalog_.getOrLoadTable("functional", "binary_tbl",
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/16066/6/fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java
File fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java:

http://gerrit.cloudera.org:8080/#/c/16066/6/fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java@433
PS6, Line 433: "WITH SERDEPROPERTIES 
('hbase.columns.mapping'=':key,d:string_col,d:binary_col', " +
line too long (92 > 90)


http://gerrit.cloudera.org:8080/#/c/16066/6/tests/query_test/test_udfs.py
File tests/query_test/test_udfs.py:

http://gerrit.cloudera.org:8080/#/c/16066/6/tests/query_test/test_udfs.py@116
PS6, Line 116: y
flake8: E501 line too long (92 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/16066
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Gerrit-Change-Number: 16066
Gerrit-PatchSet: 6
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 13 Aug 2020 18:00:18 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns

2020-08-13 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has uploaded a new patch set (#6). ( 
http://gerrit.cloudera.org:8080/16066 )

Change subject: IMPALA-9482 Support for BINARY columns
..

IMPALA-9482 Support for BINARY columns

This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type
- Ran exhaustive tests.

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
---
M be/src/exec/hbase-scan-node.cc
M be/src/exec/hbase-scan-node.h
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hdfs-rcfile-scanner.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.h
M be/src/exec/text-converter.inline.h
M be/src/exprs/expr-test.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/exprs/utility-functions.h
M be/src/runtime/descriptors.cc
M be/src/runtime/descriptors.h
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/query-result-set.cc
M be/src/testutil/test-udfs.cc
M be/src/util/coding-util.cc
M be/src/util/coding-util.h
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/InPredicate.java
M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java
M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java
M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java
M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java
M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java
M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java
M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java
M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java
M