[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. IMPALA-9482: Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Reviewed-on: http://gerrit.cloudera.org:8080/16066 Reviewed-by: Quanlong Huang Tested-by: Impala Public Jenkins --- M be/src/exec/file-metadata-utils.cc M be/src/exec/hbase/hbase-scan-node.cc M be/src/exec/hbase/hbase-scan-node.h M be/src/exec/hbase/hbase-table-writer.cc M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/orc/orc-metadata-utils.cc M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/exec/rcfile/hdfs-rcfile-scanner.cc M be/src/exec/text-converter.cc M be/src/exec/text-converter.h M be/src/exec/text-converter.inline.h M be/src/exec/text/hdfs-text-scanner.cc M be/src/exprs/expr-test.cc M be/src/exprs/utility-functions-ir.cc M be/src/exprs/utility-functions.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/hs2-util.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/query-result-set.cc M be/src/testutil/test-udfs.cc M be/src/util/coding-util.cc M be/src/util/coding-util.h M bin/rat_exclude_files.txt M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/analysis/InPredicate.java M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/analysis/SlotRef.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java M
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 26: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 26 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 19 Aug 2022 13:55:41 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 22: (4 comments) http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h File be/src/runtime/descriptors.h: http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h@256 PS14, Line 256: return col_descs_[slot_desc->col_path().back()]; > Change the analyzer to disallow complex types in select list if they have b Thanks for digging into this! Sorry that my initial confusion is that "col_descs_" are the top-level columns of the table, but the last item in SchemaPath is not always the top-level column index. Usually the first item of SchemaPath is the top-level column index, and the next items are the index inside the nested type. E.g. the 6th column in table complextypestbl is nested_struct struct< a: int, b: array, ... > If the query selects "nested_struct.a" in the SelectList, the corresponding SchemaPath is [5, 0]. Here [5] is the SchemaPath of "nested_struct". But we are using 0 (the last item) here as the index of col_descs_. So I hope we can add a test of selecting the binary column directly inside a struct top level column. Maybe I've missed something. Just explaning my confusion. http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java File fe/src/main/java/org/apache/impala/analysis/LikePredicate.java: http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java@116 PS14, Line 116: private static boolean isLikeableType(Type type) { > :D poor other types - tbh I don't remember whether the naming was intention haha http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/bin/generate-schema-statements.py File testdata/bin/generate-schema-statements.py: http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/bin/generate-schema-statements.py@222 PS22, Line 222: 'BINARY': 'bytes' nit: it'd be nice to add a trailing comma so future changes don't need to touch this line. http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/datasets/functional/functional_schema_template.sql File testdata/datasets/functional/functional_schema_template.sql: http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/datasets/functional/functional_schema_template.sql@3532 PS22, Line 3532: binary_in_complex_types Can we add some data to this table and add some e2e tests? e.g. select binary_member_col.b from binary_in_complex_types; select a.item from binary_in_complex_types t, t.binary_item_col; select m.key from binary_in_complex_types t, t.binary_key_col; select m.value from binary_in_complex_types t, t.binary_value_col; -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 22 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 18 Aug 2022 09:43:26 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16066 to look at the new patch set (#23). Change subject: IMPALA-9482: Support for BINARY columns .. IMPALA-9482: Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 --- M be/src/exec/file-metadata-utils.cc M be/src/exec/hbase-scan-node.cc M be/src/exec/hbase-scan-node.h M be/src/exec/hbase-table-writer.cc M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/orc-metadata-utils.cc M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/exec/text-converter.cc M be/src/exec/text-converter.h M be/src/exec/text-converter.inline.h M be/src/exprs/expr-test.cc M be/src/exprs/utility-functions-ir.cc M be/src/exprs/utility-functions.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/hs2-util.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/query-result-set.cc M be/src/testutil/test-udfs.cc M be/src/util/coding-util.cc M be/src/util/coding-util.h M bin/rat_exclude_files.txt M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/analysis/InPredicate.java M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/analysis/SlotRef.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java M
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 23: (2 comments) http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/bin/generate-schema-statements.py File testdata/bin/generate-schema-statements.py: http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/bin/generate-schema-statements.py@222 PS22, Line 222: 'BINARY': 'bytes' > nit: it'd be nice to add a trailing comma so future changes don't need to t Done http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/datasets/functional/functional_schema_template.sql File testdata/datasets/functional/functional_schema_template.sql: http://gerrit.cloudera.org:8080/#/c/16066/22/testdata/datasets/functional/functional_schema_template.sql@3532 PS22, Line 3532: binary_in_complex_types > Can we add some data to this table and add some e2e tests? e.g. Added minimal data and these simple tests. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 23 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 18 Aug 2022 12:20:43 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16066 to look at the new patch set (#24). Change subject: IMPALA-9482: Support for BINARY columns .. IMPALA-9482: Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 --- M be/src/exec/file-metadata-utils.cc M be/src/exec/hbase-scan-node.cc M be/src/exec/hbase-scan-node.h M be/src/exec/hbase-table-writer.cc M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/orc-metadata-utils.cc M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/exec/text-converter.cc M be/src/exec/text-converter.h M be/src/exec/text-converter.inline.h M be/src/exprs/expr-test.cc M be/src/exprs/utility-functions-ir.cc M be/src/exprs/utility-functions.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/hs2-util.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/query-result-set.cc M be/src/testutil/test-udfs.cc M be/src/util/coding-util.cc M be/src/util/coding-util.h M bin/rat_exclude_files.txt M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/analysis/InPredicate.java M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/analysis/SlotRef.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java M
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 24: (1 comment) http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h File be/src/runtime/descriptors.h: http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h@256 PS14, Line 256: return col_descs_[slot_desc->col_path().back()]; > Thanks for digging into this! It turned out that while function would not work for nested columns, we do not use it at any place where this could be a problem. Changed the dcheck to ensure that the type is not nested (the path's size is 1). -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 24 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 18 Aug 2022 14:36:05 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 23: (6 comments) http://gerrit.cloudera.org:8080/#/c/16066/23/tests/hs2/test_fetch.py File tests/hs2/test_fetch.py: http://gerrit.cloudera.org:8080/#/c/16066/23/tests/hs2/test_fetch.py@69 PS23, Line 69: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/23/tests/hs2/test_fetch.py@89 PS23, Line 89: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/23/tests/hs2/test_fetch.py@100 PS23, Line 100: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/23/tests/hs2/test_fetch.py@113 PS23, Line 113: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/23/tests/hs2/test_fetch.py@123 PS23, Line 123: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/23/tests/query_test/test_udfs.py File tests/query_test/test_udfs.py: http://gerrit.cloudera.org:8080/#/c/16066/23/tests/query_test/test_udfs.py@116 PS23, Line 116: y flake8: E501 line too long (92 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 23 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 18 Aug 2022 12:15:08 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 24: (6 comments) http://gerrit.cloudera.org:8080/#/c/16066/24/tests/hs2/test_fetch.py File tests/hs2/test_fetch.py: http://gerrit.cloudera.org:8080/#/c/16066/24/tests/hs2/test_fetch.py@69 PS24, Line 69: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/24/tests/hs2/test_fetch.py@89 PS24, Line 89: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/24/tests/hs2/test_fetch.py@100 PS24, Line 100: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/24/tests/hs2/test_fetch.py@113 PS24, Line 113: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/24/tests/hs2/test_fetch.py@123 PS24, Line 123: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/24/tests/query_test/test_udfs.py File tests/query_test/test_udfs.py: http://gerrit.cloudera.org:8080/#/c/16066/24/tests/query_test/test_udfs.py@116 PS24, Line 116: y flake8: E501 line too long (92 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 24 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 18 Aug 2022 14:33:23 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 24: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/11184/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 24 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 18 Aug 2022 14:53:48 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 23: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/11181/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 23 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 18 Aug 2022 12:35:30 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 24: Code-Review+1 (6 comments) http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/exec/orc-metadata-utils.cc File be/src/exec/orc-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/exec/orc-metadata-utils.cc@477 PS24, Line 477: || type.type == TYPE_CHAR) { Could you add a comment about why we don't need TYPE_BINARY here? http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/runtime/descriptors.h File be/src/runtime/descriptors.h: http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/runtime/descriptors.h@256 PS24, Line 256: back() nit: front() or [0] seems better (as line 248 uses). Same comment to line 255. http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java File java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java: http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@40 PS24, Line 40: import org.apache.hadoop.io.Text; duplicated line http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@319 PS24, Line 319: String BytesWritable http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@321 PS24, Line 321: get nit: get() is deprecated by getBytes(). http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@326 PS24, Line 326: result nit: handle null 'result' when 'inputs' is empty. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 24 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 19 Aug 2022 02:30:29 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16066 to look at the new patch set (#25). Change subject: IMPALA-9482: Support for BINARY columns .. IMPALA-9482: Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 --- M be/src/exec/file-metadata-utils.cc M be/src/exec/hbase/hbase-scan-node.cc M be/src/exec/hbase/hbase-scan-node.h M be/src/exec/hbase/hbase-table-writer.cc M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/orc/orc-metadata-utils.cc M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/exec/rcfile/hdfs-rcfile-scanner.cc M be/src/exec/text-converter.cc M be/src/exec/text-converter.h M be/src/exec/text-converter.inline.h M be/src/exec/text/hdfs-text-scanner.cc M be/src/exprs/expr-test.cc M be/src/exprs/utility-functions-ir.cc M be/src/exprs/utility-functions.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/hs2-util.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/query-result-set.cc M be/src/testutil/test-udfs.cc M be/src/util/coding-util.cc M be/src/util/coding-util.h M bin/rat_exclude_files.txt M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/analysis/InPredicate.java M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/analysis/SlotRef.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java M
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16066 to look at the new patch set (#26). Change subject: IMPALA-9482: Support for BINARY columns .. IMPALA-9482: Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 --- M be/src/exec/file-metadata-utils.cc M be/src/exec/hbase/hbase-scan-node.cc M be/src/exec/hbase/hbase-scan-node.h M be/src/exec/hbase/hbase-table-writer.cc M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/orc/orc-metadata-utils.cc M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/exec/rcfile/hdfs-rcfile-scanner.cc M be/src/exec/text-converter.cc M be/src/exec/text-converter.h M be/src/exec/text-converter.inline.h M be/src/exec/text/hdfs-text-scanner.cc M be/src/exprs/expr-test.cc M be/src/exprs/utility-functions-ir.cc M be/src/exprs/utility-functions.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/hs2-util.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/query-result-set.cc M be/src/testutil/test-udfs.cc M be/src/util/coding-util.cc M be/src/util/coding-util.h M bin/rat_exclude_files.txt M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/analysis/InPredicate.java M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/analysis/SlotRef.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java M
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 26: Code-Review+2 Thanks for the great work! -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 26 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 19 Aug 2022 08:47:57 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 26: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/8473/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 26 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 19 Aug 2022 09:07:18 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 26: (6 comments) Thanks for the comments! PS 25 was a rebase that included https://gerrit.cloudera.org/#/c/18815/ , which is a major refactoring that moved several h/cc files. PS 26 tries to fix the comments. http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/exec/orc-metadata-utils.cc File be/src/exec/orc-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/exec/orc-metadata-utils.cc@477 PS24, Line 477: > Could you add a comment about why we don't need TYPE_BINARY here? added a comment - note that file was moved during the rebase http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/runtime/descriptors.h File be/src/runtime/descriptors.h: http://gerrit.cloudera.org:8080/#/c/16066/24/be/src/runtime/descriptors.h@256 PS24, Line 256: 0]]; > nit: front() or [0] seems better (as line 248 uses). Same comment to line 2 Done http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java File java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java: http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@40 PS24, Line 40: import org.apache.hadoop.io.Writable; > duplicated line Done http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@319 PS24, Line 319: t " + > BytesWritable Done http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@321 PS24, Line 321: get > nit: get() is deprecated by getBytes(). Done http://gerrit.cloudera.org:8080/#/c/16066/24/java/test-hive-udfs/src/main/java/org/apache/impala/TestGenericUdf.java@326 PS24, Line 326: nary.s > nit: handle null 'result' when 'inputs' is empty. Done -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 26 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 19 Aug 2022 08:16:26 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 25: (1 comment) http://gerrit.cloudera.org:8080/#/c/16066/25/tests/query_test/test_udfs.py File tests/query_test/test_udfs.py: http://gerrit.cloudera.org:8080/#/c/16066/25/tests/query_test/test_udfs.py@116 PS25, Line 116: y flake8: E501 line too long (92 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 25 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 19 Aug 2022 06:52:39 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 25: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/11189/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 25 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 19 Aug 2022 07:12:28 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 26: (1 comment) http://gerrit.cloudera.org:8080/#/c/16066/26/tests/query_test/test_udfs.py File tests/query_test/test_udfs.py: http://gerrit.cloudera.org:8080/#/c/16066/26/tests/query_test/test_udfs.py@116 PS26, Line 116: y flake8: E501 line too long (92 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 26 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 19 Aug 2022 08:13:01 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 26: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/11190/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 26 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 19 Aug 2022 08:35:16 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 22: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/11136/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 22 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 11 Aug 2022 10:05:17 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 21: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/11135/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 21 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 11 Aug 2022 09:57:15 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 22: (1 comment) Some tests were broken by recent changes related fixing comments, ps 21 fixes those (and ps 22 fixes some whitespace issues) http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java File fe/src/main/java/org/apache/impala/analysis/CastExpr.java: http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@357 PS14, Line 357: Preconditions.checkState(!childType.isNull()); > Added a precondition to ensure that if one is binary, the other must be str This change actually broke a number of tests, for example when casting NULL to binary or tests the rely on an exact error message about the invalid cast. Changed the logic to do noop cast only in case of binary<->string cast, and let the subsequent logic catch the error that there is no casting functions to/from other types. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 22 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 11 Aug 2022 09:47:05 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 22: (6 comments) http://gerrit.cloudera.org:8080/#/c/16066/22/tests/hs2/test_fetch.py File tests/hs2/test_fetch.py: http://gerrit.cloudera.org:8080/#/c/16066/22/tests/hs2/test_fetch.py@69 PS22, Line 69: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/22/tests/hs2/test_fetch.py@89 PS22, Line 89: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/22/tests/hs2/test_fetch.py@100 PS22, Line 100: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/22/tests/hs2/test_fetch.py@113 PS22, Line 113: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/22/tests/hs2/test_fetch.py@123 PS22, Line 123: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/22/tests/query_test/test_udfs.py File tests/query_test/test_udfs.py: http://gerrit.cloudera.org:8080/#/c/16066/22/tests/query_test/test_udfs.py@116 PS22, Line 116: y flake8: E501 line too long (92 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 22 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 11 Aug 2022 09:46:34 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16066 to look at the new patch set (#22). Change subject: IMPALA-9482: Support for BINARY columns .. IMPALA-9482: Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 --- M be/src/exec/file-metadata-utils.cc M be/src/exec/hbase-scan-node.cc M be/src/exec/hbase-scan-node.h M be/src/exec/hbase-table-writer.cc M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/orc-metadata-utils.cc M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/exec/text-converter.cc M be/src/exec/text-converter.h M be/src/exec/text-converter.inline.h M be/src/exprs/expr-test.cc M be/src/exprs/utility-functions-ir.cc M be/src/exprs/utility-functions.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/hs2-util.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/query-result-set.cc M be/src/testutil/test-udfs.cc M be/src/util/coding-util.cc M be/src/util/coding-util.h M bin/rat_exclude_files.txt M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/analysis/InPredicate.java M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/analysis/SlotRef.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java M
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 21: (13 comments) http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/main/java/org/apache/impala/analysis/CastExpr.java File fe/src/main/java/org/apache/impala/analysis/CastExpr.java: http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@368 PS21, Line 368: noOp_ = true; tab used for whitespace http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@369 PS21, Line 369: return; tab used for whitespace http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@370 PS21, Line 370: } tab used for whitespace http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java File fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java: http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java@4278 PS21, Line 4278:"cast(string_col as binary), bool_col, " + tab used for whitespace http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java File fe/src/test/java/org/apache/impala/catalog/CatalogTest.java: http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java@632 PS21, Line 632: // NDV is not calculated for BINARY columns tab used for whitespace http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java@633 PS21, Line 633: assertFalse(binaryCol.getStats().hasNumDistinctValues()); tab used for whitespace http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java File fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java: http://gerrit.cloudera.org:8080/#/c/16066/21/fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java@243 PS21, Line 243: stats.get(0).toString()); tab used for whitespace http://gerrit.cloudera.org:8080/#/c/16066/21/tests/hs2/test_fetch.py File tests/hs2/test_fetch.py: http://gerrit.cloudera.org:8080/#/c/16066/21/tests/hs2/test_fetch.py@69 PS21, Line 69: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/21/tests/hs2/test_fetch.py@89 PS21, Line 89: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/21/tests/hs2/test_fetch.py@100 PS21, Line 100: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/21/tests/hs2/test_fetch.py@113 PS21, Line 113: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/21/tests/hs2/test_fetch.py@123 PS21, Line 123: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/21/tests/query_test/test_udfs.py File tests/query_test/test_udfs.py: http://gerrit.cloudera.org:8080/#/c/16066/21/tests/query_test/test_udfs.py@116 PS21, Line 116: y flake8: E501 line too long (92 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 21 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 11 Aug 2022 09:36:04 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16066 to look at the new patch set (#21). Change subject: IMPALA-9482: Support for BINARY columns .. IMPALA-9482: Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 --- M be/src/exec/file-metadata-utils.cc M be/src/exec/hbase-scan-node.cc M be/src/exec/hbase-scan-node.h M be/src/exec/hbase-table-writer.cc M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/orc-metadata-utils.cc M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/exec/text-converter.cc M be/src/exec/text-converter.h M be/src/exec/text-converter.inline.h M be/src/exprs/expr-test.cc M be/src/exprs/utility-functions-ir.cc M be/src/exprs/utility-functions.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/hs2-util.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/query-result-set.cc M be/src/testutil/test-udfs.cc M be/src/util/coding-util.cc M be/src/util/coding-util.h M bin/rat_exclude_files.txt M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/analysis/InPredicate.java M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/analysis/SlotRef.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java M
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 20: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/11132/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 20 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 10 Aug 2022 18:37:45 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 19: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/11131/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 19 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 10 Aug 2022 18:36:28 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 20: (1 comment) http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h File be/src/runtime/descriptors.h: http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h@256 PS14, Line 256: return col_descs_[slot_desc->col_path().back()]; > Checked how Hive handles this, it seems to utf8 encode the nested binary wi Change the analyzer to disallow complex types in select list if they have binary members. Created follow up Jira IMPALA-11491 Also added a test tables + FE tests to check this behavior. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 20 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 10 Aug 2022 18:12:17 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16066 to look at the new patch set (#20). Change subject: IMPALA-9482: Support for BINARY columns .. IMPALA-9482: Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 --- M be/src/exec/file-metadata-utils.cc M be/src/exec/hbase-scan-node.cc M be/src/exec/hbase-scan-node.h M be/src/exec/hbase-table-writer.cc M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/orc-metadata-utils.cc M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/exec/text-converter.cc M be/src/exec/text-converter.h M be/src/exec/text-converter.inline.h M be/src/exprs/expr-test.cc M be/src/exprs/utility-functions-ir.cc M be/src/exprs/utility-functions.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/hs2-util.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/query-result-set.cc M be/src/testutil/test-udfs.cc M be/src/util/coding-util.cc M be/src/util/coding-util.h M bin/rat_exclude_files.txt M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/analysis/InPredicate.java M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/analysis/SlotRef.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java M
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16066 to look at the new patch set (#19). Change subject: IMPALA-9482: Support for BINARY columns .. IMPALA-9482: Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 --- M be/src/exec/file-metadata-utils.cc M be/src/exec/hbase-scan-node.cc M be/src/exec/hbase-scan-node.h M be/src/exec/hbase-table-writer.cc M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/orc-metadata-utils.cc M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/exec/text-converter.cc M be/src/exec/text-converter.h M be/src/exec/text-converter.inline.h M be/src/exprs/expr-test.cc M be/src/exprs/utility-functions-ir.cc M be/src/exprs/utility-functions.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/hs2-util.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/query-result-set.cc M be/src/testutil/test-udfs.cc M be/src/util/coding-util.cc M be/src/util/coding-util.h M bin/rat_exclude_files.txt M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/analysis/InPredicate.java M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/analysis/SlotRef.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java M
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 20: (6 comments) http://gerrit.cloudera.org:8080/#/c/16066/20/tests/hs2/test_fetch.py File tests/hs2/test_fetch.py: http://gerrit.cloudera.org:8080/#/c/16066/20/tests/hs2/test_fetch.py@69 PS20, Line 69: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/20/tests/hs2/test_fetch.py@89 PS20, Line 89: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/20/tests/hs2/test_fetch.py@100 PS20, Line 100: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/20/tests/hs2/test_fetch.py@113 PS20, Line 113: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/20/tests/hs2/test_fetch.py@123 PS20, Line 123: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/20/tests/query_test/test_udfs.py File tests/query_test/test_udfs.py: http://gerrit.cloudera.org:8080/#/c/16066/20/tests/query_test/test_udfs.py@116 PS20, Line 116: y flake8: E501 line too long (92 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 20 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 10 Aug 2022 18:10:21 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 19: (7 comments) http://gerrit.cloudera.org:8080/#/c/16066/19/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java File fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java: http://gerrit.cloudera.org:8080/#/c/16066/19/fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java@297 PS19, Line 297: // Returning complex types with BINARY in select list is not yet implemented (IMPALA-11491). line too long (96 > 90) http://gerrit.cloudera.org:8080/#/c/16066/19/tests/hs2/test_fetch.py File tests/hs2/test_fetch.py: http://gerrit.cloudera.org:8080/#/c/16066/19/tests/hs2/test_fetch.py@69 PS19, Line 69: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/19/tests/hs2/test_fetch.py@89 PS19, Line 89: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/19/tests/hs2/test_fetch.py@100 PS19, Line 100: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/19/tests/hs2/test_fetch.py@113 PS19, Line 113: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/19/tests/hs2/test_fetch.py@123 PS19, Line 123: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/19/tests/query_test/test_udfs.py File tests/query_test/test_udfs.py: http://gerrit.cloudera.org:8080/#/c/16066/19/tests/query_test/test_udfs.py@116 PS19, Line 116: y flake8: E501 line too long (92 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 19 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 10 Aug 2022 18:07:31 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 18: (1 comment) http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h File be/src/runtime/descriptors.h: http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h@256 PS14, Line 256: return col_descs_[slot_desc->col_path().back()]; > Hmm, I realized that I have to think nested binaries through more carefully Checked how Hive handles this, it seems to utf8 encode the nested binary with replace: select named_struct("b", unhex("00112233445566778899AABBCCDDEEFF")); result: {"b":"3DUfw} select array(unhex("00112233445566778899AABBCCDDEEFF")); result: ["3DUfw] A weird behavior around nested binary is that it is not quoted (unlike string): select named_struct("s", "a", "b", cast("a" as binary)); {"s":"a","b":a} I have also checked what happens when we read these back from Impala: - if we return the whole struct, the non-valid utf8 binary is replaced with an empty string - if we return the whole array, it is returned escaped: "\000\021\"3DUfw"] - if the member/item is returned directly, we print it the same way as Hive - if we hex() the result back, it returns it correctly, so Hive does not write it in a lossy way I am thinking about not allowing nested binary for now, and enable them once the struct/array printing functions are unified + I have verified whether not quoting them is intentional in Hive. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 18 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 05 Aug 2022 11:22:57 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 18: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/11105/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 18 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 05 Aug 2022 11:04:27 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 18: (1 comment) http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc File be/src/service/hs2-util.cc: http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc@872 PS14, Line 872: AuxColumnType aux_type(columnType); > Done Reverted this, as while gcc compiled it, clang had issue I could not solve. This is strange to me, as case TYPE_DECIMAL: also has some local variables, but I did not want to dive too deeply. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 18 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 05 Aug 2022 10:46:34 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 18: (6 comments) http://gerrit.cloudera.org:8080/#/c/16066/18/tests/hs2/test_fetch.py File tests/hs2/test_fetch.py: http://gerrit.cloudera.org:8080/#/c/16066/18/tests/hs2/test_fetch.py@69 PS18, Line 69: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/18/tests/hs2/test_fetch.py@89 PS18, Line 89: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/18/tests/hs2/test_fetch.py@100 PS18, Line 100: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/18/tests/hs2/test_fetch.py@113 PS18, Line 113: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/18/tests/hs2/test_fetch.py@123 PS18, Line 123: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/18/tests/query_test/test_udfs.py File tests/query_test/test_udfs.py: http://gerrit.cloudera.org:8080/#/c/16066/18/tests/query_test/test_udfs.py@116 PS18, Line 116: y flake8: E501 line too long (92 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 18 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 05 Aug 2022 10:44:48 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16066 to look at the new patch set (#18). Change subject: IMPALA-9482: Support for BINARY columns .. IMPALA-9482: Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 --- M be/src/exec/file-metadata-utils.cc M be/src/exec/hbase-scan-node.cc M be/src/exec/hbase-scan-node.h M be/src/exec/hbase-table-writer.cc M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/orc-metadata-utils.cc M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/exec/text-converter.cc M be/src/exec/text-converter.h M be/src/exec/text-converter.inline.h M be/src/exprs/expr-test.cc M be/src/exprs/utility-functions-ir.cc M be/src/exprs/utility-functions.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/hs2-util.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/query-result-set.cc M be/src/testutil/test-udfs.cc M be/src/util/coding-util.cc M be/src/util/coding-util.h M bin/rat_exclude_files.txt M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/analysis/InPredicate.java M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java M fe/src/main/java/org/apache/impala/catalog/ScalarType.java M fe/src/main/java/org/apache/impala/catalog/Type.java M
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 16: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/11100/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 16 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 04 Aug 2022 19:28:02 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 17: (1 comment) http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h File be/src/runtime/descriptors.h: http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h@256 PS14, Line 256: return col_descs_[slot_desc->col_path().back()]; > Added tests with binary in struct, binary in array was only tested manually Hmm, I realized that I have to think nested binaries through more carefully and check how Hive deals with them - the issue is that complex types are expected to be returned as utf8 strings, and I don't know what to do with a nested BINARY with special characters - probably they are handled by escaping them. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 17 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 04 Aug 2022 15:18:12 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16066 to look at the new patch set (#17). Change subject: IMPALA-9482: Support for BINARY columns .. IMPALA-9482: Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 --- M be/src/exec/file-metadata-utils.cc M be/src/exec/hbase-scan-node.cc M be/src/exec/hbase-scan-node.h M be/src/exec/hbase-table-writer.cc M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/orc-metadata-utils.cc M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/exec/text-converter.cc M be/src/exec/text-converter.h M be/src/exec/text-converter.inline.h M be/src/exprs/expr-test.cc M be/src/exprs/utility-functions-ir.cc M be/src/exprs/utility-functions.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/hs2-util.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/query-result-set.cc M be/src/testutil/test-udfs.cc M be/src/util/coding-util.cc M be/src/util/coding-util.h M bin/rat_exclude_files.txt M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/analysis/InPredicate.java M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java M fe/src/main/java/org/apache/impala/catalog/ScalarType.java M fe/src/main/java/org/apache/impala/catalog/Type.java M
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 17: (6 comments) http://gerrit.cloudera.org:8080/#/c/16066/17/tests/hs2/test_fetch.py File tests/hs2/test_fetch.py: http://gerrit.cloudera.org:8080/#/c/16066/17/tests/hs2/test_fetch.py@69 PS17, Line 69: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/17/tests/hs2/test_fetch.py@89 PS17, Line 89: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/17/tests/hs2/test_fetch.py@100 PS17, Line 100: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/17/tests/hs2/test_fetch.py@113 PS17, Line 113: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/17/tests/hs2/test_fetch.py@123 PS17, Line 123: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/17/tests/query_test/test_udfs.py File tests/query_test/test_udfs.py: http://gerrit.cloudera.org:8080/#/c/16066/17/tests/query_test/test_udfs.py@116 PS17, Line 116: y flake8: E501 line too long (92 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 17 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 03 Aug 2022 18:16:49 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 14: (11 comments) Ps 15 is rebase + conflict resolution, ps 16 contains fixes for the comments. http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/hbase-table-writer.cc File be/src/exec/hbase-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/hbase-table-writer.cc@161 PS14, Line 161: data_len = encoded_value.length(); > I just realized we support writing to HBase. It worths adding tests for thi Thanks for the comment, it turned out that the writer was buggy as 'base64_encoded_value' went out of scope while 'data' kept a pointer to its buffer. Added a binary column to insertalltypesagg/insertalltypesaggbinary (only used by HBase tests) and added few tests to hbase-inserts.test Also checked manually that Hive can read our data. http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/text-converter.cc File be/src/exec/text-converter.cc: http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/text-converter.cc@359 PS14, Line 359: return auxType.string_subtype != AuxColumnType::StringSubtype::BINARY; > Add JIRA for codegen on reading BINARY from text files? I guess we just nee Created a jira. http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h File be/src/runtime/descriptors.h: http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h@256 PS14, Line 256: return col_descs_[slot_desc->col_path().back()]; > Not sure if this is correct for nested types. Can we add tests on BINARY in Added tests with binary in struct, binary in array was only tested manually so far. http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/types.cc File be/src/runtime/types.cc: http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/types.cc@141 PS14, Line 141: return TPrimitiveType::BINARY; > Should we return INVALID_TYPE here because of DCHECK(false) so that we catc Agree with returning INVALID_TYPE to improve error handling in release builds. http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc File be/src/service/hs2-util.cc: http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc@872 PS14, Line 872: AuxColumnType aux_type(columnType); > If this is only used in the TYPE_STRING, perhaps it should be initialized t Done http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc@950 PS14, Line 950: type_entry.__set_type(thrift::TTypeId::STRING_TYPE); > If DCHECK is false, do we want to set this to STRING_TYPE? Or maybe we sho There is no "invalid type" concept in the HS2 thrift api + this field is required, so I don't see a better idea here (maybe we could return some status in case of error, but that seems a bit too much effort for the unlikely case of not catching this scenario in debug builds. http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/impala-beeswax-server.cc File be/src/service/impala-beeswax-server.cc: http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/impala-beeswax-server.cc@212 PS14, Line 212: return "string"; > How come we don't return TypeToOdbcString here? What do you mean for inlineing? It is also used in some tests, so we can't keep it in this file. About returning TypeToOdbcString - while odbc does support binary type, my guess is that existing beeswax clients don't, so it seemed safer to return "string" http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java File fe/src/main/java/org/apache/impala/analysis/CastExpr.java: http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@187 PS14, Line 187: // No built-in function needed for BINARY <-> STRING conversion. > I think the comment should also reflect other datatypes versus binary are n Done http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@357 PS14, Line 357: boolean isStringBinaryCast = > I may be understanding this wrong... Added a precondition to ensure that if one is binary, the other must be string. http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java File fe/src/main/java/org/apache/impala/analysis/LikePredicate.java: http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java@116 PS14, Line 116: private static boolean isLikeableType(Type type) { > No real comment here. But I feel the other types might get upset for not be :D poor other types - tbh I don't remember whether the naming was intentional http://gerrit.cloudera.org:8080/#/c/16066/14/testdata/workloads/functional-query/queries/QueryTest/binary-type.test File
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 16: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/11087/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 16 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 03 Aug 2022 17:55:45 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 16: (6 comments) http://gerrit.cloudera.org:8080/#/c/16066/16/tests/hs2/test_fetch.py File tests/hs2/test_fetch.py: http://gerrit.cloudera.org:8080/#/c/16066/16/tests/hs2/test_fetch.py@69 PS16, Line 69: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/16/tests/hs2/test_fetch.py@89 PS16, Line 89: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/16/tests/hs2/test_fetch.py@100 PS16, Line 100: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/16/tests/hs2/test_fetch.py@113 PS16, Line 113: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/16/tests/hs2/test_fetch.py@123 PS16, Line 123: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/16/tests/query_test/test_udfs.py File tests/query_test/test_udfs.py: http://gerrit.cloudera.org:8080/#/c/16066/16/tests/query_test/test_udfs.py@116 PS16, Line 116: y flake8: E501 line too long (92 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 16 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 03 Aug 2022 17:46:05 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16066 to look at the new patch set (#16). Change subject: IMPALA-9482: Support for BINARY columns .. IMPALA-9482: Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 --- M be/src/exec/file-metadata-utils.cc M be/src/exec/hbase-scan-node.cc M be/src/exec/hbase-scan-node.h M be/src/exec/hbase-table-writer.cc M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/orc-metadata-utils.cc M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/exec/text-converter.cc M be/src/exec/text-converter.h M be/src/exec/text-converter.inline.h M be/src/exprs/expr-test.cc M be/src/exprs/utility-functions-ir.cc M be/src/exprs/utility-functions.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/hs2-util.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/query-result-set.cc M be/src/testutil/test-udfs.cc M be/src/util/coding-util.cc M be/src/util/coding-util.h M bin/rat_exclude_files.txt M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/analysis/InPredicate.java M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java M fe/src/main/java/org/apache/impala/catalog/ScalarType.java M fe/src/main/java/org/apache/impala/catalog/Type.java M
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 15: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/11085/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 15 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 03 Aug 2022 13:51:05 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 15: (6 comments) http://gerrit.cloudera.org:8080/#/c/16066/15/tests/hs2/test_fetch.py File tests/hs2/test_fetch.py: http://gerrit.cloudera.org:8080/#/c/16066/15/tests/hs2/test_fetch.py@69 PS15, Line 69: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/15/tests/hs2/test_fetch.py@89 PS15, Line 89: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/15/tests/hs2/test_fetch.py@100 PS15, Line 100: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/15/tests/hs2/test_fetch.py@113 PS15, Line 113: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/15/tests/hs2/test_fetch.py@123 PS15, Line 123: " flake8: E126 continuation line over-indented for hanging indent http://gerrit.cloudera.org:8080/#/c/16066/15/tests/query_test/test_udfs.py File tests/query_test/test_udfs.py: http://gerrit.cloudera.org:8080/#/c/16066/15/tests/query_test/test_udfs.py@116 PS15, Line 116: y flake8: E501 line too long (92 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 15 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 03 Aug 2022 13:30:46 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Steve Carlin, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16066 to look at the new patch set (#15). Change subject: IMPALA-9482: Support for BINARY columns .. IMPALA-9482: Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 --- M be/src/exec/file-metadata-utils.cc M be/src/exec/hbase-scan-node.cc M be/src/exec/hbase-scan-node.h M be/src/exec/hbase-table-writer.cc M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/orc-metadata-utils.cc M be/src/exec/parquet/hdfs-parquet-table-writer.cc M be/src/exec/parquet/parquet-metadata-utils.cc M be/src/exec/parquet/parquet-metadata-utils.h M be/src/exec/text-converter.cc M be/src/exec/text-converter.h M be/src/exec/text-converter.inline.h M be/src/exprs/expr-test.cc M be/src/exprs/utility-functions-ir.cc M be/src/exprs/utility-functions.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/hs2-util.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/query-result-set.cc M be/src/testutil/test-udfs.cc M be/src/util/coding-util.cc M be/src/util/coding-util.h M bin/rat_exclude_files.txt M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/analysis/InPredicate.java M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java M fe/src/main/java/org/apache/impala/catalog/ScalarType.java M fe/src/main/java/org/apache/impala/catalog/Type.java M
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Steve Carlin has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 14: (7 comments) Hopefully these comments are helpful. Logic seems good, these are all pretty much nits. Thanks! http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/types.cc File be/src/runtime/types.cc: http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/types.cc@141 PS14, Line 141: return TPrimitiveType::BINARY; Should we return INVALID_TYPE here because of DCHECK(false) so that we catch the problem earlier on? http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc File be/src/service/hs2-util.cc: http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc@872 PS14, Line 872: AuxColumnType aux_type(columnType); If this is only used in the TYPE_STRING, perhaps it should be initialized there. http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/hs2-util.cc@950 PS14, Line 950: type_entry.__set_type(thrift::TTypeId::STRING_TYPE); If DCHECK is false, do we want to set this to STRING_TYPE? Or maybe we should set it to INVALID_TYPE? Not sure what is best. http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/impala-beeswax-server.cc File be/src/service/impala-beeswax-server.cc: http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/service/impala-beeswax-server.cc@212 PS14, Line 212: return "string"; How come we don't return TypeToOdbcString here? (sidenote, I'm rusty on my C++, but why is TypeToOdbcString not inlined?) http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java File fe/src/main/java/org/apache/impala/analysis/CastExpr.java: http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@187 PS14, Line 187: // No built-in function needed for BINARY <-> STRING conversion. I think the comment should also reflect other datatypes versus binary are not in the support matrix. http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@357 PS14, Line 357: boolean isStringBinaryCast = I may be understanding this wrong... This is slightly inconsistent with the above code. In the above code, we assume that if one is binary, the other must be either binary or string. Do we need to check explicitly here that the other type is string? I understand why, but then maybe we should have a Preconditions check sayin git is either of type string or binary? Not sure what is best here. http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java File fe/src/main/java/org/apache/impala/analysis/LikePredicate.java: http://gerrit.cloudera.org:8080/#/c/16066/14/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java@116 PS14, Line 116: private static boolean isLikeableType(Type type) { No real comment here. But I feel the other types might get upset for not being liked. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 14 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Steve Carlin Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 14 Jul 2022 00:01:00 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 14: (4 comments) http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/hbase-table-writer.cc File be/src/exec/hbase-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/hbase-table-writer.cc@161 PS14, Line 161: data_len = encoded_value.length(); I just realized we support writing to HBase. It worths adding tests for this. http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/text-converter.cc File be/src/exec/text-converter.cc: http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/exec/text-converter.cc@359 PS14, Line 359: return auxType.string_subtype != AuxColumnType::StringSubtype::BINARY; Add JIRA for codegen on reading BINARY from text files? I guess we just need the codegen of base64 decoding. http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h File be/src/runtime/descriptors.h: http://gerrit.cloudera.org:8080/#/c/16066/14/be/src/runtime/descriptors.h@256 PS14, Line 256: return col_descs_[slot_desc->col_path().back()]; Not sure if this is correct for nested types. Can we add tests on BINARY inside nested types? http://gerrit.cloudera.org:8080/#/c/16066/14/testdata/workloads/functional-query/queries/QueryTest/binary-type.test File testdata/workloads/functional-query/queries/QueryTest/binary-type.test: http://gerrit.cloudera.org:8080/#/c/16066/14/testdata/workloads/functional-query/queries/QueryTest/binary-type.test@68 PS14, Line 68: where binary_col_with_nulls = cast("01/02/09" as binary) This is identical to the above query. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 14 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 15 Jun 2022 03:17:24 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 14: > (4 comments) > > Thanks for reviving this patch! I just went throught the FE changes > and some tests and they LGTM. I'm going to dig into more details. > > Not sure if I missed it, but do we have tests on writing BINARY > columns? Assuming the data loading of functional_parquet.binary_tbl > is the test on writing BINARY to parquet, we still need test on > writing BINARY to text files. At first I thought that with the new table writing to text is covered now, but then realized that LOAD always uses Hive. Will add some table writing tests in the next patch. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 14 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 13 Jun 2022 19:12:29 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 14: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/10761/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 14 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 13 Jun 2022 18:59:08 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 14: (4 comments) PS 14 extended the testdata and EE tests and also added some new functions: concat, murmur_hash (this was needed for multiple count distinct). In the next patch I plan to fix the older comments but Gabor and Tim http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java: http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@44 PS13, Line 44: import org.apache.impala.catalog.Type; > nit: unused import? Done http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/catalog/Type.java File fe/src/main/java/org/apache/impala/catalog/Type.java: http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/catalog/Type.java@681 PS13, Line 681: if > nit: need one space after "if" Done http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java File fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java: http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java@409 PS13, Line 409: RewritesOk("cast(concat('a', 'b') as binary)", rule, "'ab'"); > nit: Can we also test concat() on binary literals? i.e. The concat for binary was not added yet. Added it in PS 14 + some tests in expr-test.cc http://gerrit.cloudera.org:8080/#/c/16066/13/testdata/workloads/functional-query/queries/QueryTest/udf.test File testdata/workloads/functional-query/queries/QueryTest/udf.test: http://gerrit.cloudera.org:8080/#/c/16066/13/testdata/workloads/functional-query/queries/QueryTest/udf.test@236 PS13, Line 236: bytes( > bytes()? in case of utf8_mode=true. Done -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 14 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 13 Jun 2022 18:42:52 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16066 to look at the new patch set (#14). Change subject: IMPALA-9482: Support for BINARY columns .. IMPALA-9482: Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 --- M be/src/exec/file-metadata-utils.cc M be/src/exec/hbase-scan-node.cc M be/src/exec/hbase-scan-node.h M be/src/exec/hbase-table-writer.cc M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/orc-metadata-utils.cc M be/src/exec/text-converter.cc M be/src/exec/text-converter.h M be/src/exec/text-converter.inline.h M be/src/exprs/expr-test.cc M be/src/exprs/utility-functions-ir.cc M be/src/exprs/utility-functions.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/hs2-util.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/query-result-set.cc M be/src/testutil/test-udfs.cc M be/src/util/coding-util.cc M be/src/util/coding-util.h M bin/rat_exclude_files.txt M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/analysis/InPredicate.java M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java M fe/src/main/java/org/apache/impala/catalog/ScalarType.java M fe/src/main/java/org/apache/impala/catalog/Type.java M fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java M fe/src/main/java/org/apache/impala/hive/executor/HiveLegacyJavaFunction.java M
[Impala-ASF-CR] IMPALA-9482: Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482: Support for BINARY columns .. Patch Set 14: (1 comment) http://gerrit.cloudera.org:8080/#/c/16066/14/tests/query_test/test_udfs.py File tests/query_test/test_udfs.py: http://gerrit.cloudera.org:8080/#/c/16066/14/tests/query_test/test_udfs.py@116 PS14, Line 116: y flake8: E501 line too long (92 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 14 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 13 Jun 2022 18:38:29 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 13: (4 comments) Thanks for reviving this patch! I just went throught the FE changes and some tests and they LGTM. I'm going to dig into more details. Not sure if I missed it, but do we have tests on writing BINARY columns? Assuming the data loading of functional_parquet.binary_tbl is the test on writing BINARY to parquet, we still need test on writing BINARY to text files. http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java: http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@44 PS13, Line 44: import org.apache.impala.catalog.PrimitiveType; nit: unused import? http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/catalog/Type.java File fe/src/main/java/org/apache/impala/catalog/Type.java: http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/main/java/org/apache/impala/catalog/Type.java@681 PS13, Line 681: if nit: need one space after "if" http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java File fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java: http://gerrit.cloudera.org:8080/#/c/16066/13/fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java@409 PS13, Line 409: RewritesOk("cast(concat('a', 'b') as binary)", rule, "'ab'"); nit: Can we also test concat() on binary literals? i.e. concat(cast('a' as binary), cast('b' as binary)) http://gerrit.cloudera.org:8080/#/c/16066/13/testdata/workloads/functional-query/queries/QueryTest/udf.test File testdata/workloads/functional-query/queries/QueryTest/udf.test: http://gerrit.cloudera.org:8080/#/c/16066/13/testdata/workloads/functional-query/queries/QueryTest/udf.test@236 PS13, Line 236: length bytes()? in case of utf8_mode=true. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 13 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 13 Jun 2022 08:29:15 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 13: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/10753/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 13 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 10 Jun 2022 19:00:01 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 13: (1 comment) http://gerrit.cloudera.org:8080/#/c/16066/13/tests/query_test/test_udfs.py File tests/query_test/test_udfs.py: http://gerrit.cloudera.org:8080/#/c/16066/13/tests/query_test/test_udfs.py@116 PS13, Line 116: y flake8: E501 line too long (92 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 13 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 10 Jun 2022 18:42:00 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16066 to look at the new patch set (#13). Change subject: IMPALA-9482 Support for BINARY columns .. IMPALA-9482 Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type - Ran exhaustive tests. Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 --- M be/src/exec/file-metadata-utils.cc M be/src/exec/hbase-scan-node.cc M be/src/exec/hbase-scan-node.h M be/src/exec/hbase-table-writer.cc M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/orc-metadata-utils.cc M be/src/exec/text-converter.cc M be/src/exec/text-converter.h M be/src/exec/text-converter.inline.h M be/src/exprs/expr-test.cc M be/src/exprs/utility-functions-ir.cc M be/src/exprs/utility-functions.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/hs2-util.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/query-result-set.cc M be/src/testutil/test-udfs.cc M be/src/util/coding-util.cc M be/src/util/coding-util.h M bin/rat_exclude_files.txt M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/analysis/InPredicate.java M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java M fe/src/main/java/org/apache/impala/catalog/ScalarType.java M fe/src/main/java/org/apache/impala/catalog/Type.java M fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java M
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 12: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 12 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 07 Jun 2022 19:36:52 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 12: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/10721/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 12 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 07 Jun 2022 15:31:04 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 12: (1 comment) http://gerrit.cloudera.org:8080/#/c/16066/12/tests/query_test/test_udfs.py File tests/query_test/test_udfs.py: http://gerrit.cloudera.org:8080/#/c/16066/12/tests/query_test/test_udfs.py@116 PS12, Line 116: y flake8: E501 line too long (92 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 12 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 07 Jun 2022 15:13:20 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 12: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/8194/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 12 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 07 Jun 2022 15:13:22 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 12: PS 12 adds support for Hive UDFs. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 12 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 07 Jun 2022 15:13:02 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16066 to look at the new patch set (#12). Change subject: IMPALA-9482 Support for BINARY columns .. IMPALA-9482 Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type - Ran exhaustive tests. Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 --- M be/src/exec/file-metadata-utils.cc M be/src/exec/hbase-scan-node.cc M be/src/exec/hbase-scan-node.h M be/src/exec/hbase-table-writer.cc M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/orc-metadata-utils.cc M be/src/exec/text-converter.cc M be/src/exec/text-converter.h M be/src/exec/text-converter.inline.h M be/src/exprs/expr-test.cc M be/src/exprs/utility-functions-ir.cc M be/src/exprs/utility-functions.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/hs2-util.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/query-result-set.cc M be/src/testutil/test-udfs.cc M be/src/util/coding-util.cc M be/src/util/coding-util.h M bin/rat_exclude_files.txt M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/analysis/InPredicate.java M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java M fe/src/main/java/org/apache/impala/catalog/ScalarType.java M fe/src/main/java/org/apache/impala/catalog/Type.java M fe/src/main/java/org/apache/impala/hive/executor/HiveGenericJavaFunction.java M
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 11: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/8171/ -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 11 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 01 Jun 2022 20:48:40 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 11: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/10675/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 11 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 01 Jun 2022 16:34:21 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 11: Ps 11 fixed length() behavior when utf8_mode=1 and added BE tests for functions and predicates on BINARY. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 11 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 01 Jun 2022 16:23:38 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 11: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/8171/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 11 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 01 Jun 2022 16:23:59 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 11: (1 comment) http://gerrit.cloudera.org:8080/#/c/16066/11/tests/query_test/test_udfs.py File tests/query_test/test_udfs.py: http://gerrit.cloudera.org:8080/#/c/16066/11/tests/query_test/test_udfs.py@116 PS11, Line 116: y flake8: E501 line too long (92 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 11 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 01 Jun 2022 16:15:22 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Hello Quanlong Huang, Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16066 to look at the new patch set (#11). Change subject: IMPALA-9482 Support for BINARY columns .. IMPALA-9482 Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type - Ran exhaustive tests. Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 --- M be/src/exec/file-metadata-utils.cc M be/src/exec/hbase-scan-node.cc M be/src/exec/hbase-scan-node.h M be/src/exec/hbase-table-writer.cc M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/orc-metadata-utils.cc M be/src/exec/text-converter.cc M be/src/exec/text-converter.h M be/src/exec/text-converter.inline.h M be/src/exprs/expr-test.cc M be/src/exprs/utility-functions-ir.cc M be/src/exprs/utility-functions.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/hs2-util.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/query-result-set.cc M be/src/testutil/test-udfs.cc M be/src/util/coding-util.cc M be/src/util/coding-util.h M bin/rat_exclude_files.txt M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/analysis/InPredicate.java M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java M fe/src/main/java/org/apache/impala/catalog/ScalarType.java M fe/src/main/java/org/apache/impala/catalog/Type.java M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java M
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 9: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/10521/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 9 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 02 May 2022 15:51:23 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 9: (1 comment) http://gerrit.cloudera.org:8080/#/c/16066/9/tests/query_test/test_udfs.py File tests/query_test/test_udfs.py: http://gerrit.cloudera.org:8080/#/c/16066/9/tests/query_test/test_udfs.py@116 PS9, Line 116: y flake8: E501 line too long (92 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 9 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 02 May 2022 15:33:02 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Hello Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16066 to look at the new patch set (#9). Change subject: IMPALA-9482 Support for BINARY columns .. IMPALA-9482 Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type - Ran exhaustive tests. Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 --- M be/src/exec/file-metadata-utils.cc M be/src/exec/hbase-scan-node.cc M be/src/exec/hbase-scan-node.h M be/src/exec/hbase-table-writer.cc M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/orc-metadata-utils.cc M be/src/exec/text-converter.cc M be/src/exec/text-converter.h M be/src/exec/text-converter.inline.h M be/src/exprs/expr-test.cc M be/src/exprs/utility-functions-ir.cc M be/src/exprs/utility-functions.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/hs2-util.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/query-result-set.cc M be/src/testutil/test-udfs.cc M be/src/util/coding-util.cc M be/src/util/coding-util.h M bin/rat_exclude_files.txt M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/analysis/InPredicate.java M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java M fe/src/main/java/org/apache/impala/catalog/ScalarType.java M fe/src/main/java/org/apache/impala/catalog/Type.java M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java M
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 8: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/8073/ -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 8 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 02 May 2022 13:51:50 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 8: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/10519/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 8 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 02 May 2022 09:45:27 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 8: (1 comment) http://gerrit.cloudera.org:8080/#/c/16066/8/tests/query_test/test_udfs.py File tests/query_test/test_udfs.py: http://gerrit.cloudera.org:8080/#/c/16066/8/tests/query_test/test_udfs.py@116 PS8, Line 116: y flake8: E501 line too long (92 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 8 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 02 May 2022 09:26:31 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 8: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/8073/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 8 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 02 May 2022 09:26:22 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Hello Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16066 to look at the new patch set (#8). Change subject: IMPALA-9482 Support for BINARY columns .. IMPALA-9482 Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type - Ran exhaustive tests. Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 --- M be/src/exec/file-metadata-utils.cc M be/src/exec/hbase-scan-node.cc M be/src/exec/hbase-scan-node.h M be/src/exec/hbase-table-writer.cc M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/orc-metadata-utils.cc M be/src/exec/text-converter.cc M be/src/exec/text-converter.h M be/src/exec/text-converter.inline.h M be/src/exprs/expr-test.cc M be/src/exprs/utility-functions-ir.cc M be/src/exprs/utility-functions.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/hs2-util.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/query-result-set.cc M be/src/testutil/test-udfs.cc M be/src/util/coding-util.cc M be/src/util/coding-util.h M bin/rat_exclude_files.txt M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/analysis/InPredicate.java M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java M fe/src/main/java/org/apache/impala/catalog/ScalarType.java M fe/src/main/java/org/apache/impala/catalog/Type.java M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java M
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 7: (13 comments) http://gerrit.cloudera.org:8080/#/c/16066/7//COMMIT_MSG Commit Message: PS7: I think you're missing some handling for the parquet_annotate_strings_utf8 option - it looks like we might set this for BINARY types, which is wrong. http://gerrit.cloudera.org:8080/#/c/16066/7//COMMIT_MSG@29 PS7, Line 29: - HS2/Beeswax service Do we need to update clients? Looks like impala-shell might have some logic for HS2 but not sure that it's tested. http://gerrit.cloudera.org:8080/#/c/16066/7//COMMIT_MSG@55 PS7, Line 55: - Added FE/EE tests mainly based on the ones added to the DATE type Can we add some more sanity tests for different parts of query processing. I think the fact that it's handled as string internally reduces the amount of testing required, but some sanity checks would be good. E.g. a join on BINARY, a group by on aggregation and a sort by BINARY? http://gerrit.cloudera.org:8080/#/c/16066/7//COMMIT_MSG@56 PS7, Line 56: - Ran exhaustive tests. Can we also add a shell test? http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/hbase-scan-node.h File be/src/exec/hbase-scan-node.h: http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/hbase-scan-node.h@84 PS7, Line 84: /// Descriptor of the HBase table. Can you mention it's set in Prepare() http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/hdfs-text-table-writer.cc File be/src/exec/hdfs-text-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/hdfs-text-table-writer.cc@107 PS7, Line 107: // TODO: try to find a more efficient imlementation Maybe remove the TODO? Or file a JIRA? Not sure that it adds much value here. http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/text-converter.h File be/src/exec/text-converter.h: http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/text-converter.h@65 PS7, Line 65: auxType aux_type here and elsewhere http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/text-converter.cc File be/src/exec/text-converter.cc: http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/text-converter.cc@353 PS7, Line 353: bool TextConverter::SupportsCodegenWriteSlot(const ColumnType& col_type, I don't love adding more logic without corresponding codegen, we spent so much time filling in gaps in the past and it feels bad to add more back in. http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/runtime/descriptors.h File be/src/runtime/descriptors.h: http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/runtime/descriptors.h@209 PS7, Line 209: ( aux_type http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java: http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@259 PS7, Line 259: // NDV is not calculated for BINARY columns (similarly to Hive). This seems kinda arbitrary - is there a real reason not to calculate it? I guess not a big deal. http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java File fe/src/main/java/org/apache/impala/analysis/LikePredicate.java: http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/main/java/org/apache/impala/analysis/LikePredicate.java@116 PS7, Line 116: private static boolean isLikeableType(Type type) { lol at the function name (no need to change). http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java File fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java: http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/test/java/org/apache/impala/catalog/PartialCatalogInfoTest.java@192 PS7, Line 192: C nit: Column http://gerrit.cloudera.org:8080/#/c/16066/7/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/16066/7/tests/query_test/test_scanners.py@1611 PS7, Line 1611: def add_test_dimensions(cls): It'd be good to run this for HS2 as well. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 7 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 04 Sep 2020 04:20:59 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 7: Let me know when I should take another look -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 7 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 01 Sep 2020 23:20:57 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 7: (2 comments) http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/text-converter.cc File be/src/exec/text-converter.cc: http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exec/text-converter.cc@355 PS7, Line 355: switch (col_type.type) { : case TYPE_CHAR: : return false; : case TYPE_STRING: : return auxType.string_subtype != AuxColumnType::StringSubtype::BINARY; : default: : return true; This could be simplified a bit in my opinion: if (col_type.type == TYPE_STRING) return auxType... != BINARY; return col_type.type != TYPE_CHAR; http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exprs/utility-functions-ir.cc File be/src/exprs/utility-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/exprs/utility-functions-ir.cc@209 PS7, Line 209: const StringVal& /*input_val*/ Is this part needed? -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 7 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 26 Aug 2020 09:20:16 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 7: (1 comment) I only checked the FE part of the patch. Will continue with the rest tomorrow. http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java File fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java: http://gerrit.cloudera.org:8080/#/c/16066/7/fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java@989 PS7, Line 989: // BINARY is only supported for count + min + max. Lucky that those were the first 3 operators in the for-loop's body. What if later someone adds another function above this line? Let's consider adding these functions for binary after the for loop similarly to ds_hll_union (or avg). -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 7 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 25 Aug 2020 15:10:02 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 7: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/6920/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 7 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 13 Aug 2020 19:25:18 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 7: (2 comments) http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/service/hs2-util.cc File be/src/service/hs2-util.cc: http://gerrit.cloudera.org:8080/#/c/16066/7/be/src/service/hs2-util.cc@273 PS7, Line 273: static void StringExprValuesToHS2TColumnHelper(ScalarExprEvaluator* expr_eval, line has trailing whitespace http://gerrit.cloudera.org:8080/#/c/16066/7/tests/query_test/test_udfs.py File tests/query_test/test_udfs.py: http://gerrit.cloudera.org:8080/#/c/16066/7/tests/query_test/test_udfs.py@116 PS7, Line 116: y flake8: E501 line too long (92 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 7 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 13 Aug 2020 19:05:08 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 7: (4 comments) http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/text-converter.h File be/src/exec/text-converter.h: http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/text-converter.h@63 PS2, Line 63: > Could you mention the new param in the comment above? Done http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/text-converter.h@88 PS2, Line 88: hat cont > Could you mention the new param in the comment above? Done http://gerrit.cloudera.org:8080/#/c/16066/6/be/src/service/hs2-util.cc File be/src/service/hs2-util.cc: http://gerrit.cloudera.org:8080/#/c/16066/6/be/src/service/hs2-util.cc@273 PS6, Line 273: static void StringExprValuesToHS2TColumnHelper(ScalarExprEvaluator* expr_eval, > line too long (95 > 90) Done http://gerrit.cloudera.org:8080/#/c/16066/6/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java File fe/src/test/java/org/apache/impala/catalog/CatalogTest.java: http://gerrit.cloudera.org:8080/#/c/16066/6/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java@624 PS6, Line 624: HdfsTable binaryTable = (HdfsTable) catalog_.getOrLoadTable("functional", > line too long (91 > 90) Done -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 7 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 13 Aug 2020 19:05:27 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Hello Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16066 to look at the new patch set (#7). Change subject: IMPALA-9482 Support for BINARY columns .. IMPALA-9482 Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type - Ran exhaustive tests. Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 --- M be/src/exec/hbase-scan-node.cc M be/src/exec/hbase-scan-node.h M be/src/exec/hbase-table-writer.cc M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/orc-metadata-utils.cc M be/src/exec/text-converter.cc M be/src/exec/text-converter.h M be/src/exec/text-converter.inline.h M be/src/exprs/expr-test.cc M be/src/exprs/utility-functions-ir.cc M be/src/exprs/utility-functions.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/hs2-util.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/query-result-set.cc M be/src/testutil/test-udfs.cc M be/src/util/coding-util.cc M be/src/util/coding-util.h M bin/rat_exclude_files.txt M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/analysis/InPredicate.java M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java M fe/src/main/java/org/apache/impala/catalog/ScalarType.java M fe/src/main/java/org/apache/impala/catalog/Type.java M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java M
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 6: (23 comments) Thanks for the comments and sorry for the huge time before resolving them! Patch 3 is just a rebase with some conflict resolution. http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@18 PS2, Line 18: - No NDV is calculated during COMPUTE STATISTICS. > Are we planning to change this? I think implicit casts from BINARY to STRIN I have changed the casting rules to always need explicit casting between STRING and BINARY. Added more info in the commit message (line 42). http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@47 PS2, Line 47: INSERT > Do you plan to do interop tests between Hive and Impala? We generally don't seem to have such tests, so I didn't add it here. I think it would be great to add some simple tests that test all types and fileformats with writer as Hive and reader as Impala + vica versa, but I would do this in jira. Note that for some file formats the Hive writer -> Impala reader path is tested as Hive writes the tables during dataload. http://gerrit.cloudera.org:8080/#/c/16066/2//COMMIT_MSG@52 PS2, Line 52: to test scanning. > I think we want a lot more coverage in analysis tests, to make sure that we I have added much more FE tests (similar ones as Attila added for the Date type) + a few to specifically check casts. http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-scan-node.cc File be/src/exec/hbase-scan-node.cc: http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-scan-node.cc@49 PS2, Line 49: null > nit: nullptr? I see there's a lot of NULLs in the file, so maybe just updat Done http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-table-writer.cc File be/src/exec/hbase-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-table-writer.cc@157 PS2, Line 157: if (col_desc.auxType().IsBinaryStringSubtype()) { > line too long (92 > 90) Done http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hbase-table-writer.cc@159 PS2, Line 159: Base64Encode > Why do we encode it for HBase? This how Hive does it, BINARY columns are base64 encoded as in text files. http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-rcfile-scanner.cc File be/src/exec/hdfs-rcfile-scanner.cc: http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-rcfile-scanner.cc@593 PS2, Line 593: const AuxColumnType& aux_type = : scan_node_->hdfs_table()->GetColumnDesc(slot_desc). > Yeah I wondered about this too. I think we should consider this in the cont Yes, I was thinking similarly as what Tim wrote - it would be much nicer if string encoding could be handled mainly in the frontend and the backend would just need to think about byte arrays. Probably VARCHAR could be also handled this way, as AFAIK most code do not enforce the length limit, only scanners/writers/client handlers need to deal with it. Another type that could probably profit from AuxType is timestamp, if we'll implement TIMESTAMP WITH (LOCAL) TIMEZONE. It should be probably the same data type underneath, but the frontend could pick different functions where the conversion matters + scanners would also need the info as different files may be written in different timezones. http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc File be/src/exec/hdfs-text-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc@103 PS2, Line 103: const ColumnDescriptor& col_desc = > line too long (96 > 90) Done http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc@105 PS2, Line 105: const StringValue* string_value = reinterpret_cast(value); > line too long (93 > 90) Done http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/exec/hdfs-text-table-writer.cc@107 PS2, Line 107: // TODO: try to find a more efficient imlementation > line too long (91 > 90) Done http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.h File be/src/runtime/types.h: http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.h@67 PS2, Line 67: Aux > nit: 'Aux' is not too descriptive. Probably EncodedColumnType/StoredColumnT I didn't change it in the last patch to avoid the additional noise, but I am not attached to the current name. http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc File be/src/runtime/types.cc: http://gerrit.cloudera.org:8080/#/c/16066/2/be/src/runtime/types.cc@122 PS2, Line 122: ToThrift(PrimitiveType ptype) > Should this function work as the inverse of ThriftToType() ? It makes sense to create a version which also uses AuxType, but this was
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 6: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/6919/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 6 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 13 Aug 2020 18:21:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. Patch Set 6: (4 comments) http://gerrit.cloudera.org:8080/#/c/16066/6/be/src/service/hs2-util.cc File be/src/service/hs2-util.cc: http://gerrit.cloudera.org:8080/#/c/16066/6/be/src/service/hs2-util.cc@273 PS6, Line 273: static void StringExprValuesToHS2TColumnHelper(ScalarExprEvaluator* expr_eval, RowBatch* batch, line too long (95 > 90) http://gerrit.cloudera.org:8080/#/c/16066/6/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java File fe/src/test/java/org/apache/impala/catalog/CatalogTest.java: http://gerrit.cloudera.org:8080/#/c/16066/6/fe/src/test/java/org/apache/impala/catalog/CatalogTest.java@624 PS6, Line 624: HdfsTable binaryTable = (HdfsTable) catalog_.getOrLoadTable("functional", "binary_tbl", line too long (91 > 90) http://gerrit.cloudera.org:8080/#/c/16066/6/fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java File fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java: http://gerrit.cloudera.org:8080/#/c/16066/6/fe/src/test/java/org/apache/impala/catalog/local/LocalCatalogTest.java@433 PS6, Line 433: "WITH SERDEPROPERTIES ('hbase.columns.mapping'=':key,d:string_col,d:binary_col', " + line too long (92 > 90) http://gerrit.cloudera.org:8080/#/c/16066/6/tests/query_test/test_udfs.py File tests/query_test/test_udfs.py: http://gerrit.cloudera.org:8080/#/c/16066/6/tests/query_test/test_udfs.py@116 PS6, Line 116: y flake8: E501 line too long (92 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 6 Gerrit-Owner: Csaba Ringhofer Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Gabor Kaszab Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 13 Aug 2020 18:00:18 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-9482 Support for BINARY columns
Csaba Ringhofer has uploaded a new patch set (#6). ( http://gerrit.cloudera.org:8080/16066 ) Change subject: IMPALA-9482 Support for BINARY columns .. IMPALA-9482 Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type - Ran exhaustive tests. Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 --- M be/src/exec/hbase-scan-node.cc M be/src/exec/hbase-scan-node.h M be/src/exec/hbase-table-writer.cc M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/orc-metadata-utils.cc M be/src/exec/text-converter.cc M be/src/exec/text-converter.h M be/src/exec/text-converter.inline.h M be/src/exprs/expr-test.cc M be/src/exprs/utility-functions-ir.cc M be/src/exprs/utility-functions.h M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/hs2-util.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/query-result-set.cc M be/src/testutil/test-udfs.cc M be/src/util/coding-util.cc M be/src/util/coding-util.h M common/function-registry/impala_functions.py M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java M fe/src/main/java/org/apache/impala/analysis/InPredicate.java M fe/src/main/java/org/apache/impala/analysis/LikePredicate.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java M fe/src/main/java/org/apache/impala/catalog/ScalarType.java M fe/src/main/java/org/apache/impala/catalog/Type.java M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java M fe/src/main/java/org/apache/impala/util/AvroSchemaParser.java M fe/src/main/java/org/apache/impala/util/AvroSchemaUtils.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeSubqueriesTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java M fe/src/test/java/org/apache/impala/analysis/LiteralExprTest.java M fe/src/test/java/org/apache/impala/analysis/ToSqlTest.java M