Hello Gabor Kaszab, Zoltan Borok-Nagy, Attila Jeges, Tim Armstrong, Impala Public Jenkins,
I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/16066 to look at the new patch set (#3). Change subject: WIP IMPALA-9482 Support for BINARY columns ...................................................................... WIP IMPALA-9482 Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. The reason for the WIP status is that: - I plan to add more tests, mainly in FE. - No successful exhaustive test run yet. - There are a few differences compared to Hive: - In INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. - UDF/UDAFs that expect STRING argument accept BINARY too, while in Hive explicit cast is needed in this case. - Hive doesn't calculate NDV during COMPUTE STATISTICS for BINARY columns. Impala still calculates it but throws away the results. I think that review of most parts can start in this state, so it makes sense to upload it without the missing pieces. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added a basic coverage in FE/EE tests, will continue adding new ones while the patch is in WIP state. - Ran core tests. Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 --- M be/src/exec/hbase-scan-node.cc M be/src/exec/hbase-scan-node.h M be/src/exec/hbase-table-writer.cc M be/src/exec/hdfs-rcfile-scanner.cc M be/src/exec/hdfs-scanner-ir.cc M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-table-writer.cc M be/src/exec/orc-metadata-utils.cc M be/src/exec/text-converter.cc M be/src/exec/text-converter.h M be/src/exec/text-converter.inline.h M be/src/exprs/expr-test.cc M be/src/runtime/descriptors.cc M be/src/runtime/descriptors.h M be/src/runtime/types.cc M be/src/runtime/types.h M be/src/service/hs2-util.cc M be/src/service/hs2-util.h M be/src/service/impala-beeswax-server.cc M be/src/service/impala-hs2-server.cc M be/src/service/query-result-set.cc M be/src/testutil/test-udfs.cc M be/src/util/coding-util.cc M be/src/util/coding-util.h M be/src/util/symbols-util.cc M fe/src/main/java/org/apache/impala/analysis/CastExpr.java M fe/src/main/java/org/apache/impala/analysis/InPredicate.java M fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java M fe/src/main/java/org/apache/impala/catalog/Function.java M fe/src/main/java/org/apache/impala/catalog/PrimitiveType.java M fe/src/main/java/org/apache/impala/catalog/ScalarFunction.java M fe/src/main/java/org/apache/impala/catalog/ScalarType.java M fe/src/main/java/org/apache/impala/catalog/Type.java M fe/src/main/java/org/apache/impala/util/AvroSchemaConverter.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeDDLTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzeExprsTest.java M fe/src/test/java/org/apache/impala/analysis/AnalyzerTest.java M fe/src/test/java/org/apache/impala/analysis/AuditingTest.java D testdata/UnsupportedTypes/data.csv M testdata/bin/generate-schema-statements.py A testdata/data/binary_tbl/000000_0.txt M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/workloads/functional-query/queries/QueryTest/binary-type.test M testdata/workloads/functional-query/queries/QueryTest/misc.test M testdata/workloads/functional-query/queries/QueryTest/udf.test M tests/common/impala_connection.py M tests/common/test_result_verifier.py M tests/custom_cluster/test_permanent_udfs.py M tests/hs2/test_fetch.py M tests/hs2/test_hs2.py M tests/query_test/test_scanners.py M tests/query_test/test_udfs.py 56 files changed, 503 insertions(+), 309 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/66/16066/3 -- To view, visit http://gerrit.cloudera.org:8080/16066 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Gerrit-Change-Number: 16066 Gerrit-PatchSet: 3 Gerrit-Owner: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Attila Jeges <atti...@cloudera.com> Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>