[Impala-ASF-CR] IMPALA-5675: Support UTF-8 Varchar and Char types

Quanlong Huang (Code Review) Wed, 18 May 2022 22:52:25 -0700

Hello Qifan Chen, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit


    http://gerrit.cloudera.org:8080/16909

to look at the new patch set (#15).

Change subject: IMPALA-5675: Support UTF-8 Varchar and Char types
......................................................................

IMPALA-5675: Support UTF-8 Varchar and Char types

This patch adds support for UTF-8 aware varchar and char types. In
UTF-8 mode, when truncating UTF-8 varchar(N) and char(N) strings,
lengths will be counted by UTF-8 characters instead of bytes. So the
result string will have up to N UTF-8 characters.

The UTF8_MODE query option is first detected in FE when analyzing the
query. A 'is_utf8' label is added in Exprs and SlotDescriptors. They are
used in generating thrift objects and computing the tuple layouts. A
char(N) slot will occupy 4 * N bytes if it's in UTF-8 type, because a
UTF-8 character can be encoded into 1~4 bytes. The slot will store up to
N UTF-8 characters.

There is a gotcha that we should not add the label in Type.java, because
Type instances are shared across the FE. Query compilation reuses the
Type instances from the metadata. If we modify Type instances during
compilation, other queries in non-UTF8 mode will be affected.

However, in BE, we need the type related classes (e.g. ColumnType,
TypeDesc) to carry in the utf8 markers. It's impractical to check the
UTF8_MODE query option everywhere it needs to be. E.g. in
AnyValUtil::SetAnyVal we can't access the query options. So we add the
'is_utf8' marker in TScalarType, ColumnType, TypeDesc to conveniently
distinguish char(N) and varchar(N) types in UTF-8 mode. When generating
thrift objects in FE, Exprs and SlotDescriptors deliver 'is_utf8'
markers to TScalaTypes. They finally landed in ColumnType and TypeDesc
instances.

Given the correct UTF-8 mode checked, we just need to truncate/pad the
char/varchar strings with their length counted by UTF-8 characters.

Since char(N) slots always occupy 4N bytes, when converting char(N) to
other string types, we need to re-calculate the actual length
corresponding to N UTF-8 characters. We can optimize this in later
patches, e.g. store the UTF-8 length in the slot, or deal with UTF-8
char(N) by the same way as varchar(N), i.e. reallocate the string space
and just store the pointer and length in the slot.

TODO:
Memtion utf-8 validation
Mention GetMaxStrLen()
Mention changed places

Tests:
 - Add tests for reading char(N) and varchar(N) columns in UTF8_MODE.
 - Add truncating/padding tests
 - Kudu only supports Varchar currently. Add special tests for Kudu.
 - Add tests for writing CHAR(N)/VARCHAR(N) in UTF-8 mode.

Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f
---
M be/src/codegen/codegen-anyval.cc
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/llvm-codegen.cc
M be/src/exec/data-source-scan-node.cc
M be/src/exec/grouping-aggregator.cc
M be/src/exec/hdfs-avro-scanner-ir.cc
M be/src/exec/hdfs-avro-scanner-test.cc
M be/src/exec/hdfs-avro-scanner.cc
M be/src/exec/hdfs-avro-scanner.h
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/kudu-scanner.cc
M be/src/exec/kudu-table-sink.cc
M be/src/exec/kudu-util.cc
M be/src/exec/kudu-util.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-stats.inline.h
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-data-converter.h
M be/src/exec/parquet/parquet-plain-test.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.inline.h
M be/src/exprs/agg-fn-evaluator.cc
M be/src/exprs/anyval-util.cc
M be/src/exprs/anyval-util.h
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/scalar-expr-evaluator.cc
M be/src/exprs/scalar-fn-call.cc
M be/src/exprs/slot-ref.cc
M be/src/runtime/raw-value-ir.cc
M be/src/runtime/raw-value.cc
M be/src/runtime/raw-value.inline.h
M be/src/runtime/string-value.h
M be/src/runtime/tuple.cc
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/fe-support.cc
M be/src/service/hs2-util.cc
M be/src/udf/udf-internal.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M be/src/util/dict-encoding.h
M be/src/util/string-util-test.cc
M be/src/util/string-util.cc
M be/src/util/string-util.h
M be/src/util/tuple-row-compare.cc
M common/thrift/Types.thrift
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
M fe/src/main/java/org/apache/impala/analysis/SlotDescriptor.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/kudu_create.test
A testdata/workloads/functional-query/queries/QueryTest/utf8-chars-casting.test
A testdata/workloads/functional-query/queries/QueryTest/utf8-chars-insert.test
A testdata/workloads/functional-query/queries/QueryTest/utf8-chars.test
M tests/query_test/test_utf8_strings.py
64 files changed, 1,035 insertions(+), 223 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/09/16909/15
--
To view, visit http://gerrit.cloudera.org:8080/16909
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f
Gerrit-Change-Number: 16909
Gerrit-PatchSet: 15
Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>

[Impala-ASF-CR] IMPALA-5675: Support UTF-8 Varchar and Char types

Reply via email to