Hello Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/16908 to look at the new patch set (#10). Change subject: IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions ...................................................................... IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions A unicode character can be encoded into 1-4 bytes in UTF-8. String functions will return undesired results when the input contains unicode characters, because we deal with a string as a byte array. For instance, length() returns the length in bytes, not in unicode characters. UTF-8 is the dominant unicode encoding used in the Hadoop ecosystem. This patch adds UTF-8 support in some string functions so they can have UTF-8 aware behavior. For compatibility with the old versions, a new query option, UTF8_MODE, is added for turning on/off the UTF-8 aware behavior. Currently, only length(), substring() and reverse() support it. Other function supports will be added in later patches. String functions will check the query option and switch to use the desired implementation. It's similar to how we use the decimal_v2 query option in builtin functions. For easy testing, the UTF-8 aware version of string functions are also exposed as builtin functions (named by utf8_*, e.g. utf8_length). Tests: - Add BE tests for utf8 functions. - Add e2e tests for the UTF8_MODE query option. Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c --- M be/src/codegen/llvm-codegen.cc M be/src/exprs/expr-test.cc M be/src/exprs/string-functions-ir.cc M be/src/exprs/string-functions.h M be/src/runtime/runtime-state.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/udf/udf-internal.h M be/src/udf/udf.cc M be/src/util/bit-util.h M common/function-registry/impala_functions.py M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M testdata/datasets/functional/functional_schema_template.sql A testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test A tests/query_test/test_utf8_strings.py 16 files changed, 361 insertions(+), 5 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/08/16908/10 -- To view, visit http://gerrit.cloudera.org:8080/16908 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c Gerrit-Change-Number: 16908 Gerrit-PatchSet: 10 Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>