Hello Impala Public Jenkins, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/17785 to look at the new patch set (#3). Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion functions ...................................................................... IMPALA-2019(part-4): Add UTF-8 support for case conversion functions There are 3 builtin string functions doing case conversion: upper, lower, and initcap. Previously they only convert English alphabetic characters. This patch adds support to deal with unicode characters. There are many corner cases in case conversion depending on the locale and context. E.g. 1) Case conversion is locale-sensitive. Turkish has 4 letter "I"s. English has only two, a lowercase dotted i and an uppercase dotless I. Turkish has lowercase and uppercase forms of both dotted and dotless I. So simply converting "i" to "I" for upper case is wrong in Turkish: +-------+--------+---------+ | | Dotted | Dotless | +-------+--------+---------+ | Upper | İ | I | +-------+--------+---------+ | Lower | i | ı | +-------+--------+---------+ 2) Case conversion may change a string's length. The German word "grüßen" should be converted to "GRÜSSEN" in upper case: the letter "ß" should be converted to "SS". 3) Case conversion is context-sensitive. The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the Greek letter "Σ" is converted to "σ" or to "ς", depending on its position in the word. This patch currently uses Boost.Locale in case conversion. ICU(International Components for Unicode) is not integrated yet since our boost in native-toolchain is not built with ICU. So currently the localization backend of Boost.Locale is iconv, and the above corner cases are not handled. We will consider integrating ICU in a follow-up JIRA. Test: - Add BE unit tests and e2e tests. Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd --- M be/src/exprs/CMakeLists.txt M be/src/exprs/expr-test.cc M be/src/exprs/mask-functions-ir.cc M be/src/exprs/string-functions-ir.cc M be/src/exprs/string-functions.h M common/function-registry/impala_functions.py M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test 7 files changed, 327 insertions(+), 63 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/85/17785/3 -- To view, visit http://gerrit.cloudera.org:8080/17785 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd Gerrit-Change-Number: 17785 Gerrit-PatchSet: 3 Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>