Hello Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17785

to look at the new patch set (#6).

Change subject: IMPALA-2019(part-4): Add UTF-8 support for case conversion 
functions
......................................................................

IMPALA-2019(part-4): Add UTF-8 support for case conversion functions

There are 3 builtin string functions doing case conversion: upper,
lower, and initcap. Previously they only convert English alphabetic
characters. This patch adds support to deal with unicode characters.

There are many corner cases in case conversion depending on the locale
and context. E.g.
1) Case conversion is locale-sensitive.
Turkish has 4 letter "I"s. English has only two, a lowercase dotted i
and an uppercase dotless I. Turkish has lowercase and uppercase forms of
both dotted and dotless I. So simply converting "i" to "I" for upper
case is wrong in Turkish:
    +-------+--------+---------+
    |       | Dotted | Dotless |
    +-------+--------+---------+
    | Upper | İ      | I       |
    +-------+--------+---------+
    | Lower | i      | ı       |
    +-------+--------+---------+

2) Case conversion may change a string's length.
The German word "grüßen" should be converted to "GRÜSSEN" in upper case:
the letter "ß" should be converted to "SS".

3) Case conversion is context-sensitive.
The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the
Greek letter "Σ" is converted to "σ" or to "ς", depending on its
position in the word.

This patch uses the simple wchar_t conversions of std::towupper and
stdd::towlower so the above cases are not handled. We will try
supporting them in follow-up JIRAs.

Test:
 - Add BE unit tests and e2e tests.

Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
---
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/string-functions.h
M common/function-registry/impala_functions.py
M 
testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
5 files changed, 164 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/85/17785/6
--
To view, visit http://gerrit.cloudera.org:8080/17785
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Gerrit-Change-Number: 17785
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>

Reply via email to