[impala] branch master updated: IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions

tarmstrong Mon, 25 Jan 2021 16:57:37 -0800

This is an automated email from the ASF dual-hosted git repository.

tarmstrong pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git



The following commit(s) were added to refs/heads/master by this push:
     new e8720b4  IMPALA-2019(Part-1): Provide UTF-8 support in length, 
substring and reverse functions
e8720b4 is described below

commit e8720b40f1b04712442dd9eb69cd603855eb6b8d
Author: stiga-huang <huangquanl...@gmail.com>
AuthorDate: Mon Jan 4 13:10:49 2021 +0800

    IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse 
functions
    
    A unicode character can be encoded into 1-4 bytes in UTF-8. String
    functions will return undesired results when the input contains unicode
    characters, because we deal with a string as a byte array. For instance,
    length() returns the length in bytes, not in unicode characters.
    
    UTF-8 is the dominant unicode encoding used in the Hadoop ecosystem.
    This patch adds UTF-8 support in some string functions so they can have
    UTF-8 aware behavior. For compatibility with the old versions, a new
    query option, UTF8_MODE, is added for turning on/off the UTF-8 aware
    behavior. Currently, only length(), substring() and reverse() support
    it. Other function supports will be added in later patches.
    
    String functions will check the query option and switch to use the
    desired implementation. It's similar to how we use the decimal_v2 query
    option in builtin functions.
    
    For easy testing, the UTF-8 aware version of string functions are
    also exposed as builtin functions (named by utf8_*, e.g. utf8_length).
    
    Tests:
     - Add BE tests for utf8 functions.
     - Add e2e tests for the UTF8_MODE query option.
    
    Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c
    Reviewed-on: http://gerrit.cloudera.org:8080/16908
    Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
    Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
---
 be/src/codegen/llvm-codegen.cc                     |   3 +-
 be/src/exprs/expr-test.cc                          |  80 ++++++++++++++
 be/src/exprs/string-functions-ir.cc                | 103 ++++++++++++++++++
 be/src/exprs/string-functions.h                    |   6 ++
 be/src/runtime/runtime-state.h                     |   1 +
 be/src/service/query-options.cc                    |   4 +
 be/src/service/query-options.h                     |   3 +-
 be/src/udf/udf-internal.h                          |   4 +-
 be/src/udf/udf.cc                                  |  12 ++-
 be/src/util/bit-util.h                             |   5 +
 common/function-registry/impala_functions.py       |   6 ++
 common/thrift/ImpalaInternalService.thrift         |   3 +
 common/thrift/ImpalaService.thrift                 |   4 +
 .../functional/functional_schema_template.sql      |  15 +++
 .../queries/QueryTest/utf8-string-functions.test   | 116 +++++++++++++++++++++
 tests/query_test/test_utf8_strings.py              |  42 ++++++++
 16 files changed, 402 insertions(+), 5 deletions(-)

diff --git a/be/src/codegen/llvm-codegen.cc b/be/src/codegen/llvm-codegen.cc
index ac3db8f..98e5bb4 100644
--- a/be/src/codegen/llvm-codegen.cc
+++ b/be/src/codegen/llvm-codegen.cc
@@ -1010,7 +1010,8 @@ int LlvmCodeGen::InlineConstFnAttrs(const 
FunctionContext::TypeDesc& ret_type,
     DCHECK(state_ != nullptr);
     // All supported constants are currently integers.
     
call_instr->replaceAllUsesWith(GetI32Constant(FunctionContextImpl::GetConstFnAttr(
-        state_->query_options().decimal_v2, ret_type, arg_types, t_val, 
i_val)));
+        state_->query_options().decimal_v2, state_->query_options().utf8_mode, 
ret_type,
+        arg_types, t_val, i_val)));
     call_instr->eraseFromParent();
     ++replaced;
   }
diff --git a/be/src/exprs/expr-test.cc b/be/src/exprs/expr-test.cc
index 58d98d0..9e801a4 100644
--- a/be/src/exprs/expr-test.cc
+++ b/be/src/exprs/expr-test.cc
@@ -10539,6 +10539,86 @@ TEST_P(ExprTest, MaskHashTest) {
   TestIsNull("mask_hash(cast('2016-04-20' as timestamp))", TYPE_TIMESTAMP);
 }
 
+TEST_P(ExprTest, Utf8Test) {
+  // Verifies utf8_length() counts length by UTF-8 characters instead of bytes.
+  // '你' and '好' are both encoded into 3 bytes.
+  TestIsNull("utf8_length(NULL)", TYPE_INT);
+  TestValue("utf8_length('你好')", TYPE_INT, 2);
+  TestValue("utf8_length('你好hello')", TYPE_INT, 7);
+  TestValue("utf8_length('你好 hello 你好')", TYPE_INT, 11);
+  TestValue("utf8_length('hello')", TYPE_INT, 5);
+
+  // Verifies position and length of utf8_substring() are UTF-8 aware.
+  // '你' and '好' are both encoded into 3 bytes.
+  TestStringValue("utf8_substring('Hello', 1)", "Hello");
+  TestStringValue("utf8_substring('Hello', -2)", "lo");
+  TestStringValue("utf8_substring('Hello', cast(0 as bigint))", "");
+  TestStringValue("utf8_substring('Hello', -5)", "Hello");
+  TestStringValue("utf8_substring('Hello', cast(-6 as bigint))", "");
+  TestStringValue("utf8_substring('Hello', 100)", "");
+  TestStringValue("utf8_substring('Hello', -100)", "");
+  TestIsNull("utf8_substring(NULL, 100)", TYPE_STRING);
+  TestIsNull("utf8_substring('Hello', NULL)", TYPE_STRING);
+  TestIsNull("utf8_substring(NULL, NULL)", TYPE_STRING);
+  TestStringValue("utf8_substring('Hello', 1, 1)", "H");
+  TestStringValue("utf8_substring('Hello', cast(2 as bigint), 100)", "ello");
+  TestStringValue("utf8_substring('Hello', -3, cast(2 as bigint))", "ll");
+  TestStringValue("utf8_substring('Hello', 1, 0)", "");
+  TestStringValue("utf8_substring('Hello', cast(1 as bigint), cast(-1 as 
bigint))", "");
+  TestIsNull("utf8_substring(NULL, 1, 100)", TYPE_STRING);
+  TestIsNull("utf8_substring('Hello', NULL, 100)", TYPE_STRING);
+  TestIsNull("utf8_substring('Hello', 1, NULL)", TYPE_STRING);
+  TestIsNull("utf8_substring(NULL, NULL, NULL)", TYPE_STRING);
+  TestStringValue("utf8_substring('你好', 0)", "");
+  TestStringValue("utf8_substring('你好', 1)", "你好");
+  TestStringValue("utf8_substring('你好', 2)", "好");
+  TestStringValue("utf8_substring('你好', 3)", "");
+  TestStringValue("utf8_substring('你好', 0, 1)", "");
+  TestStringValue("utf8_substring('你好', 1, 0)", "");
+  TestStringValue("utf8_substring('你好', 1, 1)", "你");
+  TestStringValue("utf8_substring('你好', 1, -1)", "");
+  TestStringValue("utf8_substring('你好hello', 1, 4)", "你好he");
+  TestStringValue("utf8_substring('hello你好', 2, 5)", "ello你");
+  TestStringValue("utf8_substring('你好hello你好', -1)", "好");
+  TestStringValue("utf8_substring('你好hello你好', -2)", "你好");
+  TestStringValue("utf8_substring('你好hello你好', -3)", "o你好");
+  TestStringValue("utf8_substring('你好hello你好', -7)", "hello你好");
+  TestStringValue("utf8_substring('你好hello你好', -8)", "好hello你好");
+  TestStringValue("utf8_substring('你好hello你好', -9)", "你好hello你好");
+  TestStringValue("utf8_substring('你好hello你好', -10)", "");
+  TestStringValue("utf8_substring('你好hello你好', -3, cast(2 as bigint))", "o你");
+  TestStringValue("utf8_substring('你好hello你好', -1, -1)", "");
+
+  // Verifies utf8_reverse() reverses the UTF-8 characters (code points).
+  // '你' and '好' are both encoded into 3 bytes.
+  TestIsNull("utf8_reverse(NULL)", TYPE_STRING);
+  TestStringValue("utf8_reverse('hello')", "olleh");
+  TestStringValue("utf8_reverse('')", "");
+  TestStringValue("utf8_reverse('你好')", "好你");
+  TestStringValue("utf8_reverse('你好hello')", "olleh好你");
+  TestStringValue("utf8_reverse('hello你好')", "好你olleh");
+  TestStringValue("utf8_reverse('你好hello你好')", "好你olleh好你");
+  TestStringValue("utf8_reverse('hello你好hello')", "olleh好你olleh");
+  // '🙂' is encoded into 1 code points (U+1F642) and finally encoded into 4 
bytes.
+  TestStringValue("utf8_reverse('hello🙂')", "🙂olleh");
+  // Verifies utf8_reverse() reverse code points instead of grapheme clusters.
+  // 'ñ' can be encoded into 1-2 code points depending on the normalization.
+  // In NFC, it's U+00F1 which is encoded into 2 bytes (0xc3 0xb1).
+  // In NFD, it's 'n' and U+0303 which are finally encoded into 3 bytes (0x6e 
for 'n' and
+  // 0xcc 0x83 for the '~'). Here "n\u0303" is a grapheme cluster.
+  TestStringValue("utf8_reverse('ma\u00f1ana')", "ana\u00f1am");
+  TestStringValue("utf8_reverse('man\u0303ana')", "ana\u0303nam");
+  // NFC(default in Linux) is used in this file so the following test can pass.
+  TestStringValue("utf8_reverse('mañana')", "anañam");
+  // "\u0928\u0940" is a grapheme clusters, same as "\u0ba8\u0bbf" and
+  // "\U0001f468\u200d\U0001f468\u200d\U0001f467\u200d\U0001f467".
+  // The string is reversed in code points.
+  TestStringValue("utf8_reverse('\u0928\u0940\u0ba8\u0bbf"
+      "\U0001f468\u200d\U0001f468\u200d\U0001f467\u200d\U0001f467')",
+      "\U0001f467\u200d\U0001f467\u200d\U0001f468\u200d\U0001f468"
+      "\u0bbf\u0ba8\u0940\u0928");
+}
+
 } // namespace impala
 
 INSTANTIATE_TEST_CASE_P(Instantiations, ExprTest, ::testing::Values(
diff --git a/be/src/exprs/string-functions-ir.cc 
b/be/src/exprs/string-functions-ir.cc
index 55fe239..cc85eb0 100644
--- a/be/src/exprs/string-functions-ir.cc
+++ b/be/src/exprs/string-functions-ir.cc
@@ -55,6 +55,9 @@ const char* ERROR_CHARACTER_LIMIT_EXCEEDED =
 StringVal StringFunctions::Substring(FunctionContext* context,
     const StringVal& str, const BigIntVal& pos, const BigIntVal& len) {
   if (str.is_null || pos.is_null || len.is_null) return StringVal::null();
+  if (context->impl()->GetConstFnAttr(FunctionContextImpl::UTF8_MODE)) {
+    return Utf8Substring(context, str, pos, len);
+  }
   int fixed_pos = pos.val;
   if (fixed_pos < 0) fixed_pos = str.len + fixed_pos + 1;
   int max_len = str.len - fixed_pos + 1;
@@ -72,6 +75,60 @@ StringVal StringFunctions::Substring(FunctionContext* 
context,
   return Substring(context, str, pos, BigIntVal(INT32_MAX));
 }
 
+StringVal StringFunctions::Utf8Substring(FunctionContext* context, const 
StringVal& str,
+    const BigIntVal& pos) {
+  return Utf8Substring(context, str, pos, BigIntVal(INT32_MAX));
+}
+
+StringVal StringFunctions::Utf8Substring(FunctionContext* context, const 
StringVal& str,
+    const BigIntVal& pos, const BigIntVal& len) {
+  if (str.is_null || pos.is_null || len.is_null) return StringVal::null();
+  if (str.len == 0 || pos.val == 0 || len.val <= 0) return StringVal();
+
+  int byte_pos;
+  int utf8_cnt = 0;
+  // pos.val starts at 1 (1-indexed positions).
+  if (pos.val > 0) {
+    // Seek to the start byte of the pos-th UTF-8 character.
+    for (byte_pos = 0; utf8_cnt < pos.val && byte_pos < str.len; ++byte_pos) {
+      if (BitUtil::IsUtf8StartByte(str.ptr[byte_pos])) ++utf8_cnt;
+    }
+    // Not enough UTF-8 characters.
+    if (utf8_cnt < pos.val) return StringVal();
+    // Back to the start byte of the pos-th UTF-8 character.
+    --byte_pos;
+    int byte_start = byte_pos;
+    // Seek to the end until we get enough UTF-8 characters.
+    for (utf8_cnt = 0; utf8_cnt < len.val && byte_pos < str.len; ++byte_pos) {
+      if (BitUtil::IsUtf8StartByte(str.ptr[byte_pos])) ++utf8_cnt;
+    }
+    if (utf8_cnt == len.val) {
+      // We are now at the middle byte of the last UTF-8 character. Seek to 
the end of it.
+      while (byte_pos < str.len && 
!BitUtil::IsUtf8StartByte(str.ptr[byte_pos])) {
+        ++byte_pos;
+      }
+    }
+    return StringVal(str.ptr + byte_start, byte_pos - byte_start);
+  }
+  // pos.val is negative. Seek from the end of the string.
+  int byte_end = str.len;
+  utf8_cnt = 0;
+  byte_pos = str.len - 1;
+  while (utf8_cnt < -pos.val && byte_pos >= 0) {
+    if (BitUtil::IsUtf8StartByte(str.ptr[byte_pos])) {
+      ++utf8_cnt;
+      // Remember the end of the substring's last UTF-8 character.
+      if (utf8_cnt > 0 && utf8_cnt == -pos.val - len.val) byte_end = byte_pos;
+    }
+    --byte_pos;
+  }
+  // Not enough UTF-8 characters.
+  if (utf8_cnt < -pos.val) return StringVal();
+  // Back to the start byte of the substring's first UTF-8 character.
+  ++byte_pos;
+  return StringVal(str.ptr + byte_pos, byte_end - byte_pos);
+}
+
 // This behaves identically to the mysql implementation.
 StringVal StringFunctions::Left(
     FunctionContext* context, const StringVal& str, const BigIntVal& len) {
@@ -195,6 +252,9 @@ StringVal StringFunctions::Rpad(FunctionContext* context, 
const StringVal& str,
 
 IntVal StringFunctions::Length(FunctionContext* context, const StringVal& str) 
{
   if (str.is_null) return IntVal::null();
+  if (context->impl()->GetConstFnAttr(FunctionContextImpl::UTF8_MODE, 0)) {
+    return Utf8Length(context, str);
+  }
   return IntVal(str.len);
 }
 
@@ -205,6 +265,15 @@ IntVal StringFunctions::CharLength(FunctionContext* 
context, const StringVal& st
   return StringValue::UnpaddedCharLength(reinterpret_cast<char*>(str.ptr), 
t->len);
 }
 
+IntVal StringFunctions::Utf8Length(FunctionContext* context, const StringVal& 
str) {
+  if (str.is_null) return IntVal::null();
+  int len = 0;
+  for (int i = 0; i < str.len; ++i) {
+    if (BitUtil::IsUtf8StartByte(str.ptr[i])) ++len;
+  }
+  return IntVal(len);
+}
+
 StringVal StringFunctions::Lower(FunctionContext* context, const StringVal& 
str) {
   if (str.is_null) return StringVal::null();
   StringVal result(context, str.len);
@@ -407,12 +476,46 @@ StringVal StringFunctions::Replace(FunctionContext* 
context, const StringVal& st
 
 StringVal StringFunctions::Reverse(FunctionContext* context, const StringVal& 
str) {
   if (str.is_null) return StringVal::null();
+  if (context->impl()->GetConstFnAttr(FunctionContextImpl::UTF8_MODE)) {
+    return Utf8Reverse(context, str);
+  }
   StringVal result(context, str.len);
   if (UNLIKELY(result.is_null)) return StringVal::null();
   BitUtil::ByteSwap(result.ptr, str.ptr, str.len);
   return result;
 }
 
+static inline void InPlaceReverse(uint8_t* ptr, int len) {
+  for (int i = 0, j = len - 1; i < j; ++i, --j) {
+    uint8_t tmp = ptr[i];
+    ptr[i] = ptr[j];
+    ptr[j] = tmp;
+  }
+}
+
+// Returns a string with the UTF-8 characters (code points) in revrese order. 
Note that
+// this function operates on Unicode code points and not user visible 
characters (or
+// grapheme clusters). This is consistent with other systems, e.g. Hive, 
SparkSQL.
+StringVal StringFunctions::Utf8Reverse(FunctionContext* context, const 
StringVal& str) {
+  if (str.is_null) return StringVal::null();
+  if (str.len == 0) return StringVal();
+  StringVal result(context, str.len);
+  if (UNLIKELY(result.is_null)) return StringVal::null();
+  // First make a copy of the reversed string.
+  BitUtil::ByteSwap(result.ptr, str.ptr, str.len);
+  // Then reverse bytes inside each UTF-8 character.
+  int last = result.len;
+  for (int i = result.len - 1; i >= 0; --i) {
+    if (BitUtil::IsUtf8StartByte(result.ptr[i])) {
+      // Only reverse bytes of a UTF-8 character
+      if (last - i > 1) InPlaceReverse(result.ptr + i + 1, last - i);
+      last = i;
+    }
+  }
+  if (last > 0) InPlaceReverse(result.ptr, last + 1);
+  return result;
+}
+
 StringVal StringFunctions::Translate(FunctionContext* context, const 
StringVal& str,
     const StringVal& src, const StringVal& dst) {
   if (str.is_null || src.is_null || dst.is_null) return StringVal::null();
diff --git a/be/src/exprs/string-functions.h b/be/src/exprs/string-functions.h
index 09a16b5..aa0544a 100644
--- a/be/src/exprs/string-functions.h
+++ b/be/src/exprs/string-functions.h
@@ -58,6 +58,10 @@ class StringFunctions {
       const BigIntVal& len);
   static StringVal Substring(FunctionContext*, const StringVal& str,
       const BigIntVal& pos);
+  static StringVal Utf8Substring(FunctionContext*, const StringVal& str,
+      const BigIntVal& pos, const BigIntVal& len);
+  static StringVal Utf8Substring(FunctionContext*, const StringVal& str,
+      const BigIntVal& pos);
   static StringVal SplitPart(FunctionContext* context, const StringVal& str,
       const StringVal& delim, const BigIntVal& field);
   static StringVal Left(FunctionContext*, const StringVal& str, const 
BigIntVal& len);
@@ -70,6 +74,7 @@ class StringFunctions {
       const StringVal& pad);
   static IntVal Length(FunctionContext*, const StringVal& str);
   static IntVal CharLength(FunctionContext*, const StringVal& str);
+  static IntVal Utf8Length(FunctionContext*, const StringVal& str);
   static StringVal Lower(FunctionContext*, const StringVal& str);
   static StringVal Upper(FunctionContext*, const StringVal& str);
   static StringVal InitCap(FunctionContext*, const StringVal& str);
@@ -78,6 +83,7 @@ class StringFunctions {
   static StringVal Replace(FunctionContext*, const StringVal& str,
       const StringVal& pattern, const StringVal& replace);
   static StringVal Reverse(FunctionContext*, const StringVal& str);
+  static StringVal Utf8Reverse(FunctionContext*, const StringVal& str);
   static StringVal Translate(FunctionContext*, const StringVal& str, const 
StringVal& src,
       const StringVal& dst);
   static StringVal Trim(FunctionContext*, const StringVal& str);
diff --git a/be/src/runtime/runtime-state.h b/be/src/runtime/runtime-state.h
index fce5a95..948fb52 100644
--- a/be/src/runtime/runtime-state.h
+++ b/be/src/runtime/runtime-state.h
@@ -102,6 +102,7 @@ class RuntimeState {
   int batch_size() const { return query_options().batch_size; }
   bool abort_on_error() const { return query_options().abort_on_error; }
   bool strict_mode() const { return query_options().strict_mode; }
+  bool utf8_mode() const { return query_options().utf8_mode; }
   bool decimal_v2() const { return query_options().decimal_v2; }
   const TQueryCtx& query_ctx() const;
   const TPlanFragment& fragment() const { return *fragment_; }
diff --git a/be/src/service/query-options.cc b/be/src/service/query-options.cc
index 96c8809..c7b3a3c 100644
--- a/be/src/service/query-options.cc
+++ b/be/src/service/query-options.cc
@@ -1017,6 +1017,10 @@ Status impala::SetQueryOption(const string& key, const 
string& value,
         
query_options->__set_join_rows_produced_limit(join_rows_produced_limit);
         break;
       }
+      case TImpalaQueryOptions::UTF8_MODE: {
+        query_options->__set_utf8_mode(IsTrue(value));
+        break;
+      }
       default:
         if (IsRemovedQueryOption(key)) {
           LOG(WARNING) << "Ignoring attempt to set removed query option '" << 
key << "'";
diff --git a/be/src/service/query-options.h b/be/src/service/query-options.h
index e23805c..bd787aa 100644
--- a/be/src/service/query-options.h
+++ b/be/src/service/query-options.h
@@ -47,7 +47,7 @@ typedef std::unordered_map<string, 
beeswax::TQueryOptionLevel::type>
 // time we add or remove a query option to/from the enum TImpalaQueryOptions.
 #define QUERY_OPTS_TABLE\
   DCHECK_EQ(_TImpalaQueryOptions_VALUES_TO_NAMES.size(),\
-      TImpalaQueryOptions::JOIN_ROWS_PRODUCED_LIMIT + 1);\
+      TImpalaQueryOptions::UTF8_MODE + 1);\
   REMOVED_QUERY_OPT_FN(abort_on_default_limit_exceeded, 
ABORT_ON_DEFAULT_LIMIT_EXCEEDED)\
   QUERY_OPT_FN(abort_on_error, ABORT_ON_ERROR, TQueryOptionLevel::REGULAR)\
   REMOVED_QUERY_OPT_FN(allow_unsupported_formats, ALLOW_UNSUPPORTED_FORMATS)\
@@ -231,6 +231,7 @@ typedef std::unordered_map<string, 
beeswax::TQueryOptionLevel::type>
       TQueryOptionLevel::ADVANCED)\
   QUERY_OPT_FN(join_rows_produced_limit, JOIN_ROWS_PRODUCED_LIMIT,\
       TQueryOptionLevel::ADVANCED)\
+  QUERY_OPT_FN(utf8_mode, UTF8_MODE, TQueryOptionLevel::DEVELOPMENT)\
   ;
 
 /// Enforce practical limits on some query options to avoid undesired query 
state.
diff --git a/be/src/udf/udf-internal.h b/be/src/udf/udf-internal.h
index 0c820a1..9a212d1 100644
--- a/be/src/udf/udf-internal.h
+++ b/be/src/udf/udf-internal.h
@@ -154,6 +154,8 @@ class FunctionContextImpl {
     ARG_TYPE_SCALE, // int[]
     /// True if decimal_v2 query option is set.
     DECIMAL_V2,
+    /// True if utf8_mode query option is set.
+    UTF8_MODE,
   };
 
   /// This function returns the various static attributes of the UDF/UDA. 
Calls to this
@@ -170,7 +172,7 @@ class FunctionContextImpl {
   int GetConstFnAttr(ConstFnAttr t, int i = -1);
 
   /// Return the function attribute 't' defined in ConstFnAttr above.
-  static int GetConstFnAttr(bool uses_decimal_v2,
+  static int GetConstFnAttr(bool uses_decimal_v2, bool is_utf8_mode,
       const impala_udf::FunctionContext::TypeDesc& return_type,
       const std::vector<impala_udf::FunctionContext::TypeDesc>& arg_types, 
ConstFnAttr t,
       int i = -1);
diff --git a/be/src/udf/udf.cc b/be/src/udf/udf.cc
index 7865a90..8410c0f 100644
--- a/be/src/udf/udf.cc
+++ b/be/src/udf/udf.cc
@@ -99,6 +99,11 @@ class RuntimeState {
     return false;
   }
 
+  bool utf8_mode() const {
+    assert(false);
+    return false;
+  }
+
   bool LogError(const std::string& error) {
     assert(false);
     return false;
@@ -558,10 +563,11 @@ static int GetTypeByteSize(const 
FunctionContext::TypeDesc& type) {
 }
 
 int FunctionContextImpl::GetConstFnAttr(FunctionContextImpl::ConstFnAttr t, 
int i) {
-  return GetConstFnAttr(state_->decimal_v2(), return_type_, arg_types_, t, i);
+  return GetConstFnAttr(state_->decimal_v2(), state_->utf8_mode(), 
return_type_,
+      arg_types_, t, i);
 }
 
-int FunctionContextImpl::GetConstFnAttr(bool uses_decimal_v2,
+int FunctionContextImpl::GetConstFnAttr(bool uses_decimal_v2, bool 
is_utf8_mode,
     const FunctionContext::TypeDesc& return_type,
     const vector<FunctionContext::TypeDesc>& arg_types, ConstFnAttr t, int i) {
   switch (t) {
@@ -592,6 +598,8 @@ int FunctionContextImpl::GetConstFnAttr(bool 
uses_decimal_v2,
       return arg_types[i].scale;
     case DECIMAL_V2:
       return uses_decimal_v2;
+    case UTF8_MODE:
+      return is_utf8_mode;
     default:
       assert(false);
       return -1;
diff --git a/be/src/util/bit-util.h b/be/src/util/bit-util.h
index 2ab783b..94b9a86 100644
--- a/be/src/util/bit-util.h
+++ b/be/src/util/bit-util.h
@@ -120,6 +120,11 @@ class BitUtil {
   /// Returns the rounded down to 64 multiple.
   constexpr static inline uint32_t RoundDownNumi64(uint32_t bits) { return 
bits >> 6; }
 
+  /// Returns whether the given byte is the start byte of a UTF-8 character.
+  constexpr static inline bool IsUtf8StartByte(uint8_t b) {
+    return (b & 0xC0) != 0x80;
+  }
+
   /// Non hw accelerated pop count.
   /// TODO: we don't use this in any perf sensitive code paths currently.  
There
   /// might be a much faster way to implement this.
diff --git a/common/function-registry/impala_functions.py 
b/common/function-registry/impala_functions.py
index 77e057e..d95424f 100644
--- a/common/function-registry/impala_functions.py
+++ b/common/function-registry/impala_functions.py
@@ -493,6 +493,10 @@ visible_functions = [
    'impala::StringFunctions::Substring'],
   [['substr', 'substring'], 'STRING', ['STRING', 'BIGINT', 'BIGINT'],
    'impala::StringFunctions::Substring'],
+  [['utf8_substr', 'utf8_substring'], 'STRING', ['STRING', 'BIGINT'],
+   'impala::StringFunctions::Utf8Substring'],
+  [['utf8_substr', 'utf8_substring'], 'STRING', ['STRING', 'BIGINT', 'BIGINT'],
+   'impala::StringFunctions::Utf8Substring'],
   [['split_part'], 'STRING', ['STRING', 'STRING', 'BIGINT'],
    'impala::StringFunctions::SplitPart'],
   [['base64encode'], 'STRING', ['STRING'], 
'impala::StringFunctions::Base64Encode'],
@@ -507,6 +511,7 @@ visible_functions = [
   [['length'], 'INT', ['CHAR'], 'impala::StringFunctions::CharLength'],
   [['char_length'], 'INT', ['STRING'], 'impala::StringFunctions::Length'],
   [['character_length'], 'INT', ['STRING'], 'impala::StringFunctions::Length'],
+  [['utf8_length'], 'INT', ['STRING'], 'impala::StringFunctions::Utf8Length'],
   [['lower', 'lcase'], 'STRING', ['STRING'], 'impala::StringFunctions::Lower'],
   [['upper', 'ucase'], 'STRING', ['STRING'], 'impala::StringFunctions::Upper'],
   [['initcap'], 'STRING', ['STRING'], 'impala::StringFunctions::InitCap'],
@@ -514,6 +519,7 @@ visible_functions = [
    
'_ZN6impala15StringFunctions14ReplacePrepareEPN10impala_udf15FunctionContextENS2_18FunctionStateScopeE',
    
'_ZN6impala15StringFunctions12ReplaceCloseEPN10impala_udf15FunctionContextENS2_18FunctionStateScopeE'],
   [['reverse'], 'STRING', ['STRING'], 'impala::StringFunctions::Reverse'],
+  [['utf8_reverse'], 'STRING', ['STRING'], 
'impala::StringFunctions::Utf8Reverse'],
   [['translate'], 'STRING', ['STRING', 'STRING', 'STRING'],
    'impala::StringFunctions::Translate'],
   [['trim'], 'STRING', ['STRING'], 'impala::StringFunctions::Trim',
diff --git a/common/thrift/ImpalaInternalService.thrift 
b/common/thrift/ImpalaInternalService.thrift
index 6b63b97..781e307 100644
--- a/common/thrift/ImpalaInternalService.thrift
+++ b/common/thrift/ImpalaInternalService.thrift
@@ -477,6 +477,9 @@ struct TQueryOptions {
 
   // See comment in ImpalaService.thrift
   120: optional i64 join_rows_produced_limit = 0;
+
+  // See comment in ImpalaService.thrift
+  121: optional bool utf8_mode = false;
 }
 
 // Impala currently has two types of sessions: Beeswax and HiveServer2
diff --git a/common/thrift/ImpalaService.thrift 
b/common/thrift/ImpalaService.thrift
index 0561ae5..521a604 100644
--- a/common/thrift/ImpalaService.thrift
+++ b/common/thrift/ImpalaService.thrift
@@ -617,6 +617,10 @@ enum TImpalaQueryOptions {
   // canceled if the query is still executing after this limit is hit. A value
   // of 0 means there is no limit on the number of join rows produced.
   JOIN_ROWS_PRODUCED_LIMIT = 119
+
+  // If true, strings are processed in a UTF-8 aware way, e.g. counting 
lengths by UTF-8
+  // characters instead of bytes.
+  UTF8_MODE = 120
 }
 
 // The summary of a DML statement.
diff --git a/testdata/datasets/functional/functional_schema_template.sql 
b/testdata/datasets/functional/functional_schema_template.sql
index d36dd94..c00cd88 100644
--- a/testdata/datasets/functional/functional_schema_template.sql
+++ b/testdata/datasets/functional/functional_schema_template.sql
@@ -3075,3 +3075,18 @@ AS SELECT * FROM 
{db_name}{db_suffix}.alltypes_date_partition_2 [convert_limit_t
 where [always_true] date_col = cast(timestamp_col as date) and int_col in 
(select int_col from {db_name}{db_suffix}.alltypessmall);
 ---- LOAD
 ====
+---- DATASET
+functional
+---- BASE_TABLE_NAME
+utf8_str_tiny
+---- COLUMNS
+id int
+name string
+---- DEPENDENT_LOAD_HIVE
+INSERT OVERWRITE TABLE {db_name}{db_suffix}.{table_name}
+SELECT id, name FROM {db_name}.{table_name};
+---- LOAD
+INSERT OVERWRITE TABLE {db_name}{db_suffix}.{table_name} VALUES
+  (1, "张三"), (2, "李四"), (3, "王五"), (4, "李小龙"), (5, "Alice"),
+  (6, "陈Bob"), (7, "Бopиc"), (8, "Jörg"), (9, "ひなた"), (10, "서연");
+====
diff --git 
a/testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
 
b/testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
new file mode 100644
index 0000000..8d6f070
--- /dev/null
+++ 
b/testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test
@@ -0,0 +1,116 @@
+====
+---- QUERY
+set utf8_mode=true;
+select length('你好'), length('你好hello'), length('你好 hello 你好')
+---- RESULTS
+2,7,11
+---- TYPES
+INT,INT,INT
+====
+---- QUERY
+set utf8_mode=false;
+select length('你好'), length('你好hello'), length('你好 hello 你好')
+---- RESULTS
+6,11,19
+---- TYPES
+INT,INT,INT
+====
+---- QUERY
+set utf8_mode=true;
+select substring('你好hello', 1, 3)
+---- RESULTS: RAW_STRING
+'你好h'
+---- TYPES
+STRING
+====
+---- QUERY
+set utf8_mode=false;
+select substring('你好hello', 1, 3)
+---- RESULTS: RAW_STRING
+'你'
+---- TYPES
+STRING
+====
+---- QUERY
+set utf8_mode=true;
+select reverse('你好hello你好');
+---- RESULTS: RAW_STRING
+'好你olleh好你'
+---- TYPES
+STRING
+====
+---- QUERY
+set utf8_mode=off;
+select id, length(name), substring(name, 1, 3), length(substring(name, 1, 3)) 
from utf8_str_tiny
+---- RESULTS: RAW_STRING
+1,6,'张',3
+2,6,'李',3
+3,6,'王',3
+4,9,'李',3
+5,5,'Ali',3
+6,6,'陈',3
+7,7,'Бo',3
+8,5,'Jö',3
+9,9,'ひ',3
+10,6,'서',3
+---- TYPES
+INT,INT,STRING,INT
+====
+---- QUERY
+set utf8_mode=true;
+select id, length(name), substring(name, 1, 2), reverse(name) from 
utf8_str_tiny
+---- RESULTS: RAW_STRING
+1,2,'张三','三张'
+2,2,'李四','四李'
+3,2,'王五','五王'
+4,3,'李小','龙小李'
+5,5,'Al','ecilA'
+6,4,'陈B','boB陈'
+7,5,'Бo','cиpoБ'
+8,4,'Jö','gröJ'
+9,3,'ひな','たなひ'
+10,2,'서연','연서'
+---- TYPES
+INT,INT,STRING,STRING
+====
+---- QUERY
+# Test utf8 functions in where clause.
+set utf8_mode=true;
+select id, name from functional.utf8_str_tiny
+where length(name) = 2 and substring(name, 1, 1) = '李';
+---- RESULTS: RAW_STRING
+2,'李四'
+---- TYPES
+INT,STRING
+====
+---- QUERY
+# Test utf8 functions in group by clause. group_concat() may produce 
undetermined results
+# due to the order. Here we wrap it with length().
+set utf8_mode=true;
+select substring(name, 1, 1), length(group_concat(name)) from 
functional.utf8_str_tiny
+group by substring(name, 1, 1);
+---- RESULTS: RAW_STRING
+'A',5
+'ひ',3
+'陈',4
+'王',2
+'张',2
+'서',2
+'J',4
+'Б',5
+'李',7
+---- TYPES
+STRING,INT
+====
+---- QUERY
+# Test utf8 functions in group by and having clauses. group_concat() may 
produce
+# undetermined results due to the order. Here we wrap it with length().
+set utf8_mode=true;
+select substring(name, 1, 1), length(group_concat(name)) from 
functional.utf8_str_tiny
+group by substring(name, 1, 1)
+having length(group_concat(name)) = 7;
+---- RESULTS: RAW_STRING
+'李',7
+---- TYPES
+STRING,INT
+====
diff --git a/tests/query_test/test_utf8_strings.py 
b/tests/query_test/test_utf8_strings.py
new file mode 100644
index 0000000..3221eb8
--- /dev/null
+++ b/tests/query_test/test_utf8_strings.py
@@ -0,0 +1,42 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from tests.common.impala_test_suite import ImpalaTestSuite
+from tests.common.test_dimensions import (create_exec_option_dimension,
+    create_client_protocol_dimension, hs2_parquet_constraint)
+
+
+class TestUtf8StringFunctions(ImpalaTestSuite):
+  @classmethod
+  def get_workload(cls):
+    return 'functional-query'
+
+  @classmethod
+  def add_test_dimensions(cls):
+    super(TestUtf8StringFunctions, cls).add_test_dimensions()
+    cls.ImpalaTestMatrix.add_dimension(
+      create_exec_option_dimension(disable_codegen_options=[False, True]))
+    cls.ImpalaTestMatrix.add_constraint(lambda v:
+        v.get_value('table_format').file_format in ['parquet'] and
+        v.get_value('table_format').compression_codec in ['none'])
+    # Run these queries through both beeswax and HS2 to get coverage of 
CHAR/VARCHAR
+    # returned via both protocols.
+    cls.ImpalaTestMatrix.add_dimension(create_client_protocol_dimension())
+    cls.ImpalaTestMatrix.add_constraint(hs2_parquet_constraint)
+
+  def test_string_functions(self, vector):
+    self.run_test_case('QueryTest/utf8-string-functions', vector)

[impala] branch master updated: IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions

Reply via email to