Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )
Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate() ...................................................................... Patch Set 1: (16 comments) Looks good. http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc File be/src/exprs/string-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@273 PS1, Line 273: if (BitUtil::IsUtf8StartByte(ptr[i])) ++cnt; nit. A better algorithm would be to use utf8CodePointLen() to skip non-starting bytes. http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@673 PS1, Line 673: int search_start_pos = start_position.val - 1; : if (utf8_mode) { : search_start_pos = FindUtf8Pos(str.ptr, str.len, search_start_pos); : } nit. may combine into one expression to avoid extra computation when in UTF8 mode: int serach_start_pos = (utf8_mode) ? FindUtf8Pos(str.ptr, str.len, search_start_pos) : start_position.val - 1; http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@688 PS1, Line 688: int search_start_pos = haystack.len + start_position.val; : if (utf8_mode) { : search_start_pos = FindLastUtf8Pos(str.ptr, str.len, -start_position.val); : } nit. same as above (to combine). http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@708 PS1, Line 708: if (!utf8_mode) return IntVal(match_pos + 1); nit. to combine. http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@710 PS1, Line 710: chars (code points) nit. Unicode characters in UTF8 encoding. http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/CMakeLists.txt File be/src/util/CMakeLists.txt: http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/CMakeLists.txt@221 PS1, Line 221: ADD_UNIFIED_BE_LSAN_TEST(string-util-test "TruncateDownTest.*:TruncateUpTest.*:CommaSeparatedContainsTest.*:FindUtf8PosTest.*:FindLastUtf8PosTest.*") Good to know! http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/bit-util.h File be/src/util/bit-util.h: http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/bit-util.h@132 PS1, Line 132: Utf8CodePointLen nit. Probably should be named as UTF8EncodeLen() as it returns the length of a Unicode character in UTF8 encoding. Code point normally implies the logical value of a Unicode character. http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc File be/src/util/string-util-test.cc: http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc@127 PS1, Line 127: EXPECT_EQ(10, FindUtf8Pos("李小龙 \x93 ", 12, 4)); nit. Add a new test case FindUtf8Pos("??? \x93 ", 12, 6) http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc@128 PS1, Line 128: } nit. Add a couple test cases with a combination of UTF8 characters in 1, 2, and 4 bytes. http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc@144 PS1, Line 144: } nit. Add a couple test cases with a combination of UTF8 characters in 1,2 and 4 bytes. http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h File be/src/util/string-util.h: http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@56 PS1, Line 56: /// position of it. If there are not enough characters, return 'str_len'. E.g. nit. May need to describe 'index' here as follows. The input string 'str_ptr' is viewed logically as sequence of Unicode characters encoded in UTF8. 'index' is the index (0-based) of Unicode character whose byte position in 'str_ptr' is to be found. http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@60 PS1, Line 60: /// - If the string is "你好" and index > 1, returns 6. May add one more example. "??Bob" and index is 4. http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@63 PS1, Line 63: FindUtf8Pos nit. Maybe named as FindUTF8PosForwared(). http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@67 PS1, Line 67: 'index' must be positive and 1 nit. Making the index 0-based is consistent with the above function. http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@75 PS1, Line 75: FindLastUtf8Pos nit. Maybe named as FindUTF8PosBackward() http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.cc@99 PS1, Line 99: pos += BitUtil::Utf8CodePointLen(ptr[pos]); We may need to call IsUtf8StartByte() first to handle non-UTF8 character situation. -- To view, visit http://gerrit.cloudera.org:8080/17580 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662 Gerrit-Change-Number: 17580 Gerrit-PatchSet: 1 Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Comment-Date: Mon, 21 Jun 2021 15:49:25 +0000 Gerrit-HasComments: Yes