Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and 
locate()
......................................................................


Patch Set 1:

(16 comments)

Looks good.

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@273
PS1, Line 273:  if (BitUtil::IsUtf8StartByte(ptr[i])) ++cnt;
nit. A better algorithm would be to use utf8CodePointLen() to skip non-starting 
bytes.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@673
PS1, Line 673:     int search_start_pos = start_position.val - 1;
             :     if (utf8_mode) {
             :       search_start_pos = FindUtf8Pos(str.ptr, str.len, 
search_start_pos);
             :     }
nit. may combine into one expression to avoid extra computation when in UTF8 
mode:

int serach_start_pos = (utf8_mode) ? FindUtf8Pos(str.ptr, str.len, 
search_start_pos) : start_position.val - 1;


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@688
PS1, Line 688:  int search_start_pos = haystack.len + start_position.val;
             :     if (utf8_mode) {
             :       search_start_pos = FindLastUtf8Pos(str.ptr, str.len, 
-start_position.val);
             :     }
nit. same as above (to combine).


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@708
PS1, Line 708: if (!utf8_mode) return IntVal(match_pos + 1);
nit. to combine.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@710
PS1, Line 710: chars (code points)
nit. Unicode characters in UTF8 encoding.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/CMakeLists.txt
File be/src/util/CMakeLists.txt:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/CMakeLists.txt@221
PS1, Line 221: ADD_UNIFIED_BE_LSAN_TEST(string-util-test 
"TruncateDownTest.*:TruncateUpTest.*:CommaSeparatedContainsTest.*:FindUtf8PosTest.*:FindLastUtf8PosTest.*")
Good to know!


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/bit-util.h
File be/src/util/bit-util.h:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/bit-util.h@132
PS1, Line 132: Utf8CodePointLen
nit. Probably should be named as UTF8EncodeLen() as it returns the length of a 
Unicode character in UTF8 encoding. Code point normally implies the logical 
value of a Unicode character.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc@127
PS1, Line 127: EXPECT_EQ(10, FindUtf8Pos("李小龙 \x93 ", 12, 4));
nit. Add a new test case FindUtf8Pos("??? \x93 ", 12, 6)


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc@128
PS1, Line 128: }
nit. Add a couple test cases with a combination of UTF8 characters in 1, 2, and 
4 bytes.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc@144
PS1, Line 144: }
nit. Add a couple test cases with a combination of UTF8 characters in 1,2 and 4 
bytes.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h
File be/src/util/string-util.h:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@56
PS1, Line 56: /// position of it. If there are not enough characters, return 
'str_len'. E.g.
nit. May need to describe 'index' here as follows.

The input string 'str_ptr' is viewed logically as sequence of Unicode 
characters encoded in UTF8. 'index' is the index (0-based) of Unicode character 
whose byte position in 'str_ptr' is to be found.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@60
PS1, Line 60: ///  - If the string is "你好" and index > 1, returns 6.
May add one more example.

"??Bob" and index is 4.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@63
PS1, Line 63: FindUtf8Pos
nit. Maybe named as FindUTF8PosForwared().


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@67
PS1, Line 67: 'index' must be positive and 1
nit. Making the index 0-based is consistent with the above function.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@75
PS1, Line 75: FindLastUtf8Pos
nit. Maybe named as FindUTF8PosBackward()


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.cc@99
PS1, Line 99: pos += BitUtil::Utf8CodePointLen(ptr[pos]);
We may need to call IsUtf8StartByte() first to handle non-UTF8 character 
situation.



--
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Comment-Date: Mon, 21 Jun 2021 15:49:25 +0000
Gerrit-HasComments: Yes

Reply via email to