[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Quanlong Huang (Code Review) Mon, 21 Jun 2021 23:54:40 -0700

Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17580 )


Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and 
locate()
......................................................................


Patch Set 1:

(16 comments)

Thank Qifan's feekback! Addressed the comments.

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@273
PS1, Line 273:  if (BitUtil::IsUtf8StartByte(ptr[i])) ++cnt;
> nit. A better algorithm would be to use utf8CodePointLen() to skip non-star
Done


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@673
PS1, Line 673:     int search_start_pos = start_position.val - 1;
             :     if (utf8_mode) {
             :       search_start_pos = FindUtf8Pos(str.ptr, str.len, 
search_start_pos);
             :     }
> nit. may combine into one expression to avoid extra computation when in UTF
We always need to calculate 'start_position.val - 1' since it's the last 
argument for FindUtf8Pos. So I think the current code is more readable.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@688
PS1, Line 688:  int search_start_pos = haystack.len + start_position.val;
             :     if (utf8_mode) {
             :       search_start_pos = FindLastUtf8Pos(str.ptr, str.len, 
-start_position.val);
             :     }
> nit. same as above (to combine).
Done


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@708
PS1, Line 708: if (!utf8_mode) return IntVal(match_pos + 1);
> nit. to combine.
Done


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@710
PS1, Line 710: chars (code points)
> nit. Unicode characters in UTF8 encoding.
Done


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/CMakeLists.txt
File be/src/util/CMakeLists.txt:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/CMakeLists.txt@221
PS1, Line 221: ADD_UNIFIED_BE_LSAN_TEST(string-util-test 
"TruncateDownTest.*:TruncateUpTest.*:CommaSeparatedContainsTest.*:FindUtf8PosTest.*:FindLastUtf8PosTest.*")
> Good to know!
Thanks!


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/bit-util.h
File be/src/util/bit-util.h:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/bit-util.h@132
PS1, Line 132: Utf8CodePointLen
> nit. Probably should be named as UTF8EncodeLen() as it returns the length o
I think I read it as Utf8CodePoint-Len but not Utf8-CodePointLen so don't get 
confused. It seems other systems also use this name, e.g. 
https://www.freepascal.org/docs-html/rtl/system/utf8codepointlen.html
If we mean the number of code points, the function name will be 
numUtf8CodePoints. Is it ok for you to keep the original name given we have 
method comments mentioning byte length above?


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc@127
PS1, Line 127: EXPECT_EQ(10, FindUtf8Pos("李小龙 \x93 ", 12, 4));
> nit. Add a new test case FindUtf8Pos("??? \x93 ", 12, 6)
Done


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc@128
PS1, Line 128: }
> nit. Add a couple test cases with a combination of UTF8 characters in 1, 2,
Good point!


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc@144
PS1, Line 144: }
> nit. Add a couple test cases with a combination of UTF8 characters in 1,2 a
Done


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h
File be/src/util/string-util.h:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@56
PS1, Line 56: /// position of it. If there are not enough characters, return 
'str_len'. E.g.
> nit. May need to describe 'index' here as follows.
Done


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@60
PS1, Line 60: ///  - If the string is "你好" and index > 1, returns 6.
> May add one more example.
Done


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@63
PS1, Line 63: FindUtf8Pos
> nit. Maybe named as FindUTF8PosForwared().
Done. Change to FindUtf8PosForward since we use camel case.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@67
PS1, Line 67: 'index' must be positive and 1
> nit. Making the index 0-based is consistent with the above function.
OK, I was thinking negative index previously so just make them positive to ease 
the use. Change it to 0-based.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@75
PS1, Line 75: FindLastUtf8Pos
> nit. Maybe named as FindUTF8PosBackward()
Done


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.cc@99
PS1, Line 99: pos += BitUtil::Utf8CodePointLen(ptr[pos]);
> We may need to call IsUtf8StartByte() first to handle non-UTF8 character si
Done. I think the better way is doing validation in the boundary. Then we can 
do this more efficiently. The desired behaviors can be replacing illegal UTF-8 
characters with U+FFFD (REPLACEMENT CHARACTER), or just ignoring them, or 
reporting errors. Filed IMPALA-10761 for this.



--
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Comment-Date: Tue, 22 Jun 2021 06:54:29 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Reply via email to