Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )
Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate() ...................................................................... Patch Set 1: (16 comments) Thank Qifan's feekback! Addressed the comments. http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc File be/src/exprs/string-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@273 PS1, Line 273: if (BitUtil::IsUtf8StartByte(ptr[i])) ++cnt; > nit. A better algorithm would be to use utf8CodePointLen() to skip non-star Done http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@673 PS1, Line 673: int search_start_pos = start_position.val - 1; : if (utf8_mode) { : search_start_pos = FindUtf8Pos(str.ptr, str.len, search_start_pos); : } > nit. may combine into one expression to avoid extra computation when in UTF We always need to calculate 'start_position.val - 1' since it's the last argument for FindUtf8Pos. So I think the current code is more readable. http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@688 PS1, Line 688: int search_start_pos = haystack.len + start_position.val; : if (utf8_mode) { : search_start_pos = FindLastUtf8Pos(str.ptr, str.len, -start_position.val); : } > nit. same as above (to combine). Done http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@708 PS1, Line 708: if (!utf8_mode) return IntVal(match_pos + 1); > nit. to combine. Done http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@710 PS1, Line 710: chars (code points) > nit. Unicode characters in UTF8 encoding. Done http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/CMakeLists.txt File be/src/util/CMakeLists.txt: http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/CMakeLists.txt@221 PS1, Line 221: ADD_UNIFIED_BE_LSAN_TEST(string-util-test "TruncateDownTest.*:TruncateUpTest.*:CommaSeparatedContainsTest.*:FindUtf8PosTest.*:FindLastUtf8PosTest.*") > Good to know! Thanks! http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/bit-util.h File be/src/util/bit-util.h: http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/bit-util.h@132 PS1, Line 132: Utf8CodePointLen > nit. Probably should be named as UTF8EncodeLen() as it returns the length o I think I read it as Utf8CodePoint-Len but not Utf8-CodePointLen so don't get confused. It seems other systems also use this name, e.g. https://www.freepascal.org/docs-html/rtl/system/utf8codepointlen.html If we mean the number of code points, the function name will be numUtf8CodePoints. Is it ok for you to keep the original name given we have method comments mentioning byte length above? http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc File be/src/util/string-util-test.cc: http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc@127 PS1, Line 127: EXPECT_EQ(10, FindUtf8Pos("李小龙 \x93 ", 12, 4)); > nit. Add a new test case FindUtf8Pos("??? \x93 ", 12, 6) Done http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc@128 PS1, Line 128: } > nit. Add a couple test cases with a combination of UTF8 characters in 1, 2, Good point! http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util-test.cc@144 PS1, Line 144: } > nit. Add a couple test cases with a combination of UTF8 characters in 1,2 a Done http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h File be/src/util/string-util.h: http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@56 PS1, Line 56: /// position of it. If there are not enough characters, return 'str_len'. E.g. > nit. May need to describe 'index' here as follows. Done http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@60 PS1, Line 60: /// - If the string is "你好" and index > 1, returns 6. > May add one more example. Done http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@63 PS1, Line 63: FindUtf8Pos > nit. Maybe named as FindUTF8PosForwared(). Done. Change to FindUtf8PosForward since we use camel case. http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@67 PS1, Line 67: 'index' must be positive and 1 > nit. Making the index 0-based is consistent with the above function. OK, I was thinking negative index previously so just make them positive to ease the use. Change it to 0-based. http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.h@75 PS1, Line 75: FindLastUtf8Pos > nit. Maybe named as FindUTF8PosBackward() Done http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/string-util.cc@99 PS1, Line 99: pos += BitUtil::Utf8CodePointLen(ptr[pos]); > We may need to call IsUtf8StartByte() first to handle non-UTF8 character si Done. I think the better way is doing validation in the boundary. Then we can do this more efficiently. The desired behaviors can be replacing illegal UTF-8 characters with U+FFFD (REPLACEMENT CHARACTER), or just ignoring them, or reporting errors. Filed IMPALA-10761 for this. -- To view, visit http://gerrit.cloudera.org:8080/17580 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662 Gerrit-Change-Number: 17580 Gerrit-PatchSet: 1 Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Comment-Date: Tue, 22 Jun 2021 06:54:29 +0000 Gerrit-HasComments: Yes