Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16908 )
Change subject: IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions ...................................................................... Patch Set 7: (2 comments) http://gerrit.cloudera.org:8080/#/c/16908/7/be/src/exprs/string-functions-ir.cc File be/src/exprs/string-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/16908/7/be/src/exprs/string-functions-ir.cc@498 PS7, Line 498: StringVal StringFunctions::Utf8Reverse(FunctionContext* context, const StringVal& str) { > We might need to be careful with reverse, cause I think reversing the unico Good point! Currently. we just reverse code points, which can't guarantee getting the original grapheme clusters. I tried Hive, Spark and PostgreSQL. Seems like they have the same issue: hive> select 'abc\u0303def', reverse('abc\u0303def'); OK abc?def fed?cba scala> spark.sql("select 'abc\u0303def', reverse('abc\u0303def')").show(); +-------+----------------+ |abc?def|reverse(abc?def)| +-------+----------------+ |abc?def| fed?cba| +-------+----------------+ postgres=# select E'abc\u0303def', reverse(E'abc\u0303def'); ?column? | reverse ----------+--------- abc?def | fed?cba (1 row) Note that Gerrit can't display unicode characters correctly so they are replaced by "?". Filed HIVE-24620 for hive. I think a more general question is what should be the unit of a string, whether code point or grapheme. This will also affect the results of substring() and length(). Maybe we need another query option for this. Do you mind if we follow-up this in another JIRA? I think grapheme boundary detection is more complex than detecting code points. Need some investigation to find an efficient library, e.g. https://github.com/ruoso/u5e BTW, Presto (now called Trino) explicitly mentions that it won't deal with grapheme clusters: "Additionally, the functions operate on Unicode code points and not user visible characters (or grapheme clusters)." https://trino.io/docs/351/functions/string.html http://gerrit.cloudera.org:8080/#/c/16908/7/fe/src/main/java/org/apache/impala/catalog/ScalarType.java File fe/src/main/java/org/apache/impala/catalog/ScalarType.java: http://gerrit.cloudera.org:8080/#/c/16908/7/fe/src/main/java/org/apache/impala/catalog/ScalarType.java@47 PS7, Line 47: private boolean isUtf8_ = false; > I was thinking about how to think about this. I think isUtf8_ == false can Sure. It means "the expression has legacy string semantics". FE won't check whether the expression's behavior is the same with or without utf-8 semantics. Will add comments in the next patch. -- To view, visit http://gerrit.cloudera.org:8080/16908 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c Gerrit-Change-Number: 16908 Gerrit-PatchSet: 7 Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Comment-Date: Tue, 12 Jan 2021 04:06:32 +0000 Gerrit-HasComments: Yes