[Impala-ASF-CR] IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions

Quanlong Huang (Code Review) Mon, 11 Jan 2021 20:06:42 -0800

Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16908 )


Change subject: IMPALA-2019(Part-1): Provide UTF-8 support in length, substring 
and reverse functions
......................................................................


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16908/7/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16908/7/be/src/exprs/string-functions-ir.cc@498
PS7, Line 498: StringVal StringFunctions::Utf8Reverse(FunctionContext* context, 
const StringVal& str) {
> We might need to be careful with reverse, cause I think reversing the unico
Good point! Currently. we just reverse code points, which can't guarantee 
getting the original grapheme clusters. I tried Hive, Spark and PostgreSQL. 
Seems like they have the same issue:

hive> select 'abc\u0303def', reverse('abc\u0303def');
OK
abc?def fed?cba

scala> spark.sql("select 'abc\u0303def', reverse('abc\u0303def')").show();
+-------+----------------+
|abc?def|reverse(abc?def)|
+-------+----------------+
|abc?def|         fed?cba|
+-------+----------------+

postgres=# select E'abc\u0303def', reverse(E'abc\u0303def');
 ?column? | reverse
----------+---------
 abc?def   | fed?cba
(1 row)

Note that Gerrit can't display unicode characters correctly so they are 
replaced by "?". Filed HIVE-24620 for hive.

I think a more general question is what should be the unit of a string, whether 
code point or grapheme. This will also affect the results of substring() and 
length(). Maybe we need another query option for this. Do you mind if we 
follow-up this in another JIRA? I think grapheme boundary detection is more 
complex than detecting code points. Need some investigation to find an 
efficient library, e.g. https://github.com/ruoso/u5e

BTW, Presto (now called Trino) explicitly mentions that it won't deal with 
grapheme clusters: "Additionally, the functions operate on Unicode code points 
and not user visible characters (or grapheme clusters)." 
https://trino.io/docs/351/functions/string.html


http://gerrit.cloudera.org:8080/#/c/16908/7/fe/src/main/java/org/apache/impala/catalog/ScalarType.java
File fe/src/main/java/org/apache/impala/catalog/ScalarType.java:

http://gerrit.cloudera.org:8080/#/c/16908/7/fe/src/main/java/org/apache/impala/catalog/ScalarType.java@47
PS7, Line 47:   private boolean isUtf8_ = false;
> I was thinking about how to think about this. I think isUtf8_ == false can
Sure. It means "the expression has legacy string semantics". FE won't check 
whether the expression's behavior is the same with or without utf-8 semantics.

Will add comments in the next patch.



--
To view, visit http://gerrit.cloudera.org:8080/16908
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c
Gerrit-Change-Number: 16908
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Tue, 12 Jan 2021 04:06:32 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions

Reply via email to