Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16909 )

Change subject: IMPALA-5675: Support UTF-8 Varchar and Char types
......................................................................


Patch Set 14:

(5 comments)

Thanks for your review, Tim! Addressed the comments.

http://gerrit.cloudera.org:8080/#/c/16909/13//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16909/13//COMMIT_MSG@7
PS13, Line 7: IMPALA-5675: Support UTF-8 Varchar and Char types
> One minor thing : I think we should comment the len field, e.g. in StringVa
Done.

In StringValue and StringVal, it's still the length in bytes of the string.

In ScalarType(FE), ColumnType and TypeDesc, it's now the logical length of the 
type. So CHAR(3) will always has len=3 regradless whether we are in UTF-8 mode. 
The difference happens in how we use the len field. In UTF-8 mode, we have 
special logics like allocating 4*N bytes for a CHAR(N) slot.

Added comments in ScalarType.java, ColumnType(types.h), TypeDesc(udf.h) and 
StringValue(string-value.h).


http://gerrit.cloudera.org:8080/#/c/16909/13//COMMIT_MSG@45
PS13, Line 45:
> In the backend, how did you identify all the places you needed to change th
Yeah, this is the difficulty of this work. Thanks for listing out the places. 
They are covered in this patch.

In developing this patch, I checked all places where type.len is used. 
Previously they are treated as length in bytes. Truncating and padding happen 
there. Now in UTF-8 mode, they should be treated as length in characters so 
need code changes.

Hopefully, I didn't miss any places. I'll try to add more tests to improve our 
confidence.


http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exec/kudu-scanner.cc@392
PS13, Line 392:     // it is too long.
> Should we add a DCHECK in UTF-8 mode to verify that the strings returned fr
Done


http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/cast-functions-ir.cc
File be/src/exprs/cast-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/cast-functions-ir.cc@218
PS13, Line 218:   // The format string can have non-ASCII characters using 
double quoted nested strings.
> I don't think they could? Custom timestamp formats could have them, but tha
Yes, unfortunately, after some investigation, I found the format can have 
customized strings surrounded with double quotes [1].

When casting TIMESTAMP to CHAR(n), FE will split it into two castings: 
TIMESTAMP -> STRING and STRING -> CHAR(n). (Codes are in CastExpr#analyze()). 
So casting to CHAR(n) is currently correct since STRING to CHAR(n) casting is 
correct.
However, when casting TIMESTAMP to VARCHAR(n), the current result is wrong if 
there are non-ascii characters in the format. Example queries are in [2]. I'll 
fix the varchar issue.

Gerrit can't show non-ascii characters in comments. I uploaded them in 
snapshots:
[1] 
https://drive.google.com/file/d/1sgjSTBgAGCGZU3nQGfOSUBFw35lrvzoq/view?usp=sharing
[2] 
https://drive.google.com/file/d/1cirgYIEnssf2CsoaSpRD1m9pxN78VNua/view?usp=sharing


http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/scalar-expr-evaluator.cc
File be/src/exprs/scalar-expr-evaluator.cc:

http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/scalar-expr-evaluator.cc@316
PS13, Line 316:       // TODO: consider returning reference of the StringVal in 
UTF-8 mode.
> You might have to be careful here with UDFs, since they could return a len
Yeah, but it's ok for now since we don't support UDFs that returns CHAR or 
VARCHAR:
https://github.com/apache/impala/blob/d271baa33da1a02aa6ffc47b0380dc62239107b4/fe/src/main/java/org/apache/impala/analysis/CreateUdfStmt.java#L75-L80

So only builtin cast functions can go here.



--
To view, visit http://gerrit.cloudera.org:8080/16909
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f
Gerrit-Change-Number: 16909
Gerrit-PatchSet: 14
Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Mon, 22 Feb 2021 08:22:17 +0000
Gerrit-HasComments: Yes

Reply via email to