Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16909 )
Change subject: IMPALA-5675: Support UTF-8 Varchar and Char types ...................................................................... Patch Set 14: (5 comments) Thanks for your review, Tim! Addressed the comments. http://gerrit.cloudera.org:8080/#/c/16909/13//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16909/13//COMMIT_MSG@7 PS13, Line 7: IMPALA-5675: Support UTF-8 Varchar and Char types > One minor thing : I think we should comment the len field, e.g. in StringVa Done. In StringValue and StringVal, it's still the length in bytes of the string. In ScalarType(FE), ColumnType and TypeDesc, it's now the logical length of the type. So CHAR(3) will always has len=3 regradless whether we are in UTF-8 mode. The difference happens in how we use the len field. In UTF-8 mode, we have special logics like allocating 4*N bytes for a CHAR(N) slot. Added comments in ScalarType.java, ColumnType(types.h), TypeDesc(udf.h) and StringValue(string-value.h). http://gerrit.cloudera.org:8080/#/c/16909/13//COMMIT_MSG@45 PS13, Line 45: > In the backend, how did you identify all the places you needed to change th Yeah, this is the difficulty of this work. Thanks for listing out the places. They are covered in this patch. In developing this patch, I checked all places where type.len is used. Previously they are treated as length in bytes. Truncating and padding happen there. Now in UTF-8 mode, they should be treated as length in characters so need code changes. Hopefully, I didn't miss any places. I'll try to add more tests to improve our confidence. http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exec/kudu-scanner.cc File be/src/exec/kudu-scanner.cc: http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exec/kudu-scanner.cc@392 PS13, Line 392: // it is too long. > Should we add a DCHECK in UTF-8 mode to verify that the strings returned fr Done http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/cast-functions-ir.cc File be/src/exprs/cast-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/cast-functions-ir.cc@218 PS13, Line 218: // The format string can have non-ASCII characters using double quoted nested strings. > I don't think they could? Custom timestamp formats could have them, but tha Yes, unfortunately, after some investigation, I found the format can have customized strings surrounded with double quotes [1]. When casting TIMESTAMP to CHAR(n), FE will split it into two castings: TIMESTAMP -> STRING and STRING -> CHAR(n). (Codes are in CastExpr#analyze()). So casting to CHAR(n) is currently correct since STRING to CHAR(n) casting is correct. However, when casting TIMESTAMP to VARCHAR(n), the current result is wrong if there are non-ascii characters in the format. Example queries are in [2]. I'll fix the varchar issue. Gerrit can't show non-ascii characters in comments. I uploaded them in snapshots: [1] https://drive.google.com/file/d/1sgjSTBgAGCGZU3nQGfOSUBFw35lrvzoq/view?usp=sharing [2] https://drive.google.com/file/d/1cirgYIEnssf2CsoaSpRD1m9pxN78VNua/view?usp=sharing http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/scalar-expr-evaluator.cc File be/src/exprs/scalar-expr-evaluator.cc: http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/scalar-expr-evaluator.cc@316 PS13, Line 316: // TODO: consider returning reference of the StringVal in UTF-8 mode. > You might have to be careful here with UDFs, since they could return a len Yeah, but it's ok for now since we don't support UDFs that returns CHAR or VARCHAR: https://github.com/apache/impala/blob/d271baa33da1a02aa6ffc47b0380dc62239107b4/fe/src/main/java/org/apache/impala/analysis/CreateUdfStmt.java#L75-L80 So only builtin cast functions can go here. -- To view, visit http://gerrit.cloudera.org:8080/16909 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f Gerrit-Change-Number: 16909 Gerrit-PatchSet: 14 Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Comment-Date: Mon, 22 Feb 2021 08:22:17 +0000 Gerrit-HasComments: Yes