Attila Bukor has posted comments on this change. ( http://gerrit.cloudera.org:8080/13760 )
Change subject: KUDU-1938 Add support for CHAR/VARCHAR pt 1 ...................................................................... Patch Set 22: (4 comments) While double checking maximum lengths for CHAR and VARCHAR in other RDMBSs I noticed that there's a difference between the 'standard' approach of padding CHARs vs. Apache Impala's and now ours. Originally I implemented the padding of CHARs *before* persisting which it seems is what other databases (e.g. MySQL[1], Oracle[2] and PostgreSQL[3]) is doing. IIRC this was originally to have fixed-width rows, but with UTF-8 they still wouldn't be fixed-width as UTF-8 itself is variable length. In MySQL's case the trailing spaces are even removed by default when scanned: > The length of a CHAR column is fixed to the length that you declare when you > create the table. The length can be any value from 0 to 255. When CHAR values > are stored, they are right-padded with spaces to the specified length. When > CHAR values are retrieved, trailing spaces are removed unless the > PAD_CHAR_TO_FULL_LENGTH SQL mode is enabled. Impala[4] on the other hand stores the data without trailing whitespaces and it's padded upon retrieval: > If you store a CHAR value containing trailing spaces in a table, those > trailing spaces are not stored in the data file. When the value is retrieved > by a query, the result could have a different number of trailing spaces. That > is, the value includes however many spaces are needed to pad it to the > specified length of the column. Due to the variable length nature of UTF8 and the columnar format I believe it makes most sense to implement it the same way as Impala did, only wanted to bring your attention to this discrepancy. [1] https://docs.oracle.com/cd/E17952_01/mysql-5.1-en/char.html [2] https://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#CNCPT1821 [3] https://www.postgresql.org/docs/9.0/datatype-character.html [4] https://impala.apache.org/docs/build/html/topics/impala_char.html http://gerrit.cloudera.org:8080/#/c/13760/22//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/13760/22//COMMIT_MSG@15 PS22, Line 15: The maximum length for VARCHAR is 65,535 and 255 for CHAR > Could you add _why_ these maximum lengths make sense? Done http://gerrit.cloudera.org:8080/#/c/13760/22/src/kudu/common/partial_row.h File src/kudu/common/partial_row.h: http://gerrit.cloudera.org:8080/#/c/13760/22/src/kudu/common/partial_row.h@415 PS22, Line 415: /// Get the string/binary value for a column by its name. > This sentence should also be updated. Done http://gerrit.cloudera.org:8080/#/c/13760/22/src/kudu/common/partial_row.h@438 PS22, Line 438: /// Get the string/binary value for a column by its index. > Likewise. Done http://gerrit.cloudera.org:8080/#/c/13760/20/src/kudu/common/schema.h File src/kudu/common/schema.h: http://gerrit.cloudera.org:8080/#/c/13760/20/src/kudu/common/schema.h@124 PS20, Line 124: // Maximum value of the length is 65,535 for compatibility reasons as it's : // used by VARCHAR type which can be set to a maximum of 65,535 in case of : // MySQL and less for other major RDMBMS implementations. > Thanks for the clarification. Could you update the comment to reflect that Done -- To view, visit http://gerrit.cloudera.org:8080/13760 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I998982dba93831db91c43a97ce30d3e68c2a4a54 Gerrit-Change-Number: 13760 Gerrit-PatchSet: 22 Gerrit-Owner: Attila Bukor <abu...@apache.org> Gerrit-Reviewer: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Attila Bukor <abu...@apache.org> Gerrit-Reviewer: Grant Henke <granthe...@apache.org> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Reviewer: Will Berkeley <wdberke...@gmail.com> Gerrit-Comment-Date: Sat, 20 Jul 2019 10:37:12 +0000 Gerrit-HasComments: Yes