[ https://issues.apache.org/jira/browse/HIVE-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
BELUGA BEHR updated HIVE-16826: ------------------------------- Comment: was deleted (was: Interestingly, there seems to be an issue with the current code. When I instruct beeline to use quote {{disable.quoting.for.sv}}, my changes provide the same output as the current implementations. However, when no quotes are specified, there is a difference. \\ \\ * theFileWhereToStoreTheData.csv = current implementation * theFileWhereToStoreTheData.csv.mod = with my changes {code} [root@host ~]# md5sum theFileWhereToStoreTheData.csv* 6bfb928df7d2a7d778930bb972bc23c5 theFileWhereToStoreTheData.csv fb3972fe583a4e1565a4fddb81dc8d62 theFileWhereToStoreTheData.csv.mod {code} For the first 20,000 outputs, we are good, but then it gets weird... {code} [root@host ~]# head -n 20000 theFileWhereToStoreTheData.csv | xxd | md5sum 280b418c87ed701b509f4cbbdfe8fa29 - [root@host ~]# head -n 20000 theFileWhereToStoreTheData.csv.mod | xxd | md5sum 280b418c87ed701b509f4cbbdfe8fa29 - [root@host ~]# head -n 21000 theFileWhereToStoreTheData.csv | xxd | md5sum 3b1eb5b7b63a5255c8e1539230d190a9 - [root@host ~]# head -n 21000 theFileWhereToStoreTheData.csv.mod | xxd | md5sum 7de5ae6604e91a42a388c9826174ee30 - {code} Everything in the file starts fine... {code} [root@host ~]# head -n 4 theFileWhereToStoreTheData.csv | tail -n 2 | xxd 0000000: 3030 2d30 3030 302c 416c 6c20 4f63 6375 00-0000,All Occu 0000010: 7061 7469 6f6e 732c 3133 3433 3534 3235 pations,13435425 0000020: 302c 3430 3639 300a 3030 2d30 3030 302c 0,40690.00-0000, 0000030: 416c 6c20 4f63 6375 7061 7469 6f6e 732c All Occupations, 0000040: 3133 3433 3534 3235 302c 3430 3639 300a 134354250,40690. {code} But then it changes behavior. We see that strings are being quoted with NUL bytes "00": {code} [root@nightly513-unsecure-1 ~]# head -n 100000 theFileWhereToStoreTheData.csv | tail -n 2 | xxd 0000000: 3135 2d31 3031 312c 0043 6f6d 7075 7465 15-1011,.Compute 0000010: 7220 616e 6420 696e 666f 726d 6174 696f r and informatio 0000020: 6e20 7363 6965 6e74 6973 7473 2c20 7265 n scientists, re 0000030: 7365 6172 6368 002c 3238 3732 302c 3130 search.,28720,10 0000040: 3036 3430 0a31 352d 3130 3131 2c00 436f 0640.15-1011,.Co 0000050: 6d70 7574 6572 2061 6e64 2069 6e66 6f72 mputer and infor 0000060: 6d61 7469 6f6e 2073 6369 656e 7469 7374 mation scientist 0000070: 732c 2072 6573 6561 7263 6800 2c32 3837 s, research.,287 0000080: 3230 2c31 3030 3634 300a 20,100640. {code} I can't figure out how these NUL bytes are being introduced in the current implementation, but my changes seem to address this issue and do not include these erroneous extra bytes.) > Improvements for SeparatedValuesOutputFormat > -------------------------------------------- > > Key: HIVE-16826 > URL: https://issues.apache.org/jira/browse/HIVE-16826 > Project: Hive > Issue Type: Bug > Components: Beeline > Affects Versions: 2.1.1, 3.0.0 > Reporter: BELUGA BEHR > Assignee: BELUGA BEHR > Priority: Minor > Attachments: HIVE-16826.1.patch, HIVE-16826.2.patch > > > Proposing changes to class > {{org.apache.hive.beeline.SeparatedValuesOutputFormat}}. > # Simplify the code > # Code currently creates and destroys {{CsvListWriter}}, which contains a > buffer, for every line printed > # Use Apache Commons libraries for certain actions > # Prefer non-synchronized {{StringBuilderWriter}} to Java's synchronized > {{StringWriter}} -- This message was sent by Atlassian JIRA (v6.4.14#64029)