[ https://issues.apache.org/jira/browse/DRILL-6842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arina Ielchiieva updated DRILL-6842: ------------------------------------ Fix Version/s: 1.17.0 > Export to CSV using CREATE TABLE AS (CTAS) wrong parsed > ------------------------------------------------------- > > Key: DRILL-6842 > URL: https://issues.apache.org/jira/browse/DRILL-6842 > Project: Apache Drill > Issue Type: Improvement > Components: Storage - Text & CSV, Storage - Writer > Affects Versions: 1.14.0 > Environment: - Tested with latest version *Apache Drill* 1.14.0, and > building the latest version from master (Github repo), commit > ad61c6bc1dd24994e50fe7dfed043d5e57dba8f9 at _Nov 5, 2018_. > - *Linux* x64, Ubuntu 16.04 > - *OpenJDK* Runtime Environment (build > 1.8.0_171-8u171-b11-0ubuntu0.17.10.1-b11) > - Apache *Maven* 3.5.0 > Reporter: Mariano Ruiz > Priority: Minor > Labels: csv, export > Fix For: 1.17.0 > > Attachments: Screenshot from 2018-11-09 14-18-43.png > > > When you export to a CSV using CTAS the result of a query, most of the time > the generated file is OK, but if you have in the results text columns with > "," characters, the resulting CSV file is broken, because does not enclose > the cells with commas inside with the " character. > Steps to reproduce the bug: > Lets say you have the following table in some source of data, maybe a CSV > file too: > {code:title=/tmp/input.csv} > product_ean,product_name,product_brand > 12345678900,IPhone X,Apple > 99999911100,"Samsung S9, Black",Samsung > 11111223456,Smartwatch XY,Some Brand > {code} > Note that the second row of data, in the column "product_name", it has a > value with a comma inside (_Samsung S9, Black_), so all the cell value is > enclosed with " characters, while the rest of the column cells aren't, > despite they could be enclosed too. > So if you query this file, Drill will interpret correctly the file and does > not interpret that comma inside the cell as a separator like the rest of the > commas in the file: > {code} > 0: jdbc:drill:zk=local> SELECT * FROM dfs.`/tmp/input.csv`; > +--------------+--------------------+----------------+ > | product_ean | product_name | product_brand | > +--------------+--------------------+----------------+ > | 12345678900 | IPhone X | Apple | > | 99999911100 | Samsung S9, Black | Samsung | > | 11111223456 | Smartwatch XY | Some Brand | > +--------------+--------------------+----------------+ > 3 rows selected (1.874 seconds) > {code} > But now, if you want to query the file and export the result as CSV using the > CTAS feature, using the following steps: > {code} > 0: jdbc:drill:zk=local> USE dfs.tmp; > +-------+--------------------------------------+ > | ok | summary | > +-------+--------------------------------------+ > | true | Default schema changed to [dfs.tmp] | > +-------+--------------------------------------+ > 1 row selected (0.13 seconds) > 0: jdbc:drill:zk=local> ALTER SESSION SET `store.format`='csv'; > +-------+------------------------+ > | ok | summary | > +-------+------------------------+ > | true | store.format updated. | > +-------+------------------------+ > 1 row selected (0.094 seconds) > 0: jdbc:drill:zk=local> CREATE TABLE dfs.tmp.my_output AS SELECT * FROM > dfs.`/tmp/input.csv`; > +-----------+----------------------------+ > | Fragment | Number of records written | > +-----------+----------------------------+ > | 0_0 | 3 | > +-----------+----------------------------+ > 1 row selected (0.453 seconds) > {code} > The output file is this: > {code:title=/tmp/my_output/0_0_0.csv} > product_ean,product_name,product_brand > 12345678900,IPhone X,Apple > 99999911100,Samsung S9, Black,Samsung > 11111223456,Smartwatch XY,Some Brand > {code} > The text _Samsung S9, Black_ in the cell is not quoted, so any CSV > interpreter like an office tool, a Java/Python/... library will interpret it > as two cell instead of one. Even Apache Drill will interpret it wrong: > {code} > 0: jdbc:drill:zk=local> SELECT * FROM dfs.`/tmp/my_output/0_0_0.csv`; > +--------------+----------------+----------------+ > | product_ean | product_name | product_brand | > +--------------+----------------+----------------+ > | 12345678900 | IPhone X | Apple | > | 99999911100 | Samsung S9 | Black | > | 11111223456 | Smartwatch XY | Some Brand | > +--------------+----------------+----------------+ > 3 rows selected (0.175 seconds) > {code} > Note that the ending part _ Black_ was interpreted as a following cell, and > the real following cell is not showed, but it's not an error in the Drill > interpreter, it's an error of how Drill exported the result that now in the > last query was used as input. > Here is how the file is interpreted by LibreOffice Calc: > !Screenshot from 2018-11-09 14-18-43.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)