[ 
https://issues.apache.org/jira/browse/DRILL-6842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-6842:
------------------------------------
    Fix Version/s: 1.17.0

> Export to CSV using CREATE TABLE AS (CTAS) wrong parsed
> -------------------------------------------------------
>
>                 Key: DRILL-6842
>                 URL: https://issues.apache.org/jira/browse/DRILL-6842
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Text & CSV, Storage - Writer
>    Affects Versions: 1.14.0
>         Environment: - Tested with latest version *Apache Drill* 1.14.0, and 
> building the latest version from master (Github repo), commit 
> ad61c6bc1dd24994e50fe7dfed043d5e57dba8f9 at _Nov 5, 2018_.
> - *Linux* x64, Ubuntu 16.04
> - *OpenJDK* Runtime Environment (build 
> 1.8.0_171-8u171-b11-0ubuntu0.17.10.1-b11)
> - Apache *Maven* 3.5.0
>            Reporter: Mariano Ruiz
>            Priority: Minor
>              Labels: csv, export
>             Fix For: 1.17.0
>
>         Attachments: Screenshot from 2018-11-09 14-18-43.png
>
>
> When you export to a CSV using CTAS the result of a query, most of the time 
> the generated file is OK, but if you have in the results text columns with 
> "," characters, the resulting CSV file is broken, because does not enclose 
> the cells with commas inside with the " character.
> Steps to reproduce the bug:
> Lets say you have the following table in some source of data, maybe a CSV 
> file too:
> {code:title=/tmp/input.csv}
> product_ean,product_name,product_brand
> 12345678900,IPhone X,Apple
> 99999911100,"Samsung S9, Black",Samsung
> 11111223456,Smartwatch XY,Some Brand
> {code}
> Note that the second row of data, in the column "product_name", it has a 
> value with a comma inside (_Samsung S9, Black_), so all the cell value is 
> enclosed with " characters, while the rest of the column cells aren't, 
> despite they could be enclosed too.
> So if you query this file, Drill will interpret correctly the file and does 
> not interpret that comma inside the cell as a separator like the rest of the 
> commas in the file:
> {code}
> 0: jdbc:drill:zk=local> SELECT * FROM dfs.`/tmp/input.csv`;
> +--------------+--------------------+----------------+
> | product_ean  |    product_name    | product_brand  |
> +--------------+--------------------+----------------+
> | 12345678900  | IPhone X           | Apple          |
> | 99999911100  | Samsung S9, Black  | Samsung        |
> | 11111223456  | Smartwatch XY      | Some Brand     |
> +--------------+--------------------+----------------+
> 3 rows selected (1.874 seconds)
> {code}
> But now, if you want to query the file and export the result as CSV using the 
> CTAS feature, using the following steps:
> {code}
> 0: jdbc:drill:zk=local> USE dfs.tmp;
> +-------+--------------------------------------+
> |  ok   |               summary                |
> +-------+--------------------------------------+
> | true  | Default schema changed to [dfs.tmp]  |
> +-------+--------------------------------------+
> 1 row selected (0.13 seconds)
> 0: jdbc:drill:zk=local> ALTER SESSION SET `store.format`='csv';
> +-------+------------------------+
> |  ok   |        summary         |
> +-------+------------------------+
> | true  | store.format updated.  |
> +-------+------------------------+
> 1 row selected (0.094 seconds)
> 0: jdbc:drill:zk=local> CREATE TABLE dfs.tmp.my_output AS SELECT * FROM 
> dfs.`/tmp/input.csv`;
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 3                          |
> +-----------+----------------------------+
> 1 row selected (0.453 seconds)
> {code}
> The output file is this:
> {code:title=/tmp/my_output/0_0_0.csv}
> product_ean,product_name,product_brand
> 12345678900,IPhone X,Apple
> 99999911100,Samsung S9, Black,Samsung
> 11111223456,Smartwatch XY,Some Brand
> {code}
> The text _Samsung S9, Black_ in the cell is not quoted, so any CSV 
> interpreter like an office tool, a Java/Python/... library will interpret it 
> as two cell instead of one. Even Apache Drill will interpret it wrong:
> {code}
> 0: jdbc:drill:zk=local> SELECT * FROM dfs.`/tmp/my_output/0_0_0.csv`;
> +--------------+----------------+----------------+
> | product_ean  |  product_name  | product_brand  |
> +--------------+----------------+----------------+
> | 12345678900  | IPhone X       | Apple          |
> | 99999911100  | Samsung S9     |  Black         |
> | 11111223456  | Smartwatch XY  | Some Brand     |
> +--------------+----------------+----------------+
> 3 rows selected (0.175 seconds)
> {code}
> Note that the ending part _ Black_ was interpreted as a following cell, and 
> the real following cell is not showed, but it's not an error in the Drill 
> interpreter, it's an error of how Drill exported the result that now in the 
> last query was used as input.
> Here is how the file is interpreted by LibreOffice Calc:
> !Screenshot from 2018-11-09 14-18-43.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to