Mariano Ruiz created DRILL-6842:
-----------------------------------

             Summary: Export to CSV using CREATE TABLE AS (CTAS) wrong parsed
                 Key: DRILL-6842
                 URL: https://issues.apache.org/jira/browse/DRILL-6842
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - Text & CSV, Storage - Writer
    Affects Versions: 1.14.0
         Environment: - Tested with latest version *Apache Drill* 1.14.0, and 
building the latest version from master (Github repo), commit 
ad61c6bc1dd24994e50fe7dfed043d5e57dba8f9 at _Nov 5, 2018_.
- *Linux* x64, Ubuntu 16.04
- *OpenJDK* Runtime Environment (build 1.8.0_171-8u171-b11-0ubuntu0.17.10.1-b11)
- Apache *Maven* 3.5.0
            Reporter: Mariano Ruiz
         Attachments: Screenshot from 2018-11-09 14-18-43.png

When you export to a CSV using CTAS the result of a query, most of the time the 
generated file is OK, but if you have in the results text columns with "," 
characters, the resulting CSV file is broken, because does not enclose the 
cells with commas inside with the " character.

Steps to reproduce the bug:

Lets say you have the following table in some source of data, maybe a CSV file 
too:

{code|title=/tmp/input.csv}
product_ean,product_name,product_brand
12345678900,IPhone X,Apple
99999911100,"Samsung S9, Black",Samsung
11111223456,Smartwatch XY,Some Brand
{code}

Note that the second row of data, in the column "product_name", it has a value 
with a comma inside (_Samsung S9, Black_), so all the cell value is enclosed 
with " characters, while the rest of the column cells aren't, despite they 
could be enclosed too.

So if you query this file, Drill will interpret correctly the file and does not 
interpret that comma inside the cell as a separator like the rest of the commas 
in the file:

{code}
0: jdbc:drill:zk=local> SELECT * FROM dfs.`/tmp/input.csv`;
+--------------+--------------------+----------------+
| product_ean  |    product_name    | product_brand  |
+--------------+--------------------+----------------+
| 12345678900  | IPhone X           | Apple          |
| 99999911100  | Samsung S9, Black  | Samsung        |
| 11111223456  | Smartwatch XY      | Some Brand     |
+--------------+--------------------+----------------+
3 rows selected (1.874 seconds)
{code}

But now, if you want to query the file and export the result as CSV using the 
CTAS feature, using the following steps:

{code}
0: jdbc:drill:zk=local> USE dfs.tmp;
+-------+--------------------------------------+
|  ok   |               summary                |
+-------+--------------------------------------+
| true  | Default schema changed to [dfs.tmp]  |
+-------+--------------------------------------+
1 row selected (0.13 seconds)
0: jdbc:drill:zk=local> ALTER SESSION SET `store.format`='csv';
+-------+------------------------+
|  ok   |        summary         |
+-------+------------------------+
| true  | store.format updated.  |
+-------+------------------------+
1 row selected (0.094 seconds)
0: jdbc:drill:zk=local> CREATE TABLE dfs.tmp.my_output AS SELECT * FROM 
dfs.`/tmp/input.csv`;
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 3                          |
+-----------+----------------------------+
1 row selected (0.453 seconds)
{code}

The output file is this:

{code|title=/tmp/my_output/0_0_0.csv}
product_ean,product_name,product_brand
12345678900,IPhone X,Apple
99999911100,Samsung S9, Black,Samsung
11111223456,Smartwatch XY,Some Brand
{code}

The text _Samsung S9, Black_ in the cell is not quoted, so any CSV interpreter 
like an office tool, a Java/Python/... library will interpret it as two cell 
instead of one. Even Apache Drill will interpret it wrong:

{code}
0: jdbc:drill:zk=local> SELECT * FROM dfs.`/tmp/my_output/0_0_0.csv`;
+--------------+----------------+----------------+
| product_ean  |  product_name  | product_brand  |
+--------------+----------------+----------------+
| 12345678900  | IPhone X       | Apple          |
| 99999911100  | Samsung S9     |  Black         |
| 11111223456  | Smartwatch XY  | Some Brand     |
+--------------+----------------+----------------+
3 rows selected (0.175 seconds)
{code}

Note that the ending part _ Black_ was interpreted as a following cell, and the 
real following cell is not showed, but it's not an error in the Drill 
interpreter, it's an error of how Drill exported the result that now in the 
last query was used as input.

Here is how the file is interpreted by LibreOffice Calc:

!Screenshot from 2018-11-09 14-18-43.png!

CC [~arina] that we discussed this issue in another ticket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to