[jira] [Commented] (DRILL-2849) Difference in query results over CSV file created by CTAS, compared to results over original CSV file

2015-05-07 Thread Steven Phillips (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533322#comment-14533322
 ] 

Steven Phillips commented on DRILL-2849:


I think we should close this bug. The query is failing due to the malformed 
data.

> Difference in query results over CSV file created by CTAS, compared to 
> results over original CSV file 
> --
>
> Key: DRILL-2849
> URL: https://issues.apache.org/jira/browse/DRILL-2849
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Text & CSV
>Affects Versions: 0.9.0
> Environment: 64e3ec52b93e9331aa5179e040eca19afece8317 | DRILL-2611: 
> value vectors should report valid value count | 16.04.2015 @ 13:53:34 EDT
>Reporter: Khurram Faraaz
>Assignee: Khurram Faraaz
>Priority: Critical
> Fix For: 1.0.0
>
>
> Different results are seen for the same query over CSV data file and another 
> CSV data file created by CTAS using the same CSV file.
> Tests were executed on 4 node cluster on CentOS.
> I got rid of the header information that is written by CTAS into the new CSV 
> file that CTAS creates, and then ran my queries over CTAS' CSV file.
> query over uncompressed CSV file, deletions/deletions-0-of-00020.csv
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `deletions/deletions-0-of-00020.csv` group by 
> > columns[7];
> 88 rows selected (6.893 seconds)
> =
> {code}
> query over CSV file that was created by CTAS. (input to CTAS was 
> deletions/deletions-0-of-00020.csv)
> Notice there is one more record returned.
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `csvToCSV_0_of_00020/0_0_0.csv` group by columns[7];
>  
> 89 rows selected (6.623 seconds)
> ==
> {code}
> query over compressed CSV file
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `deletions-0-of-00020.csv.gz` group by columns[7];
> 88 rows selected (10.526 seconds)
> ==
> {code}
> In the below cases, the count and sum results are different when query is 
> executed over CSV file that was created by CTAS. ( this may explain why we 
> see the difference in results in the above queries ? )
> {code}
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `deletions/deletions-0-of-00020.csv` where columns[7] is 
> null group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 252| 1.362983396001E12 | 1.165768779027E12 | 1.293794515595635E12 | 
> null   |
> ++++++
> 1 row selected (6.013 seconds)
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `deletions-0-of-00020.csv.gz` where columns[7] is null 
> group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 252| 1.362983396001E12 | 1.165768779027E12 | 1.293794515595635E12 | 
> null   |
> ++++++
> 1 row selected (8.899 seconds)
> {code}
> Notice that count and sum results are different (from those above) when query 
> is executed over the CSV file created by CTAS.
> {code}
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `csvToCSV_0_of_00020/0_0_0.csv` where columns[7] is null 
> group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 245| 1.349670663E12 | 1.165768779027E12 | 1.2930281335065144E12 | 
> null   |
> ++--

[jira] [Commented] (DRILL-2849) Difference in query results over CSV file created by CTAS, compared to results over original CSV file

2015-05-07 Thread Khurram Faraaz (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533190#comment-14533190
 ] 

Khurram Faraaz commented on DRILL-2849:
---

I ran the test on b856d0dd50a4495f95ee5d32ea3d396df56842ed | DRILL-2826: 
Simplify and centralize Operator Cleanup | 05.05.2015 @ 20:28:37 EDT

> Difference in query results over CSV file created by CTAS, compared to 
> results over original CSV file 
> --
>
> Key: DRILL-2849
> URL: https://issues.apache.org/jira/browse/DRILL-2849
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Text & CSV
>Affects Versions: 0.9.0
> Environment: 64e3ec52b93e9331aa5179e040eca19afece8317 | DRILL-2611: 
> value vectors should report valid value count | 16.04.2015 @ 13:53:34 EDT
>Reporter: Khurram Faraaz
>Assignee: Khurram Faraaz
>Priority: Critical
> Fix For: 1.0.0
>
>
> Different results are seen for the same query over CSV data file and another 
> CSV data file created by CTAS using the same CSV file.
> Tests were executed on 4 node cluster on CentOS.
> I got rid of the header information that is written by CTAS into the new CSV 
> file that CTAS creates, and then ran my queries over CTAS' CSV file.
> query over uncompressed CSV file, deletions/deletions-0-of-00020.csv
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `deletions/deletions-0-of-00020.csv` group by 
> > columns[7];
> 88 rows selected (6.893 seconds)
> =
> {code}
> query over CSV file that was created by CTAS. (input to CTAS was 
> deletions/deletions-0-of-00020.csv)
> Notice there is one more record returned.
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `csvToCSV_0_of_00020/0_0_0.csv` group by columns[7];
>  
> 89 rows selected (6.623 seconds)
> ==
> {code}
> query over compressed CSV file
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `deletions-0-of-00020.csv.gz` group by columns[7];
> 88 rows selected (10.526 seconds)
> ==
> {code}
> In the below cases, the count and sum results are different when query is 
> executed over CSV file that was created by CTAS. ( this may explain why we 
> see the difference in results in the above queries ? )
> {code}
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `deletions/deletions-0-of-00020.csv` where columns[7] is 
> null group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 252| 1.362983396001E12 | 1.165768779027E12 | 1.293794515595635E12 | 
> null   |
> ++++++
> 1 row selected (6.013 seconds)
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `deletions-0-of-00020.csv.gz` where columns[7] is null 
> group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 252| 1.362983396001E12 | 1.165768779027E12 | 1.293794515595635E12 | 
> null   |
> ++++++
> 1 row selected (8.899 seconds)
> {code}
> Notice that count and sum results are different (from those above) when query 
> is executed over the CSV file created by CTAS.
> {code}
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `csvToCSV_0_of_00020/0_0_0.csv` where columns[7] is null 
> group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 245| 1.349670663E12 | 1.165768779027E12 | 1.293028

[jira] [Commented] (DRILL-2849) Difference in query results over CSV file created by CTAS, compared to results over original CSV file

2015-05-07 Thread Khurram Faraaz (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533188#comment-14533188
 ] 

Khurram Faraaz commented on DRILL-2849:
---

I tried the tests, and I see an error message, earlier we did see the queries 
returned results. Is this expected behavior ? Should we allow new line 
characters inside of quoted strings ?

{code}
0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] as 
double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
columns[7] from `deletions/deletions-0-of-00020.csv` group by columns[7];
Error: SYSTEM ERROR: Error processing input: Cannot use newline character 
within quoted string, line=4071, char=513510. Content parsed: [ ]

Fragment 2:0

[Error Id: fe72951a-7a1e-41fe-abd7-c48dfdc74038 on centos-01.qa.lab:31010] 
(state=,code=0)
{code}

{code}
0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] as 
double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
columns[7] from `csvToCSV_0_of_00020/0_0_0.csv` group by columns[7];
Error: SYSTEM ERROR: Error processing input: Cannot use newline character 
within quoted string, line=4071, char=513515. Content parsed: [ ]

Fragment 2:1

[Error Id: 9db065d1-04e8-422f-962e-0056e117da98 on centos-01.qa.lab:31010] 
(state=,code=0)
{code}

{code}
0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] as 
double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
columns[7] from `deletions-0-of-00020.csv.gz` group by columns[7];
Error: SYSTEM ERROR: Error processing input: Cannot use newline character 
within quoted string, line=4071, char=513510. Content parsed: [ ]

Fragment 0:0

[Error Id: 45cf2b9e-26df-4a38-a328-8c9bf16a28b0 on centos-01.qa.lab:31010] 
(state=,code=0)
{code}


> Difference in query results over CSV file created by CTAS, compared to 
> results over original CSV file 
> --
>
> Key: DRILL-2849
> URL: https://issues.apache.org/jira/browse/DRILL-2849
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Text & CSV
>Affects Versions: 0.9.0
> Environment: 64e3ec52b93e9331aa5179e040eca19afece8317 | DRILL-2611: 
> value vectors should report valid value count | 16.04.2015 @ 13:53:34 EDT
>Reporter: Khurram Faraaz
>Assignee: Khurram Faraaz
>Priority: Critical
> Fix For: 1.0.0
>
>
> Different results are seen for the same query over CSV data file and another 
> CSV data file created by CTAS using the same CSV file.
> Tests were executed on 4 node cluster on CentOS.
> I got rid of the header information that is written by CTAS into the new CSV 
> file that CTAS creates, and then ran my queries over CTAS' CSV file.
> query over uncompressed CSV file, deletions/deletions-0-of-00020.csv
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `deletions/deletions-0-of-00020.csv` group by 
> > columns[7];
> 88 rows selected (6.893 seconds)
> =
> {code}
> query over CSV file that was created by CTAS. (input to CTAS was 
> deletions/deletions-0-of-00020.csv)
> Notice there is one more record returned.
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `csvToCSV_0_of_00020/0_0_0.csv` group by columns[7];
>  
> 89 rows selected (6.623 seconds)
> ==
> {code}
> query over compressed CSV file
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `deletions-0-of-00020.csv.gz` group by columns[7];
> 88 rows selected (10.526 seconds)
> ==
> {code}
> In the below cases, the count and sum results are different when query is 
> executed over CSV file that was created by CTAS. ( this may explain why we 
> see the difference in results in the above queries ? )
> {code}
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `deletions/deletions-0-of-00020.csv` where columns[7] is 
> null group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 252| 1

[jira] [Commented] (DRILL-2849) Difference in query results over CSV file created by CTAS, compared to results over original CSV file

2015-05-07 Thread Jacques Nadeau (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532302#comment-14532302
 ] 

Jacques Nadeau commented on DRILL-2849:
---

[~khfaraaz], with DRILL-2006 merged, can you reattempt this?

> Difference in query results over CSV file created by CTAS, compared to 
> results over original CSV file 
> --
>
> Key: DRILL-2849
> URL: https://issues.apache.org/jira/browse/DRILL-2849
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Text & CSV
>Affects Versions: 0.9.0
> Environment: 64e3ec52b93e9331aa5179e040eca19afece8317 | DRILL-2611: 
> value vectors should report valid value count | 16.04.2015 @ 13:53:34 EDT
>Reporter: Khurram Faraaz
>Assignee: Khurram Faraaz
>Priority: Critical
> Fix For: 1.0.0
>
>
> Different results are seen for the same query over CSV data file and another 
> CSV data file created by CTAS using the same CSV file.
> Tests were executed on 4 node cluster on CentOS.
> I got rid of the header information that is written by CTAS into the new CSV 
> file that CTAS creates, and then ran my queries over CTAS' CSV file.
> query over uncompressed CSV file, deletions/deletions-0-of-00020.csv
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `deletions/deletions-0-of-00020.csv` group by 
> > columns[7];
> 88 rows selected (6.893 seconds)
> =
> {code}
> query over CSV file that was created by CTAS. (input to CTAS was 
> deletions/deletions-0-of-00020.csv)
> Notice there is one more record returned.
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `csvToCSV_0_of_00020/0_0_0.csv` group by columns[7];
>  
> 89 rows selected (6.623 seconds)
> ==
> {code}
> query over compressed CSV file
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `deletions-0-of-00020.csv.gz` group by columns[7];
> 88 rows selected (10.526 seconds)
> ==
> {code}
> In the below cases, the count and sum results are different when query is 
> executed over CSV file that was created by CTAS. ( this may explain why we 
> see the difference in results in the above queries ? )
> {code}
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `deletions/deletions-0-of-00020.csv` where columns[7] is 
> null group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 252| 1.362983396001E12 | 1.165768779027E12 | 1.293794515595635E12 | 
> null   |
> ++++++
> 1 row selected (6.013 seconds)
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `deletions-0-of-00020.csv.gz` where columns[7] is null 
> group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 252| 1.362983396001E12 | 1.165768779027E12 | 1.293794515595635E12 | 
> null   |
> ++++++
> 1 row selected (8.899 seconds)
> {code}
> Notice that count and sum results are different (from those above) when query 
> is executed over the CSV file created by CTAS.
> {code}
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `csvToCSV_0_of_00020/0_0_0.csv` where columns[7] is null 
> group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 245| 1.349670663E12 | 1.165768779027E12 | 1.2930281335065144E12 | 
> null   |
> ++++

[jira] [Commented] (DRILL-2849) Difference in query results over CSV file created by CTAS, compared to results over original CSV file

2015-05-04 Thread Steven Phillips (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527558#comment-14527558
 ] 

Steven Phillips commented on DRILL-2849:


The problem is that we don't really have a good concept of null in Text files. 
In the original text file, some rows have 7 entries, and some have 8. So when 
selecting columns[7], for the rows where it is missing, we return null.

The CSV writer, on the other hand, write the string "null" when it sees a null 
value. But since there is no concept of null in text files as far as the reader 
is concerned, this is simply returned as a string "null".

Also of significance, is the fact that the original csv file doesn't appear to 
be valid. I see lines like this scattered throughout:

1348755809001,/user/ovguide,1363510469000,/user/turtlewax_bot,/m/0jkdvpx,/common/topic/description,"Guests:
 Columnist Thomas Friedman, writer David Frum, singer Natalie Mames, Gen. 
Wesley Clark and Sen. Barbara Boxer.

Note how there is no closing quotation mark for the string beginning with 
"Guests". This is causing unexpected behavior because the reader currently 
treats the quotations mark as part of the string, and thus faithfully writes it 
to the new file when doing CTAS. The null value that comes after is then 
assumed by the reader to part of the same string.

> Difference in query results over CSV file created by CTAS, compared to 
> results over original CSV file 
> --
>
> Key: DRILL-2849
> URL: https://issues.apache.org/jira/browse/DRILL-2849
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Text & CSV
>Affects Versions: 0.9.0
> Environment: 64e3ec52b93e9331aa5179e040eca19afece8317 | DRILL-2611: 
> value vectors should report valid value count | 16.04.2015 @ 13:53:34 EDT
>Reporter: Khurram Faraaz
>Assignee: Steven Phillips
>Priority: Critical
> Fix For: 1.0.0
>
>
> Different results are seen for the same query over CSV data file and another 
> CSV data file created by CTAS using the same CSV file.
> Tests were executed on 4 node cluster on CentOS.
> I got rid of the header information that is written by CTAS into the new CSV 
> file that CTAS creates, and then ran my queries over CTAS' CSV file.
> query over uncompressed CSV file, deletions/deletions-0-of-00020.csv
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `deletions/deletions-0-of-00020.csv` group by 
> > columns[7];
> 88 rows selected (6.893 seconds)
> =
> {code}
> query over CSV file that was created by CTAS. (input to CTAS was 
> deletions/deletions-0-of-00020.csv)
> Notice there is one more record returned.
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `csvToCSV_0_of_00020/0_0_0.csv` group by columns[7];
>  
> 89 rows selected (6.623 seconds)
> ==
> {code}
> query over compressed CSV file
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `deletions-0-of-00020.csv.gz` group by columns[7];
> 88 rows selected (10.526 seconds)
> ==
> {code}
> In the below cases, the count and sum results are different when query is 
> executed over CSV file that was created by CTAS. ( this may explain why we 
> see the difference in results in the above queries ? )
> {code}
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `deletions/deletions-0-of-00020.csv` where columns[7] is 
> null group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 252| 1.362983396001E12 | 1.165768779027E12 | 1.293794515595635E12 | 
> null   |
> ++++++
> 1 row selected (6.013 seconds)
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `deletions-0-of-00020.csv.gz` where columns[7] is null 
> group by columns[7];
> +++++--

[jira] [Commented] (DRILL-2849) Difference in query results over CSV file created by CTAS, compared to results over original CSV file

2015-05-04 Thread Khurram Faraaz (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527535#comment-14527535
 ] 

Khurram Faraaz commented on DRILL-2849:
---

I have sent the files to you  in an email.

> Difference in query results over CSV file created by CTAS, compared to 
> results over original CSV file 
> --
>
> Key: DRILL-2849
> URL: https://issues.apache.org/jira/browse/DRILL-2849
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Text & CSV
>Affects Versions: 0.9.0
> Environment: 64e3ec52b93e9331aa5179e040eca19afece8317 | DRILL-2611: 
> value vectors should report valid value count | 16.04.2015 @ 13:53:34 EDT
>Reporter: Khurram Faraaz
>Assignee: Steven Phillips
>Priority: Critical
> Fix For: 1.0.0
>
>
> Different results are seen for the same query over CSV data file and another 
> CSV data file created by CTAS using the same CSV file.
> Tests were executed on 4 node cluster on CentOS.
> I got rid of the header information that is written by CTAS into the new CSV 
> file that CTAS creates, and then ran my queries over CTAS' CSV file.
> query over uncompressed CSV file, deletions/deletions-0-of-00020.csv
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `deletions/deletions-0-of-00020.csv` group by 
> > columns[7];
> 88 rows selected (6.893 seconds)
> =
> {code}
> query over CSV file that was created by CTAS. (input to CTAS was 
> deletions/deletions-0-of-00020.csv)
> Notice there is one more record returned.
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `csvToCSV_0_of_00020/0_0_0.csv` group by columns[7];
>  
> 89 rows selected (6.623 seconds)
> ==
> {code}
> query over compressed CSV file
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `deletions-0-of-00020.csv.gz` group by columns[7];
> 88 rows selected (10.526 seconds)
> ==
> {code}
> In the below cases, the count and sum results are different when query is 
> executed over CSV file that was created by CTAS. ( this may explain why we 
> see the difference in results in the above queries ? )
> {code}
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `deletions/deletions-0-of-00020.csv` where columns[7] is 
> null group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 252| 1.362983396001E12 | 1.165768779027E12 | 1.293794515595635E12 | 
> null   |
> ++++++
> 1 row selected (6.013 seconds)
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `deletions-0-of-00020.csv.gz` where columns[7] is null 
> group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 252| 1.362983396001E12 | 1.165768779027E12 | 1.293794515595635E12 | 
> null   |
> ++++++
> 1 row selected (8.899 seconds)
> {code}
> Notice that count and sum results are different (from those above) when query 
> is executed over the CSV file created by CTAS.
> {code}
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `csvToCSV_0_of_00020/0_0_0.csv` where columns[7] is null 
> group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 245| 1.349670663E12 | 1.165768779027E12 | 1.2930281335065144E12 | 
> null   |
> +++++

[jira] [Commented] (DRILL-2849) Difference in query results over CSV file created by CTAS, compared to results over original CSV file

2015-04-28 Thread Steven Phillips (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518256#comment-14518256
 ] 

Steven Phillips commented on DRILL-2849:


could you please attach the file you used in the query.

> Difference in query results over CSV file created by CTAS, compared to 
> results over original CSV file 
> --
>
> Key: DRILL-2849
> URL: https://issues.apache.org/jira/browse/DRILL-2849
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Text & CSV
>Affects Versions: 0.9.0
> Environment: 64e3ec52b93e9331aa5179e040eca19afece8317 | DRILL-2611: 
> value vectors should report valid value count | 16.04.2015 @ 13:53:34 EDT
>Reporter: Khurram Faraaz
>Assignee: Steven Phillips
>Priority: Critical
> Fix For: 1.0.0
>
>
> Different results are seen for the same query over CSV data file and another 
> CSV data file created by CTAS using the same CSV file.
> Tests were executed on 4 node cluster on CentOS.
> I got rid of the header information that is written by CTAS into the new CSV 
> file that CTAS creates, and then ran my queries over CTAS' CSV file.
> query over uncompressed CSV file, deletions/deletions-0-of-00020.csv
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `deletions/deletions-0-of-00020.csv` group by 
> > columns[7];
> 88 rows selected (6.893 seconds)
> =
> {code}
> query over CSV file that was created by CTAS. (input to CTAS was 
> deletions/deletions-0-of-00020.csv)
> Notice there is one more record returned.
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `csvToCSV_0_of_00020/0_0_0.csv` group by columns[7];
>  
> 89 rows selected (6.623 seconds)
> ==
> {code}
> query over compressed CSV file
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `deletions-0-of-00020.csv.gz` group by columns[7];
> 88 rows selected (10.526 seconds)
> ==
> {code}
> In the below cases, the count and sum results are different when query is 
> executed over CSV file that was created by CTAS. ( this may explain why we 
> see the difference in results in the above queries ? )
> {code}
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `deletions/deletions-0-of-00020.csv` where columns[7] is 
> null group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 252| 1.362983396001E12 | 1.165768779027E12 | 1.293794515595635E12 | 
> null   |
> ++++++
> 1 row selected (6.013 seconds)
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `deletions-0-of-00020.csv.gz` where columns[7] is null 
> group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 252| 1.362983396001E12 | 1.165768779027E12 | 1.293794515595635E12 | 
> null   |
> ++++++
> 1 row selected (8.899 seconds)
> {code}
> Notice that count and sum results are different (from those above) when query 
> is executed over the CSV file created by CTAS.
> {code}
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `csvToCSV_0_of_00020/0_0_0.csv` where columns[7] is null 
> group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 245| 1.349670663E12 | 1.165768779027E12 | 1.2930281335065144E12 | 
> null   |
> ++++--

[jira] [Commented] (DRILL-2849) Difference in query results over CSV file created by CTAS, compared to results over original CSV file

2015-04-22 Thread Venki Korukanti (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508122#comment-14508122
 ] 

Venki Korukanti commented on DRILL-2849:


Nevermind the above comment. Looks like that is not the issue.

> Difference in query results over CSV file created by CTAS, compared to 
> results over original CSV file 
> --
>
> Key: DRILL-2849
> URL: https://issues.apache.org/jira/browse/DRILL-2849
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Text & CSV
>Affects Versions: 0.9.0
> Environment: 64e3ec52b93e9331aa5179e040eca19afece8317 | DRILL-2611: 
> value vectors should report valid value count | 16.04.2015 @ 13:53:34 EDT
>Reporter: Khurram Faraaz
>Assignee: Steven Phillips
>Priority: Critical
>
> Different results are seen for the same query over CSV data file and another 
> CSV data file created by CTAS using the same CSV file.
> Tests were executed on 4 node cluster on CentOS.
> I got rid of the header information that is written by CTAS into the new CSV 
> file that CTAS creates, and then ran my queries over CTAS' CSV file.
> query over uncompressed CSV file, deletions/deletions-0-of-00020.csv
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `deletions/deletions-0-of-00020.csv` group by 
> > columns[7];
> 88 rows selected (6.893 seconds)
> =
> {code}
> query over CSV file that was created by CTAS. (input to CTAS was 
> deletions/deletions-0-of-00020.csv)
> Notice there is one more record returned.
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `csvToCSV_0_of_00020/0_0_0.csv` group by columns[7];
>  
> 89 rows selected (6.623 seconds)
> ==
> {code}
> query over compressed CSV file
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `deletions-0-of-00020.csv.gz` group by columns[7];
> 88 rows selected (10.526 seconds)
> ==
> {code}
> In the below cases, the count and sum results are different when query is 
> executed over CSV file that was created by CTAS. ( this may explain why we 
> see the difference in results in the above queries ? )
> {code}
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `deletions/deletions-0-of-00020.csv` where columns[7] is 
> null group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 252| 1.362983396001E12 | 1.165768779027E12 | 1.293794515595635E12 | 
> null   |
> ++++++
> 1 row selected (6.013 seconds)
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `deletions-0-of-00020.csv.gz` where columns[7] is null 
> group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 252| 1.362983396001E12 | 1.165768779027E12 | 1.293794515595635E12 | 
> null   |
> ++++++
> 1 row selected (8.899 seconds)
> {code}
> Notice that count and sum results are different (from those above) when query 
> is executed over the CSV file created by CTAS.
> {code}
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `csvToCSV_0_of_00020/0_0_0.csv` where columns[7] is null 
> group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 245| 1.349670663E12 | 1.165768779027E12 | 1.2930281335065144E12 | 
> null   |
> ++++++
> 1 row

[jira] [Commented] (DRILL-2849) Difference in query results over CSV file created by CTAS, compared to results over original CSV file

2015-04-22 Thread Venki Korukanti (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508115#comment-14508115
 ] 

Venki Korukanti commented on DRILL-2849:


CSV writer creates a header row from the column names. Is that why you are 
seeing an extra row? Can you open the output file and check?

> Difference in query results over CSV file created by CTAS, compared to 
> results over original CSV file 
> --
>
> Key: DRILL-2849
> URL: https://issues.apache.org/jira/browse/DRILL-2849
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Text & CSV
>Affects Versions: 0.9.0
> Environment: 64e3ec52b93e9331aa5179e040eca19afece8317 | DRILL-2611: 
> value vectors should report valid value count | 16.04.2015 @ 13:53:34 EDT
>Reporter: Khurram Faraaz
>Assignee: Steven Phillips
>Priority: Critical
>
> Different results are seen for the same query over CSV data file and another 
> CSV data file created by CTAS using the same CSV file.
> Tests were executed on 4 node cluster on CentOS.
> I got rid of the header information that is written by CTAS into the new CSV 
> file that CTAS creates, and then ran my queries over CTAS' CSV file.
> query over uncompressed CSV file, deletions/deletions-0-of-00020.csv
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `deletions/deletions-0-of-00020.csv` group by 
> > columns[7];
> 88 rows selected (6.893 seconds)
> =
> {code}
> query over CSV file that was created by CTAS. (input to CTAS was 
> deletions/deletions-0-of-00020.csv)
> Notice there is one more record returned.
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `csvToCSV_0_of_00020/0_0_0.csv` group by columns[7];
>  
> 89 rows selected (6.623 seconds)
> ==
> {code}
> query over compressed CSV file
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `deletions-0-of-00020.csv.gz` group by columns[7];
> 88 rows selected (10.526 seconds)
> ==
> {code}
> In the below cases, the count and sum results are different when query is 
> executed over CSV file that was created by CTAS. ( this may explain why we 
> see the difference in results in the above queries ? )
> {code}
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `deletions/deletions-0-of-00020.csv` where columns[7] is 
> null group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 252| 1.362983396001E12 | 1.165768779027E12 | 1.293794515595635E12 | 
> null   |
> ++++++
> 1 row selected (6.013 seconds)
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `deletions-0-of-00020.csv.gz` where columns[7] is null 
> group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 252| 1.362983396001E12 | 1.165768779027E12 | 1.293794515595635E12 | 
> null   |
> ++++++
> 1 row selected (8.899 seconds)
> {code}
> Notice that count and sum results are different (from those above) when query 
> is executed over the CSV file created by CTAS.
> {code}
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `csvToCSV_0_of_00020/0_0_0.csv` where columns[7] is null 
> group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 245| 1.349670663E12 | 1.165768779027E12 | 1.2930281335065144E12 | 
> null   |
> 

[jira] [Commented] (DRILL-2849) Difference in query results over CSV file created by CTAS, compared to results over original CSV file

2015-04-22 Thread Khurram Faraaz (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508056#comment-14508056
 ] 

Khurram Faraaz commented on DRILL-2849:
---

CTAS query used was

create table csvToCSV_0_of_00020 as select columns[0], columns[1], 
columns[2], columns[3], columns[4], columns[5], columns[6], columns[7] from 
`deletions-0-of-00020.csv.gz`;

> Difference in query results over CSV file created by CTAS, compared to 
> results over original CSV file 
> --
>
> Key: DRILL-2849
> URL: https://issues.apache.org/jira/browse/DRILL-2849
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Text & CSV
>Affects Versions: 0.9.0
> Environment: 64e3ec52b93e9331aa5179e040eca19afece8317 | DRILL-2611: 
> value vectors should report valid value count | 16.04.2015 @ 13:53:34 EDT
>Reporter: Khurram Faraaz
>Assignee: Steven Phillips
>
> Different results are seen for the same query over CSV data file and another 
> CSV data file created by CTAS using the same CSV file.
> Tests were executed on 4 node cluster on CentOS.
> I got rid of the header information that is written by CTAS into the new CSV 
> file that CTAS creates, and then ran my queries over CTAS' CSV file.
> query over uncompressed CSV file, deletions/deletions-0-of-00020.csv
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `deletions/deletions-0-of-00020.csv` group by 
> > columns[7];
> 88 rows selected (6.893 seconds)
> =
> {code}
> query over CSV file that was created by CTAS. (input to CTAS was 
> deletions/deletions-0-of-00020.csv)
> Notice there is one more record returned.
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `csvToCSV_0_of_00020/0_0_0.csv` group by columns[7];
>  
> 89 rows selected (6.623 seconds)
> ==
> {code}
> query over compressed CSV file
> {code}
> > select count(cast(columns[0] as double)),max(cast(columns[0] as 
> > double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> > columns[7] from `deletions-0-of-00020.csv.gz` group by columns[7];
> 88 rows selected (10.526 seconds)
> ==
> {code}
> In the below cases, the count and sum results are different when query is 
> executed over CSV file that was created by CTAS. ( this may explain why we 
> see the difference in results in the above queries ? )
> {code}
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `deletions/deletions-0-of-00020.csv` where columns[7] is 
> null group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 252| 1.362983396001E12 | 1.165768779027E12 | 1.293794515595635E12 | 
> null   |
> ++++++
> 1 row selected (6.013 seconds)
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `deletions-0-of-00020.csv.gz` where columns[7] is null 
> group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 252| 1.362983396001E12 | 1.165768779027E12 | 1.293794515595635E12 | 
> null   |
> ++++++
> 1 row selected (8.899 seconds)
> {code}
> Notice that count and sum results are different (from those above) when query 
> is executed over the CSV file created by CTAS.
> {code}
> 0: jdbc:drill:> select count(cast(columns[0] as double)),max(cast(columns[0] 
> as double)),min(cast(columns[0] as double)),avg(cast(columns[0] as double)), 
> columns[7] from `csvToCSV_0_of_00020/0_0_0.csv` where columns[7] is null 
> group by columns[7];
> ++++++
> |   EXPR$0   |   EXPR$1   |   EXPR$2   |   EXPR$3   |   EXPR$4   |
> ++++++
> | 245| 1.349670663E12 | 1.165768779027E12 | 1.293028