[jira] [Commented] (DRILL-4145) IndexOutOfBoundsException raised during select * query on S3 csv file

2015-12-02 Thread John Omernik (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036572#comment-15036572
 ] 

John Omernik commented on DRILL-4145:
-

Looks like Steven pushed a change. Steven does that one line addition fix this? 
That's awesome if that's all it took! I did confirm that I have the same issue 
on MapRFS as well. 

Peter, the other issue I saw you mention was that adding the extractHeader to 
the csv didn't actually have the desired affect.  That may be a bug too, do you 
want to open a Jira on that too? (It should work, and when I did my testing, it 
didn't either). 

Thanks for your work on this Peter, it's great to find bugs like this. Helps 
everyone!

John

> IndexOutOfBoundsException raised during select * query on S3 csv file
> -
>
> Key: DRILL-4145
> URL: https://issues.apache.org/jira/browse/DRILL-4145
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
> Environment: Drill 1.3.0 on a 3 node distributed-mode cluster on AWS.
> Data files on S3.
> S3 storage plugin configuration:
> {
>   "type": "file",
>   "enabled": true,
>   "connection": "s3a://",
>   "workspaces": {
> "root": {
>   "location": "/",
>   "writable": false,
>   "defaultInputFormat": null
> },
> "views": {
>   "location": "/processed",
>   "writable": true,
>   "defaultInputFormat": null
> },
> "tmp": {
>   "location": "/tmp",
>   "writable": true,
>   "defaultInputFormat": null
> }
>   },
>   "formats": {
> "psv": {
>   "type": "text",
>   "extensions": [
> "tbl"
>   ],
>   "delimiter": "|"
> },
> "csv": {
>   "type": "text",
>   "extensions": [
> "csv"
>   ],
>   "extractHeader": true,
>   "delimiter": ","
> },
> "tsv": {
>   "type": "text",
>   "extensions": [
> "tsv"
>   ],
>   "delimiter": "\t"
> },
> "parquet": {
>   "type": "parquet"
> },
> "json": {
>   "type": "json"
> },
> "avro": {
>   "type": "avro"
> },
> "sequencefile": {
>   "type": "sequencefile",
>   "extensions": [
> "seq"
>   ]
> },
> "csvh": {
>   "type": "text",
>   "extensions": [
> "csvh",
> "csv"
>   ],
>   "extractHeader": true,
>   "delimiter": ","
> }
>   }
> }
>Reporter: Peter McTaggart
>Assignee: Jacques Nadeau
> Attachments: apps1-bad.csv, apps1.csv
>
>
> When trying to query (via sqlline or WebUI) a .csv file I am getting an 
> IndexOutofBoundsException:
> {noformat} 0: jdbc:drill:> select * from 
> s3data.root.`staging/data/apps1-bad.csv` limit 1;
> Error: SYSTEM ERROR: IndexOutOfBoundsException: index: 16384, length: 4 
> (expected: range(0, 16384))
> Fragment 0:0
> [Error Id: be9856d2-0b80-4b9c-94a4-a1ca38ec5db0 on 
> ip-X.compute.internal:31010] (state=,code=0)
> 0: jdbc:drill:> select * from s3data.root.`staging/data/apps1.csv` limit 1;
> +--+--+--+--+--++--++--+--+---+--+---+---+---+---+---+---+---+--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
> | FIELD_1  |   FIELD_2| FIELD_3  | FIELD_4  | FIELD_5  |  FIELD_6 
>   | FIELD_7  |  FIELD_8   | FIELD_9  |   FIELD_10   | FIELD_11  |   
> FIELD_12   | FIELD_13  | FIELD_14  | FIELD_15  | FIELD_16  | FIELD_17  | 
> FIELD_18  | FIELD_19  |   FIELD_20   | FIELD_21  | FIELD_22  | 
> FIELD_23  | FIELD_24  | FIELD_25  | FIELD_26  | FIELD_27  | FIELD_28  | 
> FIELD_29  | FIELD_30  | FIELD_31  | FIELD_32  | FIELD_33  | FIELD_34  | 
> FIELD_35  |
> +--+--+--+--+--++--++--+--+---+--+---+---+---+---+---+---+---+--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
> | 489517   | 27/10/2015 02:05:27  | 261  | 1130232  | 0| 
> 925630488  | 0| 925630488  | -1   | 19531580547  |   | 
> 27/10/2015 02:00:00  |   | 30| 300   | 0 | 0  
>|   |   | 27/10/2015 02:05:27  | 0 | 1 | 0 
> | 35.0  |   |   |   | 505 

[jira] [Commented] (DRILL-4145) IndexOutOfBoundsException raised during select * query on S3 csv file

2015-12-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036816#comment-15036816
 ] 

ASF GitHub Bot commented on DRILL-4145:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/287


> IndexOutOfBoundsException raised during select * query on S3 csv file
> -
>
> Key: DRILL-4145
> URL: https://issues.apache.org/jira/browse/DRILL-4145
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
> Environment: Drill 1.3.0 on a 3 node distributed-mode cluster on AWS.
> Data files on S3.
> S3 storage plugin configuration:
> {
>   "type": "file",
>   "enabled": true,
>   "connection": "s3a://",
>   "workspaces": {
> "root": {
>   "location": "/",
>   "writable": false,
>   "defaultInputFormat": null
> },
> "views": {
>   "location": "/processed",
>   "writable": true,
>   "defaultInputFormat": null
> },
> "tmp": {
>   "location": "/tmp",
>   "writable": true,
>   "defaultInputFormat": null
> }
>   },
>   "formats": {
> "psv": {
>   "type": "text",
>   "extensions": [
> "tbl"
>   ],
>   "delimiter": "|"
> },
> "csv": {
>   "type": "text",
>   "extensions": [
> "csv"
>   ],
>   "extractHeader": true,
>   "delimiter": ","
> },
> "tsv": {
>   "type": "text",
>   "extensions": [
> "tsv"
>   ],
>   "delimiter": "\t"
> },
> "parquet": {
>   "type": "parquet"
> },
> "json": {
>   "type": "json"
> },
> "avro": {
>   "type": "avro"
> },
> "sequencefile": {
>   "type": "sequencefile",
>   "extensions": [
> "seq"
>   ]
> },
> "csvh": {
>   "type": "text",
>   "extensions": [
> "csvh",
> "csv"
>   ],
>   "extractHeader": true,
>   "delimiter": ","
> }
>   }
> }
>Reporter: Peter McTaggart
>Assignee: Jacques Nadeau
> Attachments: apps1-bad.csv, apps1.csv
>
>
> When trying to query (via sqlline or WebUI) a .csv file I am getting an 
> IndexOutofBoundsException:
> {noformat} 0: jdbc:drill:> select * from 
> s3data.root.`staging/data/apps1-bad.csv` limit 1;
> Error: SYSTEM ERROR: IndexOutOfBoundsException: index: 16384, length: 4 
> (expected: range(0, 16384))
> Fragment 0:0
> [Error Id: be9856d2-0b80-4b9c-94a4-a1ca38ec5db0 on 
> ip-X.compute.internal:31010] (state=,code=0)
> 0: jdbc:drill:> select * from s3data.root.`staging/data/apps1.csv` limit 1;
> +--+--+--+--+--++--++--+--+---+--+---+---+---+---+---+---+---+--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
> | FIELD_1  |   FIELD_2| FIELD_3  | FIELD_4  | FIELD_5  |  FIELD_6 
>   | FIELD_7  |  FIELD_8   | FIELD_9  |   FIELD_10   | FIELD_11  |   
> FIELD_12   | FIELD_13  | FIELD_14  | FIELD_15  | FIELD_16  | FIELD_17  | 
> FIELD_18  | FIELD_19  |   FIELD_20   | FIELD_21  | FIELD_22  | 
> FIELD_23  | FIELD_24  | FIELD_25  | FIELD_26  | FIELD_27  | FIELD_28  | 
> FIELD_29  | FIELD_30  | FIELD_31  | FIELD_32  | FIELD_33  | FIELD_34  | 
> FIELD_35  |
> +--+--+--+--+--++--++--+--+---+--+---+---+---+---+---+---+---+--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
> | 489517   | 27/10/2015 02:05:27  | 261  | 1130232  | 0| 
> 925630488  | 0| 925630488  | -1   | 19531580547  |   | 
> 27/10/2015 02:00:00  |   | 30| 300   | 0 | 0  
>|   |   | 27/10/2015 02:05:27  | 0 | 1 | 0 
> | 35.0  |   |   |   | 505   | 872.0   
>   |   | aBc   |   |   |   |   |
> 

[jira] [Commented] (DRILL-4145) IndexOutOfBoundsException raised during select * query on S3 csv file

2015-12-02 Thread Steven Phillips (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035589#comment-15035589
 ] 

Steven Phillips commented on DRILL-4145:


There is a bug in the case where there is an empty string for the last field. 
Basically, when the parser sees the pattern , 
the parser calls the "endEmptyField()" method of the TextInput. This was ok 
when using the RepeatedVarCharInput, because calling this method resulted in an 
empty string element being added to the array. But in the FieldVarCharOutput, 
ending the field doesn't do anything unless you first start the field.

> IndexOutOfBoundsException raised during select * query on S3 csv file
> -
>
> Key: DRILL-4145
> URL: https://issues.apache.org/jira/browse/DRILL-4145
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
> Environment: Drill 1.3.0 on a 3 node distributed-mode cluster on AWS.
> Data files on S3.
> S3 storage plugin configuration:
> {
>   "type": "file",
>   "enabled": true,
>   "connection": "s3a://",
>   "workspaces": {
> "root": {
>   "location": "/",
>   "writable": false,
>   "defaultInputFormat": null
> },
> "views": {
>   "location": "/processed",
>   "writable": true,
>   "defaultInputFormat": null
> },
> "tmp": {
>   "location": "/tmp",
>   "writable": true,
>   "defaultInputFormat": null
> }
>   },
>   "formats": {
> "psv": {
>   "type": "text",
>   "extensions": [
> "tbl"
>   ],
>   "delimiter": "|"
> },
> "csv": {
>   "type": "text",
>   "extensions": [
> "csv"
>   ],
>   "extractHeader": true,
>   "delimiter": ","
> },
> "tsv": {
>   "type": "text",
>   "extensions": [
> "tsv"
>   ],
>   "delimiter": "\t"
> },
> "parquet": {
>   "type": "parquet"
> },
> "json": {
>   "type": "json"
> },
> "avro": {
>   "type": "avro"
> },
> "sequencefile": {
>   "type": "sequencefile",
>   "extensions": [
> "seq"
>   ]
> },
> "csvh": {
>   "type": "text",
>   "extensions": [
> "csvh",
> "csv"
>   ],
>   "extractHeader": true,
>   "delimiter": ","
> }
>   }
> }
>Reporter: Peter McTaggart
> Attachments: apps1-bad.csv, apps1.csv
>
>
> When trying to query (via sqlline or WebUI) a .csv file I am getting an 
> IndexOutofBoundsException:
> {noformat} 0: jdbc:drill:> select * from 
> s3data.root.`staging/data/apps1-bad.csv` limit 1;
> Error: SYSTEM ERROR: IndexOutOfBoundsException: index: 16384, length: 4 
> (expected: range(0, 16384))
> Fragment 0:0
> [Error Id: be9856d2-0b80-4b9c-94a4-a1ca38ec5db0 on 
> ip-X.compute.internal:31010] (state=,code=0)
> 0: jdbc:drill:> select * from s3data.root.`staging/data/apps1.csv` limit 1;
> +--+--+--+--+--++--++--+--+---+--+---+---+---+---+---+---+---+--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
> | FIELD_1  |   FIELD_2| FIELD_3  | FIELD_4  | FIELD_5  |  FIELD_6 
>   | FIELD_7  |  FIELD_8   | FIELD_9  |   FIELD_10   | FIELD_11  |   
> FIELD_12   | FIELD_13  | FIELD_14  | FIELD_15  | FIELD_16  | FIELD_17  | 
> FIELD_18  | FIELD_19  |   FIELD_20   | FIELD_21  | FIELD_22  | 
> FIELD_23  | FIELD_24  | FIELD_25  | FIELD_26  | FIELD_27  | FIELD_28  | 
> FIELD_29  | FIELD_30  | FIELD_31  | FIELD_32  | FIELD_33  | FIELD_34  | 
> FIELD_35  |
> +--+--+--+--+--++--++--+--+---+--+---+---+---+---+---+---+---+--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
> | 489517   | 27/10/2015 02:05:27  | 261  | 1130232  | 0| 
> 925630488  | 0| 925630488  | -1   | 19531580547  |   | 
> 27/10/2015 02:00:00  |   | 30| 300   | 0 | 0  
>|   |   | 27/10/2015 02:05:27  | 0 | 1 | 0 
> | 35.0  |   |   |   | 505   | 872.0   
>   |   | aBc   |   |   |   |   |
> 

[jira] [Commented] (DRILL-4145) IndexOutOfBoundsException raised during select * query on S3 csv file

2015-12-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035590#comment-15035590
 ] 

ASF GitHub Bot commented on DRILL-4145:
---

GitHub user StevenMPhillips opened a pull request:

https://github.com/apache/drill/pull/287

DRILL-4145: Handle empty final field in Text reader correctly



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/StevenMPhillips/drill drill-4145

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/287.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #287


commit 8f56250aeb29d5d21bcdc6c727cec89607150224
Author: Steven Phillips 
Date:   2015-12-02T10:09:20Z

DRILL-4145: Handle empty final field in Text reader correctly




> IndexOutOfBoundsException raised during select * query on S3 csv file
> -
>
> Key: DRILL-4145
> URL: https://issues.apache.org/jira/browse/DRILL-4145
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
> Environment: Drill 1.3.0 on a 3 node distributed-mode cluster on AWS.
> Data files on S3.
> S3 storage plugin configuration:
> {
>   "type": "file",
>   "enabled": true,
>   "connection": "s3a://",
>   "workspaces": {
> "root": {
>   "location": "/",
>   "writable": false,
>   "defaultInputFormat": null
> },
> "views": {
>   "location": "/processed",
>   "writable": true,
>   "defaultInputFormat": null
> },
> "tmp": {
>   "location": "/tmp",
>   "writable": true,
>   "defaultInputFormat": null
> }
>   },
>   "formats": {
> "psv": {
>   "type": "text",
>   "extensions": [
> "tbl"
>   ],
>   "delimiter": "|"
> },
> "csv": {
>   "type": "text",
>   "extensions": [
> "csv"
>   ],
>   "extractHeader": true,
>   "delimiter": ","
> },
> "tsv": {
>   "type": "text",
>   "extensions": [
> "tsv"
>   ],
>   "delimiter": "\t"
> },
> "parquet": {
>   "type": "parquet"
> },
> "json": {
>   "type": "json"
> },
> "avro": {
>   "type": "avro"
> },
> "sequencefile": {
>   "type": "sequencefile",
>   "extensions": [
> "seq"
>   ]
> },
> "csvh": {
>   "type": "text",
>   "extensions": [
> "csvh",
> "csv"
>   ],
>   "extractHeader": true,
>   "delimiter": ","
> }
>   }
> }
>Reporter: Peter McTaggart
> Attachments: apps1-bad.csv, apps1.csv
>
>
> When trying to query (via sqlline or WebUI) a .csv file I am getting an 
> IndexOutofBoundsException:
> {noformat} 0: jdbc:drill:> select * from 
> s3data.root.`staging/data/apps1-bad.csv` limit 1;
> Error: SYSTEM ERROR: IndexOutOfBoundsException: index: 16384, length: 4 
> (expected: range(0, 16384))
> Fragment 0:0
> [Error Id: be9856d2-0b80-4b9c-94a4-a1ca38ec5db0 on 
> ip-X.compute.internal:31010] (state=,code=0)
> 0: jdbc:drill:> select * from s3data.root.`staging/data/apps1.csv` limit 1;
> +--+--+--+--+--++--++--+--+---+--+---+---+---+---+---+---+---+--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
> | FIELD_1  |   FIELD_2| FIELD_3  | FIELD_4  | FIELD_5  |  FIELD_6 
>   | FIELD_7  |  FIELD_8   | FIELD_9  |   FIELD_10   | FIELD_11  |   
> FIELD_12   | FIELD_13  | FIELD_14  | FIELD_15  | FIELD_16  | FIELD_17  | 
> FIELD_18  | FIELD_19  |   FIELD_20   | FIELD_21  | FIELD_22  | 
> FIELD_23  | FIELD_24  | FIELD_25  | FIELD_26  | FIELD_27  | FIELD_28  | 
> FIELD_29  | FIELD_30  | FIELD_31  | FIELD_32  | FIELD_33  | FIELD_34  | 
> FIELD_35  |
> +--+--+--+--+--++--++--+--+---+--+---+---+---+---+---+---+---+--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
> | 489517   | 27/10/2015 02:05:27  | 261  | 1130232  | 0| 
> 925630488  | 0| 925630488  | -1   | 19531580547  |   | 
> 27/10/2015 

[jira] [Commented] (DRILL-4145) IndexOutOfBoundsException raised during select * query on S3 csv file

2015-12-01 Thread Peter McTaggart (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034772#comment-15034772
 ] 

Peter McTaggart commented on DRILL-4145:


I have posted my storage plugin config for S3 in the 'environment' field

Basically, I added "csv" to the extensions list in the "csvh" format section. 
This has extractHeaders set to true and parses out the first line.  (I also 
tried to set extractHeaders in the "csv" format section but it didn't seem to 
work and I didn't pursue it further)

{noformat}
"csvh":
{ "type": "text", "extensions": [ "csvh", "csv" ], "extractHeader": true, 
"delimiter": "," }
}
{noformat}

I am using the official 1.3.0 release from an apache mirror site.

{noformat}
0: jdbc:drill:> select * from sys.version;
+--+---+-++-++
| version  | commit_id |   
commit_message|commit_time | 
build_email | build_time |
+--+---+-++-++
| 1.3.0| cc127ff4ac6272d2cb1b602890c0b7c503ea2062  | [maven-release-plugin] 
prepare release drill-1.3.0  | 17.11.2015 @ 22:05:19 PST  | jacq...@apache.org  
| 17.11.2015 @ 22:09:19 PST  |
+--+---+-++-++
1 row selected (0.975 seconds)
{noformat}

Note:  I have 6 files that contain the same type of data and are roughly the 
same size (I think the only difference apart from the data values is that the 
columns may be in different orders in the files)  Three of these files work 
fine and 3 seem to have this problem - which is weird.

On the files that cause this problem, I have narrowed two (haven't tried the 
3rd yet) of them down to this 4096 line size (where they work) -- both fail 
when the number of lines is increased to 4097 or more.



> IndexOutOfBoundsException raised during select * query on S3 csv file
> -
>
> Key: DRILL-4145
> URL: https://issues.apache.org/jira/browse/DRILL-4145
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
> Environment: Drill 1.3.0 on a 3 node distributed-mode cluster on AWS.
> Data files on S3.
> S3 storage plugin configuration:
> {
>   "type": "file",
>   "enabled": true,
>   "connection": "s3a://",
>   "workspaces": {
> "root": {
>   "location": "/",
>   "writable": false,
>   "defaultInputFormat": null
> },
> "views": {
>   "location": "/processed",
>   "writable": true,
>   "defaultInputFormat": null
> },
> "tmp": {
>   "location": "/tmp",
>   "writable": true,
>   "defaultInputFormat": null
> }
>   },
>   "formats": {
> "psv": {
>   "type": "text",
>   "extensions": [
> "tbl"
>   ],
>   "delimiter": "|"
> },
> "csv": {
>   "type": "text",
>   "extensions": [
> "csv"
>   ],
>   "extractHeader": true,
>   "delimiter": ","
> },
> "tsv": {
>   "type": "text",
>   "extensions": [
> "tsv"
>   ],
>   "delimiter": "\t"
> },
> "parquet": {
>   "type": "parquet"
> },
> "json": {
>   "type": "json"
> },
> "avro": {
>   "type": "avro"
> },
> "sequencefile": {
>   "type": "sequencefile",
>   "extensions": [
> "seq"
>   ]
> },
> "csvh": {
>   "type": "text",
>   "extensions": [
> "csvh",
> "csv"
>   ],
>   "extractHeader": true,
>   "delimiter": ","
> }
>   }
> }
>Reporter: Peter McTaggart
> Attachments: apps1-bad.csv, apps1.csv
>
>
> When trying to query (via sqlline or WebUI) a .csv file I am getting an 
> IndexOutofBoundsException:
> {noformat} 0: jdbc:drill:> select * from 
> s3data.root.`staging/data/apps1-bad.csv` limit 1;
> Error: SYSTEM ERROR: IndexOutOfBoundsException: index: 16384, length: 4 
> (expected: range(0, 16384))
> Fragment 0:0
> [Error Id: be9856d2-0b80-4b9c-94a4-a1ca38ec5db0 on 
> ip-X.compute.internal:31010] (state=,code=0)
> 0: jdbc:drill:> select * from s3data.root.`staging/data/apps1.csv` limit 1;
> 

[jira] [Commented] (DRILL-4145) IndexOutOfBoundsException raised during select * query on S3 csv file

2015-12-01 Thread Peter McTaggart (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034799#comment-15034799
 ] 

Peter McTaggart commented on DRILL-4145:


I have just tried the 3rd 'broken' file and it has the same problem.

eg.
head -n 4096 file_A.csv > data_file.csv   ==> select * from s3.data_file.csv  
works!
head -n 4097 file_A.csv > data_file.csv  ==> select * from s3.data_file.csv   
throws IndexOutOfBoundsException

> IndexOutOfBoundsException raised during select * query on S3 csv file
> -
>
> Key: DRILL-4145
> URL: https://issues.apache.org/jira/browse/DRILL-4145
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
> Environment: Drill 1.3.0 on a 3 node distributed-mode cluster on AWS.
> Data files on S3.
> S3 storage plugin configuration:
> {
>   "type": "file",
>   "enabled": true,
>   "connection": "s3a://",
>   "workspaces": {
> "root": {
>   "location": "/",
>   "writable": false,
>   "defaultInputFormat": null
> },
> "views": {
>   "location": "/processed",
>   "writable": true,
>   "defaultInputFormat": null
> },
> "tmp": {
>   "location": "/tmp",
>   "writable": true,
>   "defaultInputFormat": null
> }
>   },
>   "formats": {
> "psv": {
>   "type": "text",
>   "extensions": [
> "tbl"
>   ],
>   "delimiter": "|"
> },
> "csv": {
>   "type": "text",
>   "extensions": [
> "csv"
>   ],
>   "extractHeader": true,
>   "delimiter": ","
> },
> "tsv": {
>   "type": "text",
>   "extensions": [
> "tsv"
>   ],
>   "delimiter": "\t"
> },
> "parquet": {
>   "type": "parquet"
> },
> "json": {
>   "type": "json"
> },
> "avro": {
>   "type": "avro"
> },
> "sequencefile": {
>   "type": "sequencefile",
>   "extensions": [
> "seq"
>   ]
> },
> "csvh": {
>   "type": "text",
>   "extensions": [
> "csvh",
> "csv"
>   ],
>   "extractHeader": true,
>   "delimiter": ","
> }
>   }
> }
>Reporter: Peter McTaggart
> Attachments: apps1-bad.csv, apps1.csv
>
>
> When trying to query (via sqlline or WebUI) a .csv file I am getting an 
> IndexOutofBoundsException:
> {noformat} 0: jdbc:drill:> select * from 
> s3data.root.`staging/data/apps1-bad.csv` limit 1;
> Error: SYSTEM ERROR: IndexOutOfBoundsException: index: 16384, length: 4 
> (expected: range(0, 16384))
> Fragment 0:0
> [Error Id: be9856d2-0b80-4b9c-94a4-a1ca38ec5db0 on 
> ip-X.compute.internal:31010] (state=,code=0)
> 0: jdbc:drill:> select * from s3data.root.`staging/data/apps1.csv` limit 1;
> +--+--+--+--+--++--++--+--+---+--+---+---+---+---+---+---+---+--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
> | FIELD_1  |   FIELD_2| FIELD_3  | FIELD_4  | FIELD_5  |  FIELD_6 
>   | FIELD_7  |  FIELD_8   | FIELD_9  |   FIELD_10   | FIELD_11  |   
> FIELD_12   | FIELD_13  | FIELD_14  | FIELD_15  | FIELD_16  | FIELD_17  | 
> FIELD_18  | FIELD_19  |   FIELD_20   | FIELD_21  | FIELD_22  | 
> FIELD_23  | FIELD_24  | FIELD_25  | FIELD_26  | FIELD_27  | FIELD_28  | 
> FIELD_29  | FIELD_30  | FIELD_31  | FIELD_32  | FIELD_33  | FIELD_34  | 
> FIELD_35  |
> +--+--+--+--+--++--++--+--+---+--+---+---+---+---+---+---+---+--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
> | 489517   | 27/10/2015 02:05:27  | 261  | 1130232  | 0| 
> 925630488  | 0| 925630488  | -1   | 19531580547  |   | 
> 27/10/2015 02:00:00  |   | 30| 300   | 0 | 0  
>|   |   | 27/10/2015 02:05:27  | 0 | 1 | 0 
> | 35.0  |   |   |   | 505   | 872.0   
>   |   | aBc   |   |   |   |   |
> 

[jira] [Commented] (DRILL-4145) IndexOutOfBoundsException raised during select * query on S3 csv file

2015-12-01 Thread Peter McTaggart (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035040#comment-15035040
 ] 

Peter McTaggart commented on DRILL-4145:


OK, trying a few other things, it seems that the problem is related to the 
extractHeaders functionality / "csvh".

I copied the test files to a maprfs from s3.

On maprfs:
If I do a normal select * as a csv file on either file it works  (returning 
each row as a single list value  in the columns field).

If I change the filename to give it a csvh extension the shorter(4096 line) 
file works and the other (4097 line) file throws the IndexOutOfBoundsException. 

> IndexOutOfBoundsException raised during select * query on S3 csv file
> -
>
> Key: DRILL-4145
> URL: https://issues.apache.org/jira/browse/DRILL-4145
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
> Environment: Drill 1.3.0 on a 3 node distributed-mode cluster on AWS.
> Data files on S3.
> S3 storage plugin configuration:
> {
>   "type": "file",
>   "enabled": true,
>   "connection": "s3a://",
>   "workspaces": {
> "root": {
>   "location": "/",
>   "writable": false,
>   "defaultInputFormat": null
> },
> "views": {
>   "location": "/processed",
>   "writable": true,
>   "defaultInputFormat": null
> },
> "tmp": {
>   "location": "/tmp",
>   "writable": true,
>   "defaultInputFormat": null
> }
>   },
>   "formats": {
> "psv": {
>   "type": "text",
>   "extensions": [
> "tbl"
>   ],
>   "delimiter": "|"
> },
> "csv": {
>   "type": "text",
>   "extensions": [
> "csv"
>   ],
>   "extractHeader": true,
>   "delimiter": ","
> },
> "tsv": {
>   "type": "text",
>   "extensions": [
> "tsv"
>   ],
>   "delimiter": "\t"
> },
> "parquet": {
>   "type": "parquet"
> },
> "json": {
>   "type": "json"
> },
> "avro": {
>   "type": "avro"
> },
> "sequencefile": {
>   "type": "sequencefile",
>   "extensions": [
> "seq"
>   ]
> },
> "csvh": {
>   "type": "text",
>   "extensions": [
> "csvh",
> "csv"
>   ],
>   "extractHeader": true,
>   "delimiter": ","
> }
>   }
> }
>Reporter: Peter McTaggart
> Attachments: apps1-bad.csv, apps1.csv
>
>
> When trying to query (via sqlline or WebUI) a .csv file I am getting an 
> IndexOutofBoundsException:
> {noformat} 0: jdbc:drill:> select * from 
> s3data.root.`staging/data/apps1-bad.csv` limit 1;
> Error: SYSTEM ERROR: IndexOutOfBoundsException: index: 16384, length: 4 
> (expected: range(0, 16384))
> Fragment 0:0
> [Error Id: be9856d2-0b80-4b9c-94a4-a1ca38ec5db0 on 
> ip-X.compute.internal:31010] (state=,code=0)
> 0: jdbc:drill:> select * from s3data.root.`staging/data/apps1.csv` limit 1;
> +--+--+--+--+--++--++--+--+---+--+---+---+---+---+---+---+---+--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
> | FIELD_1  |   FIELD_2| FIELD_3  | FIELD_4  | FIELD_5  |  FIELD_6 
>   | FIELD_7  |  FIELD_8   | FIELD_9  |   FIELD_10   | FIELD_11  |   
> FIELD_12   | FIELD_13  | FIELD_14  | FIELD_15  | FIELD_16  | FIELD_17  | 
> FIELD_18  | FIELD_19  |   FIELD_20   | FIELD_21  | FIELD_22  | 
> FIELD_23  | FIELD_24  | FIELD_25  | FIELD_26  | FIELD_27  | FIELD_28  | 
> FIELD_29  | FIELD_30  | FIELD_31  | FIELD_32  | FIELD_33  | FIELD_34  | 
> FIELD_35  |
> +--+--+--+--+--++--++--+--+---+--+---+---+---+---+---+---+---+--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
> | 489517   | 27/10/2015 02:05:27  | 261  | 1130232  | 0| 
> 925630488  | 0| 925630488  | -1   | 19531580547  |   | 
> 27/10/2015 02:00:00  |   | 30| 300   | 0 | 0  
>|   |   | 27/10/2015 02:05:27  | 0 | 1 | 0 
> | 35.0  |   |   |   | 505   | 872.0   
>   |   | aBc   |   |   |   |   |
> 

[jira] [Commented] (DRILL-4145) IndexOutOfBoundsException raised during select * query on S3 csv file

2015-12-01 Thread John Omernik (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033713#comment-15033713
 ] 

John Omernik commented on DRILL-4145:
-

I just tested the apps1-bad.csv on a MapRFS based DFS plugin.  (Perhaps we can 
focus on S3 here).  Basically, when I ran  the same query as you as I had no 
issues at all.  I am running the Developer release (Based on the 1.3 release 
from Apache) of MapR Drill, thus, other then some additions for MapR Tables, we 
should be same on code base (if you are running 1.3).  

This is interesting to me though, because when I ran the query, instead of 
interpreting the fields as your setup did, mine returned one field of  
"columns" with an array. Thus my "limit 1" query data started out like this:

| ["FIELD_1","FIELD_2","FIELD_3","

I.e. in your query, it parsed the header field into fields, in mine it returned 
the all as an array. The reason I bring this up, is I am curious on the 
differences in our setup. If we are both running 1.3, it should return the same 
right?  Can you share the formats section of your s3 plugin? I tried to use 
"extractHeader": true on mine, but got the same result, I am curious on your 
configuration there.  

I want to get it so we can either hone in the S3 difference, and eliminate 
configuration or version differences. 

Additionally, can you do select * from sys.version and share the commit_time 
and build_time on yours?  That may be helpful as well for me.  I have a commit 
time of 20.11.2015 & 01:34:54 UTC and a build time of 21.11.2015 @ 05:21:04 
UTC.   Are you using the official release or are you using a snapshot from 
Github?

Thanks!


> IndexOutOfBoundsException raised during select * query on S3 csv file
> -
>
> Key: DRILL-4145
> URL: https://issues.apache.org/jira/browse/DRILL-4145
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Functions - Drill
>Affects Versions: 1.3.0
> Environment: Drill 1.3.0 on a 3 node distributed-mode cluster on AWS.
> Data files on S3.
> S3 storage plugin configuration:
> {
>   "type": "file",
>   "enabled": true,
>   "connection": "s3a://",
>   "workspaces": {
> "root": {
>   "location": "/",
>   "writable": false,
>   "defaultInputFormat": null
> },
> "views": {
>   "location": "/processed",
>   "writable": true,
>   "defaultInputFormat": null
> },
> "tmp": {
>   "location": "/tmp",
>   "writable": true,
>   "defaultInputFormat": null
> }
>   },
>   "formats": {
> "psv": {
>   "type": "text",
>   "extensions": [
> "tbl"
>   ],
>   "delimiter": "|"
> },
> "csv": {
>   "type": "text",
>   "extensions": [
> "csv"
>   ],
>   "extractHeader": true,
>   "delimiter": ","
> },
> "tsv": {
>   "type": "text",
>   "extensions": [
> "tsv"
>   ],
>   "delimiter": "\t"
> },
> "parquet": {
>   "type": "parquet"
> },
> "json": {
>   "type": "json"
> },
> "avro": {
>   "type": "avro"
> },
> "sequencefile": {
>   "type": "sequencefile",
>   "extensions": [
> "seq"
>   ]
> },
> "csvh": {
>   "type": "text",
>   "extensions": [
> "csvh",
> "csv"
>   ],
>   "extractHeader": true,
>   "delimiter": ","
> }
>   }
> }
>Reporter: Peter McTaggart
> Attachments: apps1-bad.csv, apps1.csv
>
>
> When trying to query (via sqlline or WebUI) a .csv file I am getting an 
> IndexOutofBoundsException:
> {noformat} 0: jdbc:drill:> select * from 
> s3data.root.`staging/data/apps1-bad.csv` limit 1;
> Error: SYSTEM ERROR: IndexOutOfBoundsException: index: 16384, length: 4 
> (expected: range(0, 16384))
> Fragment 0:0
> [Error Id: be9856d2-0b80-4b9c-94a4-a1ca38ec5db0 on 
> ip-X.compute.internal:31010] (state=,code=0)
> 0: jdbc:drill:> select * from s3data.root.`staging/data/apps1.csv` limit 1;
> +--+--+--+--+--++--++--+--+---+--+---+---+---+---+---+---+---+--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
> | FIELD_1  |   FIELD_2| FIELD_3  | FIELD_4  | FIELD_5  |  FIELD_6 
>   | FIELD_7  |  FIELD_8   | FIELD_9  |   FIELD_10   | FIELD_11  |   
> FIELD_12   | FIELD_13  | FIELD_14  | FIELD_15  | FIELD_16  | FIELD_17  | 
> FIELD_18  | FIELD_19  |   FIELD_20   | FIELD_21  | FIELD_22  | 
> FIELD_23  | FIELD_24  | FIELD_25  | FIELD_26  | FIELD_27  |