[ 
https://issues.apache.org/jira/browse/DRILL-4145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034772#comment-15034772
 ] 

Peter McTaggart commented on DRILL-4145:
----------------------------------------

I have posted my storage plugin config for S3 in the 'environment' field

Basically, I added "csv" to the extensions list in the "csvh" format section. 
This has extractHeaders set to true and parses out the first line.  (I also 
tried to set extractHeaders in the "csv" format section but it didn't seem to 
work and I didn't pursue it further)

{noformat}
"csvh":
{ "type": "text", "extensions": [ "csvh", "csv" ], "extractHeader": true, 
"delimiter": "," }
}
{noformat}

I am using the official 1.3.0 release from an apache mirror site.

{noformat}
0: jdbc:drill:> select * from sys.version;
+----------+-------------------------------------------+-----------------------------------------------------+----------------------------+---------------------+----------------------------+
| version  |                 commit_id                 |                   
commit_message                    |        commit_time         |     
build_email     |         build_time         |
+----------+-------------------------------------------+-----------------------------------------------------+----------------------------+---------------------+----------------------------+
| 1.3.0    | cc127ff4ac6272d2cb1b602890c0b7c503ea2062  | [maven-release-plugin] 
prepare release drill-1.3.0  | 17.11.2015 @ 22:05:19 PST  | jacq...@apache.org  
| 17.11.2015 @ 22:09:19 PST  |
+----------+-------------------------------------------+-----------------------------------------------------+----------------------------+---------------------+----------------------------+
1 row selected (0.975 seconds)
{noformat}

Note:  I have 6 files that contain the same type of data and are roughly the 
same size (I think the only difference apart from the data values is that the 
columns may be in different orders in the files)  Three of these files work 
fine and 3 seem to have this problem - which is weird.

On the files that cause this problem, I have narrowed two (haven't tried the 
3rd yet) of them down to this 4096 line size (where they work) -- both fail 
when the number of lines is increased to 4097 or more.



> IndexOutOfBoundsException raised during select * query on S3 csv file
> ---------------------------------------------------------------------
>
>                 Key: DRILL-4145
>                 URL: https://issues.apache.org/jira/browse/DRILL-4145
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.3.0
>         Environment: Drill 1.3.0 on a 3 node distributed-mode cluster on AWS.
> Data files on S3.
> S3 storage plugin configuration:
> {
>   "type": "file",
>   "enabled": true,
>   "connection": "s3a://<bucket-name-was-here>",
>   "workspaces": {
>     "root": {
>       "location": "/",
>       "writable": false,
>       "defaultInputFormat": null
>     },
>     "views": {
>       "location": "/processed",
>       "writable": true,
>       "defaultInputFormat": null
>     },
>     "tmp": {
>       "location": "/tmp",
>       "writable": true,
>       "defaultInputFormat": null
>     }
>   },
>   "formats": {
>     "psv": {
>       "type": "text",
>       "extensions": [
>         "tbl"
>       ],
>       "delimiter": "|"
>     },
>     "csv": {
>       "type": "text",
>       "extensions": [
>         "csv"
>       ],
>       "extractHeader": true,
>       "delimiter": ","
>     },
>     "tsv": {
>       "type": "text",
>       "extensions": [
>         "tsv"
>       ],
>       "delimiter": "\t"
>     },
>     "parquet": {
>       "type": "parquet"
>     },
>     "json": {
>       "type": "json"
>     },
>     "avro": {
>       "type": "avro"
>     },
>     "sequencefile": {
>       "type": "sequencefile",
>       "extensions": [
>         "seq"
>       ]
>     },
>     "csvh": {
>       "type": "text",
>       "extensions": [
>         "csvh",
>         "csv"
>       ],
>       "extractHeader": true,
>       "delimiter": ","
>     }
>   }
> }
>            Reporter: Peter McTaggart
>         Attachments: apps1-bad.csv, apps1.csv
>
>
> When trying to query (via sqlline or WebUI) a .csv file I am getting an 
> IndexOutofBoundsException:
> {noformat} 0: jdbc:drill:> select * from 
> s3data.root.`staging/data/apps1-bad.csv` limit 1;
> Error: SYSTEM ERROR: IndexOutOfBoundsException: index: 16384, length: 4 
> (expected: range(0, 16384))
> Fragment 0:0
> [Error Id: be9856d2-0b80-4b9c-94a4-a1ca38ec5db0 on 
> ip-XXXXX.compute.internal:31010] (state=,code=0)
> 0: jdbc:drill:> select * from s3data.root.`staging/data/apps1.csv` limit 1;
> +----------+----------------------+----------+----------+----------+------------+----------+------------+----------+--------------+-----------+----------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
> | FIELD_1  |       FIELD_2        | FIELD_3  | FIELD_4  | FIELD_5  |  FIELD_6 
>   | FIELD_7  |  FIELD_8   | FIELD_9  |   FIELD_10   | FIELD_11  |       
> FIELD_12       | FIELD_13  | FIELD_14  | FIELD_15  | FIELD_16  | FIELD_17  | 
> FIELD_18  | FIELD_19  |       FIELD_20       | FIELD_21  | FIELD_22  | 
> FIELD_23  | FIELD_24  | FIELD_25  | FIELD_26  | FIELD_27  | FIELD_28  | 
> FIELD_29  | FIELD_30  | FIELD_31  | FIELD_32  | FIELD_33  | FIELD_34  | 
> FIELD_35  |
> +----------+----------------------+----------+----------+----------+------------+----------+------------+----------+--------------+-----------+----------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
> | 489517   | 27/10/2015 02:05:27  | 261      | 1130232  | 0        | 
> 925630488  | 0        | 925630488  | -1       | 19531580547  | 00000000  | 
> 27/10/2015 02:00:00  |           | 30        | 300       | 0         | 0      
>    | 00000000  | 00000000  | 27/10/2015 02:05:27  | 0         | 1         | 0 
>         | 35.0      |           |           |           | 505       | 872.0   
>   |           | aBc       |           |           |           |           |
> +----------+----------------------+----------+----------+----------+------------+----------+------------+----------+--------------+-----------+----------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+----------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
> 1 row selected (1.094 seconds)
> 0: jdbc:drill:>  {noformat}
> Good file: apps1.csv, and 
> Bad file: apps1-bad.csv  attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to