[ 
https://issues.apache.org/jira/browse/DRILL-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779443#comment-17779443
 ] 

ASF GitHub Bot commented on DRILL-8457:
---------------------------------------

ztomanek-dw opened a new pull request, #2840:
URL: https://github.com/apache/drill/pull/2840

   # [DRILL-8457](https://issues.apache.org/jira/browse/DRILL-8457): Allow 
configuring csv parser in http storage plugin configuration
   
   ## Description
   
   HttpApiConfiguration was extended with `csvOptions` field which allows 
setting a following properties:
   
   ```json
   {
     "csvOptions": {
       "delimiter": ",",
       "quote": "\"",
       "quoteEscape": "\"",
       "lineSeparator": "\n",
       "headerExtractionEnabled": null,
       "numberOfRowsToSkip": 0,
       "numberOfRecordsToRead": -1,
       "lineSeparatorDetectionEnabled": true,
       "maxColumns": 512,
       "maxCharsPerColumn": 4096,
       "skipEmptyLines": true,
       "ignoreLeadingWhitespaces": true,
       "ignoreTrailingWhitespaces": true,
       "nullValue": null
     }
   }
   ```
   
   this provides greater csv parsing flexibility since user can set different 
delimiters, number of columns or max column size. 
   
   Also backward compatibility is ensured and parser works same as before if 
`csvOptions` is null.
   
   ## Documentation
   
   Add a following paragraph into 
https://drill.apache.org/docs/http-storage-plugin/#configuring-the-api-connections
   
   ```
   ##### CSV parser options
   
   CSV parser of HTTP Storage plugin can be configured using `csvOptions`.
   
   ```json
   {
     "csvOptions": {
       "delimiter": ",",
       "quote": "\"",
       "quoteEscape": "\"",
       "lineSeparator": "\n",
       "headerExtractionEnabled": null,
       "numberOfRowsToSkip": 0,
       "numberOfRecordsToRead": -1,
       "lineSeparatorDetectionEnabled": true,
       "maxColumns": 512,
       "maxCharsPerColumn": 4096,
       "skipEmptyLines": true,
       "ignoreLeadingWhitespaces": true,
       "ignoreTrailingWhitespaces": true,
       "nullValue": null
     }
   }
   ```
   
   E.g. to parse `.tsv` files you can use a following config:
   
   ```json
   {
     "csvOptions": {
       "delimiter": "\t"
     }
   }
   ```
   
   ```
   
   ## Testing
   
   Create a following storage plugin with name `github`
   
   
   ```json
   {
     "type": "http",
     "connections": {
       "test-data": {
         "url": 
"https://raw.githubusercontent.com/semantic-web-company/wic-tsv/master/data/de/Test/test_examples.txt";,
         "requireTail": false,
         "method": "GET",
         "authType": "none",
         "inputType": "csv",
         "xmlDataLevel": 1,
         "postParameterLocation": "QUERY_STRING",
         "csvOptions": {
           "delimiter": "\t",
           "quote": "\"",
           "quoteEscape": "\"",
           "lineSeparator": "\n",
           "numberOfRecordsToRead": -1,
           "lineSeparatorDetectionEnabled": true,
           "maxColumns": 512,
           "maxCharsPerColumn": 4096,
           "skipEmptyLines": true,
           "ignoreLeadingWhitespaces": true,
           "ignoreTrailingWhitespaces": true
         },
         "verifySSLCert": true
       }
     },
     "timeout": 5,
     "retryDelay": 1000,
     "proxyType": "direct",
     "authMode": "SHARED_USER",
     "enabled": true
   }
   ```
   
   Then query tsv file with 
   
   ```sql
   SELECT * from github.`test-data`
   ```.
   
   You should see a result set containing three columns
   




> Allow configuring csv parser in http storage plugin configuration
> -----------------------------------------------------------------
>
>                 Key: DRILL-8457
>                 URL: https://issues.apache.org/jira/browse/DRILL-8457
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - HTTP
>    Affects Versions: Future
>            Reporter: Zbigniew Tomanek
>            Priority: Minor
>             Fix For: Future
>
>
> Currently there is no way to configure csv parser when http plugin is used. 
> Because of that some kind of files cannot be parsed (e.g. when any column has 
> more than 4096 chars or file has a delimiter different from `,`).
> Since in DataWalk we utilize http plugin quite often we've changed our 
> internal fork of Drill so following parser/format properties can be 
> configured using additional `csvOptions` field:
>  
> {code:json}
> {
>   "csvOptions": {
>     "delimiter": "\t",
>     "quote": "\"",
>     "quote_escape": "\"",
>     "line_separator": "\n",
>     "header_extraction_enabled": null,
>     "number_of_rows_to_skip": 0,
>     "number_of_records_to_read": -1,
>     "line_separator_detection_enabled": true,
>     "max_columns": 512,
>     "max_chars_per_column": 4096,
>     "skip_empty_lines": true,
>     "ignore_leading_whitespaces": true,
>     "ignore_trailing_whitespaces": true,
>     "null_value": null
>   }
> }{code}
> I'd be glad to get feedback whether creating PR with these changes would 
> bring any value to the Drill



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to