[ 
https://issues.apache.org/jira/browse/DRILL-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7279:
------------------------------------
    Labels: ready-to-commit  (was: )

> Support provided schema for CSV without headers
> -----------------------------------------------
>
>                 Key: DRILL-7279
>                 URL: https://issues.apache.org/jira/browse/DRILL-7279
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.16.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>              Labels: ready-to-commit
>             Fix For: 1.17.0
>
>
> Extend the Drill 1.16 provided schema support for the text reader to allow a 
> provided schema for files without headers. Behavior:
> * If the file is configured to not extract headers, and a schema is provided, 
> and the schema has at least one column, then use the provided schema to 
> create individual columns. Otherwise, continue to use {{columns}} as in 
> previous versions.
> * The columns in the schema are assumed to match left-to-right with those in 
> the file.
> * If the schema contains more columns than the file, the extra columns take 
> their default values. (This occurs in schema evolution when a column is added 
> to newer files.)
> * If the file contains more columns than the schema, then the extra columns, 
> at the end of the line, are ignored. This is the same behavior as occurs if 
> the file contains headers.
> h4. Table Properties
> Also adds four table properties for text files. These properties, if present, 
> override those defined in the format plugin configuration. The properties 
> allow the user to have a single "csv" config, but to have many tables with 
> the "csv" suffix, each with different properties. That is, the user need not 
> define a new plugin config, and define a new extension, just to change a file 
> format property. With this system, the user can have a ".csv" file with 
> headers; the user need not define a different suffix (usually ".csvh" in 
> Drill) for this case.
> || Table Property || Equivalent Plugin Config Property ||
> | {{drill.headers}} | {{extractHeader}} |
> | {{drill.skipFirstLine}} |  {{skipFirstLine}} | 
> | {{drill.delimiter}} |  {{fieldDelimiter}} | 
> |  {{drill.commentChar}} |  {{comment}}| 
> For each, the rules are:
> * If the table property is not set, then the plugin property is used.
> * If the table property is set, then the property value replaces the plugin 
> property value for that one specific table.
> * For the delimiter, if the property value is an empty string, then this is 
> the same as an unset property.
> * For the comment, if the property value is an empty string, then the comment 
> is set to the ASCII NULL, which will never match. This effectively turns off 
> the comment feature for this one table.
> * If the delimiter or comment value is longer than a single character, only 
> the first character is used.
> It is possible to use the table properties without specifying a "provided" 
> schema. Just omit any columns from the schema:
> {noformat}
> create schema () for table `dfs.data`.`example`
> PROPERTIES ('drill.headers'='false', 'drill.skipFirstLine'='false', 
> 'drill.delimiter'='|')
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to