I think that these would be significant improvements.

The current behavior is pretty painful on average. Better defaults and just
a bit of deduction could pay off big. I even think that the presence of
headers might be pretty reliably inferred.



On Wed, Nov 17, 2021 at 4:31 PM Charles Givre <cgi...@gmail.com> wrote:

> Hello Drill Community,
> I would like to put forward some thoughts I've had relating to the CSV
> reader in Drill.  I would like to propose a few changes which could
> actually be breaking changes, so I wanted to see if there are any strongly
> held opinions in the community.  Here goes:
>
> The Problems:
> 1.  The default behavior for Drill is to leave the extractColumnHeaders
> option as false.  When a user queries a CSV file this way, the results are
> returned in a list of columns called columns.  Thus if a user wants the
> first column, they would project columns[0].  I have never been a fan of
> this behavior.  Even though Drill ships with the csvh file extension which
> enables the header extraction, this is not a commonly used file format.
> Furthermore, the returned results (the column list) does not work well with
> BI tools.
>
> 2.  The CSV reader does not attempt to do any kind of data type discovery.
>
> Proposed Changes:
> The overall goal is to make it easier to query CSV data and also to make
> the behavior more consistent across format plugins.
> 1.  Change the default behavior and set the extractHeaders to true.
> 2.  Other formats, like the excel reader, read tables directly into
> columns.  If the header is not known, Drill assigns a name of field_n.  I
> would propose replacing the `columns` array with a model similar to the
> Excel reader.
> 3.  Implement schema discovery (data types) with an allTextMode option
> similar to the JSON reader.  When the allTextMode is disabled, the CSV
> reader would attempt to infer data types.
>
> Since there are some breaking changes here, I'd like to ask if people have
> any strong feelings on this topic or suggestions.
> Thanks!,
> -- C
>
>
>
>

Reply via email to