I think that these would be significant improvements. The current behavior is pretty painful on average. Better defaults and just a bit of deduction could pay off big. I even think that the presence of headers might be pretty reliably inferred.
On Wed, Nov 17, 2021 at 4:31 PM Charles Givre <cgi...@gmail.com> wrote: > Hello Drill Community, > I would like to put forward some thoughts I've had relating to the CSV > reader in Drill. I would like to propose a few changes which could > actually be breaking changes, so I wanted to see if there are any strongly > held opinions in the community. Here goes: > > The Problems: > 1. The default behavior for Drill is to leave the extractColumnHeaders > option as false. When a user queries a CSV file this way, the results are > returned in a list of columns called columns. Thus if a user wants the > first column, they would project columns[0]. I have never been a fan of > this behavior. Even though Drill ships with the csvh file extension which > enables the header extraction, this is not a commonly used file format. > Furthermore, the returned results (the column list) does not work well with > BI tools. > > 2. The CSV reader does not attempt to do any kind of data type discovery. > > Proposed Changes: > The overall goal is to make it easier to query CSV data and also to make > the behavior more consistent across format plugins. > 1. Change the default behavior and set the extractHeaders to true. > 2. Other formats, like the excel reader, read tables directly into > columns. If the header is not known, Drill assigns a name of field_n. I > would propose replacing the `columns` array with a model similar to the > Excel reader. > 3. Implement schema discovery (data types) with an allTextMode option > similar to the JSON reader. When the allTextMode is disabled, the CSV > reader would attempt to infer data types. > > Since there are some breaking changes here, I'd like to ask if people have > any strong feelings on this topic or suggestions. > Thanks!, > -- C > > > >