On Thu, Apr 5, 2018 at 9:02 AM, Aman Sinha <amansi...@apache.org> wrote:

> All good discussions in this thread.  It clearly shows that Drill's
> schema-on-read is not only a nice-to-have but for applications like IOT, it
> is a must-have.
>

Absolutely. It must be possible to have schema on read even if that isn't
necessarily the most common path.


> For other types of data that is slowly changing,  in order to improve
> overall user experience where the user is willing to run offline commands
> to discover schema (as opposed to doing it while querying),
> we should consider doing sampling of the files with different sampling
> percentages.


I am not even convinced that sampling is necessary. Just have [select *
from foo.file] store the types and statistics that it discovers. If you
want to discover the stats before most people do a query, just do that on
each new file in a directory.



> This would be similar to collecting statistics through
> sampling.


I don't see why there needs to be anything special to make this happen. Any
full file scan for new files should suffice.



> In fact,
> the two things (schema discovery and stats) can be done in a single pass
> over the data.
>

Indeed.

Reply via email to