On Thu, Apr 5, 2018 at 9:02 AM, Aman Sinha <amansi...@apache.org> wrote:
> All good discussions in this thread. It clearly shows that Drill's > schema-on-read is not only a nice-to-have but for applications like IOT, it > is a must-have. > Absolutely. It must be possible to have schema on read even if that isn't necessarily the most common path. > For other types of data that is slowly changing, in order to improve > overall user experience where the user is willing to run offline commands > to discover schema (as opposed to doing it while querying), > we should consider doing sampling of the files with different sampling > percentages. I am not even convinced that sampling is necessary. Just have [select * from foo.file] store the types and statistics that it discovers. If you want to discover the stats before most people do a query, just do that on each new file in a directory. > This would be similar to collecting statistics through > sampling. I don't see why there needs to be anything special to make this happen. Any full file scan for new files should suffice. > In fact, > the two things (schema discovery and stats) can be done in a single pass > over the data. > Indeed.