@Paul I really appreciate the statement ` Effort can go into new features rather than fighting an unwinnable battle to use non-relational data in a relational engine.` .
At AntFinancial( known as Alipay an Alibaba related company ) we now use Drill to support most of our analysis work. Our business and data is complex enough. Our strategy is to let users design their schema first, then dump in their data , query their data later. This work flow runs fluently. But by deep inside into the Drill's code internal and see the JIRA bugs, we will see most of the non-intuitive codes to solve the schema change but really no help to most of the actual use case. I think this also make the storage plugin interface not so intuitive to implement. We are sacrificing most of our work to pay for little income. Users really don't care about defining a schema first, but pay attention whether their query is fast enough. By probing the data to guess the schema and cache them , to me ,is a compromise strategy but still not clean enough. So I hope we move the mess schema solving logic out of Drill to let the code cleaner by defining the schema firstly with DDL statements. If we agree on this, the work should be a sub work of DRILL-6552. On Thu, Aug 16, 2018 at 8:51 AM Paul Rogers <par0...@yahoo.com.invalid> wrote: > Hi Ted, > > I like the "schema auto-detect" idea. > > As we discussed in a prior thread, caching of schema is a nice-add on once > we have defined the schema-on-read mechanism. Maybe we first get it to work > with a user-provided schema. Then, as an enhancement, we offer to infer the > schema by scanning data. > > There are some ambiguities that schema inference can't resolve: in {x: > "1002"} {x: 1003}, should x be an Int or a Varchar? > > Still if Drill could provide a guess at the schema, and the user could > refine it, we'd have a very elegant solution. > > > Thanks, > - Paul > > > > On Wednesday, August 15, 2018, 5:35:06 PM PDT, Ted Dunning < > ted.dunn...@gmail.com> wrote: > > This is a bold statement. > > And there are variants of it that could give users nearly the same > experience that we have now. For instance, if we cache discovered schemas > for old files and discover the schema for any new file that we see (and > cache it) before actually running a query. That gives us pretty much the > flexibility of schema on read without as much of the burden. > > > > On Wed, Aug 15, 2018 at 5:02 PM weijie tong <tongweijie...@gmail.com> > wrote: > > > Hi all: > > Hope the statement not seems too dash to you. > > Drill claims be a schema-free distributed SQL engine. It pays lots of > > work to make the execution engine to support it to support JSON file like > > storage format. It is easier to make bugs and let the code logic ugly. I > > wonder do we still insist on this ,since we are designing the metadata > > system with DRILL-6552. > > Traditionally, people is used to design its table schema firstly > before > > firing a SQL query. I don't think this saves people too much time. Other > > system like Spark is popular not due to lack the schema claiming. I think > > we should be brave enough to take the right decision whether to still > > insist on this feature which seems not so important but a burden. > > Thanks. > > >