btw: In one project that I'm currently working (an application related to IOT), I'm leveraging Drill's schema-on-read ability, without requiring user to predefine table DDL.
On Wed, Aug 15, 2018 at 11:15 PM, Jinfeng Ni <j...@apache.org> wrote: > The use case Weijie described seems to fall into the category of > traditional data warehouse, i.e, schemas are predefined by users, data > strictly conforms to schema. Certainly this is one important uses, and I > agreed that the schema-on-read logic in Drill run-time indeed is a > disadvantage for such use case, compared with other SQL query engine like > Impala/Presto. > > The question we want to ask is whether that's the only use case Drill > wants to target. We probably want to hear more cases from Drill community, > before we can decide what's the best strategy going forward. > > In examples Paul listed, why would two sets of data have different schema? > In many cases, that's because application generating the data is changed; > either adding/deleting one field, or modifying one existing field. ETL is > a typical approach to clean up such data with different schema. Drill's > argument, couple of years ago when the project was started, was that ETL is > too time-consuming. it would provide great value if a query engine could > query directly against such datasets. > > I feel Paul's suggestion of letting user provide schema, or Drill > scan/probe and learn the schema seems to fall in the middle of spectrum; > ETL is one extreme, and Drill's current schema-on-read is the other > extreme. Personally, I would prefer letting Drill scan/probe the schema, > as it might not be easy for user to provide schema in the case of nested > data (will they have to provide type information for any nested field?). > > To Weijie's comment about complexity of code of dealing schema, in theory > we should refactor/rewrite majority run-time operator, separating the logic > of handling schema and handling regular data flow. That would clean up the > current mess. > > ps1: IMHO, schema-less is purely PR word. The more appropriate word for > Drill would be schema-on-read. > 2: I would not call it a battle between non-relational data and > relational engine. The extended relational model has type of > array/composite types, similar to what Drill has. > > > > > > On Wed, Aug 15, 2018 at 7:27 PM, weijie tong <tongweijie...@gmail.com> > wrote: > >> @Paul I really appreciate the statement ` Effort can go into new features >> rather than fighting an unwinnable battle to use non-relational data in a >> relational engine.` . >> >> At AntFinancial( known as Alipay an Alibaba related company ) we now use >> Drill to support most of our analysis work. Our business and data is >> complex enough. Our strategy is to let users design their schema first, >> then dump in their data , query their data later. This work flow runs >> fluently. But by deep inside into the Drill's code internal and see the >> JIRA bugs, we will see most of the non-intuitive codes to solve the schema >> change but really no help to most of the actual use case. I think this >> also >> make the storage plugin interface not so intuitive to implement. >> >> We are sacrificing most of our work to pay for little income. Users really >> don't care about defining a schema first, but pay attention whether their >> query is fast enough. By probing the data to guess the schema and cache >> them , to me ,is a compromise strategy but still not clean enough. So I >> hope we move the mess schema solving logic out of Drill to let the code >> cleaner by defining the schema firstly with DDL statements. If we agree on >> this, the work should be a sub work of DRILL-6552. >> >> On Thu, Aug 16, 2018 at 8:51 AM Paul Rogers <par0...@yahoo.com.invalid> >> wrote: >> >> > Hi Ted, >> > >> > I like the "schema auto-detect" idea. >> > >> > As we discussed in a prior thread, caching of schema is a nice-add on >> once >> > we have defined the schema-on-read mechanism. Maybe we first get it to >> work >> > with a user-provided schema. Then, as an enhancement, we offer to infer >> the >> > schema by scanning data. >> > >> > There are some ambiguities that schema inference can't resolve: in {x: >> > "1002"} {x: 1003}, should x be an Int or a Varchar? >> > >> > Still if Drill could provide a guess at the schema, and the user could >> > refine it, we'd have a very elegant solution. >> > >> > >> > Thanks, >> > - Paul >> > >> > >> > >> > On Wednesday, August 15, 2018, 5:35:06 PM PDT, Ted Dunning < >> > ted.dunn...@gmail.com> wrote: >> > >> > This is a bold statement. >> > >> > And there are variants of it that could give users nearly the same >> > experience that we have now. For instance, if we cache discovered >> schemas >> > for old files and discover the schema for any new file that we see (and >> > cache it) before actually running a query. That gives us pretty much the >> > flexibility of schema on read without as much of the burden. >> > >> > >> > >> > On Wed, Aug 15, 2018 at 5:02 PM weijie tong <tongweijie...@gmail.com> >> > wrote: >> > >> > > Hi all: >> > > Hope the statement not seems too dash to you. >> > > Drill claims be a schema-free distributed SQL engine. It pays lots of >> > > work to make the execution engine to support it to support JSON file >> like >> > > storage format. It is easier to make bugs and let the code logic >> ugly. I >> > > wonder do we still insist on this ,since we are designing the metadata >> > > system with DRILL-6552. >> > > Traditionally, people is used to design its table schema firstly >> > before >> > > firing a SQL query. I don't think this saves people too much time. >> Other >> > > system like Spark is popular not due to lack the schema claiming. I >> think >> > > we should be brave enough to take the right decision whether to still >> > > insist on this feature which seems not so important but a burden. >> > > Thanks. >> > > >> > >> > >