I think there's no schema-free data. To one ad-hoc query, one file, its schema is already defined. The schema is just discovered by the Drill not defined the user explicitly now.
On Thu, Aug 16, 2018 at 2:29 PM Jinfeng Ni <j...@apache.org> wrote: > btw: In one project that I'm currently working (an application related to > IOT), I'm leveraging Drill's schema-on-read ability, without requiring user > to predefine table DDL. > > > On Wed, Aug 15, 2018 at 11:15 PM, Jinfeng Ni <j...@apache.org> wrote: > > > The use case Weijie described seems to fall into the category of > > traditional data warehouse, i.e, schemas are predefined by users, data > > strictly conforms to schema. Certainly this is one important uses, and I > > agreed that the schema-on-read logic in Drill run-time indeed is a > > disadvantage for such use case, compared with other SQL query engine like > > Impala/Presto. > > > > The question we want to ask is whether that's the only use case Drill > > wants to target. We probably want to hear more cases from Drill > community, > > before we can decide what's the best strategy going forward. > > > > In examples Paul listed, why would two sets of data have different > schema? > > In many cases, that's because application generating the data is changed; > > either adding/deleting one field, or modifying one existing field. ETL > is > > a typical approach to clean up such data with different schema. Drill's > > argument, couple of years ago when the project was started, was that ETL > is > > too time-consuming. it would provide great value if a query engine could > > query directly against such datasets. > > > > I feel Paul's suggestion of letting user provide schema, or Drill > > scan/probe and learn the schema seems to fall in the middle of spectrum; > > ETL is one extreme, and Drill's current schema-on-read is the other > > extreme. Personally, I would prefer letting Drill scan/probe the schema, > > as it might not be easy for user to provide schema in the case of nested > > data (will they have to provide type information for any nested field?). > > > > To Weijie's comment about complexity of code of dealing schema, in theory > > we should refactor/rewrite majority run-time operator, separating the > logic > > of handling schema and handling regular data flow. That would clean up > the > > current mess. > > > > ps1: IMHO, schema-less is purely PR word. The more appropriate word for > > Drill would be schema-on-read. > > 2: I would not call it a battle between non-relational data and > > relational engine. The extended relational model has type of > > array/composite types, similar to what Drill has. > > > > > > > > > > > > On Wed, Aug 15, 2018 at 7:27 PM, weijie tong <tongweijie...@gmail.com> > > wrote: > > > >> @Paul I really appreciate the statement ` Effort can go into new > features > >> rather than fighting an unwinnable battle to use non-relational data in > a > >> relational engine.` . > >> > >> At AntFinancial( known as Alipay an Alibaba related company ) we now > use > >> Drill to support most of our analysis work. Our business and data is > >> complex enough. Our strategy is to let users design their schema first, > >> then dump in their data , query their data later. This work flow runs > >> fluently. But by deep inside into the Drill's code internal and see the > >> JIRA bugs, we will see most of the non-intuitive codes to solve the > schema > >> change but really no help to most of the actual use case. I think this > >> also > >> make the storage plugin interface not so intuitive to implement. > >> > >> We are sacrificing most of our work to pay for little income. Users > really > >> don't care about defining a schema first, but pay attention whether > their > >> query is fast enough. By probing the data to guess the schema and cache > >> them , to me ,is a compromise strategy but still not clean enough. So I > >> hope we move the mess schema solving logic out of Drill to let the code > >> cleaner by defining the schema firstly with DDL statements. If we agree > on > >> this, the work should be a sub work of DRILL-6552. > >> > >> On Thu, Aug 16, 2018 at 8:51 AM Paul Rogers <par0...@yahoo.com.invalid> > >> wrote: > >> > >> > Hi Ted, > >> > > >> > I like the "schema auto-detect" idea. > >> > > >> > As we discussed in a prior thread, caching of schema is a nice-add on > >> once > >> > we have defined the schema-on-read mechanism. Maybe we first get it to > >> work > >> > with a user-provided schema. Then, as an enhancement, we offer to > infer > >> the > >> > schema by scanning data. > >> > > >> > There are some ambiguities that schema inference can't resolve: in {x: > >> > "1002"} {x: 1003}, should x be an Int or a Varchar? > >> > > >> > Still if Drill could provide a guess at the schema, and the user could > >> > refine it, we'd have a very elegant solution. > >> > > >> > > >> > Thanks, > >> > - Paul > >> > > >> > > >> > > >> > On Wednesday, August 15, 2018, 5:35:06 PM PDT, Ted Dunning < > >> > ted.dunn...@gmail.com> wrote: > >> > > >> > This is a bold statement. > >> > > >> > And there are variants of it that could give users nearly the same > >> > experience that we have now. For instance, if we cache discovered > >> schemas > >> > for old files and discover the schema for any new file that we see > (and > >> > cache it) before actually running a query. That gives us pretty much > the > >> > flexibility of schema on read without as much of the burden. > >> > > >> > > >> > > >> > On Wed, Aug 15, 2018 at 5:02 PM weijie tong <tongweijie...@gmail.com> > >> > wrote: > >> > > >> > > Hi all: > >> > > Hope the statement not seems too dash to you. > >> > > Drill claims be a schema-free distributed SQL engine. It pays lots > of > >> > > work to make the execution engine to support it to support JSON file > >> like > >> > > storage format. It is easier to make bugs and let the code logic > >> ugly. I > >> > > wonder do we still insist on this ,since we are designing the > metadata > >> > > system with DRILL-6552. > >> > > Traditionally, people is used to design its table schema firstly > >> > before > >> > > firing a SQL query. I don't think this saves people too much time. > >> Other > >> > > system like Spark is popular not due to lack the schema claiming. I > >> think > >> > > we should be brave enough to take the right decision whether to > still > >> > > insist on this feature which seems not so important but a burden. > >> > > Thanks. > >> > > > >> > > >> > > > > >