Re: [DISCUSSION] Does schema-free really need

weijie tong Wed, 15 Aug 2018 23:47:44 -0700

I think there's no schema-free data. To one ad-hoc query, one file, its
schema is already defined. The schema is just discovered by the Drill not
defined the user explicitly now.


On Thu, Aug 16, 2018 at 2:29 PM Jinfeng Ni <[email protected]> wrote:

> btw:  In one project that I'm currently working (an application related to
> IOT), I'm leveraging Drill's schema-on-read ability, without requiring user
> to predefine table DDL.
>
>
> On Wed, Aug 15, 2018 at 11:15 PM, Jinfeng Ni <[email protected]> wrote:
>
> > The use case Weijie described seems to fall into the category of
> > traditional data warehouse, i.e, schemas are predefined by users, data
> > strictly conforms to schema. Certainly this is one important uses, and I
> > agreed that the schema-on-read logic in Drill run-time indeed  is a
> > disadvantage for such use case, compared with other SQL query engine like
> > Impala/Presto.
> >
> > The question we want to ask is whether that's the only use case Drill
> > wants to target. We probably want to hear more cases from Drill
> community,
> > before we can decide what's the best strategy going forward.
> >
> > In examples Paul listed, why would two sets of data have different
> schema?
> > In many cases, that's because application generating the data is changed;
> > either adding/deleting one field, or modifying one existing field.  ETL
> is
> > a typical approach to clean up such data with different schema.  Drill's
> > argument, couple of years ago when the project was started, was that ETL
> is
> > too time-consuming.  it would provide great value if a query engine could
> > query directly against such datasets.
> >
> > I feel Paul's suggestion of letting user provide schema, or Drill
> > scan/probe and learn the schema seems to fall in the middle of spectrum;
> > ETL is one extreme, and Drill's current schema-on-read is the other
> > extreme.  Personally, I would prefer letting Drill scan/probe the schema,
> > as it might not be easy for user to provide schema in the case of nested
> > data (will they have to provide type information for any nested field?).
> >
> > To Weijie's comment about complexity of code of dealing schema, in theory
> > we should refactor/rewrite majority run-time operator, separating the
> logic
> > of handling schema and handling regular data flow.  That would clean up
> the
> > current mess.
> >
> > ps1:  IMHO, schema-less is purely PR word. The more appropriate word for
> > Drill would be schema-on-read.
> >     2:  I would not call it a battle between non-relational data and
> > relational engine. The extended relational model has type of
> > array/composite types, similar to what Drill has.
> >
> >
> >
> >
> >
> > On Wed, Aug 15, 2018 at 7:27 PM, weijie tong <[email protected]>
> > wrote:
> >
> >> @Paul I really appreciate the statement ` Effort can go into new
> features
> >> rather than fighting an unwinnable battle to use non-relational data in
> a
> >> relational engine.` .
> >>
> >> At AntFinancial( known as Alipay  an Alibaba related company ) we now
> use
> >> Drill to support most of our analysis work. Our business and data is
> >> complex enough. Our strategy is to let users design their schema first,
> >> then dump in their data , query their data later. This work flow runs
> >> fluently.  But by deep inside into the Drill's code internal and see the
> >> JIRA bugs, we will see most of the non-intuitive codes to solve the
> schema
> >> change but really no help to most of the actual use case. I think this
> >> also
> >> make the storage plugin interface not so intuitive to implement.
> >>
> >> We are sacrificing most of our work to pay for little income. Users
> really
> >> don't care about defining a schema first, but pay attention whether
> their
> >> query is fast enough. By probing the data to guess the schema and cache
> >> them , to me ,is a compromise strategy but still not clean enough. So I
> >> hope we move the mess schema solving logic out of Drill to let the code
> >> cleaner by defining the schema firstly with DDL statements. If we agree
> on
> >> this, the work should be a sub work of DRILL-6552.
> >>
> >> On Thu, Aug 16, 2018 at 8:51 AM Paul Rogers <[email protected]>
> >> wrote:
> >>
> >> > Hi Ted,
> >> >
> >> > I like the "schema auto-detect" idea.
> >> >
> >> > As we discussed in a prior thread, caching of schema is a nice-add on
> >> once
> >> > we have defined the schema-on-read mechanism. Maybe we first get it to
> >> work
> >> > with a user-provided schema. Then, as an enhancement, we offer to
> infer
> >> the
> >> > schema by scanning data.
> >> >
> >> > There are some ambiguities that schema inference can't resolve: in {x:
> >> > "1002"} {x: 1003}, should x be an Int or a Varchar?
> >> >
> >> > Still if Drill could provide a guess at the schema, and the user could
> >> > refine it, we'd have a very elegant solution.
> >> >
> >> >
> >> > Thanks,
> >> > - Paul
> >> >
> >> >
> >> >
> >> >     On Wednesday, August 15, 2018, 5:35:06 PM PDT, Ted Dunning <
> >> > [email protected]> wrote:
> >> >
> >> >  This is a bold statement.
> >> >
> >> > And there are variants of it that could give users nearly the same
> >> > experience that we have now. For instance, if we cache discovered
> >> schemas
> >> > for old files and discover the schema for any new file that we see
> (and
> >> > cache it) before actually running a query. That gives us pretty much
> the
> >> > flexibility of schema on read without as much of the burden.
> >> >
> >> >
> >> >
> >> > On Wed, Aug 15, 2018 at 5:02 PM weijie tong <[email protected]>
> >> > wrote:
> >> >
> >> > > Hi all:
> >> > >  Hope the statement not seems too dash to you.
> >> > >  Drill claims be a schema-free distributed SQL engine. It pays lots
> of
> >> > > work to make the execution engine to support it to support JSON file
> >> like
> >> > > storage format. It is easier to make bugs and let the code logic
> >> ugly. I
> >> > > wonder do we still insist on this ,since we are designing the
> metadata
> >> > > system with DRILL-6552.
> >> > >    Traditionally, people is used to design its table schema firstly
> >> > before
> >> > > firing a SQL query. I don't think this saves people too much time.
> >> Other
> >> > > system like Spark is popular not due to lack the schema claiming. I
> >> think
> >> > > we should be brave enough to take the right decision whether to
> still
> >> > > insist on this feature which seems not so important but a burden.
> >> > >    Thanks.
> >> > >
> >> >
> >>
> >
> >
>

Re: [DISCUSSION] Does schema-free really need

Reply via email to