The use case Weijie described seems to fall into the category of
traditional data warehouse, i.e, schemas are predefined by users, data
strictly conforms to schema. Certainly this is one important uses, and I
agreed that the schema-on-read logic in Drill run-time indeed  is a
disadvantage for such use case, compared with other SQL query engine like
Impala/Presto.

The question we want to ask is whether that's the only use case Drill wants
to target. We probably want to hear more cases from Drill community, before
we can decide what's the best strategy going forward.

In examples Paul listed, why would two sets of data have different schema?
In many cases, that's because application generating the data is changed;
either adding/deleting one field, or modifying one existing field.  ETL is
a typical approach to clean up such data with different schema.  Drill's
argument, couple of years ago when the project was started, was that ETL is
too time-consuming.  it would provide great value if a query engine could
query directly against such datasets.

I feel Paul's suggestion of letting user provide schema, or Drill
scan/probe and learn the schema seems to fall in the middle of spectrum;
ETL is one extreme, and Drill's current schema-on-read is the other
extreme.  Personally, I would prefer letting Drill scan/probe the schema,
as it might not be easy for user to provide schema in the case of nested
data (will they have to provide type information for any nested field?).

To Weijie's comment about complexity of code of dealing schema, in theory
we should refactor/rewrite majority run-time operator, separating the logic
of handling schema and handling regular data flow.  That would clean up the
current mess.

ps1:  IMHO, schema-less is purely PR word. The more appropriate word for
Drill would be schema-on-read.
    2:  I would not call it a battle between non-relational data and
relational engine. The extended relational model has type of
array/composite types, similar to what Drill has.





On Wed, Aug 15, 2018 at 7:27 PM, weijie tong <tongweijie...@gmail.com>
wrote:

> @Paul I really appreciate the statement ` Effort can go into new features
> rather than fighting an unwinnable battle to use non-relational data in a
> relational engine.` .
>
> At AntFinancial( known as Alipay  an Alibaba related company ) we now use
> Drill to support most of our analysis work. Our business and data is
> complex enough. Our strategy is to let users design their schema first,
> then dump in their data , query their data later. This work flow runs
> fluently.  But by deep inside into the Drill's code internal and see the
> JIRA bugs, we will see most of the non-intuitive codes to solve the schema
> change but really no help to most of the actual use case. I think this also
> make the storage plugin interface not so intuitive to implement.
>
> We are sacrificing most of our work to pay for little income. Users really
> don't care about defining a schema first, but pay attention whether their
> query is fast enough. By probing the data to guess the schema and cache
> them , to me ,is a compromise strategy but still not clean enough. So I
> hope we move the mess schema solving logic out of Drill to let the code
> cleaner by defining the schema firstly with DDL statements. If we agree on
> this, the work should be a sub work of DRILL-6552.
>
> On Thu, Aug 16, 2018 at 8:51 AM Paul Rogers <par0...@yahoo.com.invalid>
> wrote:
>
> > Hi Ted,
> >
> > I like the "schema auto-detect" idea.
> >
> > As we discussed in a prior thread, caching of schema is a nice-add on
> once
> > we have defined the schema-on-read mechanism. Maybe we first get it to
> work
> > with a user-provided schema. Then, as an enhancement, we offer to infer
> the
> > schema by scanning data.
> >
> > There are some ambiguities that schema inference can't resolve: in {x:
> > "1002"} {x: 1003}, should x be an Int or a Varchar?
> >
> > Still if Drill could provide a guess at the schema, and the user could
> > refine it, we'd have a very elegant solution.
> >
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >     On Wednesday, August 15, 2018, 5:35:06 PM PDT, Ted Dunning <
> > ted.dunn...@gmail.com> wrote:
> >
> >  This is a bold statement.
> >
> > And there are variants of it that could give users nearly the same
> > experience that we have now. For instance, if we cache discovered schemas
> > for old files and discover the schema for any new file that we see (and
> > cache it) before actually running a query. That gives us pretty much the
> > flexibility of schema on read without as much of the burden.
> >
> >
> >
> > On Wed, Aug 15, 2018 at 5:02 PM weijie tong <tongweijie...@gmail.com>
> > wrote:
> >
> > > Hi all:
> > >  Hope the statement not seems too dash to you.
> > >  Drill claims be a schema-free distributed SQL engine. It pays lots of
> > > work to make the execution engine to support it to support JSON file
> like
> > > storage format. It is easier to make bugs and let the code logic ugly.
> I
> > > wonder do we still insist on this ,since we are designing the metadata
> > > system with DRILL-6552.
> > >    Traditionally, people is used to design its table schema firstly
> > before
> > > firing a SQL query. I don't think this saves people too much time.
> Other
> > > system like Spark is popular not due to lack the schema claiming. I
> think
> > > we should be brave enough to take the right decision whether to still
> > > insist on this feature which seems not so important but a burden.
> > >    Thanks.
> > >
> >
>

Reply via email to