Re: [DISCUSSION] Does schema-free really need

Jinfeng Ni Wed, 15 Aug 2018 23:29:29 -0700

btw:  In one project that I'm currently working (an application related to
IOT), I'm leveraging Drill's schema-on-read ability, without requiring user
to predefine table DDL.



On Wed, Aug 15, 2018 at 11:15 PM, Jinfeng Ni <j...@apache.org> wrote:

> The use case Weijie described seems to fall into the category of
> traditional data warehouse, i.e, schemas are predefined by users, data
> strictly conforms to schema. Certainly this is one important uses, and I
> agreed that the schema-on-read logic in Drill run-time indeed  is a
> disadvantage for such use case, compared with other SQL query engine like
> Impala/Presto.
>
> The question we want to ask is whether that's the only use case Drill
> wants to target. We probably want to hear more cases from Drill community,
> before we can decide what's the best strategy going forward.
>
> In examples Paul listed, why would two sets of data have different schema?
> In many cases, that's because application generating the data is changed;
> either adding/deleting one field, or modifying one existing field.  ETL is
> a typical approach to clean up such data with different schema.  Drill's
> argument, couple of years ago when the project was started, was that ETL is
> too time-consuming.  it would provide great value if a query engine could
> query directly against such datasets.
>
> I feel Paul's suggestion of letting user provide schema, or Drill
> scan/probe and learn the schema seems to fall in the middle of spectrum;
> ETL is one extreme, and Drill's current schema-on-read is the other
> extreme.  Personally, I would prefer letting Drill scan/probe the schema,
> as it might not be easy for user to provide schema in the case of nested
> data (will they have to provide type information for any nested field?).
>
> To Weijie's comment about complexity of code of dealing schema, in theory
> we should refactor/rewrite majority run-time operator, separating the logic
> of handling schema and handling regular data flow.  That would clean up the
> current mess.
>
> ps1:  IMHO, schema-less is purely PR word. The more appropriate word for
> Drill would be schema-on-read.
>     2:  I would not call it a battle between non-relational data and
> relational engine. The extended relational model has type of
> array/composite types, similar to what Drill has.
>
>
>
>
>
> On Wed, Aug 15, 2018 at 7:27 PM, weijie tong <tongweijie...@gmail.com>
> wrote:
>
>> @Paul I really appreciate the statement ` Effort can go into new features
>> rather than fighting an unwinnable battle to use non-relational data in a
>> relational engine.` .
>>
>> At AntFinancial( known as Alipay  an Alibaba related company ) we now use
>> Drill to support most of our analysis work. Our business and data is
>> complex enough. Our strategy is to let users design their schema first,
>> then dump in their data , query their data later. This work flow runs
>> fluently.  But by deep inside into the Drill's code internal and see the
>> JIRA bugs, we will see most of the non-intuitive codes to solve the schema
>> change but really no help to most of the actual use case. I think this
>> also
>> make the storage plugin interface not so intuitive to implement.
>>
>> We are sacrificing most of our work to pay for little income. Users really
>> don't care about defining a schema first, but pay attention whether their
>> query is fast enough. By probing the data to guess the schema and cache
>> them , to me ,is a compromise strategy but still not clean enough. So I
>> hope we move the mess schema solving logic out of Drill to let the code
>> cleaner by defining the schema firstly with DDL statements. If we agree on
>> this, the work should be a sub work of DRILL-6552.
>>
>> On Thu, Aug 16, 2018 at 8:51 AM Paul Rogers <par0...@yahoo.com.invalid>
>> wrote:
>>
>> > Hi Ted,
>> >
>> > I like the "schema auto-detect" idea.
>> >
>> > As we discussed in a prior thread, caching of schema is a nice-add on
>> once
>> > we have defined the schema-on-read mechanism. Maybe we first get it to
>> work
>> > with a user-provided schema. Then, as an enhancement, we offer to infer
>> the
>> > schema by scanning data.
>> >
>> > There are some ambiguities that schema inference can't resolve: in {x:
>> > "1002"} {x: 1003}, should x be an Int or a Varchar?
>> >
>> > Still if Drill could provide a guess at the schema, and the user could
>> > refine it, we'd have a very elegant solution.
>> >
>> >
>> > Thanks,
>> > - Paul
>> >
>> >
>> >
>> >     On Wednesday, August 15, 2018, 5:35:06 PM PDT, Ted Dunning <
>> > ted.dunn...@gmail.com> wrote:
>> >
>> >  This is a bold statement.
>> >
>> > And there are variants of it that could give users nearly the same
>> > experience that we have now. For instance, if we cache discovered
>> schemas
>> > for old files and discover the schema for any new file that we see (and
>> > cache it) before actually running a query. That gives us pretty much the
>> > flexibility of schema on read without as much of the burden.
>> >
>> >
>> >
>> > On Wed, Aug 15, 2018 at 5:02 PM weijie tong <tongweijie...@gmail.com>
>> > wrote:
>> >
>> > > Hi all:
>> > >  Hope the statement not seems too dash to you.
>> > >  Drill claims be a schema-free distributed SQL engine. It pays lots of
>> > > work to make the execution engine to support it to support JSON file
>> like
>> > > storage format. It is easier to make bugs and let the code logic
>> ugly. I
>> > > wonder do we still insist on this ,since we are designing the metadata
>> > > system with DRILL-6552.
>> > >    Traditionally, people is used to design its table schema firstly
>> > before
>> > > firing a SQL query. I don't think this saves people too much time.
>> Other
>> > > system like Spark is popular not due to lack the schema claiming. I
>> think
>> > > we should be brave enough to take the right decision whether to still
>> > > insist on this feature which seems not so important but a burden.
>> > >    Thanks.
>> > >
>> >
>>
>
>

Re: [DISCUSSION] Does schema-free really need

Reply via email to