Re: [DISCUSSION] Does schema-free really need

weijie tong Wed, 15 Aug 2018 19:28:57 -0700

@Paul I really appreciate the statement ` Effort can go into new features
rather than fighting an unwinnable battle to use non-relational data in a
relational engine.` .

At AntFinancial( known as Alipay  an Alibaba related company ) we now use
Drill to support most of our analysis work. Our business and data is
complex enough. Our strategy is to let users design their schema first,
then dump in their data , query their data later. This work flow runs
fluently.  But by deep inside into the Drill's code internal and see the
JIRA bugs, we will see most of the non-intuitive codes to solve the schema
change but really no help to most of the actual use case. I think this also
make the storage plugin interface not so intuitive to implement.

We are sacrificing most of our work to pay for little income. Users really
don't care about defining a schema first, but pay attention whether their
query is fast enough. By probing the data to guess the schema and cache
them , to me ,is a compromise strategy but still not clean enough. So I
hope we move the mess schema solving logic out of Drill to let the code
cleaner by defining the schema firstly with DDL statements. If we agree on
this, the work should be a sub work of DRILL-6552.

On Thu, Aug 16, 2018 at 8:51 AM Paul Rogers <[email protected]>
wrote:

> Hi Ted,
>
> I like the "schema auto-detect" idea.
>
> As we discussed in a prior thread, caching of schema is a nice-add on once
> we have defined the schema-on-read mechanism. Maybe we first get it to work
> with a user-provided schema. Then, as an enhancement, we offer to infer the
> schema by scanning data.
>
> There are some ambiguities that schema inference can't resolve: in {x:
> "1002"} {x: 1003}, should x be an Int or a Varchar?
>
> Still if Drill could provide a guess at the schema, and the user could
> refine it, we'd have a very elegant solution.
>
>
> Thanks,
> - Paul
>
>
>
>     On Wednesday, August 15, 2018, 5:35:06 PM PDT, Ted Dunning <
> [email protected]> wrote:
>
>  This is a bold statement.
>
> And there are variants of it that could give users nearly the same
> experience that we have now. For instance, if we cache discovered schemas
> for old files and discover the schema for any new file that we see (and
> cache it) before actually running a query. That gives us pretty much the
> flexibility of schema on read without as much of the burden.
>
>
>
> On Wed, Aug 15, 2018 at 5:02 PM weijie tong <[email protected]>
> wrote:
>
> > Hi all:
> >  Hope the statement not seems too dash to you.
> >  Drill claims be a schema-free distributed SQL engine. It pays lots of
> > work to make the execution engine to support it to support JSON file like
> > storage format. It is easier to make bugs and let the code logic ugly. I
> > wonder do we still insist on this ,since we are designing the metadata
> > system with DRILL-6552.
> >    Traditionally, people is used to design its table schema firstly
> before
> > firing a SQL query. I don't think this saves people too much time. Other
> > system like Spark is popular not due to lack the schema claiming. I think
> > we should be brave enough to take the right decision whether to still
> > insist on this feature which seems not so important but a burden.
> >    Thanks.
> >
>

Re: [DISCUSSION] Does schema-free really need

Reply via email to