Re: [DISCUSSION] Does schema-free really need

Paul Rogers Thu, 16 Aug 2018 09:09:02 -0700

Thanks for the background Jinfeng, your explanation brings us back to the topic 
Arina raised: the state and direction of the Drill project.

For several years now, Drill has said, essentially, "dabble with raw data all 
you like early in the project, but for production, ETL your data into Parquet." 
One view is that we double-down on this idea. Parquet has many advantages: it 
carries its own schema. As Jinfeng noted, it is the product of an ETL process 
that cleans up and normalizes data, removing the variations that creep in 
during schema evolution.

Further, Drill already has Ted's mechanism to pre-scan the data: Drill does it 
to capture the Parquet footer and directory metadata. There is a 
not-yet-committed project to gather stats on Parquet files. Parquet allows 
Drill to be schema-free (yes, a marketing term, really schema-on-read) with a 
good, solid schema as defined in Parquet. Of course, even Parquet is subject to 
ambiguities due to newly added columns, but ETL should clean up such issues.

Extend that to other types: "schema-free" could mean, "Drill does not do 
schemas; each data source must provide its own CONSISTENT schema." Parquet does 
this today as does Hive. Allow external hints for CSV and JSON. Require that 
CSV and JSON have clean, extended relational schemas to avoid messy ambiguities 
and schema changes.

An argument can be made that many fine ETL tools exist: NiFi, Spark, MR, Hive, 
StreamSets, ... Drill need not try to solve this problem. Further, those other 
products generally allow the user to use code to handle tough cases; Drill only 
has SQL, and often SQL is simply not sufficiently expressive. Here is a 
favorite example:

{fields: [ {id: 1, name: "customer-name", type: "string", value: "fred"}, {id: 
2, name: "balance", type: "float", value: "123.45"}] }

It is unlikely that Drill can add enough UDFs to parse the above, and to do so 
efficiently. But, parsing the above in code (in Spark, say) is easy.

So, perhaps Drill becomes a Parquet-focused query engine, requiring unambiguous 
schema, defined by the data source itself (which includes, for Parquet, Drill's 
metadata and stats files.) Drill is "schema-free", but the data is required to 
provide a clean, unambiguous schema.

The only problem is that this big data niche is already occupied by an 
entrenched leader called Impala. Impala is faster than Drill (even if it is 
hard to build and is very hard to tune.) Drill can't win by saying, "we're 
almost as fast, have fewer bugs, and are easier to build." Spark didn't win by 
being almost as good as MR, it won because it was far better on many 
dimensions. Marketing 101 says that a product can never will a head-to-head 
battle with an entrenched leader.

So, what niche can Drill fill that is not already occupied? Spark is great for 
complex data transforms of the kind shown above, but is not good at all for 
interactive SQL. (Spark is not a persistent engine, it is a batch system like 
MR. As noted in an earlier message, Spark shuffles data in stages, not in 
batches like Drill and Impala.) But, Spark has a huge community; maybe someone 
will solve these issues. So, Drill as a better Spark is not appealing.

One idea that has come up from time-to-time is Drill as an open source 
Splunk-like tool. Splunk can ingest zillions of file formats using adapters, 
akin to Drill's storage and format plugins. Ingesting arbitrary files requires 
some schema cleansing on read. Drill's answer could be to add ad-hoc metadata, 
allow data-cleaning plugins, and add UDFs. That is, double-down on the idea 
that Drill does read multiple formats; solve the remaining issues to do so well.

In short, the big question is, "what does Drill want to do now that its grown 
up?" Compete with Impala (Parquet only)? Complete with Spark (better code-based 
query engine)? Compete with Splunk (query any file format)? Something else?

Whatever we do, to Weijie's point, we should do it in a way that is stable: 
today's approach to handling messy schema's can't ever work completely because 
it requires that Drill predict the future: a reader must decide on record 1 how 
to handle a field that won't actually appear until file (or block) 100. Do we 
need that? How do we maintain code (union vectors, list vectors, schema change) 
that never worked and probably never can? What is the better solution?

Thanks,
- Paul

    On Wednesday, August 15, 2018, 11:15:10 PM PDT, Jinfeng Ni 
<j...@apache.org> wrote:  

 The use case Weijie described seems to fall into the category of
traditional data warehouse, i.e, schemas are predefined by users, data
strictly conforms to schema. Certainly this is one important uses, and I
agreed that the schema-on-read logic in Drill run-time indeed  is a
disadvantage for such use case, compared with other SQL query engine like
Impala/Presto.

The question we want to ask is whether that's the only use case Drill wants
to target. We probably want to hear more cases from Drill community, before
we can decide what's the best strategy going forward.

In examples Paul listed, why would two sets of data have different schema?
In many cases, that's because application generating the data is changed;
either adding/deleting one field, or modifying one existing field.  ETL is
a typical approach to clean up such data with different schema.  Drill's
argument, couple of years ago when the project was started, was that ETL is
too time-consuming.  it would provide great value if a query engine could
query directly against such datasets.

I feel Paul's suggestion of letting user provide schema, or Drill
scan/probe and learn the schema seems to fall in the middle of spectrum;
ETL is one extreme, and Drill's current schema-on-read is the other
extreme.  Personally, I would prefer letting Drill scan/probe the schema,
as it might not be easy for user to provide schema in the case of nested
data (will they have to provide type information for any nested field?).

To Weijie's comment about complexity of code of dealing schema, in theory
we should refactor/rewrite majority run-time operator, separating the logic
of handling schema and handling regular data flow.  That would clean up the
current mess.

ps1:  IMHO, schema-less is purely PR word. The more appropriate word for
Drill would be schema-on-read.
    2:  I would not call it a battle between non-relational data and
relational engine. The extended relational model has type of
array/composite types, similar to what Drill has.

On Wed, Aug 15, 2018 at 7:27 PM, weijie tong <tongweijie...@gmail.com>
wrote:

> @Paul I really appreciate the statement ` Effort can go into new features
> rather than fighting an unwinnable battle to use non-relational data in a
> relational engine.` .
>
> At AntFinancial( known as Alipay  an Alibaba related company ) we now use
> Drill to support most of our analysis work. Our business and data is
> complex enough. Our strategy is to let users design their schema first,
> then dump in their data , query their data later. This work flow runs
> fluently.  But by deep inside into the Drill's code internal and see the
> JIRA bugs, we will see most of the non-intuitive codes to solve the schema
> change but really no help to most of the actual use case. I think this also
> make the storage plugin interface not so intuitive to implement.
>
> We are sacrificing most of our work to pay for little income. Users really
> don't care about defining a schema first, but pay attention whether their
> query is fast enough. By probing the data to guess the schema and cache
> them , to me ,is a compromise strategy but still not clean enough. So I
> hope we move the mess schema solving logic out of Drill to let the code
> cleaner by defining the schema firstly with DDL statements. If we agree on
> this, the work should be a sub work of DRILL-6552.
>
> On Thu, Aug 16, 2018 at 8:51 AM Paul Rogers <par0...@yahoo.com.invalid>
> wrote:
>
> > Hi Ted,
> >
> > I like the "schema auto-detect" idea.
> >
> > As we discussed in a prior thread, caching of schema is a nice-add on
> once
> > we have defined the schema-on-read mechanism. Maybe we first get it to
> work
> > with a user-provided schema. Then, as an enhancement, we offer to infer
> the
> > schema by scanning data.
> >
> > There are some ambiguities that schema inference can't resolve: in {x:
> > "1002"} {x: 1003}, should x be an Int or a Varchar?
> >
> > Still if Drill could provide a guess at the schema, and the user could
> > refine it, we'd have a very elegant solution.
> >
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Wednesday, August 15, 2018, 5:35:06 PM PDT, Ted Dunning <
> > ted.dunn...@gmail.com> wrote:
> >
> >  This is a bold statement.
> >
> > And there are variants of it that could give users nearly the same
> > experience that we have now. For instance, if we cache discovered schemas
> > for old files and discover the schema for any new file that we see (and
> > cache it) before actually running a query. That gives us pretty much the
> > flexibility of schema on read without as much of the burden.
> >
> >
> >
> > On Wed, Aug 15, 2018 at 5:02 PM weijie tong <tongweijie...@gmail.com>
> > wrote:
> >
> > > Hi all:
> > >  Hope the statement not seems too dash to you.
> > >  Drill claims be a schema-free distributed SQL engine. It pays lots of
> > > work to make the execution engine to support it to support JSON file
> like
> > > storage format. It is easier to make bugs and let the code logic ugly.
> I
> > > wonder do we still insist on this ,since we are designing the metadata
> > > system with DRILL-6552.
> > >    Traditionally, people is used to design its table schema firstly
> > before
> > > firing a SQL query. I don't think this saves people too much time.
> Other
> > > system like Spark is popular not due to lack the schema claiming. I
> think
> > > we should be brave enough to take the right decision whether to still
> > > insist on this feature which seems not so important but a burden.
> > >    Thanks.
> > >
> >
>

Re: [DISCUSSION] Does schema-free really need

Reply via email to