Re: "Death of Schema-on-Read"

Paul Rogers Mon, 02 Apr 2018 13:36:47 -0700

Just to clarify, the article seemed to indicate that Comcast has an Avro file 
for each of their Kafka data sources, and that file contains metadata 
information. The analogy in Drill would be if we had an ambiguous JSON file, 
along with an Avro file (say) that defined the columns, their data types, their 
names, and so on. The exact Comcast design probably wouldn't fit Drill. It is 
the concept that is thought-provoking.

Today, we can do a pretty good job with JSON as long as it is a very clean 
schema:
* No null or missing fields in the first record.* Consistent data types.* Same 
set of fields in every file.

The easiest problem to visualize is when something is missing, Drill has to 
guess, there are multiple choices, and Drill guesses wrong. Classic example, a 
field is missing and we guess "Nullable Int" when it is, in fact, a VarChar. 
Yes, we could drop this "dangling" field, but doing so might be somewhat 
surprising to the user.

For Parquet, we have to deal only with the missing-field problem (i.e. schema 
evolution.) With CSV, we have the problem of knowing the actual data type of a 
column (is column "price" really text, or is it text that represents a number? 
Of what type?) And so it goes.

With regard to the two modes; we could even have a single mode in which we use 
metadata when it is present, and we guess otherwise. With clean schemas 
(Parquet files with identical schemas), Drill's guesses are sufficient. But, 
when the situation is ambiguous, we could allow the user to specify just enough 
metadata to resolve the ambiguity. (In Parquet, say, if field "x" was added 
later, then the metadata could just say that, "when column 'x' is missing, 
assume it is 'Date'".)

If the user wants to specify everything, then that is just a special case of 
the "just enough schema" model.

I believe Drill can use Hive, but only for the Hive readers. So, a possible 
first step is to combine the Hive metastore schema information with the Drill 
native readers.

In an ideal world, Drill would have a "schema plugin" along with its storage 
and format plugins, so Drill can integrate with a variety of metadata systems 
(including Comcast's unique Avro schema files.) Even better if the schema hints 
could also be provided via Drill's existing table functions for ad-hoc use.

All of this is just something to keep in the back of our minds as we think 
about how to resolve schema change issues.

Thanks,
- Paul

    On Monday, April 2, 2018, 10:54:23 AM PDT, Aman Sinha 
<[email protected]> wrote:  

 It is certainly a huge advantage to have embedded data type information in
the data such as provided by Avro format.  In the past, XML also had
schemas and DTDs.
Although, one may argue that XML died because of the weight of the extra
structure added to it and people just gravitated towards JSON.
In that respect,  Avro provides a good middle ground.  A similar approach
is taken by MapR-DB  JSON database which has data type information for the
fields of a JSON document.

That said, we still have to (a) deal with JSON data which is one of the
most prevalent format in big data space and (b) still have to handle schema
changes even with Avro-like formats.
Comcast's view point suggests the one-size-fits-all approach but there is
counter-points to that, for instance as mentioned here [1].  It would be
very useful to have a survey of other users/companies that are dealing with
the schema evolution issues to get a better understanding of whether
Comcast's experience is a broader trend.

>From Drill's perspective, we have in the past discussed the need for 2
modes:
 - A fixed schema mode which operates in a manner similar to the RDBMSs.
This is needed not just to resolve ambiguities but also for performance.
Why treat a column as nullable when data is non-nullable ?
 - A variable schema mode which is what it does today...but this part needs
to be enhanced to be *'declarative' such that ambiguities are removed.*  A
user may choose not to create any declaration, in which case Drill would
default to certain documented set of rules that do type conversions.

[1] https://www.marklogic.com/blog/schema-on-read-vs-schema-on-write/

-Aman

On Sun, Apr 1, 2018 at 10:46 PM, Paul Rogers <[email protected]>
wrote:

> ...is the name of a provocative blog post [1].
> Quote: "Once found, diverse data sets are very hard to integrate, since
> the data typically contains no documentation on the semantics of its
> attributes. ... The rule of thumb is that data scientists spend 70% of
> their time finding, interpreting, and cleaning data, and only 30% actually
> analyzing it. Schema on read offers no help in these tasks, because data
> gives up none of its secrets until actually read, and even when read has no
> documentation beyond attribute names, which may be inscrutable, vacuous, or
> even misleading."
> This quote relates to a discussion Salim & I have been having: that Drill
> struggles to extract a usable schema directly from anything but the
> cleanest of data sets, leading to unwanted and unexpected schema change
> exceptions due to inherent ambiguities in how to interpret the data. (E.g.
> in JSON, if we see nothing but nulls, what type is the null?)
> A possible answer is further down in the post: "At Comcast, for instance,
> Kafka topics are associated with Apache Avro schemas that include
> non-trivial documentation on every attribute and use common subschemas to
> capture commonly used data... 'Schema on read' using Avro files thus
> includes rich documentation and common structures and naming conventions."
> Food for thought.
> Thanks,
> - Paul
> [1] https://www.oreilly.com/ideas/data-governance-and-the-
> death-of-schema-on-read?imm_mid=0fc3c6&cmp=em-data-na-na-newsltr_20180328
>
>
>
>
>

Re: "Death of Schema-on-Read"

Reply via email to