It is certainly a huge advantage to have embedded data type information in
the data such as provided by Avro format.  In the past, XML also had
schemas and DTDs.
Although, one may argue that XML died because of the weight of the extra
structure added to it and people just gravitated towards JSON.
In that respect,  Avro provides a good middle ground.   A similar approach
is taken by MapR-DB  JSON database which has data type information for the
fields of a JSON document.

That said, we still have to (a) deal with JSON data which is one of the
most prevalent format in big data space and (b) still have to handle schema
changes even with Avro-like formats.
Comcast's view point suggests the one-size-fits-all approach but there is
counter-points to that, for instance as mentioned here [1].  It would be
very useful to have a survey of other users/companies that are dealing with
the schema evolution issues to get a better understanding of whether
Comcast's experience is a broader trend.

>From Drill's perspective, we have in the past discussed the need for 2
modes:
 - A fixed schema mode which operates in a manner similar to the RDBMSs.
This is needed not just to resolve ambiguities but also for performance.
Why treat a column as nullable when data is non-nullable ?
 - A variable schema mode which is what it does today...but this part needs
to be enhanced to be *'declarative' such that ambiguities are removed.*   A
user may choose not to create any declaration, in which case Drill would
default to certain documented set of rules that do type conversions.


[1] https://www.marklogic.com/blog/schema-on-read-vs-schema-on-write/


-Aman


On Sun, Apr 1, 2018 at 10:46 PM, Paul Rogers <[email protected]>
wrote:

> ...is the name of a provocative blog post [1].
> Quote: "Once found, diverse data sets are very hard to integrate, since
> the data typically contains no documentation on the semantics of its
> attributes. ... The rule of thumb is that data scientists spend 70% of
> their time finding, interpreting, and cleaning data, and only 30% actually
> analyzing it. Schema on read offers no help in these tasks, because data
> gives up none of its secrets until actually read, and even when read has no
> documentation beyond attribute names, which may be inscrutable, vacuous, or
> even misleading."
> This quote relates to a discussion Salim & I have been having: that Drill
> struggles to extract a usable schema directly from anything but the
> cleanest of data sets, leading to unwanted and unexpected schema change
> exceptions due to inherent ambiguities in how to interpret the data. (E.g.
> in JSON, if we see nothing but nulls, what type is the null?)
> A possible answer is further down in the post: "At Comcast, for instance,
> Kafka topics are associated with Apache Avro schemas that include
> non-trivial documentation on every attribute and use common subschemas to
> capture commonly used data... 'Schema on read' using Avro files thus
> includes rich documentation and common structures and naming conventions."
> Food for thought.
> Thanks,
> - Paul
> [1] https://www.oreilly.com/ideas/data-governance-and-the-
> death-of-schema-on-read?imm_mid=0fc3c6&cmp=em-data-na-na-newsltr_20180328
>
>
>
>
>

Reply via email to