This, of course, begs the question [1], doesn't it?

If you have the schema, then you have either a) spent time designing and
documenting your data (both the schema and dictionary containing the
semantics) or b) spent time "finding, interpreting, and cleaning data" to
discover the data schema and dictionary.

Data that has "no documentation beyond attribute names, which may be
inscrutable, vacuous, or even misleading" will continue to be so even after
you specify the schema.

Asking users to design their schemas when they have already accumulated
data that is unclean and undocumented is asking them to do the work that
they use your software for in the first place.

The goal of schema on read is to facilitate the task of interpreting the
data that already exists, is mutating, and is undocumented (or documented
badly).


[1] https://en.wikipedia.org/wiki/Begging_the_question


On Mon, Apr 2, 2018 at 11:16 AM, Paul Rogers <[email protected]>
wrote:

> ...is the name of a provocative blog post [1].
> Quote: "Once found, diverse data sets are very hard to integrate, since
> the data typically contains no documentation on the semantics of its
> attributes. ... The rule of thumb is that data scientists spend 70% of
> their time finding, interpreting, and cleaning data, and only 30% actually
> analyzing it. Schema on read offers no help in these tasks, because data
> gives up none of its secrets until actually read, and even when read has no
> documentation beyond attribute names, which may be inscrutable, vacuous, or
> even misleading."
> This quote relates to a discussion Salim & I have been having: that Drill
> struggles to extract a usable schema directly from anything but the
> cleanest of data sets, leading to unwanted and unexpected schema change
> exceptions due to inherent ambiguities in how to interpret the data. (E.g.
> in JSON, if we see nothing but nulls, what type is the null?)
> A possible answer is further down in the post: "At Comcast, for instance,
> Kafka topics are associated with Apache Avro schemas that include
> non-trivial documentation on every attribute and use common subschemas to
> capture commonly used data... 'Schema on read' using Avro files thus
> includes rich documentation and common structures and naming conventions."
> Food for thought.
> Thanks,
> - Paul
> [1] https://www.oreilly.com/ideas/data-governance-and-the-
> death-of-schema-on-read?imm_mid=0fc3c6&cmp=em-data-na-na-newsltr_20180328
>
>
>
>
>

Reply via email to