...is the name of a provocative blog post [1].
Quote: "Once found, diverse data sets are very hard to integrate, since the
data typically contains no documentation on the semantics of its attributes.
... The rule of thumb is that data scientists spend 70% of their time finding,
interpreting, and cleaning data, and only 30% actually analyzing it. Schema on
read offers no help in these tasks, because data gives up none of its secrets
until actually read, and even when read has no documentation beyond attribute
names, which may be inscrutable, vacuous, or even misleading."
This quote relates to a discussion Salim & I have been having: that Drill
struggles to extract a usable schema directly from anything but the cleanest of
data sets, leading to unwanted and unexpected schema change exceptions due to
inherent ambiguities in how to interpret the data. (E.g. in JSON, if we see
nothing but nulls, what type is the null?)
A possible answer is further down in the post: "At Comcast, for instance, Kafka
topics are associated with Apache Avro schemas that include non-trivial
documentation on every attribute and use common subschemas to capture commonly
used data... 'Schema on read' using Avro files thus includes rich documentation
and common structures and naming conventions."
Food for thought.
Thanks,
- Paul
[1]
https://www.oreilly.com/ideas/data-governance-and-the-death-of-schema-on-read?imm_mid=0fc3c6&cmp=em-data-na-na-newsltr_20180328