"Death of Schema-on-Read"

Paul Rogers Sun, 01 Apr 2018 22:48:07 -0700
...is the name of a provocative blog post [1].
Quote: "Once found, diverse data sets are very hard to integrate, since the 
data typically contains no documentation on the semantics of its attributes. 
... The rule of thumb is that data scientists spend 70% of their time finding, 
interpreting, and cleaning data, and only 30% actually analyzing it. Schema on 
read offers no help in these tasks, because data gives up none of its secrets 
until actually read, and even when read has no documentation beyond attribute 
names, which may be inscrutable, vacuous, or even misleading."
This quote relates to a discussion Salim & I have been having: that Drill 
struggles to extract a usable schema directly from anything but the cleanest of 
data sets, leading to unwanted and unexpected schema change exceptions due to 
inherent ambiguities in how to interpret the data. (E.g. in JSON, if we see 
nothing but nulls, what type is the null?)
A possible answer is further down in the post: "At Comcast, for instance, Kafka 
topics are associated with Apache Avro schemas that include non-trivial 
documentation on every attribute and use common subschemas to capture commonly 
used data... 'Schema on read' using Avro files thus includes rich documentation 
and common structures and naming conventions."
Food for thought.
Thanks,
- Paul
[1] 
https://www.oreilly.com/ideas/data-governance-and-the-death-of-schema-on-read?imm_mid=0fc3c6&cmp=em-data-na-na-newsltr_20180328
"Death of Schema-on-Read"

Reply via email to