It is certainly a huge advantage to have embedded data type information in the data such as provided by Avro format. In the past, XML also had schemas and DTDs. Although, one may argue that XML died because of the weight of the extra structure added to it and people just gravitated towards JSON. In that respect, Avro provides a good middle ground. A similar approach is taken by MapR-DB JSON database which has data type information for the fields of a JSON document.
That said, we still have to (a) deal with JSON data which is one of the most prevalent format in big data space and (b) still have to handle schema changes even with Avro-like formats. Comcast's view point suggests the one-size-fits-all approach but there is counter-points to that, for instance as mentioned here [1]. It would be very useful to have a survey of other users/companies that are dealing with the schema evolution issues to get a better understanding of whether Comcast's experience is a broader trend. >From Drill's perspective, we have in the past discussed the need for 2 modes: - A fixed schema mode which operates in a manner similar to the RDBMSs. This is needed not just to resolve ambiguities but also for performance. Why treat a column as nullable when data is non-nullable ? - A variable schema mode which is what it does today...but this part needs to be enhanced to be *'declarative' such that ambiguities are removed.* A user may choose not to create any declaration, in which case Drill would default to certain documented set of rules that do type conversions. [1] https://www.marklogic.com/blog/schema-on-read-vs-schema-on-write/ -Aman On Sun, Apr 1, 2018 at 10:46 PM, Paul Rogers <[email protected]> wrote: > ...is the name of a provocative blog post [1]. > Quote: "Once found, diverse data sets are very hard to integrate, since > the data typically contains no documentation on the semantics of its > attributes. ... The rule of thumb is that data scientists spend 70% of > their time finding, interpreting, and cleaning data, and only 30% actually > analyzing it. Schema on read offers no help in these tasks, because data > gives up none of its secrets until actually read, and even when read has no > documentation beyond attribute names, which may be inscrutable, vacuous, or > even misleading." > This quote relates to a discussion Salim & I have been having: that Drill > struggles to extract a usable schema directly from anything but the > cleanest of data sets, leading to unwanted and unexpected schema change > exceptions due to inherent ambiguities in how to interpret the data. (E.g. > in JSON, if we see nothing but nulls, what type is the null?) > A possible answer is further down in the post: "At Comcast, for instance, > Kafka topics are associated with Apache Avro schemas that include > non-trivial documentation on every attribute and use common subschemas to > capture commonly used data... 'Schema on read' using Avro files thus > includes rich documentation and common structures and naming conventions." > Food for thought. > Thanks, > - Paul > [1] https://www.oreilly.com/ideas/data-governance-and-the- > death-of-schema-on-read?imm_mid=0fc3c6&cmp=em-data-na-na-newsltr_20180328 > > > > >
