On Thu, Apr 5, 2018 at 7:24 AM, Joel Pfaff <[email protected]> wrote:
> Hello, > > A lot of versioning problems arise when trying to share data through kafka > between multiple applications with different lifecycles and maintainers, > since by default, a single message in Kafka is just a blob. > One way to solve that is to agree on a single serialization format, > friendly with a record per record storage (like avro) and in order to not > have to serialize the schema in use for every message, just reference an > entry in the Avro Schema Registry (this flow is described here: > https://medium.com/@stephane.maarek/introduction-to- > schemas-in-apache-kafka-with-the-confluent-schema-registry-3bf55e401321 > ). > On top of the schema registry, specific client libs allow to validate the > message structure prior to the injection in kafka. > So while comcast mentions the usage of an Avro Schema to describe its > feeds, it does not mention directly the usage of avro files (to describe > the schema). > This is all good except for the assumption of a single schema for all time. You can mutate schemas in Avro (or JSON) in a future-proof manner, but it is important to recognize the simple truth that the data in a stream will not necessarily be uniform (and is even unlikely to be uniform) > > .... But the usage of CSV/JSON still is problematic. I like the idea of > having > an optional way to describe the expected types somewhere (either in a > central meta-store, or in a structured file next to the dataset). > Central meta-stores are seriously bad problems and are the single biggest nightmare in trying to upgrade Hive users. Let's avoid that if possible. Writing meta-data next to the file is also problematic if it needs to be written by the processing doing a query (the directory may not be writable). Having a convention for redirecting the meta-data cache to a parallel directory might solve the problem of non-writable local locations. In the worst case that Drill can't have any place to persist what it has learned but wants to do a restart, there needs to be SOME place to cache meta-data or else restarts will get no further than the original failed query.
