Re: "Death of Schema-on-Read"

Ted Dunning Thu, 05 Apr 2018 15:22:57 -0700

On Thu, Apr 5, 2018 at 7:24 AM, Joel Pfaff <[email protected]> wrote:


> Hello,
>
> A lot of versioning problems arise when trying to share data through kafka
> between multiple applications with different lifecycles and maintainers,
> since by default, a single message in Kafka is just a blob.
> One way to solve that is to agree on a single serialization format,
> friendly with a record per record storage (like avro) and in order to not
> have to serialize the schema in use for every message, just reference an
> entry in the Avro Schema Registry (this flow is described here:
> https://medium.com/@stephane.maarek/introduction-to-
> schemas-in-apache-kafka-with-the-confluent-schema-registry-3bf55e401321
> ).
> On top of the schema registry, specific client libs allow to validate the
> message structure prior to the injection in kafka.
> So while comcast mentions the usage of an Avro Schema to describe its
> feeds, it does not mention directly the usage of avro files (to describe
> the schema).
>

This is all good except for the assumption of a single schema for all time.
You can mutate schemas in Avro (or JSON) in a future-proof manner, but it
is important to recognize the simple truth that the data in a stream will
not necessarily be uniform (and is even unlikely to be uniform)




>
> .... But the usage of CSV/JSON still is problematic. I like the idea of
> having
> an optional way to describe the expected types somewhere (either in a
> central meta-store, or in a structured file next to the dataset).
>

Central meta-stores are seriously bad problems and are the single biggest
nightmare in trying to upgrade Hive users. Let's avoid that if possible.

Writing meta-data next to the file is also problematic if it needs to be
written by the processing doing a query (the directory may not be writable).

Having a convention for redirecting the meta-data cache to a parallel
directory might solve the problem of non-writable local locations.

In the worst case that Drill can't have any place to persist what it has
learned but wants to do a restart, there needs to be SOME place to cache
meta-data or else restarts will get no further than the original failed
query.

Re: "Death of Schema-on-Read"

Reply via email to