Florian Leibert wrote:
I just figured out that I can just use the GenericDatumWriter instead of the DataFileWriter - the former doesn't store the schema in the file while the latter does.

Florian,

It sounds like you worked this one out for yourself. Different DatumWriter implementations encode equivalent data identically. They differ in how the data is represented in Java, not when serialized.

The best practice with Avro is to store the schema with serialized data, so that later, even if the schema in your application has changed, you can still read that data. Avro's data file stores the schema once per file. Avro RPC clients pass the MD5 hash of their schema with each request, and, when a server has not seen that version of the schema, the client must resubmit the request with the full schema. If you're, e.g., potentially storing different versions of a record in a database, then you might consider annotating each entry with the hash of its schema and separately maintaining a table mapping hashes to schemas, so that applications can always find the schema that was used to write the data when processing it.

I hope this helps!

Cheers,

Doug

Reply via email to