Florian Leibert wrote:
I just figured out that I can just use the GenericDatumWriter instead of
the DataFileWriter - the former doesn't store the schema in the file
while the latter does.
Florian,
It sounds like you worked this one out for yourself. Different
DatumWriter implementations encode equivalent data identically. They
differ in how the data is represented in Java, not when serialized.
The best practice with Avro is to store the schema with serialized data,
so that later, even if the schema in your application has changed, you
can still read that data. Avro's data file stores the schema once per
file. Avro RPC clients pass the MD5 hash of their schema with each
request, and, when a server has not seen that version of the schema, the
client must resubmit the request with the full schema. If you're, e.g.,
potentially storing different versions of a record in a database, then
you might consider annotating each entry with the hash of its schema and
separately maintaining a table mapping hashes to schemas, so that
applications can always find the schema that was used to write the data
when processing it.
I hope this helps!
Cheers,
Doug