Schema evolution and Specific vs Generic

Arvind Kalyan Tue, 03 Dec 2013 22:31:52 -0800

Hi folks, a high level question.

Say we have readers and writers in different projects. The writer project
dumps some data in some directory (or stores in a common store, etc) and
the reader project picks up that data and uses its reader schema and the
published writer schema (say we have a way to ship writer schemas along
with the dataset).


In that kind of setup where reader and writer schemas change at their own
rate, and are their own projects, and they are going to ship data over the
wire, how do you compare using SpecificRecords vs GenericRecords?

1. At what point would the reader project be forced to re-generate their
Specific records from schemas? Every time writer schema changes in any way?
every time a new field is added in the writer schema? When schema evolution
support is critical and we have multiple projects writing and reading data
over the wire, is the static typing provided by SpecificRecord going to be
a bottleneck or is that not going to be a concern regardless of Generic or
Specific Record?

2. In terms of efficiency and performance, have you noticed one performing
better than the other in terms of serialized/deserialized storage space and
cpu utilization?

We are interested in using Specific records because it offers static
compile time checks and ensures we are writing code to the correct field
names and datatypes and such but would like to hear from the community what
your thoughts are on this.

Thanks!

-- 
Arvind Kalyan

Schema evolution and Specific vs Generic

Reply via email to