Excellent point about unions at too high of a level, which I never thought about. The best practice is definitely to add the new column with a default instead of versioning the entire record! I wonder if there is something we can do about that.

rb

On 04/08/2015 06:03 PM, Jacques Nadeau wrote:
I agree with what Ryan said.  In terms of effort of implementation, using
the existing object models are great.

However, as you try to tune your application,  you may find suboptimal
transformation patterns to the physical format.  This is always a possible
risk when working through an abstraction.  The example I've seen previously
is that people might create a union at a level higher than is necessary.
For example, imagine

old: {
   first:string
   last:string
}

new: {
   first:string
   last:string
   twitter_handle:string
}

People are inclined to union (old,new).  Last I checked, the default Avro
behavior in this situation would be to create five columns: old_first,
old_last, new_first, and new_last (names are actually nested as group0.x,
group1.x or something similar).  Depending on what is being done, this can
be suboptimal as a logical query of "select table.first from table" now has
to read two columns, manage two possibly different encoding schemes, etc.
This will be even more impactful as we implement things like indices in the
physical layer.

In short, if you are using an abstraction, be aware that the physical
layout may not be as optimal as it would have been if you had hand-tuned
the schema with your particular application in mind.  The flip-side is you
save time and aggravation in implementation.

Make sense?


--
Ryan Blue
Software Engineer
Cloudera, Inc.

Reply via email to