Honestly, though, don't bother unless you actually identified avro layer as
a significant bottleneck. This is going to take some work and may be fun /
educational, but I worry about hearsay-based perf optimizations...

On Thursday, April 16, 2015, Karthikeyan Muthukumar <[email protected]>
wrote:

> Thanks Jacques and Ryan for the insights!
> Im going to try something based on RecordConsumer model.
> MK
>
> On Thu, Apr 9, 2015 at 12:57 PM, Ryan Blue <[email protected]
> <javascript:;>> wrote:
>
> > Excellent point about unions at too high of a level, which I never
> thought
> > about. The best practice is definitely to add the new column with a
> default
> > instead of versioning the entire record! I wonder if there is something
> we
> > can do about that.
> >
> > rb
> >
> > On 04/08/2015 06:03 PM, Jacques Nadeau wrote:
> >
> >> I agree with what Ryan said.  In terms of effort of implementation,
> using
> >> the existing object models are great.
> >>
> >> However, as you try to tune your application,  you may find suboptimal
> >> transformation patterns to the physical format.  This is always a
> possible
> >> risk when working through an abstraction.  The example I've seen
> >> previously
> >> is that people might create a union at a level higher than is necessary.
> >> For example, imagine
> >>
> >> old: {
> >>    first:string
> >>    last:string
> >> }
> >>
> >> new: {
> >>    first:string
> >>    last:string
> >>    twitter_handle:string
> >> }
> >>
> >> People are inclined to union (old,new).  Last I checked, the default
> Avro
> >> behavior in this situation would be to create five columns: old_first,
> >> old_last, new_first, and new_last (names are actually nested as
> group0.x,
> >> group1.x or something similar).  Depending on what is being done, this
> can
> >> be suboptimal as a logical query of "select table.first from table" now
> >> has
> >> to read two columns, manage two possibly different encoding schemes,
> etc.
> >> This will be even more impactful as we implement things like indices in
> >> the
> >> physical layer.
> >>
> >> In short, if you are using an abstraction, be aware that the physical
> >> layout may not be as optimal as it would have been if you had hand-tuned
> >> the schema with your particular application in mind.  The flip-side is
> you
> >> save time and aggravation in implementation.
> >>
> >> Make sense?
> >>
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Cloudera, Inc.
> >
>

Reply via email to