Honestly, though, don't bother unless you actually identified avro layer as a significant bottleneck. This is going to take some work and may be fun / educational, but I worry about hearsay-based perf optimizations...
On Thursday, April 16, 2015, Karthikeyan Muthukumar <[email protected]> wrote: > Thanks Jacques and Ryan for the insights! > Im going to try something based on RecordConsumer model. > MK > > On Thu, Apr 9, 2015 at 12:57 PM, Ryan Blue <[email protected] > <javascript:;>> wrote: > > > Excellent point about unions at too high of a level, which I never > thought > > about. The best practice is definitely to add the new column with a > default > > instead of versioning the entire record! I wonder if there is something > we > > can do about that. > > > > rb > > > > On 04/08/2015 06:03 PM, Jacques Nadeau wrote: > > > >> I agree with what Ryan said. In terms of effort of implementation, > using > >> the existing object models are great. > >> > >> However, as you try to tune your application, you may find suboptimal > >> transformation patterns to the physical format. This is always a > possible > >> risk when working through an abstraction. The example I've seen > >> previously > >> is that people might create a union at a level higher than is necessary. > >> For example, imagine > >> > >> old: { > >> first:string > >> last:string > >> } > >> > >> new: { > >> first:string > >> last:string > >> twitter_handle:string > >> } > >> > >> People are inclined to union (old,new). Last I checked, the default > Avro > >> behavior in this situation would be to create five columns: old_first, > >> old_last, new_first, and new_last (names are actually nested as > group0.x, > >> group1.x or something similar). Depending on what is being done, this > can > >> be suboptimal as a logical query of "select table.first from table" now > >> has > >> to read two columns, manage two possibly different encoding schemes, > etc. > >> This will be even more impactful as we implement things like indices in > >> the > >> physical layer. > >> > >> In short, if you are using an abstraction, be aware that the physical > >> layout may not be as optimal as it would have been if you had hand-tuned > >> the schema with your particular application in mind. The flip-side is > you > >> save time and aggravation in implementation. > >> > >> Make sense? > >> > > > > > > -- > > Ryan Blue > > Software Engineer > > Cloudera, Inc. > > >
