Thanks Jacques and Ryan for the insights! Im going to try something based on RecordConsumer model. MK
On Thu, Apr 9, 2015 at 12:57 PM, Ryan Blue <[email protected]> wrote: > Excellent point about unions at too high of a level, which I never thought > about. The best practice is definitely to add the new column with a default > instead of versioning the entire record! I wonder if there is something we > can do about that. > > rb > > On 04/08/2015 06:03 PM, Jacques Nadeau wrote: > >> I agree with what Ryan said. In terms of effort of implementation, using >> the existing object models are great. >> >> However, as you try to tune your application, you may find suboptimal >> transformation patterns to the physical format. This is always a possible >> risk when working through an abstraction. The example I've seen >> previously >> is that people might create a union at a level higher than is necessary. >> For example, imagine >> >> old: { >> first:string >> last:string >> } >> >> new: { >> first:string >> last:string >> twitter_handle:string >> } >> >> People are inclined to union (old,new). Last I checked, the default Avro >> behavior in this situation would be to create five columns: old_first, >> old_last, new_first, and new_last (names are actually nested as group0.x, >> group1.x or something similar). Depending on what is being done, this can >> be suboptimal as a logical query of "select table.first from table" now >> has >> to read two columns, manage two possibly different encoding schemes, etc. >> This will be even more impactful as we implement things like indices in >> the >> physical layer. >> >> In short, if you are using an abstraction, be aware that the physical >> layout may not be as optimal as it would have been if you had hand-tuned >> the schema with your particular application in mind. The flip-side is you >> save time and aggravation in implementation. >> >> Make sense? >> > > > -- > Ryan Blue > Software Engineer > Cloudera, Inc. >
