The overhead of checking the union is not that high, but it would be useful to be able to specify a map of different Avro schemas to source paths for a variety of use cases. I am not sure to what extent that is possible with the current Avro mapreduce API.
There are some folks working on making improved Avro mapreduce/mapred APIs with the intention of eventually contributing it back to Avro. You might get some good ideas from there: https://issues.apache.org/jira/browse/AVRO-593 https://github.com/wibidata/odiago-avro On 12/13/11 8:46 AM, "Andrew Kenworthy" <adwkenwor...@yahoo.com> wrote: > I'm currently using a UNION-schema to map two different types of data (read > from two different input paths) in my reducer to a common record. This works > fine, but - if I have understood the mechanism correctly - it would mean that > Avro is having to check each and every record against my UNION schema. With a > "normal" reduce-side join, I could use MultipleInputs to specify a mapper for > each input, thus letting them run independently (since each mapper knows its > input) with presumably less overhead. > > Is it possible with Avro to avoid the overhead of checking each input row > against the union schema? > > Thanks, > > Andrew > >> >> >> >> >> From: Scott Carey <scottca...@apache.org> >> To: "user@avro.apache.org" <user@avro.apache.org>; Andrew Kenworthy >> <adwkenwor...@yahoo.com> >> Sent: Wednesday, December 7, 2011 7:40 PM >> Subject: Re: Reduce-side joins in Avro M/R >> >> This should be conceptually the same as a normal map-reduce join of the same >> type. Avro handles the serialization, but not the map-reduce algorithm or >> strategy. >> >> On 12/6/11 8:43 AM, "Andrew Kenworthy" <adwkenwor...@yahoo.com> wrote: >> >>> Hi, >>> >>> I'd like to use reduce-side joins in an avro M/R job, and am not sure how to >>> do it: are there any best-practice tips or outlines of what one would have >>> to implement in order to make this possible? >>> >>> Thanks, >>> >>> Andrew Kenworthy >> >> >> >> >> >