Isn't also of some help using Cascading (http://www.cascading.org/) ?
*Fabio Pitzolu* Consultant - BI & Infrastructure Mob. +39 3356033776 Telefono 02 87157239 Fax. 02 93664786 *Gruppo Consulenza Innovazione - http://www.gr-ci.com* 2012/10/8 Bertrand Dechoux <[email protected]> > Have you looked at graph processing for Hadoop? Like Hama ( > http://hama.apache.org/) or Giraph (http://incubator.apache.org/giraph/). > I can't say for sure it would help you but it seems to be in the same > problem domain. > > With regard to the chaining reducer issue this is indeed a general > implementation decision of Hadoop 1. > From a purely functional point of view, regardless of performance, I guess > it could be shown that a map/reduce/map can be done with a reduce only and > that a sequence of map can be done with a single map. Of course, with > Hadoop the picture is bit more complex due to the sort phase. > > map -> sort -> reduce : operations in map/reduce can not generally be > transferred due to the sort 'blocking' them when they are related to the > sort key > reduce -> map : all operations can be performed in the reduce > So > map -> sort -> reduce -> map -> sort -> reduce -> map -> sort -> reduce > can generally be implemented as > map -> sort -> reduce -> sort -> reduce -> sort -> reduce > if you are willing to let the possibility of having different scaling > options for maps and reduces > > And that's what you are asking. But with hadoop 1 the map phase is not an > option (even though you could use the identify but that's not a wise option > with regards to performance like you said). The picture might be changing > with Hadoop 2/YARN. I can't provide the details but it may be worth it to > look at it. > > Regards > > Bertrand > > > On Fri, Oct 5, 2012 at 8:02 PM, Jim Twensky <[email protected]> wrote: > >> Hi Harsh, >> >> The hidden map operation which is applied to the reduced partition at >> one stage can generate keys that are outside of the range covered by >> that particular reducer. I still need to have the many-to-many >> communication from reduce step k to reduce step k+1. Otherwise, I >> think the ChainReducer would do the job and apply multiple maps to >> each isolated partition produced by the reducer. >> >> Jim >> >> On Fri, Oct 5, 2012 at 12:54 PM, Harsh J <[email protected]> wrote: >> > Would it then be right to assume that the keys produced by the reduced >> > partition at one stage would be isolated to its partition alone and >> > not occur in any of the other partition outputs? I'm guessing not, >> > based on the nature of your data? >> > >> > I'm trying to understand why shuffling is good to be avoided here, and >> > if it can be in some ways, given the data. As I see it, you need >> > re-sort based on the new key per partition, but not the shuffle? Or am >> > I wrong? >> > >> > On Fri, Oct 5, 2012 at 11:13 PM, Jim Twensky <[email protected]> >> wrote: >> >> Hi Harsh, >> >> >> >> Yes, there is actually a "hidden" map stage, that generates new >> >> <key,value> pairs based on the last reduce output but I can create >> >> those records during the reduce step instead and get rid of the >> >> intermediate map computation completely. The idea is to apply the map >> >> function to each output of the reduce inside the reduce class and emit >> >> the result as the output of the reducer. >> >> >> >> Jim >> >> >> >> On Fri, Oct 5, 2012 at 12:18 PM, Harsh J <[email protected]> wrote: >> >>> Hey Jim, >> >>> >> >>> Are you looking to re-sort or re-partition your data by a different >> >>> key or key combo after each output from reduce? >> >>> >> >>> On Fri, Oct 5, 2012 at 10:01 PM, Jim Twensky <[email protected]> >> wrote: >> >>>> Hi, >> >>>> >> >>>> I have a complex Hadoop job that iterates over large graph data >> >>>> multiple times until some convergence condition is met. I know that >> >>>> the map output goes to the local disk of each particular mapper >> first, >> >>>> and then fetched by the reducers before the reduce tasks start. I can >> >>>> see that this is an overhead, and it theory we can ship the data >> >>>> directly from mappers to reducers, without serializing on the local >> >>>> disk first. I understand that this step is necessary for fault >> >>>> tolerance and it is an essential building block of MapReduce. >> >>>> >> >>>> In my application, the map process consists of identity mappers which >> >>>> read the input from HDFS and ship it to reducers. Essentially, what I >> >>>> am doing is applying chains of reduce jobs until the algorithm >> >>>> converges. My question is, can I bypass the serialization of the >> local >> >>>> data and ship it from mappers to reducers immediately (as soon as I >> >>>> call context.write() in my mapper class)? If not, are there any other >> >>>> MR platforms that can do this? I've been searching around and >> couldn't >> >>>> see anything similar to what I need. Hadoop On Line is a prototype >> and >> >>>> has some similar functionality but it hasn't been updated for a >> while. >> >>>> >> >>>> Note: I know about ChainMapper and ChainReducer classes but I don't >> >>>> want to chain multiple mappers in the same local node. I want to >> chain >> >>>> multiple reduce functions globally so the data flow looks like: Map >> -> >> >>>> Reduce -> Reduce -> Reduce, which means each reduce operation is >> >>>> followed by a shuffle and sort essentially bypassing the map >> >>>> operation. >> >>> >> >>> >> >>> >> >>> -- >> >>> Harsh J >> > >> > >> > >> > -- >> > Harsh J >> > > > > -- > Bertrand Dechoux >
