Reduce

Fabio Pitzolu Mon, 08 Oct 2012 03:45:37 -0700

Isn't also of some help using Cascading (http://www.cascading.org/) ?


*Fabio Pitzolu*
Consultant - BI & Infrastructure

Mob. +39 3356033776
Telefono 02 87157239
Fax. 02 93664786

*Gruppo Consulenza Innovazione - http://www.gr-ci.com*



2012/10/8 Bertrand Dechoux <[email protected]>

> Have you looked at graph processing for Hadoop? Like Hama (
> http://hama.apache.org/) or Giraph (http://incubator.apache.org/giraph/).
> I can't say for sure it would help you but it seems to be in the same
> problem domain.
>
> With regard to the chaining reducer issue this is indeed a general
> implementation decision of Hadoop 1.
> From a purely functional point of view, regardless of performance, I guess
> it could be shown that a map/reduce/map can be done with a reduce only and
> that a sequence of map can be done with a single map. Of course, with
> Hadoop the picture is bit more complex due to the sort phase.
>
> map -> sort -> reduce : operations in map/reduce can not generally be
> transferred due to the sort 'blocking' them when they are related to the
> sort key
> reduce -> map : all operations can be performed in the reduce
> So
> map -> sort -> reduce -> map -> sort -> reduce -> map -> sort -> reduce
> can generally be implemented as
> map -> sort -> reduce -> sort -> reduce -> sort -> reduce
> if you are willing to let the possibility of having different scaling
> options for maps and reduces
>
> And that's what you are asking. But with hadoop 1 the map phase is not an
> option (even though you could use the identify but that's not a wise option
> with regards to performance like you said). The picture might be changing
> with Hadoop 2/YARN. I can't provide the details but it may be worth it to
> look at it.
>
> Regards
>
> Bertrand
>
>
> On Fri, Oct 5, 2012 at 8:02 PM, Jim Twensky <[email protected]> wrote:
>
>> Hi Harsh,
>>
>> The hidden map operation which is applied to the reduced partition at
>> one stage can generate keys that are outside of the range covered by
>> that particular reducer. I still need to have the many-to-many
>> communication from reduce step k to reduce step k+1. Otherwise, I
>> think the ChainReducer would do the job and apply multiple maps to
>> each isolated partition produced by the reducer.
>>
>> Jim
>>
>> On Fri, Oct 5, 2012 at 12:54 PM, Harsh J <[email protected]> wrote:
>> > Would it then be right to assume that the keys produced by the reduced
>> > partition at one stage would be isolated to its partition alone and
>> > not occur in any of the other partition outputs? I'm guessing not,
>> > based on the nature of your data?
>> >
>> > I'm trying to understand why shuffling is good to be avoided here, and
>> > if it can be in some ways, given the data. As I see it, you need
>> > re-sort based on the new key per partition, but not the shuffle? Or am
>> > I wrong?
>> >
>> > On Fri, Oct 5, 2012 at 11:13 PM, Jim Twensky <[email protected]>
>> wrote:
>> >> Hi Harsh,
>> >>
>> >> Yes, there is actually a "hidden" map stage, that generates new
>> >> <key,value> pairs based on the last reduce output but I can create
>> >> those records during the reduce step instead and get rid of the
>> >> intermediate map computation completely. The idea is to apply the map
>> >> function to each output of the reduce inside the reduce class and emit
>> >> the result as the output of the reducer.
>> >>
>> >> Jim
>> >>
>> >> On Fri, Oct 5, 2012 at 12:18 PM, Harsh J <[email protected]> wrote:
>> >>> Hey Jim,
>> >>>
>> >>> Are you looking to re-sort or re-partition your data by a different
>> >>> key or key combo after each output from reduce?
>> >>>
>> >>> On Fri, Oct 5, 2012 at 10:01 PM, Jim Twensky <[email protected]>
>> wrote:
>> >>>> Hi,
>> >>>>
>> >>>> I have a complex Hadoop job that iterates over  large graph data
>> >>>> multiple times until some convergence condition is met. I know that
>> >>>> the map output goes to the local disk of each particular mapper
>> first,
>> >>>> and then fetched by the reducers before the reduce tasks start. I can
>> >>>> see that this is an overhead, and it theory we can ship the data
>> >>>> directly from mappers to reducers, without serializing on the local
>> >>>> disk first. I understand that this step is necessary for fault
>> >>>> tolerance and it is an essential building block of MapReduce.
>> >>>>
>> >>>> In my application, the map process consists of identity mappers which
>> >>>> read the input from HDFS and ship it to reducers. Essentially, what I
>> >>>> am doing is applying chains of reduce jobs until the algorithm
>> >>>> converges. My question is, can I bypass the serialization of the
>> local
>> >>>> data and ship it from mappers to reducers immediately (as soon as I
>> >>>> call context.write() in my mapper class)? If not, are there any other
>> >>>> MR platforms that can do this? I've been searching around and
>> couldn't
>> >>>> see anything similar to what I need. Hadoop On Line is a prototype
>> and
>> >>>> has some similar functionality but it hasn't been updated for a
>> while.
>> >>>>
>> >>>> Note: I know about ChainMapper and ChainReducer classes but I don't
>> >>>> want to chain multiple mappers in the same local node. I want to
>> chain
>> >>>> multiple reduce functions globally so the data flow looks like: Map
>> ->
>> >>>> Reduce -> Reduce -> Reduce, which means each reduce operation is
>> >>>> followed by a shuffle and sort essentially bypassing the map
>> >>>> operation.
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Harsh J
>> >
>> >
>> >
>> > --
>> > Harsh J
>>
>
>
>
> --
> Bertrand Dechoux
>

Re: Chaning Multiple Reducers: Reduce -> Reduce -> Reduce

Reply via email to