I guess if you emmitted the key as task-id+ key you would have more
overhead but if the data "replayed" the reducer could detect dups.

Ed

On 3/4/10, Scott Carey <sc...@richrelevance.com> wrote:
> Interesting article.  It claims to have the same fault tolerance but I don't
> see any explanation of how that can be.
>
> If a single mapper fails part-way through a task when it has transmitted
> partial results to a reducer, the whole job is corrupted.  With the current
> barrier between map and reduce, a job can recover from partially completed
> tasks and speculatively execute.
>
> I would imagine that small low latency tasks can benefit greatly from such
> an approach, but larger tasks need the barrier or will not be very fault
> tolerant.  However, there is still a lot of optimizations to dot in Hadoop
> for low latency tasks while maintaining the barrier.
>
>
> On Mar 4, 2010, at 2:18 PM, Jeff Hammerbacher wrote:
>
>> Also see "Breaking the MapReduce Stage Barrier" from UIUC:
>> http://www.ideals.illinois.edu/bitstream/handle/2142/14819/breaking.pdf
>>
>> On Thu, Mar 4, 2010 at 11:41 AM, Ashutosh Chauhan <
>> ashutosh.chau...@gmail.com> wrote:
>>
>>> Bharath,
>>>
>>> This idea is  kicking around in academia.. not made into apache yet..
>>> https://issues.apache.org/jira/browse/MAPREDUCE-1211
>>>
>>> You can get a working prototype from:
>>> http://code.google.com/p/hop/
>>>
>>> Ashutosh
>>>
>>> On Thu, Mar 4, 2010 at 09:06, E. Sammer <e...@lifeless.net> wrote:
>>>> On 3/4/10 12:00 PM, bharath v wrote:
>>>>>
>>>>> Hi ,
>>>>>
>>>>> Can we pipeline the map output directly into reduce phase without
>>>>> storing it in the local filesystem (avoiding disk IOs).
>>>>> If yes , how to do that ?
>>>>
>>>> Bharath:
>>>>
>>>> No, there's no way to avoid going to disk after the mappers.
>>>>
>>>> --
>>>> Eric Sammer
>>>> e...@lifeless.net
>>>> http://esammer.blogspot.com
>>>>
>>>
>
>

Reply via email to