I guess if you emmitted the key as task-id+ key you would have more overhead but if the data "replayed" the reducer could detect dups.
Ed On 3/4/10, Scott Carey <sc...@richrelevance.com> wrote: > Interesting article. It claims to have the same fault tolerance but I don't > see any explanation of how that can be. > > If a single mapper fails part-way through a task when it has transmitted > partial results to a reducer, the whole job is corrupted. With the current > barrier between map and reduce, a job can recover from partially completed > tasks and speculatively execute. > > I would imagine that small low latency tasks can benefit greatly from such > an approach, but larger tasks need the barrier or will not be very fault > tolerant. However, there is still a lot of optimizations to dot in Hadoop > for low latency tasks while maintaining the barrier. > > > On Mar 4, 2010, at 2:18 PM, Jeff Hammerbacher wrote: > >> Also see "Breaking the MapReduce Stage Barrier" from UIUC: >> http://www.ideals.illinois.edu/bitstream/handle/2142/14819/breaking.pdf >> >> On Thu, Mar 4, 2010 at 11:41 AM, Ashutosh Chauhan < >> ashutosh.chau...@gmail.com> wrote: >> >>> Bharath, >>> >>> This idea is kicking around in academia.. not made into apache yet.. >>> https://issues.apache.org/jira/browse/MAPREDUCE-1211 >>> >>> You can get a working prototype from: >>> http://code.google.com/p/hop/ >>> >>> Ashutosh >>> >>> On Thu, Mar 4, 2010 at 09:06, E. Sammer <e...@lifeless.net> wrote: >>>> On 3/4/10 12:00 PM, bharath v wrote: >>>>> >>>>> Hi , >>>>> >>>>> Can we pipeline the map output directly into reduce phase without >>>>> storing it in the local filesystem (avoiding disk IOs). >>>>> If yes , how to do that ? >>>> >>>> Bharath: >>>> >>>> No, there's no way to avoid going to disk after the mappers. >>>> >>>> -- >>>> Eric Sammer >>>> e...@lifeless.net >>>> http://esammer.blogspot.com >>>> >>> > >