Historically, many applications/frameworks wanted to take advantage of just the resource management capabilities and failure handling of Hadoop (via JobTracker/TaskTracker), but were forced to used MapReduce even though they didn't have to. Obvious examples are graph processing (Giraph), BSP(Hama), storm/s4 and even a simple tool like DistCp.
There are issues even with map-only jobs. - You have to fake key-value processing, periodic pings, key-value outputs - You are limited to map slot capacity in the cluster - The number of tasks is static, so you cannot grow and shrink your job - You are forced to sort data all the time (even though this has changed recently) - You are tied to faking things like OutputCommit even if you don't need to. That's just for starters. I can definitely think harder and list more ;) YARN lets you move ahead without those limitations. HTH +Vinod Kumar Vavilapalli Hortonworks Inc. http://hortonworks.com/ On May 29, 2013, at 7:34 AM, Rahul Bhattacharjee wrote: > Hi all, > > I was going through the motivation behind Yarn. Splitting the responsibility > of JT is the major concern.Ultimately the base (Yarn) was built in a generic > way for building other generic distributed applications too. > > I am not able to think of any other parallel processing use case that would > be useful to built on top of YARN. I though of a lot of use cases that would > be beneficial when run in parallel , but again ,we can do those using map > only jobs in MR. > > Can someone tell me a scenario , where a application can utilize Yarn > features or can be built on top of YARN and at the same time , it cannot be > done efficiently using MRv2 jobs. > > thanks, > Rahul > >