Re: What else can be built on top of YARN.
John, On Jun 1, 2013, at 7:02 AM, John Lilley wrote: · Algorithms that are not well-suited to the MR model, such as transitive closure. They are more naturally expressed as MPI-like algorithms. You might be interested in MPICH2 on YARN: https://github.com/clarkyzl/mpich2-yarn Disclaimer: I haven't used it myself. Arun
Re: What else can be built on top of YARN.
Thanks a lot for the responses. I now have a better understanding. To answer to Jay's question , I think ZK can be used as for coordination service for a distributed program (you have built it on top of exposed granular api's) and it doesn't have features like resource management (including allocation of resources based on requests) of cluster nodes , which yarn has. Rahul On Thu, May 30, 2013 at 5:59 PM, Jay Vyas jayunit...@gmail.com wrote: What is the separation of concerns between YARN and Zookeeper? That is, where does YARN leave off and where does Zookeeper begin? Or is there some overlap On Thu, May 30, 2013 at 2:42 AM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi Rahul, It is at least because of the reasons that Vinod listed that makes my life easy for porting my application on to YARN instead of making it work in the Map Reduce framework. The main purpose of me using YARN is to exploit the resource management capabilities of YARN. Thanks, Kishore On Wed, May 29, 2013 at 11:00 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Thanks for the response Krishna. I was wondering if it were possible for using MR to solve you problem instead of building the whole stack on top of yarn. Most likely its not possible , thats why you are building it . I wanted to know why is that ? I am in just trying to find out the need or why we might need to write the application on yarn. Rahul On Wed, May 29, 2013 at 8:23 PM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi Rahul, I am porting a distributed application that runs on a fixed set of given resources to YARN, with the aim of being able to run it on a dynamically selected resources whichever are available at the time of running the application. Thanks, Kishore On Wed, May 29, 2013 at 8:04 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Hi all, I was going through the motivation behind Yarn. Splitting the responsibility of JT is the major concern.Ultimately the base (Yarn) was built in a generic way for building other generic distributed applications too. I am not able to think of any other parallel processing use case that would be useful to built on top of YARN. I though of a lot of use cases that would be beneficial when run in parallel , but again ,we can do those using map only jobs in MR. Can someone tell me a scenario , where a application can utilize Yarn features or can be built on top of YARN and at the same time , it cannot be done efficiently using MRv2 jobs. thanks, Rahul -- Jay Vyas http://jayunit100.blogspot.com
RE: What else can be built on top of YARN.
Rahul, This is a very good question, and one we are grappling with currently in our application port. I think there are a lot of legacy data-processing applications like ours which would benefit by a port to Hadoop. However, because we have a great load of C++, it is not necessarily a good fit for MR. There seem to be two main choices: · Run under Hadoop “streams” · Run as a custom ApplicationMaster One of the selling points of our application is its performance and single-code efficiency. I have concerns about streams: · We will lose performance, because of the extra layers of translation and I/O and because streams data is uncompressed · The streams model is limited to single-in, single-out · We have a very large number and size of files to make available locally, it is unclear that the -files option is going to recursively copy and cache all of it In contrast, porting our application as a YARN ApplicationMaster appears to offer several benefits (which come at the expense of extra complexity): · Negotiation for container resources and scheduling. Some of our operations are very heavy (load time and memory use), so they need larger containers and will benefit from larger data splits. · Direct access to HDFS via JNI without translation layers. · Algorithms that are not well-suited to the MR model, such as transitive closure. They are more naturally expressed as MPI-like algorithms. · If warranted, the ability to replace MR shuffle with a C++ data partition (this could be a discussion thread in its own right). Moving our processing into native Java for a more seamless MR integration is not an option due to the size and complexity of the code base. It may be that I am completely wrong about the limitations of the streams interface; if so please tell me why. john From: Rahul Bhattacharjee [mailto:rahul.rec@gmail.com] Sent: Wednesday, May 29, 2013 8:34 AM To: user@hadoop.apache.org Subject: What else can be built on top of YARN. Hi all, I was going through the motivation behind Yarn. Splitting the responsibility of JT is the major concern.Ultimately the base (Yarn) was built in a generic way for building other generic distributed applications too. I am not able to think of any other parallel processing use case that would be useful to built on top of YARN. I though of a lot of use cases that would be beneficial when run in parallel , but again ,we can do those using map only jobs in MR. Can someone tell me a scenario , where a application can utilize Yarn features or can be built on top of YARN and at the same time , it cannot be done efficiently using MRv2 jobs. thanks, Rahul
Re: What else can be built on top of YARN.
Hi Rahul, It is at least because of the reasons that Vinod listed that makes my life easy for porting my application on to YARN instead of making it work in the Map Reduce framework. The main purpose of me using YARN is to exploit the resource management capabilities of YARN. Thanks, Kishore On Wed, May 29, 2013 at 11:00 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Thanks for the response Krishna. I was wondering if it were possible for using MR to solve you problem instead of building the whole stack on top of yarn. Most likely its not possible , thats why you are building it . I wanted to know why is that ? I am in just trying to find out the need or why we might need to write the application on yarn. Rahul On Wed, May 29, 2013 at 8:23 PM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi Rahul, I am porting a distributed application that runs on a fixed set of given resources to YARN, with the aim of being able to run it on a dynamically selected resources whichever are available at the time of running the application. Thanks, Kishore On Wed, May 29, 2013 at 8:04 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Hi all, I was going through the motivation behind Yarn. Splitting the responsibility of JT is the major concern.Ultimately the base (Yarn) was built in a generic way for building other generic distributed applications too. I am not able to think of any other parallel processing use case that would be useful to built on top of YARN. I though of a lot of use cases that would be beneficial when run in parallel , but again ,we can do those using map only jobs in MR. Can someone tell me a scenario , where a application can utilize Yarn features or can be built on top of YARN and at the same time , it cannot be done efficiently using MRv2 jobs. thanks, Rahul
Re: What else can be built on top of YARN.
What is the separation of concerns between YARN and Zookeeper? That is, where does YARN leave off and where does Zookeeper begin? Or is there some overlap On Thu, May 30, 2013 at 2:42 AM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi Rahul, It is at least because of the reasons that Vinod listed that makes my life easy for porting my application on to YARN instead of making it work in the Map Reduce framework. The main purpose of me using YARN is to exploit the resource management capabilities of YARN. Thanks, Kishore On Wed, May 29, 2013 at 11:00 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Thanks for the response Krishna. I was wondering if it were possible for using MR to solve you problem instead of building the whole stack on top of yarn. Most likely its not possible , thats why you are building it . I wanted to know why is that ? I am in just trying to find out the need or why we might need to write the application on yarn. Rahul On Wed, May 29, 2013 at 8:23 PM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi Rahul, I am porting a distributed application that runs on a fixed set of given resources to YARN, with the aim of being able to run it on a dynamically selected resources whichever are available at the time of running the application. Thanks, Kishore On Wed, May 29, 2013 at 8:04 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Hi all, I was going through the motivation behind Yarn. Splitting the responsibility of JT is the major concern.Ultimately the base (Yarn) was built in a generic way for building other generic distributed applications too. I am not able to think of any other parallel processing use case that would be useful to built on top of YARN. I though of a lot of use cases that would be beneficial when run in parallel , but again ,we can do those using map only jobs in MR. Can someone tell me a scenario , where a application can utilize Yarn features or can be built on top of YARN and at the same time , it cannot be done efficiently using MRv2 jobs. thanks, Rahul -- Jay Vyas http://jayunit100.blogspot.com
Re: What else can be built on top of YARN.
Hi Rahul, I am porting a distributed application that runs on a fixed set of given resources to YARN, with the aim of being able to run it on a dynamically selected resources whichever are available at the time of running the application. Thanks, Kishore On Wed, May 29, 2013 at 8:04 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Hi all, I was going through the motivation behind Yarn. Splitting the responsibility of JT is the major concern.Ultimately the base (Yarn) was built in a generic way for building other generic distributed applications too. I am not able to think of any other parallel processing use case that would be useful to built on top of YARN. I though of a lot of use cases that would be beneficial when run in parallel , but again ,we can do those using map only jobs in MR. Can someone tell me a scenario , where a application can utilize Yarn features or can be built on top of YARN and at the same time , it cannot be done efficiently using MRv2 jobs. thanks, Rahul
Re: What else can be built on top of YARN.
Thanks for the response Krishna. I was wondering if it were possible for using MR to solve you problem instead of building the whole stack on top of yarn. Most likely its not possible , thats why you are building it . I wanted to know why is that ? I am in just trying to find out the need or why we might need to write the application on yarn. Rahul On Wed, May 29, 2013 at 8:23 PM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi Rahul, I am porting a distributed application that runs on a fixed set of given resources to YARN, with the aim of being able to run it on a dynamically selected resources whichever are available at the time of running the application. Thanks, Kishore On Wed, May 29, 2013 at 8:04 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Hi all, I was going through the motivation behind Yarn. Splitting the responsibility of JT is the major concern.Ultimately the base (Yarn) was built in a generic way for building other generic distributed applications too. I am not able to think of any other parallel processing use case that would be useful to built on top of YARN. I though of a lot of use cases that would be beneficial when run in parallel , but again ,we can do those using map only jobs in MR. Can someone tell me a scenario , where a application can utilize Yarn features or can be built on top of YARN and at the same time , it cannot be done efficiently using MRv2 jobs. thanks, Rahul
Re: What else can be built on top of YARN.
Two scenarios I can think of are re-implementations of Twitter's Storm ( http://storm-project.net/) and DryadLinq ( http://research.microsoft.com/en-us/projects/dryadlinq/). Storm, a distributed realtime computation framework used for analyzing realtime steams of data, doesn't really need to be ported. Its doing fine by itself, though I think its a prime candidate for a Yarn port. DryadLinq is a (now closed) research project out of Microsoft Research that allowed the user to write standard LINQ code (in any .net language) and it build an execution DAG based structure of the LINQ statement, and execute the DAG on a MS HPC cluster. The LINQ syntax is very much like PIG, though way more flexible and has full IDE support (is Visual Studio), and is used in standard single process programming. That, to me, was the beauty behind DryadLinq: the programming language for distributed execution was exactly the same as a well known and used language for standard single process programming already used by hundreds of thousands of programmers, so learning curve and acceptance debt is really low. But, like all good things that come out of MS Research, it was killed because they sat on it too long. The interesting thing is that distributed DAG execution is one of the main examples given for the types of Yarn applications that could be developed. On Wed, May 29, 2013 at 10:30 AM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Thanks for the response Krishna. I was wondering if it were possible for using MR to solve you problem instead of building the whole stack on top of yarn. Most likely its not possible , thats why you are building it . I wanted to know why is that ? I am in just trying to find out the need or why we might need to write the application on yarn. Rahul On Wed, May 29, 2013 at 8:23 PM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi Rahul, I am porting a distributed application that runs on a fixed set of given resources to YARN, with the aim of being able to run it on a dynamically selected resources whichever are available at the time of running the application. Thanks, Kishore On Wed, May 29, 2013 at 8:04 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Hi all, I was going through the motivation behind Yarn. Splitting the responsibility of JT is the major concern.Ultimately the base (Yarn) was built in a generic way for building other generic distributed applications too. I am not able to think of any other parallel processing use case that would be useful to built on top of YARN. I though of a lot of use cases that would be beneficial when run in parallel , but again ,we can do those using map only jobs in MR. Can someone tell me a scenario , where a application can utilize Yarn features or can be built on top of YARN and at the same time , it cannot be done efficiently using MRv2 jobs. thanks, Rahul -- Thanks, John C
Re: What else can be built on top of YARN.
There is a project at Yahoo which makes it possible to run Storm on Yarn. I think the team behind it is going to give a talk at Hadoop Summit and plan to open source it after that. -Viral On Wed, May 29, 2013 at 11:04 AM, John Conwell j...@iamjohn.me wrote: Storm, a distributed realtime computation framework used for analyzing realtime steams of data, doesn't really need to be ported. Its doing fine by itself, though I think its a prime candidate for a Yarn port.
Re: What else can be built on top of YARN.
Historically, many applications/frameworks wanted to take advantage of just the resource management capabilities and failure handling of Hadoop (via JobTracker/TaskTracker), but were forced to used MapReduce even though they didn't have to. Obvious examples are graph processing (Giraph), BSP(Hama), storm/s4 and even a simple tool like DistCp. There are issues even with map-only jobs. - You have to fake key-value processing, periodic pings, key-value outputs - You are limited to map slot capacity in the cluster - The number of tasks is static, so you cannot grow and shrink your job - You are forced to sort data all the time (even though this has changed recently) - You are tied to faking things like OutputCommit even if you don't need to. That's just for starters. I can definitely think harder and list more ;) YARN lets you move ahead without those limitations. HTH +Vinod Kumar Vavilapalli Hortonworks Inc. http://hortonworks.com/ On May 29, 2013, at 7:34 AM, Rahul Bhattacharjee wrote: Hi all, I was going through the motivation behind Yarn. Splitting the responsibility of JT is the major concern.Ultimately the base (Yarn) was built in a generic way for building other generic distributed applications too. I am not able to think of any other parallel processing use case that would be useful to built on top of YARN. I though of a lot of use cases that would be beneficial when run in parallel , but again ,we can do those using map only jobs in MR. Can someone tell me a scenario , where a application can utilize Yarn features or can be built on top of YARN and at the same time , it cannot be done efficiently using MRv2 jobs. thanks, Rahul