Re: What else can be built on top of YARN.

2013-06-06 Thread Arun C Murthy
John,

On Jun 1, 2013, at 7:02 AM, John Lilley wrote:

 
 · Algorithms that are not well-suited to the MR model, such as 
 transitive closure.  They are more naturally expressed as MPI-like algorithms.

You might be interested in MPICH2 on YARN:
https://github.com/clarkyzl/mpich2-yarn

Disclaimer: I haven't used it myself.

Arun



Re: What else can be built on top of YARN.

2013-06-01 Thread Rahul Bhattacharjee
Thanks a lot for the responses. I now have a better understanding.

To answer to Jay's question , I think ZK can be used as for coordination
service for a distributed program (you have built it on top of exposed
granular api's) and it doesn't have features like resource management
(including allocation of resources based on requests) of cluster nodes ,
which yarn has.

Rahul


On Thu, May 30, 2013 at 5:59 PM, Jay Vyas jayunit...@gmail.com wrote:

 What is the separation of concerns between YARN and Zookeeper?  That is,
 where does YARN leave off and where does Zookeeper begin?  Or is there some
 overlap


 On Thu, May 30, 2013 at 2:42 AM, Krishna Kishore Bonagiri 
 write2kish...@gmail.com wrote:

 Hi Rahul,

   It is at least because of the reasons that Vinod listed that makes my
 life easy for porting my application on to YARN instead of making it work
 in the Map Reduce framework. The main purpose of me using YARN is to
 exploit the resource management capabilities of YARN.

 Thanks,
 Kishore


 On Wed, May 29, 2013 at 11:00 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks for the response Krishna.

 I was wondering if it were possible for using MR to  solve you problem
 instead of building the whole stack on top of yarn.
 Most likely its not possible , thats why you are building it . I wanted
 to know why is that ?

 I am in just trying to find out the need or why we might need to write
 the application on yarn.

 Rahul


 On Wed, May 29, 2013 at 8:23 PM, Krishna Kishore Bonagiri 
 write2kish...@gmail.com wrote:

 Hi Rahul,

   I am porting a distributed application that runs on a fixed set of
 given resources to YARN, with the aim of  being able to run it on a
 dynamically selected resources whichever are available at the time of
 running the application.

 Thanks,
 Kishore


 On Wed, May 29, 2013 at 8:04 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Hi all,

 I was going through the motivation behind Yarn. Splitting the
 responsibility of JT is the major concern.Ultimately the base (Yarn) was
 built in a generic way for building other generic distributed applications
 too.

 I am not able to think of any other parallel processing use case that
 would be useful to built on top of YARN. I though of a lot of use cases
 that would be beneficial when run in parallel , but again ,we can do those
 using map only jobs in MR.

 Can someone tell me a scenario , where a application can utilize Yarn
 features or can be built on top of YARN and at the same time , it cannot 
 be
 done efficiently using MRv2 jobs.

 thanks,
 Rahul








 --
 Jay Vyas
 http://jayunit100.blogspot.com



RE: What else can be built on top of YARN.

2013-06-01 Thread John Lilley
Rahul,

This is a very good question, and one we are grappling with currently in our 
application port.  I think there are a lot of legacy data-processing 
applications like ours which would benefit by a port to Hadoop.  However, 
because we have a great load of C++, it is not necessarily a good fit for MR.  
There seem to be two main choices:

· Run under Hadoop “streams”

· Run as a custom ApplicationMaster

One of the selling points of our application is its performance and single-code 
efficiency.  I have concerns about streams:

· We will lose performance, because of the extra layers of translation 
and I/O and because streams data is uncompressed

· The streams model is limited to single-in, single-out

· We have a very large number and size of files to make available 
locally, it is unclear that the -files option is going to recursively copy and 
cache all of it

In contrast, porting our application as a YARN ApplicationMaster appears to 
offer several benefits (which come at the expense of extra complexity):

· Negotiation for container resources and scheduling.  Some of our 
operations are very heavy (load time and memory use), so they need larger 
containers and will benefit from larger data splits.

· Direct access to HDFS via JNI without translation layers.

· Algorithms that are not well-suited to the MR model, such as 
transitive closure.  They are more naturally expressed as MPI-like algorithms.

· If warranted, the ability to replace MR shuffle with a C++ data 
partition (this could be a discussion thread in its own right).

Moving our processing into native Java for a more seamless MR integration is 
not an option due to the size and complexity of the code base.

It may be that I am completely wrong about the limitations of the streams 
interface; if so please tell me why.

john

From: Rahul Bhattacharjee [mailto:rahul.rec@gmail.com]
Sent: Wednesday, May 29, 2013 8:34 AM
To: user@hadoop.apache.org
Subject: What else can be built on top of YARN.

Hi all,
I was going through the motivation behind Yarn. Splitting the responsibility of 
JT is the major concern.Ultimately the base (Yarn) was built in a generic way 
for building other generic distributed applications too.
I am not able to think of any other parallel processing use case that would be 
useful to built on top of YARN. I though of a lot of use cases that would be 
beneficial when run in parallel , but again ,we can do those using map only 
jobs in MR.
Can someone tell me a scenario , where a application can utilize Yarn features 
or can be built on top of YARN and at the same time , it cannot be done 
efficiently using MRv2 jobs.
thanks,
Rahul



Re: What else can be built on top of YARN.

2013-05-30 Thread Krishna Kishore Bonagiri
Hi Rahul,

  It is at least because of the reasons that Vinod listed that makes my
life easy for porting my application on to YARN instead of making it work
in the Map Reduce framework. The main purpose of me using YARN is to
exploit the resource management capabilities of YARN.

Thanks,
Kishore


On Wed, May 29, 2013 at 11:00 PM, Rahul Bhattacharjee 
rahul.rec@gmail.com wrote:

 Thanks for the response Krishna.

 I was wondering if it were possible for using MR to  solve you problem
 instead of building the whole stack on top of yarn.
 Most likely its not possible , thats why you are building it . I wanted to
 know why is that ?

 I am in just trying to find out the need or why we might need to write the
 application on yarn.

 Rahul


 On Wed, May 29, 2013 at 8:23 PM, Krishna Kishore Bonagiri 
 write2kish...@gmail.com wrote:

 Hi Rahul,

   I am porting a distributed application that runs on a fixed set of
 given resources to YARN, with the aim of  being able to run it on a
 dynamically selected resources whichever are available at the time of
 running the application.

 Thanks,
 Kishore


 On Wed, May 29, 2013 at 8:04 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Hi all,

 I was going through the motivation behind Yarn. Splitting the
 responsibility of JT is the major concern.Ultimately the base (Yarn) was
 built in a generic way for building other generic distributed applications
 too.

 I am not able to think of any other parallel processing use case that
 would be useful to built on top of YARN. I though of a lot of use cases
 that would be beneficial when run in parallel , but again ,we can do those
 using map only jobs in MR.

 Can someone tell me a scenario , where a application can utilize Yarn
 features or can be built on top of YARN and at the same time , it cannot be
 done efficiently using MRv2 jobs.

 thanks,
 Rahul







Re: What else can be built on top of YARN.

2013-05-30 Thread Jay Vyas
What is the separation of concerns between YARN and Zookeeper?  That is,
where does YARN leave off and where does Zookeeper begin?  Or is there some
overlap


On Thu, May 30, 2013 at 2:42 AM, Krishna Kishore Bonagiri 
write2kish...@gmail.com wrote:

 Hi Rahul,

   It is at least because of the reasons that Vinod listed that makes my
 life easy for porting my application on to YARN instead of making it work
 in the Map Reduce framework. The main purpose of me using YARN is to
 exploit the resource management capabilities of YARN.

 Thanks,
 Kishore


 On Wed, May 29, 2013 at 11:00 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Thanks for the response Krishna.

 I was wondering if it were possible for using MR to  solve you problem
 instead of building the whole stack on top of yarn.
 Most likely its not possible , thats why you are building it . I wanted
 to know why is that ?

 I am in just trying to find out the need or why we might need to write
 the application on yarn.

 Rahul


 On Wed, May 29, 2013 at 8:23 PM, Krishna Kishore Bonagiri 
 write2kish...@gmail.com wrote:

 Hi Rahul,

   I am porting a distributed application that runs on a fixed set of
 given resources to YARN, with the aim of  being able to run it on a
 dynamically selected resources whichever are available at the time of
 running the application.

 Thanks,
 Kishore


 On Wed, May 29, 2013 at 8:04 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Hi all,

 I was going through the motivation behind Yarn. Splitting the
 responsibility of JT is the major concern.Ultimately the base (Yarn) was
 built in a generic way for building other generic distributed applications
 too.

 I am not able to think of any other parallel processing use case that
 would be useful to built on top of YARN. I though of a lot of use cases
 that would be beneficial when run in parallel , but again ,we can do those
 using map only jobs in MR.

 Can someone tell me a scenario , where a application can utilize Yarn
 features or can be built on top of YARN and at the same time , it cannot be
 done efficiently using MRv2 jobs.

 thanks,
 Rahul








-- 
Jay Vyas
http://jayunit100.blogspot.com


Re: What else can be built on top of YARN.

2013-05-29 Thread Krishna Kishore Bonagiri
Hi Rahul,

  I am porting a distributed application that runs on a fixed set of given
resources to YARN, with the aim of  being able to run it on a dynamically
selected resources whichever are available at the time of running the
application.

Thanks,
Kishore


On Wed, May 29, 2013 at 8:04 PM, Rahul Bhattacharjee 
rahul.rec@gmail.com wrote:

 Hi all,

 I was going through the motivation behind Yarn. Splitting the
 responsibility of JT is the major concern.Ultimately the base (Yarn) was
 built in a generic way for building other generic distributed applications
 too.

 I am not able to think of any other parallel processing use case that
 would be useful to built on top of YARN. I though of a lot of use cases
 that would be beneficial when run in parallel , but again ,we can do those
 using map only jobs in MR.

 Can someone tell me a scenario , where a application can utilize Yarn
 features or can be built on top of YARN and at the same time , it cannot be
 done efficiently using MRv2 jobs.

 thanks,
 Rahul





Re: What else can be built on top of YARN.

2013-05-29 Thread Rahul Bhattacharjee
Thanks for the response Krishna.

I was wondering if it were possible for using MR to  solve you problem
instead of building the whole stack on top of yarn.
Most likely its not possible , thats why you are building it . I wanted to
know why is that ?

I am in just trying to find out the need or why we might need to write the
application on yarn.

Rahul


On Wed, May 29, 2013 at 8:23 PM, Krishna Kishore Bonagiri 
write2kish...@gmail.com wrote:

 Hi Rahul,

   I am porting a distributed application that runs on a fixed set of given
 resources to YARN, with the aim of  being able to run it on a dynamically
 selected resources whichever are available at the time of running the
 application.

 Thanks,
 Kishore


 On Wed, May 29, 2013 at 8:04 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Hi all,

 I was going through the motivation behind Yarn. Splitting the
 responsibility of JT is the major concern.Ultimately the base (Yarn) was
 built in a generic way for building other generic distributed applications
 too.

 I am not able to think of any other parallel processing use case that
 would be useful to built on top of YARN. I though of a lot of use cases
 that would be beneficial when run in parallel , but again ,we can do those
 using map only jobs in MR.

 Can someone tell me a scenario , where a application can utilize Yarn
 features or can be built on top of YARN and at the same time , it cannot be
 done efficiently using MRv2 jobs.

 thanks,
 Rahul






Re: What else can be built on top of YARN.

2013-05-29 Thread John Conwell
Two scenarios I can think of are re-implementations of Twitter's Storm (
http://storm-project.net/) and DryadLinq (
http://research.microsoft.com/en-us/projects/dryadlinq/).

Storm, a distributed realtime computation framework used for analyzing
realtime steams of data, doesn't really need to be ported.  Its doing fine
by itself, though I think its a prime candidate for a Yarn port.

DryadLinq is a (now closed) research project out of Microsoft Research that
allowed the user to write standard LINQ code (in any .net language) and it
build an execution DAG based structure of the LINQ statement, and execute
the DAG on a MS HPC cluster.

The LINQ syntax is very much like PIG, though way more flexible and has
full IDE support (is Visual Studio), and is used in standard single process
programming.  That, to me, was the beauty behind DryadLinq: the programming
language for distributed execution was exactly the same as a well known and
used language for standard single process programming already used by
hundreds of thousands of programmers, so learning curve and acceptance debt
is really low.  But, like all good things that come out of MS Research, it
was killed because they sat on it too long.

The interesting thing is that distributed DAG execution is one of the main
examples given for the types of Yarn applications that could be developed.








On Wed, May 29, 2013 at 10:30 AM, Rahul Bhattacharjee 
rahul.rec@gmail.com wrote:

 Thanks for the response Krishna.

 I was wondering if it were possible for using MR to  solve you problem
 instead of building the whole stack on top of yarn.
 Most likely its not possible , thats why you are building it . I wanted to
 know why is that ?

 I am in just trying to find out the need or why we might need to write the
 application on yarn.

 Rahul


 On Wed, May 29, 2013 at 8:23 PM, Krishna Kishore Bonagiri 
 write2kish...@gmail.com wrote:

 Hi Rahul,

   I am porting a distributed application that runs on a fixed set of
 given resources to YARN, with the aim of  being able to run it on a
 dynamically selected resources whichever are available at the time of
 running the application.

 Thanks,
 Kishore


 On Wed, May 29, 2013 at 8:04 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 Hi all,

 I was going through the motivation behind Yarn. Splitting the
 responsibility of JT is the major concern.Ultimately the base (Yarn) was
 built in a generic way for building other generic distributed applications
 too.

 I am not able to think of any other parallel processing use case that
 would be useful to built on top of YARN. I though of a lot of use cases
 that would be beneficial when run in parallel , but again ,we can do those
 using map only jobs in MR.

 Can someone tell me a scenario , where a application can utilize Yarn
 features or can be built on top of YARN and at the same time , it cannot be
 done efficiently using MRv2 jobs.

 thanks,
 Rahul







-- 

Thanks,
John C


Re: What else can be built on top of YARN.

2013-05-29 Thread Viral Bajaria
There is a project at Yahoo which makes it possible to run Storm on Yarn. I
think the team behind it is going to give a talk at Hadoop Summit and plan
to open source it after that.

-Viral

On Wed, May 29, 2013 at 11:04 AM, John Conwell j...@iamjohn.me wrote:

 Storm, a distributed realtime computation framework used for analyzing
 realtime steams of data, doesn't really need to be ported.  Its doing fine
 by itself, though I think its a prime candidate for a Yarn port.


Re: What else can be built on top of YARN.

2013-05-29 Thread Vinod Kumar Vavilapalli


Historically, many applications/frameworks wanted to take advantage of just the 
resource management capabilities and failure handling of Hadoop (via 
JobTracker/TaskTracker), but were forced to used MapReduce even though they 
didn't have to. Obvious examples are graph processing (Giraph), BSP(Hama), 
storm/s4 and even a simple tool like DistCp.

There are issues even with map-only jobs.
 - You have to fake key-value processing, periodic pings, key-value outputs
 - You are limited to map slot capacity in the cluster
 - The number of tasks is static, so you cannot grow and shrink your job
 - You are forced to sort data all the time (even though this has changed 
recently)
 - You are tied to faking things like OutputCommit even if you don't need to.

That's just for starters. I can definitely think harder and list more ;)

YARN lets you move ahead without those limitations.

HTH
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/


On May 29, 2013, at 7:34 AM, Rahul Bhattacharjee wrote:

 Hi all,
 
 I was going through the motivation behind Yarn. Splitting the responsibility 
 of JT is the major concern.Ultimately the base (Yarn) was built in a generic 
 way for building other generic distributed applications too.
 
 I am not able to think of any other parallel processing use case that would 
 be useful to built on top of YARN. I though of a lot of use cases that would 
 be beneficial when run in parallel , but again ,we can do those using map 
 only jobs in MR.
 
 Can someone tell me a scenario , where a application can utilize Yarn 
 features or can be built on top of YARN and at the same time , it cannot be 
 done efficiently using MRv2 jobs.
 
 thanks,
 Rahul