Re: Running MPI jobs on Mesos based clusters

2016-10-18 Thread Suresh Marru
Hi Mangirish,

This is interesting. Looking forward to see what you will find our further on 
gang scheduling support. Since the compute nodes are getting bigger, even if 
you can explore single node MPI (on Jetstream using 22 cores) that will help.  

Suresh

P.S. Good to see the momentum on mailing list discussions on such topics. 

> On Oct 18, 2016, at 1:54 AM, Mangirish Wagle  wrote:
> 
> Hello Devs,
> 
> Here is an update on some new learnings and thoughts based on my interactions 
> with Mesos and Aurora devs.
> 
> MPI implementations in Mesos repositories (like MPI Hydra) rely on obsolete 
> MPI platforms and no longer supported my the developer community. Hence it is 
> not recommended that we use this for our purpose.
> 
> One of the known ways of running MPI jobs over mesos is using "gang 
> scheduling" which is basically distributing the MPI run over multiple jobs on 
> mesos in place of multiple nodes. The challenge here is the jobs need to be 
> scheduled as one task and any job errored should collectively error out the 
> main program including all the distributed jobs. 
> 
> One of the Mesos developer (Niklas Nielsen) pointed me out to his work on 
> gang scheduling: https://github.com/nqn . This code 
> may not be fully tested but certainly a good starting point to explore gang 
> scheduling.
> 
> One of the Aurora developer (Stephen Erb) suggests using gang scheduling on 
> top of Aurora. Aurora scheduler assumes that every job is independent. Hence, 
> there would be a need to develop some external scaffolding to coordinate and 
> schedule these jobs, which might not be trivial. One advantage of using 
> Aurora as a backend for gang scheduling is that we would inherit the 
> robustness of Aurora, which otherwise would be a key challenge if targeting 
> bare mesos.
> 
> Alternative to all the options above, I think we should probably be able to 
> run a 1 node MPI job through Aurora. A resource offer with CPUs and Memory 
> from Mesos is abstracted as a single runtime, but is mapped to multiple nodes 
> underneath, which eventually would exploit distributed resource capabilities.
> 
> I intend to try out the 1 node MPI job submission approach first and 
> simultaneously explore the gang scheduling approach.
> 
> Please let me know your thoughts/ suggestions.
> 
> Best Regards,
> Mangirish 
> 
> 
> 
> On Thu, Oct 13, 2016 at 12:39 PM, Mangirish Wagle  > wrote:
> Hi Marlon,
> Thanks for confirming and sharing the legal link.
> 
> -Mangirish
> 
> On Thu, Oct 13, 2016 at 12:13 PM, Pierce, Marlon  > wrote:
> BSD is ok: https://www.apache.org/legal/resolved 
> .
> 
>  
> 
> From: Mangirish Wagle  >
> Reply-To: "dev@airavata.apache.org " 
> mailto:dev@airavata.apache.org>>
> Date: Thursday, October 13, 2016 at 12:03 PM
> To: "dev@airavata.apache.org " 
> mailto:dev@airavata.apache.org>>
> Subject: Re: Running MPI jobs on Mesos based clusters
> 
>  
> 
> Hello Devs,
> 
> I needed some advice on the license of the MPI libraries. The MPICH library 
> that I have been trying claims to have a "BSD Like" license 
> (http://git.mpich.org/mpich.git/blob/HEAD:/COPYRIGHT 
> ).
> 
> I am aware that OpenMPI which uses BSD license is currently used in our 
> application. I had chosen to start investigating MPICH because it claims to 
> be a highly portable and high quality implementation of latest MPI standard, 
> suitable to cloud based clusters.
> 
> If anyone could please advise on the acceptance of the MPICH libraries MSD 
> Like license for ASF, that would help.
> 
> Thank you.
> 
> Best Regards,
> 
> Mangirish Wagle
> 
>  
> 
> On Thu, Oct 6, 2016 at 1:48 AM, Mangirish Wagle  > wrote:
> 
> Hello Devs,
> 
>  
> 
> The network issue mentioned above now stands resolved. The problem was with 
> the iptables had some conflicting rules which blocked the traffic. It was 
> resolved by simple iptables flush.
> 
>  
> 
> Here is the test MPI program running on multiple machines:-
> 
>  
> 
> [centos@mesos-slave-1 ~]$ mpiexec -f machinefile -n 2 ./mpitest
> 
> Hello world!  I am process number: 0 on host mesos-slave-1
> 
> Hello world!  I am process number: 1 on host mesos-slave-2
> 
>  
> 
> The next step is to try invoking this through framework like Marathon. 
> However, the job submission still does not run through Marathon. It seems to 
> gets stuck in the 'waiting' state forever (For example 
> http://149.165.170.245:8080/ui/#/apps/%2Fmaw-try 
> ). Further, I notice that 
> Marathon is listed under 'inactive frameworks' in mesos dashboard 
> (http://149.165.171.33:5050/#/frameworks 
> ).
> 
>  
> 
> I am trying to get t

Re: Running MPI jobs on Mesos based clusters

2016-10-18 Thread Mangirish Wagle
Sure Suresh, will update my findings on the mailing list. Thanks!

On Tue, Oct 18, 2016 at 7:59 AM, Suresh Marru  wrote:

> Hi Mangirish,
>
> This is interesting. Looking forward to see what you will find our further
> on gang scheduling support. Since the compute nodes are getting bigger,
> even if you can explore single node MPI (on Jetstream using 22 cores) that
> will help.
>
> Suresh
>
> P.S. Good to see the momentum on mailing list discussions on such topics.
>
> On Oct 18, 2016, at 1:54 AM, Mangirish Wagle 
> wrote:
>
> Hello Devs,
>
> Here is an update on some new learnings and thoughts based on my
> interactions with Mesos and Aurora devs.
>
> MPI implementations in Mesos repositories (like MPI Hydra) rely on
> obsolete MPI platforms and no longer supported my the developer community.
> Hence it is not recommended that we use this for our purpose.
>
> One of the known ways of running MPI jobs over mesos is using "gang
> scheduling" which is basically distributing the MPI run over multiple jobs
> on mesos in place of multiple nodes. The challenge here is the jobs need to
> be scheduled as one task and any job errored should collectively error out
> the main program including all the distributed jobs.
>
> One of the Mesos developer (Niklas Nielsen) pointed me out to his work on
> gang scheduling: https://github.com/nqn. This code may not be fully
> tested but certainly a good starting point to explore gang scheduling.
>
> One of the Aurora developer (Stephen Erb) suggests using gang scheduling
> on top of Aurora. Aurora scheduler assumes that every job is independent.
> Hence, there would be a need to develop some external scaffolding to
> coordinate and schedule these jobs, which might not be trivial. One
> advantage of using Aurora as a backend for gang scheduling is that we would
> inherit the robustness of Aurora, which otherwise would be a key challenge
> if targeting bare mesos.
>
> Alternative to all the options above, I think we should probably be able
> to run a 1 node MPI job through Aurora. A resource offer with CPUs and
> Memory from Mesos is abstracted as a single runtime, but is mapped to
> multiple nodes underneath, which eventually would exploit distributed
> resource capabilities.
>
> I intend to try out the 1 node MPI job submission approach first and
> simultaneously explore the gang scheduling approach.
>
> Please let me know your thoughts/ suggestions.
>
> Best Regards,
> Mangirish
>
>
>
> On Thu, Oct 13, 2016 at 12:39 PM, Mangirish Wagle <
> vaglomangir...@gmail.com> wrote:
>
>> Hi Marlon,
>> Thanks for confirming and sharing the legal link.
>>
>> -Mangirish
>>
>> On Thu, Oct 13, 2016 at 12:13 PM, Pierce, Marlon  wrote:
>>
>>> BSD is ok: https://www.apache.org/legal/resolved.
>>>
>>>
>>>
>>> *From: *Mangirish Wagle 
>>> *Reply-To: *"dev@airavata.apache.org" 
>>> *Date: *Thursday, October 13, 2016 at 12:03 PM
>>> *To: *"dev@airavata.apache.org" 
>>> *Subject: *Re: Running MPI jobs on Mesos based clusters
>>>
>>>
>>>
>>> Hello Devs,
>>>
>>> I needed some advice on the license of the MPI libraries. The MPICH
>>> library that I have been trying claims to have a "BSD Like" license (
>>> http://git.mpich.org/mpich.git/blob/HEAD:/COPYRIGHT).
>>>
>>> I am aware that OpenMPI which uses BSD license is currently used in our
>>> application. I had chosen to start investigating MPICH because it claims to
>>> be a highly portable and high quality implementation of latest MPI
>>> standard, suitable to cloud based clusters.
>>>
>>> If anyone could please advise on the acceptance of the MPICH libraries
>>> MSD Like license for ASF, that would help.
>>>
>>> Thank you.
>>>
>>> Best Regards,
>>>
>>> Mangirish Wagle
>>>
>>>
>>>
>>> On Thu, Oct 6, 2016 at 1:48 AM, Mangirish Wagle <
>>> vaglomangir...@gmail.com> wrote:
>>>
>>> Hello Devs,
>>>
>>>
>>>
>>> The network issue mentioned above now stands resolved. The problem was
>>> with the iptables had some conflicting rules which blocked the traffic. It
>>> was resolved by simple iptables flush.
>>>
>>>
>>>
>>> Here is the test MPI program running on multiple machines:-
>>>
>>>
>>>
>>> [centos@mesos-slave-1 ~]$ mpiexec -f machinefile -n 2 ./mpitest
>>>
>>> Hello world!  I am process number: 0 on host mesos-slave-1
>>>
>>> Hello world!  I am process number: 1 on host mesos-slave-2
>>>
>>>
>>>
>>> The next step is to try invoking this through framework like Marathon.
>>> However, the job submission still does not run through Marathon. It seems
>>> to gets stuck in the 'waiting' state forever (For example
>>> http://149.165.170.245:8080/ui/#/apps/%2Fmaw-try). Further, I notice
>>> that Marathon is listed under 'inactive frameworks' in mesos dashboard (
>>> http://149.165.171.33:5050/#/frameworks).
>>>
>>>
>>>
>>> I am trying to get this working, though any help/ clues with this would
>>> be really helpful.
>>>
>>>
>>>
>>> Thanks and Regards,
>>>
>>> Mangirish Wagle
>>>
>>>
>>>
>>>
>>> On Fri, Sep 30, 2016 at 9:21 PM, Mangirish Wagle <
>>> vaglomangir...@g