Re: Running MPI jobs on Mesos based clusters

Suresh Marru Thu, 20 Oct 2016 20:20:01 -0700

Hi Gourav, Mangirish,

Did you checkout SLURM on Mesos - https://github.com/nqn/slurm-mesos 
<https://github.com/nqn/slurm-mesos>


Note that this is GPL licensed code and incompatible with ASL V2. It does not 
preclude from using it, but need to watch out when integrating incompatible 
licensed codes. 

Suresh

> On Oct 20, 2016, at 10:26 PM, Shenoy, Gourav Ganesh <goshe...@indiana.edu> 
> wrote:
> 
> Hi Mangirish, devs:
>  
> The Aurora documentation for “Tasks” & “Processes” provides very good 
> information which I felt would be helpful in implementing gang scheduling, as 
> you mentioned.
>  
> http://aurora.apache.org/documentation/latest/reference/configuration/ 
> <http://aurora.apache.org/documentation/latest/reference/configuration/>
>  
>  
> From what I understood, there are these constraints:
> 1.       If targeting single-node (multi-core) MPI, then a “JOB” will be 
> broken down into multiple “PROCESSESES”, each of which will run on these 
> multi-cores.
> 2.       Even if any one of these processes fail, then the JOB should be 
> marked as failed.
>  
> As mentioned in my earlier email, Aurora provides Job abstraction – “a job 
> consists of multiple tasks, which in turn consist of multiple processes”. 
> This abstraction comes in extremely handy if we want to run MPI jobs on a 
> single node.
>  
> While submitting a job to Aurora, we can control the following parameters for 
> a TASK:
>  
> a.       “max_failures” for a TASK – the number of failed processes which is 
> needed to mark a task as failed. Hence if we set max_failures = 1, then even 
> if a single process in a task fails, Aurora will mark that task as failed.
> Note: Since a JOB can have multiple tasks, and even a JOB has 
> “max_task_failures” parameter, we can set this to 1.
>  
> b.       “max_concurrency” for a TASK – number of processes to run in 
> parallel. If a node has 16 cores, then we can limit the amount of parallelism 
> to <=16.
>  
> I did not get much time to experiment with these parameters for job 
> submission, but found this document to be handy and worth sharing. Hope this 
> helps!
>  
> Thanks and Regards,
> Gourav Shenoy
>  
> From: Mangirish Wagle <vaglomangir...@gmail.com>
> Reply-To: "dev@airavata.apache.org" <dev@airavata.apache.org>
> Date: Tuesday, October 18, 2016 at 11:48 AM
> To: "dev@airavata.apache.org" <dev@airavata.apache.org>
> Subject: Re: Running MPI jobs on Mesos based clusters
>  
> Sure Suresh, will update my findings on the mailing list. Thanks!
>  
> On Tue, Oct 18, 2016 at 7:59 AM, Suresh Marru <sma...@apache.org 
> <mailto:sma...@apache.org>> wrote:
> Hi Mangirish, 
>  
> This is interesting. Looking forward to see what you will find our further on 
> gang scheduling support. Since the compute nodes are getting bigger, even if 
> you can explore single node MPI (on Jetstream using 22 cores) that will help. 
>  
>  
> Suresh
>  
> P.S. Good to see the momentum on mailing list discussions on such topics. 
>  
> On Oct 18, 2016, at 1:54 AM, Mangirish Wagle <vaglomangir...@gmail.com 
> <mailto:vaglomangir...@gmail.com>> wrote:
>  
> Hello Devs,
> 
> Here is an update on some new learnings and thoughts based on my interactions 
> with Mesos and Aurora devs.
> 
> MPI implementations in Mesos repositories (like MPI Hydra) rely on obsolete 
> MPI platforms and no longer supported my the developer community. Hence it is 
> not recommended that we use this for our purpose.
> 
> One of the known ways of running MPI jobs over mesos is using "gang 
> scheduling" which is basically distributing the MPI run over multiple jobs on 
> mesos in place of multiple nodes. The challenge here is the jobs need to be 
> scheduled as one task and any job errored should collectively error out the 
> main program including all the distributed jobs. 
> 
> One of the Mesos developer (Niklas Nielsen) pointed me out to his work on 
> gang scheduling: https://github.com/nqn <https://github.com/nqn>. This code 
> may not be fully tested but certainly a good starting point to explore gang 
> scheduling.
> 
> One of the Aurora developer (Stephen Erb) suggests using gang scheduling on 
> top of Aurora. Aurora scheduler assumes that every job is independent. Hence, 
> there would be a need to develop some external scaffolding to coordinate and 
> schedule these jobs, which might not be trivial. One advantage of using 
> Aurora as a backend for gang scheduling is that we would inherit the 
> robustness of Aurora, which otherwise would be a key challenge if targeting 
> bare mesos.
> 
> Alternative to all the options above, I think we should probably be able to 
> run a 1 node MPI job through Aurora. A resource offer with CPUs and Memory 
> from Mesos is abstracted as a single runtime, but is mapped to multiple nodes 
> underneath, which eventually would exploit distributed resource capabilities.
> 
> I intend to try out the 1 node MPI job submission approach first and 
> simultaneously explore the gang scheduling approach.
> 
> Please let me know your thoughts/ suggestions.
> 
> Best Regards,
> Mangirish 
>  
> 
>  
> On Thu, Oct 13, 2016 at 12:39 PM, Mangirish Wagle <vaglomangir...@gmail.com 
> <mailto:vaglomangir...@gmail.com>> wrote:
> Hi Marlon,
> Thanks for confirming and sharing the legal link.
> 
> -Mangirish
>  
> On Thu, Oct 13, 2016 at 12:13 PM, Pierce, Marlon <marpi...@iu.edu 
> <mailto:marpi...@iu.edu>> wrote:
> BSD is ok: https://www.apache.org/legal/resolved 
> <https://www.apache.org/legal/resolved>.
>  
> From: Mangirish Wagle <vaglomangir...@gmail.com 
> <mailto:vaglomangir...@gmail.com>>
> Reply-To: "dev@airavata.apache.org <mailto:dev@airavata.apache.org>" 
> <dev@airavata.apache.org <mailto:dev@airavata.apache.org>>
> Date: Thursday, October 13, 2016 at 12:03 PM
> To: "dev@airavata.apache.org <mailto:dev@airavata.apache.org>" 
> <dev@airavata.apache.org <mailto:dev@airavata.apache.org>>
> Subject: Re: Running MPI jobs on Mesos based clusters
>  
> Hello Devs,
> 
> I needed some advice on the license of the MPI libraries. The MPICH library 
> that I have been trying claims to have a "BSD Like" license 
> (http://git.mpich.org/mpich.git/blob/HEAD:/COPYRIGHT 
> <http://git.mpich.org/mpich.git/blob/HEAD:/COPYRIGHT>).
> 
> I am aware that OpenMPI which uses BSD license is currently used in our 
> application. I had chosen to start investigating MPICH because it claims to 
> be a highly portable and high quality implementation of latest MPI standard, 
> suitable to cloud based clusters.
> 
> If anyone could please advise on the acceptance of the MPICH libraries MSD 
> Like license for ASF, that would help.
> 
> Thank you.
> 
> Best Regards,
> Mangirish Wagle
>  
> On Thu, Oct 6, 2016 at 1:48 AM, Mangirish Wagle <vaglomangir...@gmail.com 
> <mailto:vaglomangir...@gmail.com>> wrote:
> Hello Devs,
>  
> The network issue mentioned above now stands resolved. The problem was with 
> the iptables had some conflicting rules which blocked the traffic. It was 
> resolved by simple iptables flush.
>  
> Here is the test MPI program running on multiple machines:-
>  
> [centos@mesos-slave-1 ~]$ mpiexec -f machinefile -n 2 ./mpitest
> Hello world!  I am process number: 0 on host mesos-slave-1
> Hello world!  I am process number: 1 on host mesos-slave-2
>  
> The next step is to try invoking this through framework like Marathon. 
> However, the job submission still does not run through Marathon. It seems to 
> gets stuck in the 'waiting' state forever (For example 
> http://149.165.170.245:8080/ui/#/apps/%2Fmaw-try 
> <http://149.165.170.245:8080/ui/#/apps/%2Fmaw-try>). Further, I notice that 
> Marathon is listed under 'inactive frameworks' in mesos dashboard 
> (http://149.165.171.33:5050/#/frameworks 
> <http://149.165.171.33:5050/#/frameworks>).
>  
> I am trying to get this working, though any help/ clues with this would be 
> really helpful.
>  
> Thanks and Regards,
> Mangirish Wagle
> 
> 
> 
>  
> On Fri, Sep 30, 2016 at 9:21 PM, Mangirish Wagle <vaglomangir...@gmail.com 
> <mailto:vaglomangir...@gmail.com>> wrote:
> Hello Devs,
>  
> I am currently running a sample MPI C program using 'mpiexec' provided by 
> MPICH. I followed their installation guide 
> <http://www.mpich.org/static/downloads/3.2/mpich-3.2-installguide.pdf> to 
> install the libraries on the master and slave nodes of the mesos cluster.
>  
> The approach that I am trying out here is that I am equipping the underlying 
> nodes with MPI handling tools and then use the Mesos framework like Marathon/ 
> Aurora to submit jobs to run MPI programs by invoking these tools.
>  
> You can potentially run an MPI program using mpiexec in the following manner:-
>  
> # mpiexec -f machinefile -n 2 ./mpitest
> machinefile -> File which contains an inventory of machines to run the 
> program on and number of processes on each machine.
> mpitest -> MPI program compiled in C using mpicc compiler. The program 
> returns the process number and he hostname of the machine running the process.
> -n option indicates number of processes that it needs to spawn
> Example of machinefile contents:-
>  
> # Entries in the format <hostname/IP>:<number of processes>
> mesos-slave-1:1
> mesos-slave-2:1
>  
> The reason for choosing slaves is that Mesos runs the jobs on slaves, managed 
> by 'agents' pertaining to the slaves.
>  
> Output of the program with '-n 1':-
>  
> # mpiexec -f machinefile -n 1 ./mpitest
> Hello world!  I am process number: 0 on host mesos-slave-1
>  
> But when I try for '-n 2', I am hitting the following error:-
>  
> # mpiexec -f machinefile -n 2 ./mpitest
> [proxy:0:1@mesos-slave-2] HYDU_sock_connect 
> (/home/centos/mpich-3.2/src/pm/hydra/utils/sock/sock.c:172): unable to 
> connect from "mesos-slave-2" to "mesos-slave-1" (No route to host)
> [proxy:0:1@mesos-slave-2] main 
> (/home/centos/mpich-3.2/src/pm/hydra/pm/pmiserv/pmip.c:189): unable to 
> connect to server mesos-slave-1 at port 44788 (check for firewalls!)
>  
> It seems to not allow the program execution due to network traffic being 
> blocked. I checked security groups in scigap openstack for mesos-slave-1, 
> mesos-slave-2 nodes and it is set to 'wideopen' policy. Furthermore, I tried 
> adding explicit rules to the policies to allow all TCP and UDP (Currently I 
> am not sure what protocol is used underneath), even then it continues 
> throwing this error.
>  
> Any clues, suggestions, comments about the error or approach as a whole would 
> be helpful.
>  
> Thanks and Regards,
> Mangirish Wagle
>  
> Error! Filename not specified.
>  
> On Tue, Sep 27, 2016 at 11:23 AM, Mangirish Wagle <vaglomangir...@gmail.com 
> <mailto:vaglomangir...@gmail.com>> wrote:
> Hello Devs,
>  
> Thanks Gourav and Shameera for all the work w.r.t. setting up the 
> Mesos-Marathon cluster on Jetstream.
>  
> I am currently evaluating MPICH (http://www.mpich.org/about/overview/ 
> <http://www.mpich.org/about/overview/>) to be used for launching MPI jobs on 
> top of mesos. MPICH version 1.2 supports Mesos based MPI scheduling. I have 
> been also trying to submit jobs to the cluster through Marathon. However, in 
> either cases I am currently facing issues which I am working to get resolved.
>  
> I am compiling my notes into the following google doc. You may please review 
> and let me know your comments, suggestions.
>  
> https://docs.google.com/document/d/1p_Y4Zd4I4lgt264IHspXJli3la25y6bcPcmrTD6nR8g/edit?usp=sharing
>  
> <https://docs.google.com/document/d/1p_Y4Zd4I4lgt264IHspXJli3la25y6bcPcmrTD6nR8g/edit?usp=sharing>
>  
> Thanks and Regards,
> Mangirish Wagle
> 
> 
> Error! Filename not specified.
>  
> On Wed, Sep 21, 2016 at 3:20 PM, Shenoy, Gourav Ganesh <goshe...@indiana.edu 
> <mailto:goshe...@indiana.edu>> wrote:
> Hi Mangirish,
>  
> I have set up a Mesos-Marathon cluster for you on Jetstream. I will share 
> with you with the cluster details in a separate email. Kindly note that there 
> are 3 masters & 2 slaves in this cluster. 
>  
> I am also working on automating this process for Jetstream (similar to 
> Shameera’s ansible script for EC2) and when that is ready, we can create 
> clusters or add/remove slave machines from the cluster.
>  
> Thanks and Regards,
> Gourav Shenoy
>  
> From: Mangirish Wagle <vaglomangir...@gmail.com 
> <mailto:vaglomangir...@gmail.com>>
> Reply-To: "dev@airavata.apache.org <mailto:dev@airavata.apache.org>" 
> <dev@airavata.apache.org <mailto:dev@airavata.apache.org>>
> Date: Wednesday, September 21, 2016 at 2:36 PM
> To: "dev@airavata.apache.org <mailto:dev@airavata.apache.org>" 
> <dev@airavata.apache.org <mailto:dev@airavata.apache.org>>
> Subject: Running MPI jobs on Mesos based clusters
>  
> Hello All,
>  
> I would like to post for everybody's awareness about the study that I am 
> undertaking this fall, i.e. to evaluate various different frameworks that 
> would facilitate MPI jobs on Mesos based clusters for Apache Airavata.
>  
> Some of the options that I am looking at are:-
> MPI support framework bundled with Mesos
> Apache Aurora
> Marathon
> Chronos
> Some of the evaluation criteria that I am planning to base my investigation 
> are:-
> Ease of setup
> Documentation
> Reliability features like HA
> Scaling and Fault recovery
> Performance
> Community Support
> Gourav and Shameera are working on ansible based automation to spin up a 
> mesos based cluster and I am planning to use it to setup a cluster for 
> experimentation.
>  
> Any suggestions or information about prior work on this would be highly 
> appreciated.
>  
> Thank you.
>  
> Best Regards,
> Mangirish Wagle
> Error! Filename not specified.
>  
>  
>  
>  
>  
>  
>  
>

Re: Running MPI jobs on Mesos based clusters

Reply via email to