Hi Gourav, Mangirish, Did you checkout SLURM on Mesos - https://github.com/nqn/slurm-mesos <https://github.com/nqn/slurm-mesos>
Note that this is GPL licensed code and incompatible with ASL V2. It does not preclude from using it, but need to watch out when integrating incompatible licensed codes. Suresh > On Oct 20, 2016, at 10:26 PM, Shenoy, Gourav Ganesh <goshe...@indiana.edu> > wrote: > > Hi Mangirish, devs: > > The Aurora documentation for “Tasks” & “Processes” provides very good > information which I felt would be helpful in implementing gang scheduling, as > you mentioned. > > http://aurora.apache.org/documentation/latest/reference/configuration/ > <http://aurora.apache.org/documentation/latest/reference/configuration/> > > > From what I understood, there are these constraints: > 1. If targeting single-node (multi-core) MPI, then a “JOB” will be > broken down into multiple “PROCESSESES”, each of which will run on these > multi-cores. > 2. Even if any one of these processes fail, then the JOB should be > marked as failed. > > As mentioned in my earlier email, Aurora provides Job abstraction – “a job > consists of multiple tasks, which in turn consist of multiple processes”. > This abstraction comes in extremely handy if we want to run MPI jobs on a > single node. > > While submitting a job to Aurora, we can control the following parameters for > a TASK: > > a. “max_failures” for a TASK – the number of failed processes which is > needed to mark a task as failed. Hence if we set max_failures = 1, then even > if a single process in a task fails, Aurora will mark that task as failed. > Note: Since a JOB can have multiple tasks, and even a JOB has > “max_task_failures” parameter, we can set this to 1. > > b. “max_concurrency” for a TASK – number of processes to run in > parallel. If a node has 16 cores, then we can limit the amount of parallelism > to <=16. > > I did not get much time to experiment with these parameters for job > submission, but found this document to be handy and worth sharing. Hope this > helps! > > Thanks and Regards, > Gourav Shenoy > > From: Mangirish Wagle <vaglomangir...@gmail.com> > Reply-To: "dev@airavata.apache.org" <dev@airavata.apache.org> > Date: Tuesday, October 18, 2016 at 11:48 AM > To: "dev@airavata.apache.org" <dev@airavata.apache.org> > Subject: Re: Running MPI jobs on Mesos based clusters > > Sure Suresh, will update my findings on the mailing list. Thanks! > > On Tue, Oct 18, 2016 at 7:59 AM, Suresh Marru <sma...@apache.org > <mailto:sma...@apache.org>> wrote: > Hi Mangirish, > > This is interesting. Looking forward to see what you will find our further on > gang scheduling support. Since the compute nodes are getting bigger, even if > you can explore single node MPI (on Jetstream using 22 cores) that will help. > > > Suresh > > P.S. Good to see the momentum on mailing list discussions on such topics. > > On Oct 18, 2016, at 1:54 AM, Mangirish Wagle <vaglomangir...@gmail.com > <mailto:vaglomangir...@gmail.com>> wrote: > > Hello Devs, > > Here is an update on some new learnings and thoughts based on my interactions > with Mesos and Aurora devs. > > MPI implementations in Mesos repositories (like MPI Hydra) rely on obsolete > MPI platforms and no longer supported my the developer community. Hence it is > not recommended that we use this for our purpose. > > One of the known ways of running MPI jobs over mesos is using "gang > scheduling" which is basically distributing the MPI run over multiple jobs on > mesos in place of multiple nodes. The challenge here is the jobs need to be > scheduled as one task and any job errored should collectively error out the > main program including all the distributed jobs. > > One of the Mesos developer (Niklas Nielsen) pointed me out to his work on > gang scheduling: https://github.com/nqn <https://github.com/nqn>. This code > may not be fully tested but certainly a good starting point to explore gang > scheduling. > > One of the Aurora developer (Stephen Erb) suggests using gang scheduling on > top of Aurora. Aurora scheduler assumes that every job is independent. Hence, > there would be a need to develop some external scaffolding to coordinate and > schedule these jobs, which might not be trivial. One advantage of using > Aurora as a backend for gang scheduling is that we would inherit the > robustness of Aurora, which otherwise would be a key challenge if targeting > bare mesos. > > Alternative to all the options above, I think we should probably be able to > run a 1 node MPI job through Aurora. A resource offer with CPUs and Memory > from Mesos is abstracted as a single runtime, but is mapped to multiple nodes > underneath, which eventually would exploit distributed resource capabilities. > > I intend to try out the 1 node MPI job submission approach first and > simultaneously explore the gang scheduling approach. > > Please let me know your thoughts/ suggestions. > > Best Regards, > Mangirish > > > > On Thu, Oct 13, 2016 at 12:39 PM, Mangirish Wagle <vaglomangir...@gmail.com > <mailto:vaglomangir...@gmail.com>> wrote: > Hi Marlon, > Thanks for confirming and sharing the legal link. > > -Mangirish > > On Thu, Oct 13, 2016 at 12:13 PM, Pierce, Marlon <marpi...@iu.edu > <mailto:marpi...@iu.edu>> wrote: > BSD is ok: https://www.apache.org/legal/resolved > <https://www.apache.org/legal/resolved>. > > From: Mangirish Wagle <vaglomangir...@gmail.com > <mailto:vaglomangir...@gmail.com>> > Reply-To: "dev@airavata.apache.org <mailto:dev@airavata.apache.org>" > <dev@airavata.apache.org <mailto:dev@airavata.apache.org>> > Date: Thursday, October 13, 2016 at 12:03 PM > To: "dev@airavata.apache.org <mailto:dev@airavata.apache.org>" > <dev@airavata.apache.org <mailto:dev@airavata.apache.org>> > Subject: Re: Running MPI jobs on Mesos based clusters > > Hello Devs, > > I needed some advice on the license of the MPI libraries. The MPICH library > that I have been trying claims to have a "BSD Like" license > (http://git.mpich.org/mpich.git/blob/HEAD:/COPYRIGHT > <http://git.mpich.org/mpich.git/blob/HEAD:/COPYRIGHT>). > > I am aware that OpenMPI which uses BSD license is currently used in our > application. I had chosen to start investigating MPICH because it claims to > be a highly portable and high quality implementation of latest MPI standard, > suitable to cloud based clusters. > > If anyone could please advise on the acceptance of the MPICH libraries MSD > Like license for ASF, that would help. > > Thank you. > > Best Regards, > Mangirish Wagle > > On Thu, Oct 6, 2016 at 1:48 AM, Mangirish Wagle <vaglomangir...@gmail.com > <mailto:vaglomangir...@gmail.com>> wrote: > Hello Devs, > > The network issue mentioned above now stands resolved. The problem was with > the iptables had some conflicting rules which blocked the traffic. It was > resolved by simple iptables flush. > > Here is the test MPI program running on multiple machines:- > > [centos@mesos-slave-1 ~]$ mpiexec -f machinefile -n 2 ./mpitest > Hello world! I am process number: 0 on host mesos-slave-1 > Hello world! I am process number: 1 on host mesos-slave-2 > > The next step is to try invoking this through framework like Marathon. > However, the job submission still does not run through Marathon. It seems to > gets stuck in the 'waiting' state forever (For example > http://149.165.170.245:8080/ui/#/apps/%2Fmaw-try > <http://149.165.170.245:8080/ui/#/apps/%2Fmaw-try>). Further, I notice that > Marathon is listed under 'inactive frameworks' in mesos dashboard > (http://149.165.171.33:5050/#/frameworks > <http://149.165.171.33:5050/#/frameworks>). > > I am trying to get this working, though any help/ clues with this would be > really helpful. > > Thanks and Regards, > Mangirish Wagle > > > > > On Fri, Sep 30, 2016 at 9:21 PM, Mangirish Wagle <vaglomangir...@gmail.com > <mailto:vaglomangir...@gmail.com>> wrote: > Hello Devs, > > I am currently running a sample MPI C program using 'mpiexec' provided by > MPICH. I followed their installation guide > <http://www.mpich.org/static/downloads/3.2/mpich-3.2-installguide.pdf> to > install the libraries on the master and slave nodes of the mesos cluster. > > The approach that I am trying out here is that I am equipping the underlying > nodes with MPI handling tools and then use the Mesos framework like Marathon/ > Aurora to submit jobs to run MPI programs by invoking these tools. > > You can potentially run an MPI program using mpiexec in the following manner:- > > # mpiexec -f machinefile -n 2 ./mpitest > machinefile -> File which contains an inventory of machines to run the > program on and number of processes on each machine. > mpitest -> MPI program compiled in C using mpicc compiler. The program > returns the process number and he hostname of the machine running the process. > -n option indicates number of processes that it needs to spawn > Example of machinefile contents:- > > # Entries in the format <hostname/IP>:<number of processes> > mesos-slave-1:1 > mesos-slave-2:1 > > The reason for choosing slaves is that Mesos runs the jobs on slaves, managed > by 'agents' pertaining to the slaves. > > Output of the program with '-n 1':- > > # mpiexec -f machinefile -n 1 ./mpitest > Hello world! I am process number: 0 on host mesos-slave-1 > > But when I try for '-n 2', I am hitting the following error:- > > # mpiexec -f machinefile -n 2 ./mpitest > [proxy:0:1@mesos-slave-2] HYDU_sock_connect > (/home/centos/mpich-3.2/src/pm/hydra/utils/sock/sock.c:172): unable to > connect from "mesos-slave-2" to "mesos-slave-1" (No route to host) > [proxy:0:1@mesos-slave-2] main > (/home/centos/mpich-3.2/src/pm/hydra/pm/pmiserv/pmip.c:189): unable to > connect to server mesos-slave-1 at port 44788 (check for firewalls!) > > It seems to not allow the program execution due to network traffic being > blocked. I checked security groups in scigap openstack for mesos-slave-1, > mesos-slave-2 nodes and it is set to 'wideopen' policy. Furthermore, I tried > adding explicit rules to the policies to allow all TCP and UDP (Currently I > am not sure what protocol is used underneath), even then it continues > throwing this error. > > Any clues, suggestions, comments about the error or approach as a whole would > be helpful. > > Thanks and Regards, > Mangirish Wagle > > Error! Filename not specified. > > On Tue, Sep 27, 2016 at 11:23 AM, Mangirish Wagle <vaglomangir...@gmail.com > <mailto:vaglomangir...@gmail.com>> wrote: > Hello Devs, > > Thanks Gourav and Shameera for all the work w.r.t. setting up the > Mesos-Marathon cluster on Jetstream. > > I am currently evaluating MPICH (http://www.mpich.org/about/overview/ > <http://www.mpich.org/about/overview/>) to be used for launching MPI jobs on > top of mesos. MPICH version 1.2 supports Mesos based MPI scheduling. I have > been also trying to submit jobs to the cluster through Marathon. However, in > either cases I am currently facing issues which I am working to get resolved. > > I am compiling my notes into the following google doc. You may please review > and let me know your comments, suggestions. > > https://docs.google.com/document/d/1p_Y4Zd4I4lgt264IHspXJli3la25y6bcPcmrTD6nR8g/edit?usp=sharing > > <https://docs.google.com/document/d/1p_Y4Zd4I4lgt264IHspXJli3la25y6bcPcmrTD6nR8g/edit?usp=sharing> > > Thanks and Regards, > Mangirish Wagle > > > Error! Filename not specified. > > On Wed, Sep 21, 2016 at 3:20 PM, Shenoy, Gourav Ganesh <goshe...@indiana.edu > <mailto:goshe...@indiana.edu>> wrote: > Hi Mangirish, > > I have set up a Mesos-Marathon cluster for you on Jetstream. I will share > with you with the cluster details in a separate email. Kindly note that there > are 3 masters & 2 slaves in this cluster. > > I am also working on automating this process for Jetstream (similar to > Shameera’s ansible script for EC2) and when that is ready, we can create > clusters or add/remove slave machines from the cluster. > > Thanks and Regards, > Gourav Shenoy > > From: Mangirish Wagle <vaglomangir...@gmail.com > <mailto:vaglomangir...@gmail.com>> > Reply-To: "dev@airavata.apache.org <mailto:dev@airavata.apache.org>" > <dev@airavata.apache.org <mailto:dev@airavata.apache.org>> > Date: Wednesday, September 21, 2016 at 2:36 PM > To: "dev@airavata.apache.org <mailto:dev@airavata.apache.org>" > <dev@airavata.apache.org <mailto:dev@airavata.apache.org>> > Subject: Running MPI jobs on Mesos based clusters > > Hello All, > > I would like to post for everybody's awareness about the study that I am > undertaking this fall, i.e. to evaluate various different frameworks that > would facilitate MPI jobs on Mesos based clusters for Apache Airavata. > > Some of the options that I am looking at are:- > MPI support framework bundled with Mesos > Apache Aurora > Marathon > Chronos > Some of the evaluation criteria that I am planning to base my investigation > are:- > Ease of setup > Documentation > Reliability features like HA > Scaling and Fault recovery > Performance > Community Support > Gourav and Shameera are working on ansible based automation to spin up a > mesos based cluster and I am planning to use it to setup a cluster for > experimentation. > > Any suggestions or information about prior work on this would be highly > appreciated. > > Thank you. > > Best Regards, > Mangirish Wagle > Error! Filename not specified. > > > > > > > >