Hello Devs, I am currently running a sample MPI C program using 'mpiexec' provided by MPICH. I followed their installation guide <http://www.mpich.org/static/downloads/3.2/mpich-3.2-installguide.pdf> to install the libraries on the master and slave nodes of the mesos cluster.
The approach that I am trying out here is that I am equipping the underlying nodes with MPI handling tools and then use the Mesos framework like Marathon/ Aurora to submit jobs to run MPI programs by invoking these tools. You can potentially run an MPI program using mpiexec in the following manner:- # *mpiexec -f machinefile -n 2 ./mpitest* - *machinefile *-> File which contains an inventory of machines to run the program on and number of processes on each machine. - *mpitest *-> MPI program compiled in C using mpicc compiler. The program returns the process number and he hostname of the machine running the process. - *-n *option indicates number of processes that it needs to spawn Example of machinefile contents:- # Entries in the format <hostname/IP>:<number of processes> mesos-slave-1:1 mesos-slave-2:1 The reason for choosing slaves is that Mesos runs the jobs on slaves, managed by 'agents' pertaining to the slaves. Output of the program with '-n 1':- # mpiexec -f machinefile -n 1 ./mpitest Hello world! I am process number: 0 on host mesos-slave-1 But when I try for '-n 2', I am hitting the following error:- # mpiexec -f machinefile -n 2 ./mpitest [proxy:0:1@mesos-slave-2] HYDU_sock_connect (/home/centos/mpich-3.2/src/pm/hydra/utils/sock/sock.c:172): unable to connect from "mesos-slave-2" to "mesos-slave-1" (No route to host) [proxy:0:1@mesos-slave-2] main (/home/centos/mpich-3.2/src/pm/hydra/pm/pmiserv/pmip.c:189): *unable to connect to server mesos-slave-1 at port 44788* (check for firewalls!) It seems to not allow the program execution due to network traffic being blocked. I checked security groups in scigap openstack for mesos-slave-1, mesos-slave-2 nodes and it is set to 'wideopen' policy. Furthermore, I tried adding explicit rules to the policies to allow all TCP and UDP (Currently I am not sure what protocol is used underneath), even then it continues throwing this error. Any clues, suggestions, comments about the error or approach as a whole would be helpful. Thanks and Regards, Mangirish Wagle On Tue, Sep 27, 2016 at 11:23 AM, Mangirish Wagle <vaglomangir...@gmail.com> wrote: > Hello Devs, > > Thanks Gourav and Shameera for all the work w.r.t. setting up the > Mesos-Marathon cluster on Jetstream. > > I am currently evaluating MPICH (http://www.mpich.org/about/overview/) to > be used for launching MPI jobs on top of mesos. MPICH version 1.2 supports > Mesos based MPI scheduling. I have been also trying to submit jobs to the > cluster through Marathon. However, in either cases I am currently facing > issues which I am working to get resolved. > > I am compiling my notes into the following google doc. You may please > review and let me know your comments, suggestions. > > https://docs.google.com/document/d/1p_Y4Zd4I4lgt264IHspXJli3la25y6bc > PcmrTD6nR8g/edit?usp=sharing > > Thanks and Regards, > Mangirish Wagle > > > > On Wed, Sep 21, 2016 at 3:20 PM, Shenoy, Gourav Ganesh < > goshe...@indiana.edu> wrote: > >> Hi Mangirish, >> >> >> >> I have set up a Mesos-Marathon cluster for you on Jetstream. I will share >> with you with the cluster details in a separate email. Kindly note that >> there are 3 masters & 2 slaves in this cluster. >> >> >> >> I am also working on automating this process for Jetstream (similar to >> Shameera’s ansible script for EC2) and when that is ready, we can create >> clusters or add/remove slave machines from the cluster. >> >> >> >> Thanks and Regards, >> >> Gourav Shenoy >> >> >> >> *From: *Mangirish Wagle <vaglomangir...@gmail.com> >> *Reply-To: *"dev@airavata.apache.org" <dev@airavata.apache.org> >> *Date: *Wednesday, September 21, 2016 at 2:36 PM >> *To: *"dev@airavata.apache.org" <dev@airavata.apache.org> >> *Subject: *Running MPI jobs on Mesos based clusters >> >> >> >> Hello All, >> >> >> >> I would like to post for everybody's awareness about the study that I am >> undertaking this fall, i.e. to evaluate various different frameworks that >> would facilitate MPI jobs on Mesos based clusters for Apache Airavata. >> >> >> >> Some of the options that I am looking at are:- >> >> 1. MPI support framework bundled with Mesos >> 2. Apache Aurora >> 3. Marathon >> 4. Chronos >> >> Some of the evaluation criteria that I am planning to base my >> investigation are:- >> >> - Ease of setup >> - Documentation >> - Reliability features like HA >> - Scaling and Fault recovery >> - Performance >> - Community Support >> >> Gourav and Shameera are working on ansible based automation to spin up a >> mesos based cluster and I am planning to use it to setup a cluster for >> experimentation. >> >> >> >> Any suggestions or information about prior work on this would be highly >> appreciated. >> >> >> >> Thank you. >> >> >> >> Best Regards, >> >> Mangirish Wagle >> >> >