Re: Running MPI jobs on Mesos based clusters

Mangirish Wagle Thu, 20 Oct 2016 20:53:41 -0700

Thanks *Gourav, *for sharing the information. I may need your help with
quick ramp up for jumping into using Aurora on the mesos cluster and
exploring its capabilities more.


Hi *Suresh*,

Thanks for bringing that up. I did notice that repository earlier. It is
maintained by the same developer whom I am in touch over emails from Mesos
team. He did not specifically say anything about mesos-slurm repo in his
earlier emails, rather recommended looking at the GaSc repo. I observed
that the code is almost the same as the main slurm repo code (
https://github.com/SchedMD/slurm). The readme instructions are not specific
to mesos. Nonetheless, I have dropped Niklas an email asking him if there
has been some mesos specific customization in this repo. It would be
interesting to know if/ how he has played around with it over mesos.
I shall keep updating about the info that I get from him on the dev list.

Regards,
Mangirish

On Thu, Oct 20, 2016 at 11:19 PM, Suresh Marru <sma...@apache.org> wrote:

> Hi Gourav, Mangirish,
>
> Did you checkout SLURM on Mesos - https://github.com/nqn/slurm-mesos
>
> Note that this is GPL licensed code and incompatible with ASL V2. It does
> not preclude from using it, but need to watch out when integrating
> incompatible licensed codes.
>
> Suresh
>
> On Oct 20, 2016, at 10:26 PM, Shenoy, Gourav Ganesh <goshe...@indiana.edu>
> wrote:
>
> Hi Mangirish, devs:
>
> The Aurora documentation for “Tasks” & “Processes” provides very good
> information which I felt would be helpful in implementing gang scheduling,
> as you mentioned.
>
> http://aurora.apache.org/documentation/latest/reference/configuration/
>
>
> From what I understood, there are these constraints:
> 1.       If targeting single-node (multi-core) MPI, then a “JOB” will be
> broken down into multiple “PROCESSESES”, each of which will run on these
> multi-cores.
> 2.       Even if *any one* of these processes fail, then the JOB should
> be marked as failed.
>
> As mentioned in my earlier email, Aurora provides Job abstraction – “a job
> consists of multiple tasks, which in turn consist of multiple processes”.
> This abstraction comes in extremely handy if we want to run MPI jobs on a
> single node.
>
> While submitting a job to Aurora, we can control the following parameters
> for a TASK:
>
> a.       “max_failures” for a TASK – the number of failed processes which
> is needed to mark a task as failed. Hence if we set max_failures = 1, then
> even if a single process in a task fails, Aurora will mark that task as
> failed.
> *Note*: Since a JOB can have multiple tasks, and even a JOB has
> “max_task_failures” parameter, we can set this to 1.
>
> b.       “max_concurrency” for a TASK – number of processes to run in
> parallel. If a node has 16 cores, then we can limit the amount of
> parallelism to <=16.
>
> I did not get much time to experiment with these parameters for job
> submission, but found this document to be handy and worth sharing. Hope
> this helps!
>
> Thanks and Regards,
> Gourav Shenoy
>
> *From: *Mangirish Wagle <vaglomangir...@gmail.com>
> *Reply-To: *"dev@airavata.apache.org" <dev@airavata.apache.org>
> *Date: *Tuesday, October 18, 2016 at 11:48 AM
> *To: *"dev@airavata.apache.org" <dev@airavata.apache.org>
> *Subject: *Re: Running MPI jobs on Mesos based clusters
>
> Sure Suresh, will update my findings on the mailing list. Thanks!
>
> On Tue, Oct 18, 2016 at 7:59 AM, Suresh Marru <sma...@apache.org> wrote:
>
> Hi Mangirish,
>
> This is interesting. Looking forward to see what you will find our further
> on gang scheduling support. Since the compute nodes are getting bigger,
> even if you can explore single node MPI (on Jetstream using 22 cores) that
> will help.
>
> Suresh
>
> P.S. Good to see the momentum on mailing list discussions on such topics.
>
>
> On Oct 18, 2016, at 1:54 AM, Mangirish Wagle <vaglomangir...@gmail.com>
> wrote:
>
>
> Hello Devs,
>
> Here is an update on some new learnings and thoughts based on my
> interactions with Mesos and Aurora devs.
>
> MPI implementations in Mesos repositories (like MPI Hydra) rely on
> obsolete MPI platforms and no longer supported my the developer community.
> Hence it is not recommended that we use this for our purpose.
>
> One of the known ways of running MPI jobs over mesos is using "gang
> scheduling" which is basically distributing the MPI run over multiple jobs
> on mesos in place of multiple nodes. The challenge here is the jobs need to
> be scheduled as one task and any job errored should collectively error out
> the main program including all the distributed jobs.
>
> One of the Mesos developer (Niklas Nielsen) pointed me out to his work on
> gang scheduling: https://github.com/nqn. This code may not be fully
> tested but certainly a good starting point to explore gang scheduling.
>
> One of the Aurora developer (Stephen Erb) suggests using gang scheduling
> on top of Aurora. Aurora scheduler assumes that every job is independent.
> Hence, there would be a need to develop some external scaffolding to
> coordinate and schedule these jobs, which might not be trivial. One
> advantage of using Aurora as a backend for gang scheduling is that we would
> inherit the robustness of Aurora, which otherwise would be a key challenge
> if targeting bare mesos.
>
> Alternative to all the options above, I think we should probably be able
> to run a 1 node MPI job through Aurora. A resource offer with CPUs and
> Memory from Mesos is abstracted as a single runtime, but is mapped to
> multiple nodes underneath, which eventually would exploit distributed
> resource capabilities.
>
> I intend to try out the 1 node MPI job submission approach first and
> simultaneously explore the gang scheduling approach.
>
> Please let me know your thoughts/ suggestions.
> Best Regards,
> Mangirish
>
>
>
> On Thu, Oct 13, 2016 at 12:39 PM, Mangirish Wagle <
> vaglomangir...@gmail.com> wrote:
>
> Hi Marlon,
>
> Thanks for confirming and sharing the legal link.
> -Mangirish
>
> On Thu, Oct 13, 2016 at 12:13 PM, Pierce, Marlon <marpi...@iu.edu> wrote:
>
> BSD is ok: https://www.apache.org/legal/resolved.
>
> *From: *Mangirish Wagle <vaglomangir...@gmail.com>
> *Reply-To: *"dev@airavata.apache.org" <dev@airavata.apache.org>
> *Date: *Thursday, October 13, 2016 at 12:03 PM
> *To: *"dev@airavata.apache.org" <dev@airavata.apache.org>
> *Subject: *Re: Running MPI jobs on Mesos based clusters
>
>
> Hello Devs,
>
> I needed some advice on the license of the MPI libraries. The MPICH
> library that I have been trying claims to have a "BSD Like" license (
> http://git.mpich.org/mpich.git/blob/HEAD:/COPYRIGHT).
>
> I am aware that OpenMPI which uses BSD license is currently used in our
> application. I had chosen to start investigating MPICH because it claims to
> be a highly portable and high quality implementation of latest MPI
> standard, suitable to cloud based clusters.
>
> If anyone could please advise on the acceptance of the MPICH libraries MSD
> Like license for ASF, that would help.
>
> Thank you.
> Best Regards,
> Mangirish Wagle
>
> On Thu, Oct 6, 2016 at 1:48 AM, Mangirish Wagle <vaglomangir...@gmail.com>
> wrote:
>
> Hello Devs,
>
> The network issue mentioned above now stands resolved. The problem was
> with the iptables had some conflicting rules which blocked the traffic. It
> was resolved by simple iptables flush.
>
> Here is the test MPI program running on multiple machines:-
>
> [centos@mesos-slave-1 ~]$ mpiexec -f machinefile -n 2 ./mpitest
> Hello world!  I am process number: 0 on host mesos-slave-1
> Hello world!  I am process number: 1 on host mesos-slave-2
>
> The next step is to try invoking this through framework like Marathon.
> However, the job submission still does not run through Marathon. It seems
> to gets stuck in the 'waiting' state forever (For example
> http://149.165.170.245:8080/ui/#/apps/%2Fmaw-try). Further, I notice that
> Marathon is listed under 'inactive frameworks' in mesos dashboard (
> http://149.165.171.33:5050/#/frameworks).
>
> I am trying to get this working, though any help/ clues with this would be
> really helpful.
>
> Thanks and Regards,
>
> Mangirish Wagle
>
>
> On Fri, Sep 30, 2016 at 9:21 PM, Mangirish Wagle <vaglomangir...@gmail.com>
> wrote:
>
> Hello Devs,
>
> I am currently running a sample MPI C program using 'mpiexec' provided by
> MPICH. I followed their installation guide
> <http://www.mpich.org/static/downloads/3.2/mpich-3.2-installguide.pdf> to
> install the libraries on the master and slave nodes of the mesos cluster.
>
> The approach that I am trying out here is that I am equipping the
> underlying nodes with MPI handling tools and then use the Mesos framework
> like Marathon/ Aurora to submit jobs to run MPI programs by invoking these
> tools.
>
> You can potentially run an MPI program using mpiexec in the following
> manner:-
>
> # *mpiexec -f machinefile -n 2 ./mpitest*
>
>    - *machinefile *-> File which contains an inventory of machines to run
>    the program on and number of processes on each machine.
>    - *mpitest *-> MPI program compiled in C using mpicc compiler. The
>    program returns the process number and he hostname of the machine running
>    the process.
>    - *-n *option indicates number of processes that it needs to spawn
>
> Example of machinefile contents:-
>
> # Entries in the format <hostname/IP>:<number of processes>
> mesos-slave-1:1
> mesos-slave-2:1
>
> The reason for choosing slaves is that Mesos runs the jobs on slaves,
> managed by 'agents' pertaining to the slaves.
>
> Output of the program with '-n 1':-
>
> # mpiexec -f machinefile -n 1 ./mpitest
> Hello world!  I am process number: 0 on host mesos-slave-1
>
> But when I try for '-n 2', I am hitting the following error:-
>
> # mpiexec -f machinefile -n 2 ./mpitest
> [proxy:0:1@mesos-slave-2] HYDU_sock_connect (/home/centos/mpich-3.2/src/
> pm/hydra/utils/sock/sock.c:172): unable to connect from "mesos-slave-2"
> to "mesos-slave-1" (No route to host)
> [proxy:0:1@mesos-slave-2] main (/home/centos/mpich-3.2/src/
> pm/hydra/pm/pmiserv/pmip.c:189): *unable to connect to server
> mesos-slave-1 at port 44788* (check for firewalls!)
>
> It seems to not allow the program execution due to network traffic being
> blocked. I checked security groups in scigap openstack for mesos-slave-1,
> mesos-slave-2 nodes and it is set to 'wideopen' policy. Furthermore, I
> tried adding explicit rules to the policies to allow all TCP and UDP
> (Currently I am not sure what protocol is used underneath), even then it
> continues throwing this error.
>
> Any clues, suggestions, comments about the error or approach as a whole
> would be helpful.
>
> Thanks and Regards,
> Mangirish Wagle
>
> *Error! Filename not specified.*
>
> On Tue, Sep 27, 2016 at 11:23 AM, Mangirish Wagle <
> vaglomangir...@gmail.com> wrote:
>
> Hello Devs,
>
> Thanks Gourav and Shameera for all the work w.r.t. setting up the
> Mesos-Marathon cluster on Jetstream.
>
> I am currently evaluating MPICH (http://www.mpich.org/about/overview/) to
> be used for launching MPI jobs on top of mesos. MPICH version 1.2 supports
> Mesos based MPI scheduling. I have been also trying to submit jobs to the
> cluster through Marathon. However, in either cases I am currently facing
> issues which I am working to get resolved.
>
> I am compiling my notes into the following google doc. You may please
> review and let me know your comments, suggestions.
>
> https://docs.google.com/document/d/1p_Y4Zd4I4lgt264IHspXJli3la25y6bc
> PcmrTD6nR8g/edit?usp=sharing
>
> Thanks and Regards,
> Mangirish Wagle
>
>
> *Error! Filename not specified.*
>
> On Wed, Sep 21, 2016 at 3:20 PM, Shenoy, Gourav Ganesh <
> goshe...@indiana.edu> wrote:
>
> Hi Mangirish,
>
> I have set up a Mesos-Marathon cluster for you on Jetstream. I will share
> with you with the cluster details in a separate email. Kindly note that
> there are 3 masters & 2 slaves in this cluster.
>
> I am also working on automating this process for Jetstream (similar to
> Shameera’s ansible script for EC2) and when that is ready, we can create
> clusters or add/remove slave machines from the cluster.
>
> Thanks and Regards,
> Gourav Shenoy
>
> *From: *Mangirish Wagle <vaglomangir...@gmail.com>
> *Reply-To: *"dev@airavata.apache.org" <dev@airavata.apache.org>
> *Date: *Wednesday, September 21, 2016 at 2:36 PM
> *To: *"dev@airavata.apache.org" <dev@airavata.apache.org>
> *Subject: *Running MPI jobs on Mesos based clusters
>
> Hello All,
>
> I would like to post for everybody's awareness about the study that I am
> undertaking this fall, i.e. to evaluate various different frameworks that
> would facilitate MPI jobs on Mesos based clusters for Apache Airavata.
>
> Some of the options that I am looking at are:-
>
>    1. MPI support framework bundled with Mesos
>    2. Apache Aurora
>    3. Marathon
>    4. Chronos
>
> Some of the evaluation criteria that I am planning to base my
> investigation are:-
>
>    - Ease of setup
>    - Documentation
>    - Reliability features like HA
>    - Scaling and Fault recovery
>    - Performance
>    - Community Support
>
> Gourav and Shameera are working on ansible based automation to spin up a
> mesos based cluster and I am planning to use it to setup a cluster for
> experimentation.
>
> Any suggestions or information about prior work on this would be highly
> appreciated.
>
> Thank you.
>
> Best Regards,
> Mangirish Wagle
> *Error! Filename not specified.*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: Running MPI jobs on Mesos based clusters

Reply via email to