Hello, Levi.
2014-11-04 4:18 GMT+06:00 Levi Morrison <[email protected]>:

> Gene,
>
> You answered the "how" I integrate, but I'm after what "integration"
> means. How do I know if it "works" or not if I'm not even sure what it
> does?
>
> For instance, does it integrate with Slurm's requeue and checkpoint
> options so I can automatically checkpoint and restart jobs?
>

As Gene already told you, the current implementation is done solely on the
DMTCP side. The advantage of this solution is that user can use DMTCP on
any SLURM cluster without admin privs. Using a plugin might violate
security of the system since DMTCP may misbehave when MPI library changes.
The disadvantage of this solution is a lack of SLURM scheduler integration
as you pointed out. So there is definitely proc and cons for each of SLURM
support ways and thus we should have them both.
What we do now is detecting SLRUM artifacts (like srun launching) and wrap
them to keep DMTCP control on the spawned applications. Also we use SLURM
tools to discover resource allocation and launch restarted app (similar to
how we work with distributed apps based on SSH). You won't need this inside
SLURM at all!
You can check DMTCP on the github. There is a PR that hangs there 11 day
alread (by the way, Kapil, why didn't you merge it?), that reflects the
latest state of the plugin.

We also have contacts with SLURM team and I have commited the code there.
Right now I am working on the new mpi plugin for SLURM and it takes all of
my time. So for us creation of the SLURM plugin is mainly a question of
free time. You are welcomed to address this task if you want. The SLURM
plugin would be something completely different from what we have now as
SLURM integration.
SLURM's checkpoint framework has a set of hooks that (for the first sight)
can be effectively implemented for DMTCP too:
1. slurm_ckpt_stepd_prefork - setting LD_PRELOAD
2. slurm_ckpt_op to perform one of the following operations CHECK_ABLE,
CHECK_DISABLE, CHECK_ENABLE, CHECK_CREATE, CHECK_VACATE, CHECK_RESTART,
CHECK_ERROR, CHECK_REQUEUE (I think/hope they can be effectively mapped on
the commands we have in DMTCP).

and few others.

In this sence DMTCP plugin shouldn't differ much from BLCR one. Though
you'll need to launch dmtcp_coordinator on the launching node and broadcast
it's contacts through ENVIRONMENT variable passing it to srun. Or you can
port dmtcp_coordinator code into the SLURM and make it run as separate
thread inside srun or one of stepd's as it is done with PMI2 mpi plugin.
There might certainly be additional circumstances.


>
> Levi Morrison
>
> On Mon, Nov 3, 2014 at 3:10 PM, Gene Cooperman <[email protected]> wrote:
> > Hi Levi,
> >     In order to integrate with SLURM, you will want to use the plugin:
> >   DMTCP_ROOT/plugin/batch-queue/
> > Be sure to read the README file there.  There are example scripts that
> > you can use in conjunction with SLURM.  If you have any trouble,
> > please write to the full DMTCP team, in addition to Artem.  Artem
> Polyakov
> > is taking primary responsibility for the SLURM integration.
> >     As for integration of DMTCP with MPI, this seems to work well
> > with most common dialects of MPI.  But there are some known bugs in using
> > DMTCP with MVAPICH2.  If you encounter bugs in the use of MVAPICH2 or
> > any MPI, please write back to us.  We also have some bug fixes and
> > workarounds that may help you.
> >
> > Best,
> > - Gene
> >
> > On Mon, Nov 03, 2014 at 04:36:07PM -0500, Kapil Arya wrote:
> >> Artem/Jiajun,
> >>
> >> Can one of you help Levi with Slurm?
> >>
> >> Kapil
> >>
> >> On Mon, Nov 3, 2014 at 4:33 PM, Levi Morrison <[email protected]>
> wrote:
> >>
> >> > I have been using DMTCP and BLCR for a few applications and want to
> >> > try out scheduler integration with Slurm. However, I haven't found any
> >> > documentation that says what "integration" means; any pointers on
> >> > where I could find the documentation for it?
> >> >
> >> >
> >> >
> ------------------------------------------------------------------------------
> >> > _______________________________________________
> >> > Dmtcp-forum mailing list
> >> > [email protected]
> >> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> >> >
> >
> >>
> ------------------------------------------------------------------------------
> >
> >> _______________________________________________
> >> Dmtcp-forum mailing list
> >> [email protected]
> >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> >
>



-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov
------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to