Hello, Levi. 2014-11-04 4:18 GMT+06:00 Levi Morrison <[email protected]>:
> Gene, > > You answered the "how" I integrate, but I'm after what "integration" > means. How do I know if it "works" or not if I'm not even sure what it > does? > > For instance, does it integrate with Slurm's requeue and checkpoint > options so I can automatically checkpoint and restart jobs? > As Gene already told you, the current implementation is done solely on the DMTCP side. The advantage of this solution is that user can use DMTCP on any SLURM cluster without admin privs. Using a plugin might violate security of the system since DMTCP may misbehave when MPI library changes. The disadvantage of this solution is a lack of SLURM scheduler integration as you pointed out. So there is definitely proc and cons for each of SLURM support ways and thus we should have them both. What we do now is detecting SLRUM artifacts (like srun launching) and wrap them to keep DMTCP control on the spawned applications. Also we use SLURM tools to discover resource allocation and launch restarted app (similar to how we work with distributed apps based on SSH). You won't need this inside SLURM at all! You can check DMTCP on the github. There is a PR that hangs there 11 day alread (by the way, Kapil, why didn't you merge it?), that reflects the latest state of the plugin. We also have contacts with SLURM team and I have commited the code there. Right now I am working on the new mpi plugin for SLURM and it takes all of my time. So for us creation of the SLURM plugin is mainly a question of free time. You are welcomed to address this task if you want. The SLURM plugin would be something completely different from what we have now as SLURM integration. SLURM's checkpoint framework has a set of hooks that (for the first sight) can be effectively implemented for DMTCP too: 1. slurm_ckpt_stepd_prefork - setting LD_PRELOAD 2. slurm_ckpt_op to perform one of the following operations CHECK_ABLE, CHECK_DISABLE, CHECK_ENABLE, CHECK_CREATE, CHECK_VACATE, CHECK_RESTART, CHECK_ERROR, CHECK_REQUEUE (I think/hope they can be effectively mapped on the commands we have in DMTCP). and few others. In this sence DMTCP plugin shouldn't differ much from BLCR one. Though you'll need to launch dmtcp_coordinator on the launching node and broadcast it's contacts through ENVIRONMENT variable passing it to srun. Or you can port dmtcp_coordinator code into the SLURM and make it run as separate thread inside srun or one of stepd's as it is done with PMI2 mpi plugin. There might certainly be additional circumstances. > > Levi Morrison > > On Mon, Nov 3, 2014 at 3:10 PM, Gene Cooperman <[email protected]> wrote: > > Hi Levi, > > In order to integrate with SLURM, you will want to use the plugin: > > DMTCP_ROOT/plugin/batch-queue/ > > Be sure to read the README file there. There are example scripts that > > you can use in conjunction with SLURM. If you have any trouble, > > please write to the full DMTCP team, in addition to Artem. Artem > Polyakov > > is taking primary responsibility for the SLURM integration. > > As for integration of DMTCP with MPI, this seems to work well > > with most common dialects of MPI. But there are some known bugs in using > > DMTCP with MVAPICH2. If you encounter bugs in the use of MVAPICH2 or > > any MPI, please write back to us. We also have some bug fixes and > > workarounds that may help you. > > > > Best, > > - Gene > > > > On Mon, Nov 03, 2014 at 04:36:07PM -0500, Kapil Arya wrote: > >> Artem/Jiajun, > >> > >> Can one of you help Levi with Slurm? > >> > >> Kapil > >> > >> On Mon, Nov 3, 2014 at 4:33 PM, Levi Morrison <[email protected]> > wrote: > >> > >> > I have been using DMTCP and BLCR for a few applications and want to > >> > try out scheduler integration with Slurm. However, I haven't found any > >> > documentation that says what "integration" means; any pointers on > >> > where I could find the documentation for it? > >> > > >> > > >> > > ------------------------------------------------------------------------------ > >> > _______________________________________________ > >> > Dmtcp-forum mailing list > >> > [email protected] > >> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > >> > > > > >> > ------------------------------------------------------------------------------ > > > >> _______________________________________________ > >> Dmtcp-forum mailing list > >> [email protected] > >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > > > -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov
------------------------------------------------------------------------------
_______________________________________________ Dmtcp-forum mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
