Hello, Julien.

Thank you for your feedback. SLURM plugin is pretty new and is under
intensive testing now. It was implemented with OpenMPI in mind and may lack
of generality. We definitely want to improove it to support more general
case. Please check latest revision. Here is several notes:
1. Short option parsing wasn't considered, So I put this into the TODO list.
2. Launching jobs with srun also wasn't considered yet. We can add an
argument to --rm option to force DMTCP to load a plugin and disable auto
detection:
dmtcp_launch --rm slurm
But I think that "srun dmtcp_launch --rm ./app" should work well. We use
this trick to launch OpenMPI and MPICH applications with PMI interface.
Actually PMI interface is one of the magor improvements that you can find
in current SVN repo. You can cosider to use it in your application if you
use custom (non MPI) parallel computations.

3. Consider to check <dmtcp-root>/plugin/batch-queue/job_examples/
directory for job script examples. They were widely tested on different
clusters.

If you will still have problems you can create me tester account on your
cluster and will try to find best solution in place.


2014-03-06 21:57 GMT+07:00 ADAM Julien <[email protected]>:

> Ok, I'll check new correction in svn repo by next week and I will feed you
> back.
> Thanks for your fast answer !
> Julien
>
>
> On 03/06/14 15:52, Kapil Arya wrote:
>
>> Hi Julien,
>>
>> Thanks for writing to us. I am CC'ing Artem who implemented the
>> batch-queue plugin. He also made a lot of changes recently to the
>> batch-queue plugin in the svn trunk. I believe he also added a bunch
>> of helper scripts, etc. to use DMTCP with slurm. Can you checkout the
>> latest svn trunk and see if some of the helper scripts are useful for
>> you?
>>
>> In any case, Artem, can provide you with better answers.
>>
>> Thanks,
>> Kapil
>>
>> On Thu, Mar 6, 2014 at 9:12 AM, ADAM Julien <[email protected]> wrote:
>>
>>> Hi Team,
>>>
>>> Me and my team are using your tool to checkpoint our applications. But,
>>> we
>>> have encountered some issues.
>>> There are no problems when we use DMTCP to checkpoint local-running apps.
>>> However, on clusters, we have some issues and we haven't figured out a
>>> solution yet :
>>>
>>> First of all, here is our environment:
>>> - We have a sample application, which simply indefinitively increments a
>>> counter, sleep one second and displays it at each loop step (not really
>>> complex)
>>> - We use DMTCP 2.0 and 2.1, according to the used cluster (most of DMTCP
>>> tests are performed using DMTCP 2.1)
>>> - We use SLURM as job manager to send our jobs on clusters (salloc,
>>> sbatch,
>>> srun and so on).
>>> - We have two prompts, one running dmtcp_coordinator and the other one
>>> launching the command.
>>>
>>> First little thing about SLURM issues is how DMTCP parses SRUN command
>>> when
>>> exec() functions are overrided. Only long-format options are detected and
>>> not short-format ones. Thus, when we use "srun -N 1 ./a.out", DMTCP
>>> believes
>>> "1" is the application name and we get "srun -N dmtcp_launch <options> 1
>>> ./a.out" command. (it's not a big deal but it's a good thing to know
>>> before
>>> using it)
>>> The second one is how SLURM plugin is loaded. DMTCP checks if some SLURM
>>> environment variables are set before loading. The issue is when we use
>>> DMTCP
>>> to launch SRUN without have a SLURM environment. Thus, plugin SLURM is
>>> not
>>> loaded and it's unable to checkpoint applications over the job manager.
>>> Instead of using SRUN directly, we currently decided to use SBATCH
>>> instead,
>>> as you have written down in your documentation. So, would it be possible
>>> to
>>> use SALLOC instead of SBATCH (in order to keep interactive mode)?
>>> Moreover,
>>> if we attempts to launch jobs like : "salloc -N 1 dmtcp_launch <options>
>>> srun --nodes=1 ./a.out", we have the following error :
>>>
>>> [46000] ERROR at fileconnlist.cpp:363 in processFileConnection;
>>>       path = /proc/self/fd/socket:[132529151]
>>> Message: Unimplemented file type.
>>> tmp (46000): Terminating...
>>>
>>> Finally, on SLURM using, we launch our job like :
>>>
>>> Sbatch
>>>
>>> dmtcp_launch <options>
>>>
>>> myMainScript.sh
>>>
>>> srun <options>
>>>
>>> ./a.out
>>>
>>> When we do like that, checkpointing seems to be good (even in
>>> --enable-debug, no particular warnings), but, on restart, we get the
>>> following output (and the application stops):
>>>
>>> [45000] TRACE at pid.cpp:121 in openSharedFile; REASON='_real_open: '
>>>       strerror((*__errno_location ())) = File exists
>>>       fd = -1
>>> [45000] ERROR at pid.cpp:130 in openSharedFile; REASON='JASSERT(false)
>>> failed'
>>>       name =
>>> /tmp/dmtcp-login@clusterNode5/dmtcpPidMap.57d889deebbd7d0c-
>>> 45000-53187ae9.53187b323
>>>       strerror((*__errno_location ())) = Bad file descriptor
>>> Message: Cannot open file
>>> bash (45000): Terminating...
>>>
>>> In case of "no-ideas", we'll provide you complete logs and backtraces.
>>>
>>> Thanks in advance for your help and congratulations for what you have
>>> made
>>> so far :)
>>> Regards,
>>>
>>> --
>>>
>>> Julien Adam
>>> Information Systems Engineering student
>>>
>>>
>>> ------------------------------------------------------------
>>> ------------------
>>> Subversion Kills Productivity. Get off Subversion & Make the Move to
>>> Perforce.
>>> With Perforce, you get hassle-free workflows. Merge that actually works.
>>> Faster operations. Version large binaries.  Built-in WAN optimization and
>>> the
>>> freedom to use Git, Perforce or both. Make the move to Perforce.
>>> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&;
>>> iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Dmtcp-forum mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>>
>>>
>


-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to