Ok, I'll check new correction in svn repo by next week and I will feed 
you back.
Thanks for your fast answer !
Julien

On 03/06/14 15:52, Kapil Arya wrote:
> Hi Julien,
>
> Thanks for writing to us. I am CC'ing Artem who implemented the
> batch-queue plugin. He also made a lot of changes recently to the
> batch-queue plugin in the svn trunk. I believe he also added a bunch
> of helper scripts, etc. to use DMTCP with slurm. Can you checkout the
> latest svn trunk and see if some of the helper scripts are useful for
> you?
>
> In any case, Artem, can provide you with better answers.
>
> Thanks,
> Kapil
>
> On Thu, Mar 6, 2014 at 9:12 AM, ADAM Julien <[email protected]> wrote:
>> Hi Team,
>>
>> Me and my team are using your tool to checkpoint our applications. But, we
>> have encountered some issues.
>> There are no problems when we use DMTCP to checkpoint local-running apps.
>> However, on clusters, we have some issues and we haven't figured out a
>> solution yet :
>>
>> First of all, here is our environment:
>> - We have a sample application, which simply indefinitively increments a
>> counter, sleep one second and displays it at each loop step (not really
>> complex)
>> - We use DMTCP 2.0 and 2.1, according to the used cluster (most of DMTCP
>> tests are performed using DMTCP 2.1)
>> - We use SLURM as job manager to send our jobs on clusters (salloc, sbatch,
>> srun and so on).
>> - We have two prompts, one running dmtcp_coordinator and the other one
>> launching the command.
>>
>> First little thing about SLURM issues is how DMTCP parses SRUN command when
>> exec() functions are overrided. Only long-format options are detected and
>> not short-format ones. Thus, when we use "srun -N 1 ./a.out", DMTCP believes
>> "1" is the application name and we get "srun -N dmtcp_launch <options> 1
>> ./a.out" command. (it's not a big deal but it's a good thing to know before
>> using it)
>> The second one is how SLURM plugin is loaded. DMTCP checks if some SLURM
>> environment variables are set before loading. The issue is when we use DMTCP
>> to launch SRUN without have a SLURM environment. Thus, plugin SLURM is not
>> loaded and it's unable to checkpoint applications over the job manager.
>> Instead of using SRUN directly, we currently decided to use SBATCH instead,
>> as you have written down in your documentation. So, would it be possible to
>> use SALLOC instead of SBATCH (in order to keep interactive mode)? Moreover,
>> if we attempts to launch jobs like : "salloc -N 1 dmtcp_launch <options>
>> srun --nodes=1 ./a.out", we have the following error :
>>
>> [46000] ERROR at fileconnlist.cpp:363 in processFileConnection;
>>       path = /proc/self/fd/socket:[132529151]
>> Message: Unimplemented file type.
>> tmp (46000): Terminating...
>>
>> Finally, on SLURM using, we launch our job like :
>>
>> Sbatch
>>
>> dmtcp_launch <options>
>>
>> myMainScript.sh
>>
>> srun <options>
>>
>> ./a.out
>>
>> When we do like that, checkpointing seems to be good (even in
>> --enable-debug, no particular warnings), but, on restart, we get the
>> following output (and the application stops):
>>
>> [45000] TRACE at pid.cpp:121 in openSharedFile; REASON='_real_open: '
>>       strerror((*__errno_location ())) = File exists
>>       fd = -1
>> [45000] ERROR at pid.cpp:130 in openSharedFile; REASON='JASSERT(false)
>> failed'
>>       name =
>> /tmp/dmtcp-login@clusterNode5/dmtcpPidMap.57d889deebbd7d0c-45000-53187ae9.53187b323
>>       strerror((*__errno_location ())) = Bad file descriptor
>> Message: Cannot open file
>> bash (45000): Terminating...
>>
>> In case of "no-ideas", we'll provide you complete logs and backtraces.
>>
>> Thanks in advance for your help and congratulations for what you have made
>> so far :)
>> Regards,
>>
>> --
>>
>> Julien Adam
>> Information Systems Engineering student
>>
>>
>> ------------------------------------------------------------------------------
>> Subversion Kills Productivity. Get off Subversion & Make the Move to
>> Perforce.
>> With Perforce, you get hassle-free workflows. Merge that actually works.
>> Faster operations. Version large binaries.  Built-in WAN optimization and
>> the
>> freedom to use Git, Perforce or both. Make the move to Perforce.
>> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Dmtcp-forum mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>


------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works. 
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to