Ok, I'll check new correction in svn repo by next week and I will feed you back. Thanks for your fast answer ! Julien
On 03/06/14 15:52, Kapil Arya wrote: > Hi Julien, > > Thanks for writing to us. I am CC'ing Artem who implemented the > batch-queue plugin. He also made a lot of changes recently to the > batch-queue plugin in the svn trunk. I believe he also added a bunch > of helper scripts, etc. to use DMTCP with slurm. Can you checkout the > latest svn trunk and see if some of the helper scripts are useful for > you? > > In any case, Artem, can provide you with better answers. > > Thanks, > Kapil > > On Thu, Mar 6, 2014 at 9:12 AM, ADAM Julien <[email protected]> wrote: >> Hi Team, >> >> Me and my team are using your tool to checkpoint our applications. But, we >> have encountered some issues. >> There are no problems when we use DMTCP to checkpoint local-running apps. >> However, on clusters, we have some issues and we haven't figured out a >> solution yet : >> >> First of all, here is our environment: >> - We have a sample application, which simply indefinitively increments a >> counter, sleep one second and displays it at each loop step (not really >> complex) >> - We use DMTCP 2.0 and 2.1, according to the used cluster (most of DMTCP >> tests are performed using DMTCP 2.1) >> - We use SLURM as job manager to send our jobs on clusters (salloc, sbatch, >> srun and so on). >> - We have two prompts, one running dmtcp_coordinator and the other one >> launching the command. >> >> First little thing about SLURM issues is how DMTCP parses SRUN command when >> exec() functions are overrided. Only long-format options are detected and >> not short-format ones. Thus, when we use "srun -N 1 ./a.out", DMTCP believes >> "1" is the application name and we get "srun -N dmtcp_launch <options> 1 >> ./a.out" command. (it's not a big deal but it's a good thing to know before >> using it) >> The second one is how SLURM plugin is loaded. DMTCP checks if some SLURM >> environment variables are set before loading. The issue is when we use DMTCP >> to launch SRUN without have a SLURM environment. Thus, plugin SLURM is not >> loaded and it's unable to checkpoint applications over the job manager. >> Instead of using SRUN directly, we currently decided to use SBATCH instead, >> as you have written down in your documentation. So, would it be possible to >> use SALLOC instead of SBATCH (in order to keep interactive mode)? Moreover, >> if we attempts to launch jobs like : "salloc -N 1 dmtcp_launch <options> >> srun --nodes=1 ./a.out", we have the following error : >> >> [46000] ERROR at fileconnlist.cpp:363 in processFileConnection; >> path = /proc/self/fd/socket:[132529151] >> Message: Unimplemented file type. >> tmp (46000): Terminating... >> >> Finally, on SLURM using, we launch our job like : >> >> Sbatch >> >> dmtcp_launch <options> >> >> myMainScript.sh >> >> srun <options> >> >> ./a.out >> >> When we do like that, checkpointing seems to be good (even in >> --enable-debug, no particular warnings), but, on restart, we get the >> following output (and the application stops): >> >> [45000] TRACE at pid.cpp:121 in openSharedFile; REASON='_real_open: ' >> strerror((*__errno_location ())) = File exists >> fd = -1 >> [45000] ERROR at pid.cpp:130 in openSharedFile; REASON='JASSERT(false) >> failed' >> name = >> /tmp/dmtcp-login@clusterNode5/dmtcpPidMap.57d889deebbd7d0c-45000-53187ae9.53187b323 >> strerror((*__errno_location ())) = Bad file descriptor >> Message: Cannot open file >> bash (45000): Terminating... >> >> In case of "no-ideas", we'll provide you complete logs and backtraces. >> >> Thanks in advance for your help and congratulations for what you have made >> so far :) >> Regards, >> >> -- >> >> Julien Adam >> Information Systems Engineering student >> >> >> ------------------------------------------------------------------------------ >> Subversion Kills Productivity. Get off Subversion & Make the Move to >> Perforce. >> With Perforce, you get hassle-free workflows. Merge that actually works. >> Faster operations. Version large binaries. Built-in WAN optimization and >> the >> freedom to use Git, Perforce or both. Make the move to Perforce. >> http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk >> _______________________________________________ >> Dmtcp-forum mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum >> ------------------------------------------------------------------------------ Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce. With Perforce, you get hassle-free workflows. Merge that actually works. Faster operations. Version large binaries. Built-in WAN optimization and the freedom to use Git, Perforce or both. Make the move to Perforce. http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk _______________________________________________ Dmtcp-forum mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
