Dear Slurm developers,
I am installed slurm version 15.08.10 and dmtcp version 2.4.4..
When i execute the job without the SLURM:
export DMTCP_COORD_PORT=7779
export DMTCP_COORD_HOST=headnodeslurm
dmtcp_launch --rm mpirun.openmpi --host nodeslurm1,nodeslurm2 -np 2
/home/tt/lammps-16Feb16/src/lmp_pi < /home/tt/lammps-16Feb16/bench/in.lj
dmtcp_command -s
Coordinator:
Host: localhost
Port: 7779 (default port)
Status...
NUM_PEERS=5
RUNNING=yes
CKPT_INTERVAL=0 (checkpoint manually)
The checkpoints and resume works every time.
However, when i execute a script on slurm, the coordenator get stuck
and just one node stop the execution:
[2983] NOTE at dmtcp_coordinator.cpp:667 in updateMinimumState;
REASON='locking all nodes'
[2983] NOTE at dmtcp_coordinator.cpp:673 in updateMinimumState;
REASON='draining all nodes'
[2983] NOTE at dmtcp_coordinator.cpp:679 in updateMinimumState;
REASON='checkpointing all nodes'
dmtcp_command -s
Coordinator:
Host: localhost
Port: 7779 (default port)
Status...
NUM_PEERS=2
RUNNING=yes
CKPT_INTERVAL=0 (checkpoint manually)
The slurm script is the same comands used on the first test. Please,
in the slurm is needed some different settings ? I alredy tryed the
scripts in rm dmtcp folder.
Best regards.