Dear Slurm developers,

I am installed slurm version 15.08.10 and dmtcp version 2.4.4.. When i execute the job without the SLURM:

export DMTCP_COORD_PORT=7779
export DMTCP_COORD_HOST=headnodeslurm
dmtcp_launch --rm mpirun.openmpi --host nodeslurm1,nodeslurm2 -np 2 /home/tt/lammps-16Feb16/src/lmp_pi < /home/tt/lammps-16Feb16/bench/in.lj

dmtcp_command -s
Coordinator:
  Host: localhost
  Port: 7779 (default port)
Status...
  NUM_PEERS=5
  RUNNING=yes
  CKPT_INTERVAL=0 (checkpoint manually)

The checkpoints and resume works every time.

However, when i execute a script on slurm, the coordenator get stuck and just one node stop the execution:

[2983] NOTE at dmtcp_coordinator.cpp:667 in updateMinimumState; REASON='locking all nodes' [2983] NOTE at dmtcp_coordinator.cpp:673 in updateMinimumState; REASON='draining all nodes' [2983] NOTE at dmtcp_coordinator.cpp:679 in updateMinimumState; REASON='checkpointing all nodes'


dmtcp_command -s
Coordinator:
  Host: localhost
  Port: 7779 (default port)
Status...
  NUM_PEERS=2
  RUNNING=yes
  CKPT_INTERVAL=0 (checkpoint manually)

The slurm script is the same comands used on the first test. Please, in the slurm is needed some different settings ? I alredy tryed the scripts in rm dmtcp folder.


 Best regards.

Reply via email to