[slurm-dev] DMTCP and slurm

Fábio Andrijauskas Sun, 01 May 2016 18:15:07 -0700


Dear Slurm developers,

I am installed slurm version 15.08.10 and dmtcp version 2.4.4..When i execute the job without the SLURM:


export DMTCP_COORD_PORT=7779
export DMTCP_COORD_HOST=headnodeslurm

dmtcp_launch --rm mpirun.openmpi --host nodeslurm1,nodeslurm2 -np 2/home/tt/lammps-16Feb16/src/lmp_pi < /home/tt/lammps-16Feb16/bench/in.lj


dmtcp_command -s
Coordinator:
  Host: localhost
  Port: 7779 (default port)
Status...
  NUM_PEERS=5
  RUNNING=yes
  CKPT_INTERVAL=0 (checkpoint manually)

The checkpoints and resume works every time.

However, when i execute a script on slurm, the coordenator get stuckand just one node stop the execution:

[2983] NOTE at dmtcp_coordinator.cpp:667 in updateMinimumState;REASON='locking all nodes'[2983] NOTE at dmtcp_coordinator.cpp:673 in updateMinimumState;REASON='draining all nodes'[2983] NOTE at dmtcp_coordinator.cpp:679 in updateMinimumState;REASON='checkpointing all nodes'



dmtcp_command -s
Coordinator:
  Host: localhost
  Port: 7779 (default port)
Status...
  NUM_PEERS=2
  RUNNING=yes
  CKPT_INTERVAL=0 (checkpoint manually)

The slurm script is the same comands used on the first test. Please,in the slurm is needed some different settings ? I alredy tryed thescripts in rm dmtcp folder.



 Best regards.

[slurm-dev] DMTCP and slurm

Reply via email to