Hello!
I'm running v. 2.4.4 on CentOS 6.8, kernel 2.6.32-431.20.3.el6.x86_64
This is a cluster, with ~ 100 compute nodes, running slurm.
Jobs are started with dmtcp_launch --rm. The idea is that jobs can be
checkpointed as needed, to move them around between machines to fit jobs
together to make room for high memory/specific MPI geometry jobs. This has
worked well, but...
Out of ~ 45,000 jobs that have run so far, ~ 100 have errors as below.
I cannot find a common compute node, time, job type, user, memory usage, or any
other factor - it seems that dmtcp is just randomly generating this error. This
stops the job, which is a bit of a problem. No checkpointing was attempted on
these jobs.
Any ideas where I should look for the problem, anybody? Anything I can
do to get some more debugging info? Is it the coordinator, or the dmtcp library
wrapped around the running program that's generating this error?
Thanks in advance...
[47000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid;
REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
_magicBits =
Message: read invalid message, _magicBits mismatch. Did DMTCP coordinator die
uncleanly?
main-PYTHIA8-lhef (47000): Terminating...
[40000] ERROR at coordinatorapi.cpp:601 in createNewConnectionBeforeFork;
REASON='JASSERT(_coordinatorSocket.isValid()) failed'
bash (40000): Terminating...
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum