Hello!
        I'm running v. 2.4.4 on CentOS 6.8, kernel 2.6.32-431.20.3.el6.x86_64
        This is a cluster, with ~ 100 compute nodes, running slurm.
        Jobs are started with dmtcp_launch --rm. The idea is that jobs can be 
checkpointed as needed, to move them around between machines to fit jobs 
together to make room for high memory/specific MPI geometry jobs. This has 
worked well, but...
        Out of ~ 45,000 jobs that have run so far, ~ 100 have errors as below. 
I cannot find a common compute node, time, job type, user, memory usage, or any 
other factor - it seems that dmtcp is just randomly generating this error. This 
stops the job, which is a bit of a problem. No checkpointing was attempted on 
these jobs.
        Any ideas where I should look for the problem, anybody? Anything I can 
do to get some more debugging info? Is it the coordinator, or the dmtcp library 
wrapped around the running program that's generating this error?
        Thanks in advance...

[47000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid; 
REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
     _magicBits = 
Message: read invalid message, _magicBits mismatch.  Did DMTCP coordinator die 
uncleanly?
main-PYTHIA8-lhef (47000): Terminating...
[40000] ERROR at coordinatorapi.cpp:601 in createNewConnectionBeforeFork; 
REASON='JASSERT(_coordinatorSocket.isValid()) failed'
bash (40000): Terminating...


------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to