On 3/16/2015 10:36 PM, Eliot Moss wrote: > The error output is: > > [45000] ERROR at sysvipc.cpp:775 in postRestart; REASON='JASSERT(_realId != > -1) failed' > (strerror((*__errno_location ()))) = File exists > java (45000): Terminating...
The previously reported issue printed: > [42000] ERROR at sysvipc.cpp:775 in postRestart; REASON='JASSERT(_realId != > -1) failed' > (strerror((*__errno_location ()))) = No such file or directory > java (42000): Terminating... So both are at the same line of code. They do not have to do with files, per se, but with semaphores and shared memory segments. I noticed that the protocol on restart mentions a node-wide file. That may explain why I can avoid the 'File exists' case by running on another node, and also why, in the 'No such file' case, I can solve the bad behavior by running on the same node. Of course this assume that the file in question persists somewhere. (Where? Has to do with PROTECTED_LIFEBOAT_FD.) Well, that as far as I got today picking the code apart. Hope this helps. I feel stuck now in that I get one failure or the other, though not on the same runs. I don't feel confident to fire off the 10,000 or so jobs I am waiting to execute, so can't progress well until I resolve this ... Regards -- Eliot ------------------------------------------------------------------------------ Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ _______________________________________________ Dmtcp-forum mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
