On 3/16/2015 10:36 PM, Eliot Moss wrote:

> The error output is:
>
> [45000] ERROR at sysvipc.cpp:775 in postRestart; REASON='JASSERT(_realId != 
> -1) failed'
>        (strerror((*__errno_location ()))) = File exists
> java (45000): Terminating...

The previously reported issue printed:

 > [42000] ERROR at sysvipc.cpp:775 in postRestart; REASON='JASSERT(_realId != 
 > -1) failed'
 >       (strerror((*__errno_location ()))) = No such file or directory
 > java (42000): Terminating...

So both are at the same line of code.  They do not have to do with files, per 
se,
but with semaphores and shared memory segments.  I noticed that the protocol on
restart mentions a node-wide file.  That may explain why I can avoid the 'File
exists' case by running on another node, and also why, in the 'No such file' 
case,
I can solve the bad behavior by running on the same node.  Of course this assume
that the file in question persists somewhere.  (Where?  Has to do with
PROTECTED_LIFEBOAT_FD.)

Well, that as far as I got today picking the code apart.  Hope this helps.
I feel stuck now in that I get one failure or the other, though not on the
same runs.  I don't feel confident to fire off the 10,000 or so jobs I am
waiting to execute, so can't progress well until I resolve this ...

Regards -- Eliot

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to