Dear gridengine users,
some months ago I wrote about this problem, I tried around a bit more, now, and
at least have a workaround:
When I submit jobs to my SLES 11.3 execd-hosts, they do the first one fine. If
some time (ranging from under one second, to several seconds) has passed and
another job is submitted (independent of whether the first job is still
running, or not), the queue falls into the error state, finishing the first job
correctly, but not accepting any new jobs. Just clearing the error does not
help, the execd-daemon has to be restarted, too.
I have some SLES 11.1 machines, where this error does not occur. I use the same
configuration for both systems, so I don't think it has anything to do with
that. It occurs on just updated machines (11.1->11.3), newly installed (11.3,
as well as 11.2, which I tried once). I normally use the precompiled packages,
but I tried a self compiled version (compiled directly on a 11.3 machine), I
also tried the latest version of "son of gridengine", which shows the same
error, and, as our sgemaster runs an older ubuntu version (which worked fine
for several years), tried to master from a new ubuntu-machine, which didn't
help, either.
If someone thinks it helps, I can provide some strace-output, but it is a bit
long, so I won't here, most important probably are:
ge-messages:
09/15/2014 13:10:10| main|host-05|E|shepherd of job 821.1 died through signal
= 11
09/15/2014 13:10:10| main|host-05|E|abnormal termination of shepherd for job
821.1: no "exit_status" file
09/15/2014 13:10:10| main|host-05|E|can't open file active_jobs/821.1/error:
Datei oder Verzeichnis nicht gefunden
09/15/2014 13:10:10| main|host-05|E|can't open pid file
"active_jobs/821.1/pid" for job 821.1
(sometimes the signal is 6)
/var/log/messages:
Sep 15 13:10:09 host-05 kernel: [18105718.966831] sge_execd[5295]: segfault at
7ffea8000000 ip 00007ffea93f44f9 sp 00007fff2ac2e2d0 error 4 in
libc-2.11.3.so[7ffea937c000+16f000]
(I tried copying the old libc-2.11.3.so, and linking it for sge, but it didn't
work.)
A workaround I found, trying to search for the error, is running execd through
valgrind.
(Just install valgrind and change
exec 1>/dev/null 2>&1
$bin_dir/sge_execd
to
exec 1>/dev/null 2>&1
valgrindpath/valgrind $bin_dir/sge_execd
at around line 347 in the startup-script.)
If you keep the output, valgrind reports a ton, don't know, if everything is
connected to my error. This, then, doesn't show up anymore, valgrind somehow
seems to fix it. I don't know, if this is an error of GE, or SLES, maybe I
should write on some SLES-board, too.
I won't try to solve this in another way, but if someone would like to see the
valgrind output, or so, tell me.
Best regards, Sven
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users