Hi,
similar issues have been reported a long time ago, but I haven't seen a recent
solution to this.
In one of our company's SGE 6.2.u5 clusters, qrsh/qlogin jobs fail on selected
hosts with messages like this:
$ qrsh -l rhel=6,login=1,hostname=casrvodc-17 -verbose
...
Your job 1756874 ("QRLOGIN") has been submitted
waiting for interactive job to be scheduled ...timeout (3 s) expired while
waiting on socket fd 4
Your interactive job 1756874 has been successfully scheduled.
timeout (5 s) expired while waiting on socket fd 4
This goes for some time, the jobs can even be seen briefly via qstat - however,
the jobs never really kick in, switch themselves to "dr" stated and are finally
gone (after a minute or so).
The exec host's messages file has lines like this:
11/15/2016 05:59:50| main|casrvodc-17|I|SIGNAL jid: 1756876 jatask: 1 signal:
KILL
The main messages file has this:
11/15/2016 05:59:50|worker|casrvodc-01|I|mselz has registered the job 1756876
for deletion
11/15/2016 05:59:51|worker|casrvodc-01|I|removing trigger to terminate job
1756876.1
11/15/2016 05:59:51|worker|casrvodc-01|W|job 1756876.1 failed on host
casrvodc-17.diasemi.com assumedly after job because: job 1756876.1 died through
signal KILL (9)
Until a few days ago, qrsh used to work on all hosts in the cluster, and this
suddenly stopped for most (but not all!) of them, without a deliberate change
in SGE config or host config (for instance, "uptime" confirms that the hosts
have not been recently rebooted. Otherwise, the hosts in the cluster are all of
same type (hardware), kernel version, etc., with no significant difference I
have been able to identify yet.
For the same hosts, also a "qsub -now y" fails.
I have verified proper sge execd operation and host identification with
"qping", "gethostbyaddr", and "gethostbyname", and this looks all fine.
Currently I am quite puzzled - I'd appreciate any input somebody may have on
how to further debug or resolve.
Best regards,
Manfred
________________________________
Dialog Semiconductor GmbH
Neue Str. 95
D-73230 Kirchheim
Managing Directors: Dr. Jalal Bagherli, Carsten Dahl
Chairman of the Supervisory Board: Rich Beyer
Commercial register: Amtsgericht Stuttgart: HRB 231181
UST-ID-Nr. DE 811121668
Legal Disclaimer: This e-mail communication (and any attachment/s) is
confidential and contains proprietary information, some or all of which may be
legally privileged. It is intended solely for the use of the individual or
entity to which it is addressed. Access to this email by anyone else is
unauthorized. If you are not the intended recipient, any disclosure, copying,
distribution or any action taken or omitted to be taken in reliance on it, is
prohibited and may be unlawful.
Please consider the environment before printing this e-mail
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users