IME you are hitting some kind of rare bug.
Last time we had a thing like this it was because a user was specifying
many hundreds of jobids in the hold_jid parameter.
Before that, it had something to do with parallel jobs not cleaning up
quite right, and IIRC disabling the scheduling reporting parameters
fixed it.
In each case, the "easiest" way is to delete your job spool and restart
your qmaster and then monitor closely to try to figure out which user's
jobs it is that makes it crash. And then get the user to modify their
job parameters till your qmaster doesn't crash anymore :)
On 02/05/2016 07:52 AM, William Hay wrote:
On Fri, Feb 05, 2016 at 03:02:52PM +0000, Manfred Selz wrote:
Hi,
this week I have observed the (6.2u5) sgemaster crashing several times on
one of our sites.
The last message in the "messages" file was always like this:
02/05/2016 14:37:30|worker|mnsrvgems-02v|C|Removing element from other
list !!!
Automatic migration to the alternate master hosts (as define in the shadow
host list) also failed, with the new sge_qmaster also crashing (after one
minute or less).
Only after several attempts I was able to start the master again, but not
without having some queues damaged (jobs being lost).
This has never happened before since I took over the SGE admin role in our
company more than four years ago, and the messages file does not provide
an obvious reason. Sometimes I see a line like this before crashing:
02/05/2016 14:37:12| main|mnsrvgems-02v|W|removing reference to no longer
existing job 5335536 of user ...
Is the jobid consistent? The most common cause of qmaster crashes in my
experience
is a corrupted job spool. Normal procedure is to stop the qmaster and manually
delete
the job from the spool (traditional spool) before restarting.
If anybody has a good idea what I could look into, I'd appreciate this a
lot.
Is there an efficient way to trace (strace?) the master process?
You could enable the built in debugging (man sge_dl).
William
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
--
Alex Chekholko [email protected] 347-401-4860
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users