IME you are hitting some kind of rare bug.

Last time we had a thing like this it was because a user was specifying many hundreds of jobids in the hold_jid parameter.

Before that, it had something to do with parallel jobs not cleaning up quite right, and IIRC disabling the scheduling reporting parameters fixed it.

In each case, the "easiest" way is to delete your job spool and restart your qmaster and then monitor closely to try to figure out which user's jobs it is that makes it crash. And then get the user to modify their job parameters till your qmaster doesn't crash anymore :)



On 02/05/2016 07:52 AM, William Hay wrote:
On Fri, Feb 05, 2016 at 03:02:52PM +0000, Manfred Selz wrote:
    Hi,



    this week I have observed the (6.2u5) sgemaster crashing several times on
    one of our sites.

    The last message in the "messages" file was always like this:



    02/05/2016 14:37:30|worker|mnsrvgems-02v|C|Removing element from other
    list !!!



    Automatic migration to the alternate master hosts (as define in the shadow
    host list) also failed, with the new sge_qmaster also crashing (after one
    minute or less).

    Only after several attempts I was able to start the master again, but not
    without having some queues damaged (jobs being lost).



    This has never happened before since I took over the SGE admin role in our
    company more than four years ago, and the messages file does not provide
    an obvious reason. Sometimes I see a line like this before crashing:



    02/05/2016 14:37:12|  main|mnsrvgems-02v|W|removing reference to no longer
    existing job 5335536 of user ...
Is the jobid consistent?  The most common cause of qmaster crashes in my 
experience
is a corrupted job spool.  Normal procedure is to stop the qmaster and manually 
delete
the job from the spool (traditional spool) before restarting.


    If anybody has a good idea what I could look into, I'd appreciate this a
    lot.

    Is there an efficient way to trace (strace?) the master process?
You could enable the built in debugging (man sge_dl).

William

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users


--
Alex Chekholko [email protected] 347-401-4860

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to