Re: [gridengine users] Possible BDB corruption?

Dobbie, Brad Mon, 23 Nov 2015 09:53:18 -0800

Actually, on second thought I cannot tell if this is fixed until my jobseqnum 
wraps back to 1.


After this crash the new jobseqnum was 9822808.  The 
MAX(file=9822808,bad=9636418) is 9822808, so the behavior I observed could have 
been either the guess_job_nr() or the fscanf().  No way of telling until my 
jobid wraps and all pre-wrap jobs finish.

When I migrate to the new qmaster, I was hoping to keep all running and pending 
jobs intact (by db_load'ing the sge and sge_job DBs into the new spool dir).  
This situation makes me weary of doing restoring the sge_job DB.  Do I only 
lose pending jobs if I only restore the sge DB?  Will I lose running jobs as 
well?

Is there another recommended migration path if I'd like to keep the pending and 
running jobs?  Trying to minimize impact to the team and avoid any downtime.

Thanks,
Brad

-----Original Message-----
From: Dobbie, Brad 
Sent: Monday, November 23, 2015 12:26 PM
To: 'Reuti' <[email protected]>
Cc: [email protected]
Subject: RE: [gridengine users] Possible BDB corruption?

We had another qmaster crash over the weekend, and the jobseqnum again got 
reset incorrectly to 9636419. I tried to delete the job with an ID of one less:

$ qdel 9636418

And this crashed the qmaster!

11/23/2015 11:58:55|worker|qmaster|W|It is impossible to move task 0 of job 
9636418 to the list of finished jobs
11/23/2015 11:58:55|worker|qmaster|C|Removing element from other list !!!

Upon restart, the jobseqnum appears to have been correctly taken from the 
jobseqnum file, so maybe this issue is resolved.

Has anyone seen this before?  Could this be caused by corruption of the sge_job 
DB?  Should I do a db_verify/db_dump/db_load on the DB to flush out any other 
issues?

Thanks,
Brad

-----Original Message-----
From: Reuti [mailto:[email protected]] 
Sent: Tuesday, November 17, 2015 5:09 AM
To: Dobbie, Brad <[email protected]>
Cc: [email protected]
Subject: Re: [gridengine users] Possible BDB corruption?

Hi,

> Am 17.11.2015 um 00:15 schrieb Dobbie, Brad <[email protected]>:
> 
> Our cluster is 6.2u5 and uses local BDB spooling.  We recently suffered a 
> qmaster crash and after reboot we noticed some strange behavior with the 
> jobseqnum.  When we restarted the qmaster, the JOBIDs skipped up to the high 
> 9's (996xxxx range).  
> 
> From the SGE source, it appears the qmaster tries to pick a new jobseqnum 
> based on the MAX of the jobseqnum file and guess_highest_job_number function.
> 
>  fp = fopen(SEQ_NUM_FILE, "r")
>  fscanf(fp, sge_u32, &job_nr)
>  guess_job_nr = guess_highest_job_number();
>  job_nr = MAX(job_nr, guess_job_nr);

I wasn't aware that also the BDB is checked. For a fresh installation it should 
be sufficient to fill the jobseqnum file with the proper value to start from.

-- Reuti


> It appears the qmaster guesses the highest job number from the 
> master_job_list, which I assume is stored in the spooling database.
> 
>  lList *master_job_list = *(object_type_get_master_list(SGE_TYPE_JOB));
> 
> When we restarted the qmaster, we attempted to keep the running jobs running. 
>  None of the previously running jobs were in the high 9's range, and the 
> jobseqnum file did not contain a value in that range.
> 
> I'm wondering how the qmaster selected this JOBID.  Could the BDB spooling 
> database be corrupted?  Is there a way to debug or cleanup the spooling 
> database?  I poked around a little with the db_dump utility but wasn't able 
> to draw any conclusions.  I saw some JATASKs in the high 9's range, but not 
> JOBs.
> 
> We are looking to migrate our qmaster to a new machine, so we'd like to be 
> able to control the jobseqnum upon startup to avoid potential accounting file 
> overlaps.
> 
> Thanks,
> Brad Dobbie
> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Possible BDB corruption?

Reply via email to