Re: [gridengine users] Possible BDB corruption?

Reuti Tue, 17 Nov 2015 02:10:36 -0800

Hi,

> Am 17.11.2015 um 00:15 schrieb Dobbie, Brad <[email protected]>:
> 
> Our cluster is 6.2u5 and uses local BDB spooling.  We recently suffered a 
> qmaster crash and after reboot we noticed some strange behavior with the 
> jobseqnum.  When we restarted the qmaster, the JOBIDs skipped up to the high 
> 9's (996xxxx range).  
> 
> From the SGE source, it appears the qmaster tries to pick a new jobseqnum 
> based on the MAX of the jobseqnum file and guess_highest_job_number function.
> 
>  fp = fopen(SEQ_NUM_FILE, "r")
>  fscanf(fp, sge_u32, &job_nr)
>  guess_job_nr = guess_highest_job_number();
>  job_nr = MAX(job_nr, guess_job_nr);


I wasn't aware that also the BDB is checked. For a fresh installation it should 
be sufficient to fill the jobseqnum file with the proper value to start from.

-- Reuti


> It appears the qmaster guesses the highest job number from the 
> master_job_list, which I assume is stored in the spooling database.
> 
>  lList *master_job_list = *(object_type_get_master_list(SGE_TYPE_JOB));
> 
> When we restarted the qmaster, we attempted to keep the running jobs running. 
>  None of the previously running jobs were in the high 9's range, and the 
> jobseqnum file did not contain a value in that range.
> 
> I'm wondering how the qmaster selected this JOBID.  Could the BDB spooling 
> database be corrupted?  Is there a way to debug or cleanup the spooling 
> database?  I poked around a little with the db_dump utility but wasn't able 
> to draw any conclusions.  I saw some JATASKs in the high 9's range, but not 
> JOBs.
> 
> We are looking to migrate our qmaster to a new machine, so we'd like to be 
> able to control the jobseqnum upon startup to avoid potential accounting file 
> overlaps.
> 
> Thanks,
> Brad Dobbie
> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Possible BDB corruption?

Reply via email to