Our cluster is 6.2u5 and uses local BDB spooling.  We recently suffered a 
qmaster crash and after reboot we noticed some strange behavior with the 
jobseqnum.  When we restarted the qmaster, the JOBIDs skipped up to the high 
9's (996xxxx range).  

>From the SGE source, it appears the qmaster tries to pick a new jobseqnum 
>based on the MAX of the jobseqnum file and guess_highest_job_number function.

  fp = fopen(SEQ_NUM_FILE, "r")
  fscanf(fp, sge_u32, &job_nr)
  guess_job_nr = guess_highest_job_number();
  job_nr = MAX(job_nr, guess_job_nr);

It appears the qmaster guesses the highest job number from the master_job_list, 
which I assume is stored in the spooling database.

  lList *master_job_list = *(object_type_get_master_list(SGE_TYPE_JOB));

When we restarted the qmaster, we attempted to keep the running jobs running.  
None of the previously running jobs were in the high 9's range, and the 
jobseqnum file did not contain a value in that range.

I'm wondering how the qmaster selected this JOBID.  Could the BDB spooling 
database be corrupted?  Is there a way to debug or cleanup the spooling 
database?  I poked around a little with the db_dump utility but wasn't able to 
draw any conclusions.  I saw some JATASKs in the high 9's range, but not JOBs.

We are looking to migrate our qmaster to a new machine, so we'd like to be able 
to control the jobseqnum upon startup to avoid potential accounting file 
overlaps.

Thanks,
Brad Dobbie


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to