Our cluster is 6.2u5 and uses local BDB spooling. We recently suffered a qmaster crash and after reboot we noticed some strange behavior with the jobseqnum. When we restarted the qmaster, the JOBIDs skipped up to the high 9's (996xxxx range).
>From the SGE source, it appears the qmaster tries to pick a new jobseqnum >based on the MAX of the jobseqnum file and guess_highest_job_number function. fp = fopen(SEQ_NUM_FILE, "r") fscanf(fp, sge_u32, &job_nr) guess_job_nr = guess_highest_job_number(); job_nr = MAX(job_nr, guess_job_nr); It appears the qmaster guesses the highest job number from the master_job_list, which I assume is stored in the spooling database. lList *master_job_list = *(object_type_get_master_list(SGE_TYPE_JOB)); When we restarted the qmaster, we attempted to keep the running jobs running. None of the previously running jobs were in the high 9's range, and the jobseqnum file did not contain a value in that range. I'm wondering how the qmaster selected this JOBID. Could the BDB spooling database be corrupted? Is there a way to debug or cleanup the spooling database? I poked around a little with the db_dump utility but wasn't able to draw any conclusions. I saw some JATASKs in the high 9's range, but not JOBs. We are looking to migrate our qmaster to a new machine, so we'd like to be able to control the jobseqnum upon startup to avoid potential accounting file overlaps. Thanks, Brad Dobbie _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
