Hi, > Am 17.11.2015 um 00:15 schrieb Dobbie, Brad <[email protected]>: > > Our cluster is 6.2u5 and uses local BDB spooling. We recently suffered a > qmaster crash and after reboot we noticed some strange behavior with the > jobseqnum. When we restarted the qmaster, the JOBIDs skipped up to the high > 9's (996xxxx range). > > From the SGE source, it appears the qmaster tries to pick a new jobseqnum > based on the MAX of the jobseqnum file and guess_highest_job_number function. > > fp = fopen(SEQ_NUM_FILE, "r") > fscanf(fp, sge_u32, &job_nr) > guess_job_nr = guess_highest_job_number(); > job_nr = MAX(job_nr, guess_job_nr);
I wasn't aware that also the BDB is checked. For a fresh installation it should be sufficient to fill the jobseqnum file with the proper value to start from. -- Reuti > It appears the qmaster guesses the highest job number from the > master_job_list, which I assume is stored in the spooling database. > > lList *master_job_list = *(object_type_get_master_list(SGE_TYPE_JOB)); > > When we restarted the qmaster, we attempted to keep the running jobs running. > None of the previously running jobs were in the high 9's range, and the > jobseqnum file did not contain a value in that range. > > I'm wondering how the qmaster selected this JOBID. Could the BDB spooling > database be corrupted? Is there a way to debug or cleanup the spooling > database? I poked around a little with the db_dump utility but wasn't able > to draw any conclusions. I saw some JATASKs in the high 9's range, but not > JOBs. > > We are looking to migrate our qmaster to a new machine, so we'd like to be > able to control the jobseqnum upon startup to avoid potential accounting file > overlaps. > > Thanks, > Brad Dobbie > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
