Actually, on second thought I cannot tell if this is fixed until my jobseqnum wraps back to 1.
After this crash the new jobseqnum was 9822808. The MAX(file=9822808,bad=9636418) is 9822808, so the behavior I observed could have been either the guess_job_nr() or the fscanf(). No way of telling until my jobid wraps and all pre-wrap jobs finish. When I migrate to the new qmaster, I was hoping to keep all running and pending jobs intact (by db_load'ing the sge and sge_job DBs into the new spool dir). This situation makes me weary of doing restoring the sge_job DB. Do I only lose pending jobs if I only restore the sge DB? Will I lose running jobs as well? Is there another recommended migration path if I'd like to keep the pending and running jobs? Trying to minimize impact to the team and avoid any downtime. Thanks, Brad -----Original Message----- From: Dobbie, Brad Sent: Monday, November 23, 2015 12:26 PM To: 'Reuti' <[email protected]> Cc: [email protected] Subject: RE: [gridengine users] Possible BDB corruption? We had another qmaster crash over the weekend, and the jobseqnum again got reset incorrectly to 9636419. I tried to delete the job with an ID of one less: $ qdel 9636418 And this crashed the qmaster! 11/23/2015 11:58:55|worker|qmaster|W|It is impossible to move task 0 of job 9636418 to the list of finished jobs 11/23/2015 11:58:55|worker|qmaster|C|Removing element from other list !!! Upon restart, the jobseqnum appears to have been correctly taken from the jobseqnum file, so maybe this issue is resolved. Has anyone seen this before? Could this be caused by corruption of the sge_job DB? Should I do a db_verify/db_dump/db_load on the DB to flush out any other issues? Thanks, Brad -----Original Message----- From: Reuti [mailto:[email protected]] Sent: Tuesday, November 17, 2015 5:09 AM To: Dobbie, Brad <[email protected]> Cc: [email protected] Subject: Re: [gridengine users] Possible BDB corruption? Hi, > Am 17.11.2015 um 00:15 schrieb Dobbie, Brad <[email protected]>: > > Our cluster is 6.2u5 and uses local BDB spooling. We recently suffered a > qmaster crash and after reboot we noticed some strange behavior with the > jobseqnum. When we restarted the qmaster, the JOBIDs skipped up to the high > 9's (996xxxx range). > > From the SGE source, it appears the qmaster tries to pick a new jobseqnum > based on the MAX of the jobseqnum file and guess_highest_job_number function. > > fp = fopen(SEQ_NUM_FILE, "r") > fscanf(fp, sge_u32, &job_nr) > guess_job_nr = guess_highest_job_number(); > job_nr = MAX(job_nr, guess_job_nr); I wasn't aware that also the BDB is checked. For a fresh installation it should be sufficient to fill the jobseqnum file with the proper value to start from. -- Reuti > It appears the qmaster guesses the highest job number from the > master_job_list, which I assume is stored in the spooling database. > > lList *master_job_list = *(object_type_get_master_list(SGE_TYPE_JOB)); > > When we restarted the qmaster, we attempted to keep the running jobs running. > None of the previously running jobs were in the high 9's range, and the > jobseqnum file did not contain a value in that range. > > I'm wondering how the qmaster selected this JOBID. Could the BDB spooling > database be corrupted? Is there a way to debug or cleanup the spooling > database? I poked around a little with the db_dump utility but wasn't able > to draw any conclusions. I saw some JATASKs in the high 9's range, but not > JOBs. > > We are looking to migrate our qmaster to a new machine, so we'd like to be > able to control the jobseqnum upon startup to avoid potential accounting file > overlaps. > > Thanks, > Brad Dobbie > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
