Hi, > Am 16.04.2015 um 06:30 schrieb Simon Matthews <[email protected]>: > > A couple of days ago, we had a power outage and our 6.2U5 SGE qmaster > would not start when the qmaster machine was rebooted. Running the > qmaster in foreground, I got a core dump. > > I suspected that the spooldb was corrupted (we use Berkeley DB), I > re-created the spooldb/sge and spooldb/sge_job files using the > following procedure: > 1. db_dump spooldb/sge to a file. > 2. Create a new grid to get empty sge and sge_job dbs. > 3. Copy the empty sge and sge_job files into my old spooldb > 4. db_load the new spooldb/sge from the earlier db_dump.
Are there jobs in pending state you want to keep? You can try to save SGE's configuration, start from a fresh spooling DB, and restore the settings: $SGE_ROOT/sge/util/upgrade_modules/load_sge_config.sh resp. save_sge_config.sh -- Reuti > We use Berkeley db spooling because we run a very large number of jobs > (mostly very small jobs). > > With this process, the qmaster would start and my configuration was > retained from before the crash. > > Now, I see occasional emails from the execd clients with the following: > Job 4433950 caused action: none > User = build > Queue = (null)@(null) > Start Time = <unknown> > End Time = <unknown> > failed before writing exit_status:shepherd exited with exit status 19: > before writing exit_status > > As can be seen, the queue name is invalid. > > Any idea what might cause this? How to stop this? > > Simon > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
