Am 19.06.2014 um 17:25 schrieb Alex Holehouse: > Dear all, > > We're experiencing some odd behaviour, and I was hoping to get some insight. > We have a ~700 CPU cluster running GE 6.2u5 on Rocks release 5.4.3 (Viper) > > In March this year we had a network failure, which [apparently] killed all > our jobs (i.e. on the cluster reboot, no jobs were listed in qstat, which > made sense given all the nodes and the headnode had been rebooted). Many of > these jobs were MPI jobs distributed over 10-20 nodes. Several weeks later, > we began experiencing some job crashes, and in the process of investigating > that, I discovered that a number (though not all) of the jobs which had > crashed during the March network failure were actually still running on the > nodes (according to top), but were not listed from qstat. > > I killed these with SIGTERM (15) [and not the more aggressive SIGKILL (9)] > and they disappeared and resources were freed up, so I thought all was well > in the world. Alas, on my return from vacation today, I found that the > headnode has nearly 2 GBs of /opt/gridengine/default/spool/qmaster/messages > of the following flavour; > > > 06/19/2014 09:24:54|worker|<CLU>|E|execd@compute-<ID>.local reports running > job (<JOBID>/1.compute-<ID>) in queue "HARPER@compute-<ID>.local" that was > not supposed to be there - killing > > and a couple which are > > 06/19/2014 08:16:58|worker|<CLU>|E|execd@compute-<ID>.local reports running > job (<JOBID>/master) in queue "HARPER@compute-<ID>.local" that was not > supposed to be there - killing > > Note that > > <CLU> is the our cluster name > HARPER is the name of one of our queues > <JOBID> refers the the GE job ID > <ID> refers to the compute ID > > For each JOBID, these messages originate from a number of different nodes, > which I suspect reflects the nodes which each job was spread across. Choosing > job 567641.1 as a concrete example, logging in to one of the slave nodes > (compute-4-7) from which this error was coming, I found directories > associated with these jobs in both /tmp and in > /opt/gridengine/default/spool/compute-4.7/active_jobs. Even after removing > both of these directories the error messages persist from compute-4-7 on the > headnode. > > Associated files for these process are also found on the slave nodes in > /opt/gridengine/default/spool/qmaster/jobs/00/00<JOBPREFIX>/<JOBSUFFIX> In > our example, this means a binary file which you can sort of read exists in > /opt/gridengine/default/spool/qmaster/jobs/00/0056/7641 on slave node > compute-4-7. Deleting this file leads to it being regenerated after 30 > seconds, which conincides with the error message on the head node. > > Grepping in both the headnode and the compute-4.7 slave node gridengine > directories, only finds reference to the 567652.1 in log files. > > In summary, I'd thought that given qstat did not report on these phantom > jobs, they were outside of GE's scope, but despite the resources being freed > up after their deletion, the GE management is still trying to handle them. > From this there are two questions > > 1) What is the correct way to kill jobs which were submitted by qsub, do not > appear in qstat, but are still running on slave nodes and consuming resources > (as apparently the approach I used has left a lot of mess!) > > 2) Given that gridengine is reporting on all these missing jobs, how do I go > about removing them from the gridengine management infrastructure permanently? > > I rebooted a number of nodes last week, and this stopped the generation of > the error messages (i.e. the nodes were generating the messages up until they > was rebooted), but the files in /active_jobs and in /jobs remained (obviously > the /tmp directory was removed given /tmp is tmpfs). In a test case, the > /jobs file which is regenerated on the non-rebooted nodes is not regenerated > when removed from the nodes I rebooted. > > An entirely legitimate solution might just be to be to reboot the node, and > then delete the /active_jobs and /jobs files but given the mess I now have on > my hands, I'm worried there may be further issues down the line if I continue > down this manual override approach.
Well, on the one hand this is the way to go in case of the observed behavior. But only as a second step, as usually SGE will detect the former running jobs after a reboot and issue at least an email to the admin in this case. -- Reuti > Any thoughts would be greatly appreciated. I've tried to include as much > detail as possible but can, of course, provide any other info needed. > > ~ alex > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
