Dear all, We're experiencing some odd behaviour, and I was hoping to get some insight. We have a ~700 CPU cluster running GE 6.2u5 on Rocks release 5.4.3 (Viper)
In March this year we had a network failure, which [apparently] killed all our jobs (i.e. on the cluster reboot, no jobs were listed in qstat, which made sense given all the nodes and the headnode had been rebooted). Many of these jobs were MPI jobs distributed over 10-20 nodes. Several weeks later, we began experiencing some job crashes, and in the process of investigating that, I discovered that a number (though not all) of the jobs which had crashed during the March network failure were actually still running on the nodes (according to top), but were not listed from qstat. I killed these with SIGTERM (15) [and not the more aggressive SIGKILL (9)] and they disappeared and resources were freed up, so I thought all was well in the world. Alas, on my return from vacation today, I found that the headnode has nearly 2 GBs of /opt/gridengine/default/spool/qmaster/messages of the following flavour; 06/19/2014 09:24:54|worker|<CLU>|E|execd@compute-<ID>.local reports running job (<JOBID>/1.compute-<ID>) in queue "HARPER@compute-<ID>.local" that was not supposed to be there - killing and a couple which are 06/19/2014 08:16:58|worker|<CLU>|E|execd@compute-<ID>.local reports running job (<JOBID>/master) in queue "HARPER@compute-<ID>.local" that was not supposed to be there - killing Note that <CLU> is the our cluster name HARPER is the name of one of our queues <JOBID> refers the the GE job ID <ID> refers to the compute ID For each JOBID, these messages originate from a number of different nodes, which I suspect reflects the nodes which each job was spread across. Choosing job 567641.1 as a concrete example, logging in to one of the slave nodes (compute-4-7) from which this error was coming, I found directories associated with these jobs in both /tmp and in /opt/gridengine/default/spool/compute-4.7/active_jobs. Even after removing both of these directories the error messages persist from compute-4-7 on the headnode. Associated files for these process are also found on the slave nodes in /opt/gridengine/default/spool/qmaster/jobs/00/00<JOBPREFIX>/<JOBSUFFIX> In our example, this means a binary file which you can sort of read exists in /opt/gridengine/default/spool/qmaster/jobs/00/0056/7641 on slave node compute-4-7. Deleting this file leads to it being regenerated after 30 seconds, which conincides with the error message on the head node. Grepping in both the headnode and the compute-4.7 slave node gridengine directories, only finds reference to the 567652.1 in log files. In summary, I'd thought that given qstat did not report on these phantom jobs, they were outside of GE's scope, but despite the resources being freed up after their deletion, the GE management is still trying to handle them. From this there are two questions 1) What is the correct way to kill jobs which were submitted by qsub, do not appear in qstat, but are still running on slave nodes and consuming resources (as apparently the approach I used has left a lot of mess!) 2) Given that gridengine is reporting on all these missing jobs, how do I go about removing them from the gridengine management infrastructure permanently? I rebooted a number of nodes last week, and this stopped the generation of the error messages (i.e. the nodes were generating the messages up until they was rebooted), but the files in /active_jobs and in /jobs remained (obviously the /tmp directory was removed given /tmp is tmpfs). In a test case, the /jobs file which is regenerated on the non-rebooted nodes is not regenerated when removed from the nodes I rebooted. An entirely legitimate solution might just be to be to reboot the node, and then delete the /active_jobs and /jobs files but given the mess I now have on my hands, I'm worried there may be further issues down the line if I continue down this manual override approach. Any thoughts would be greatly appreciated. I've tried to include as much detail as possible but can, of course, provide any other info needed. ~ alex
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
