Am 19.06.2014 um 17:25 schrieb Alex Holehouse:

> Dear all,
> 
> We're experiencing some odd behaviour, and I was hoping to get some insight. 
> We have a ~700 CPU cluster running GE 6.2u5 on Rocks release 5.4.3 (Viper)
> 
> In March this year we had a network failure, which [apparently] killed all 
> our jobs (i.e. on the cluster reboot, no jobs were listed in qstat, which 
> made sense given all the nodes and the headnode had been rebooted). Many of 
> these jobs were MPI jobs distributed over 10-20 nodes. Several weeks later, 
> we began experiencing some job crashes, and in the process of investigating 
> that, I discovered that a number (though not all) of the jobs which had 
> crashed during the March network failure were actually still running on the 
> nodes (according to top), but were not listed from qstat. 
> 
> I killed these with SIGTERM (15) [and not the more aggressive SIGKILL (9)] 
> and they disappeared and resources were freed up, so I thought all was well 
> in the world. Alas, on my return from vacation today, I found that the 
> headnode has nearly 2 GBs of /opt/gridengine/default/spool/qmaster/messages 
> of the following flavour;
> 
> 
> 06/19/2014 09:24:54|worker|<CLU>|E|execd@compute-<ID>.local reports running 
> job (<JOBID>/1.compute-<ID>) in queue "HARPER@compute-<ID>.local" that was 
> not supposed to be there - killing
> 
> and a couple which are
> 
> 06/19/2014 08:16:58|worker|<CLU>|E|execd@compute-<ID>.local reports running 
> job (<JOBID>/master) in queue "HARPER@compute-<ID>.local" that was not 
> supposed to be there - killing
> 
> Note that
> 
> <CLU> is the our cluster name
> HARPER is the name of one of our queues
> <JOBID> refers the the GE job ID
> <ID> refers to the compute ID 
> 
> For each JOBID, these messages originate from a number of different nodes, 
> which I suspect reflects the nodes which each job was spread across. Choosing 
> job 567641.1 as a concrete example, logging in to one of the slave nodes 
> (compute-4-7) from which this error was coming, I found directories 
> associated with these jobs in both /tmp and in 
> /opt/gridengine/default/spool/compute-4.7/active_jobs. Even after removing 
> both of these directories the error messages persist from compute-4-7 on the 
> headnode. 
> 
> Associated files for these process are also found on the slave nodes in 
> /opt/gridengine/default/spool/qmaster/jobs/00/00<JOBPREFIX>/<JOBSUFFIX> In 
> our example, this means a binary file which you can sort of read exists in 
> /opt/gridengine/default/spool/qmaster/jobs/00/0056/7641 on slave node 
> compute-4-7. Deleting this file leads to it being regenerated after 30 
> seconds, which conincides with the error message on the head node.
> 
> Grepping in both the headnode and the compute-4.7 slave node gridengine 
> directories, only finds reference to the 567652.1 in log files.
> 
> In summary, I'd thought that given qstat did not report on these phantom 
> jobs, they were outside of GE's scope, but despite the resources being freed 
> up after their deletion, the GE management is still trying to handle them. 
> From this there are two questions
> 
> 1) What is the correct way to kill jobs which were submitted by qsub, do not 
> appear in qstat, but are still running on slave nodes and consuming resources 
> (as apparently the approach I used has left a lot of mess!)
> 
> 2) Given that gridengine is reporting on all these missing jobs, how do I go 
> about removing them from the gridengine management infrastructure permanently?
> 
> I rebooted a number of nodes last week, and this stopped the generation of 
> the error messages (i.e. the nodes were generating the messages up until they 
> was rebooted), but the files in /active_jobs and in /jobs remained (obviously 
> the /tmp directory was removed given /tmp is tmpfs). In a test case, the 
> /jobs file which is regenerated on the non-rebooted nodes is not regenerated 
> when removed from the nodes I rebooted.
> 
> An entirely legitimate solution might just be to be to reboot the node, and 
> then delete the /active_jobs and /jobs files but given the mess I now have on 
> my hands, I'm worried there may be further issues down the line if I continue 
> down this manual override approach.

Well, on the one hand this is the way to go in case of the observed behavior. 
But only as a second step, as usually SGE will detect the former running jobs 
after a reboot and issue at least an email to the admin in this case.

-- Reuti


> Any thoughts would be greatly appreciated. I've tried to include as much 
> detail as possible but can, of course, provide any other info needed.
> 
> ~ alex
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to