Dear all,

We're experiencing some odd behaviour, and I was hoping to get some
insight. We have a ~700 CPU cluster running GE 6.2u5 on Rocks release 5.4.3
(Viper)

In March this year we had a network failure, which [apparently] killed all
our jobs (i.e. on the cluster reboot, no jobs were listed in qstat, which
made sense given all the nodes and the headnode had been rebooted). Many of
these jobs were MPI jobs distributed over 10-20 nodes. Several weeks later,
we began experiencing some job crashes, and in the process of investigating
that, I discovered that a number (though not all) of the jobs which had
crashed during the March network failure were actually still running on the
nodes (according to top), but were not listed from qstat.

I killed these with SIGTERM (15) [and not the more aggressive SIGKILL (9)]
and they disappeared and resources were freed up, so I thought all was well
in the world. Alas, on my return from vacation today, I found that the
headnode has nearly 2 GBs of /opt/gridengine/default/spool/qmaster/messages
of the following flavour;


06/19/2014 09:24:54|worker|<CLU>|E|execd@compute-<ID>.local reports running
job (<JOBID>/1.compute-<ID>) in queue "HARPER@compute-<ID>.local" that was
not supposed to be there - killing

and a couple which are

06/19/2014 08:16:58|worker|<CLU>|E|execd@compute-<ID>.local reports running
job (<JOBID>/master) in queue "HARPER@compute-<ID>.local" that was not
supposed to be there - killing

Note that

<CLU> is the our cluster name
HARPER is the name of one of our queues
<JOBID> refers the the GE job ID
<ID> refers to the compute ID

For each JOBID, these messages originate from a number of different nodes,
which I suspect reflects the nodes which each job was spread across.
Choosing job 567641.1 as a concrete example, logging in to one of the slave
nodes (compute-4-7) from which this error was coming, I found directories
associated with these jobs in both /tmp and
in /opt/gridengine/default/spool/compute-4.7/active_jobs. Even after
removing both of these directories the error messages persist from
compute-4-7 on the headnode.

Associated files for these process are also found on the slave nodes
in /opt/gridengine/default/spool/qmaster/jobs/00/00<JOBPREFIX>/<JOBSUFFIX>
In our example, this means a binary file which you can sort of read exists
in /opt/gridengine/default/spool/qmaster/jobs/00/0056/7641 on slave node
compute-4-7. Deleting this file leads to it being regenerated after 30
seconds, which conincides with the error message on the head node.

Grepping in both the headnode and the compute-4.7 slave node gridengine
directories, only finds reference to the 567652.1 in log files.

In summary, I'd thought that given qstat did not report on these phantom
jobs, they were outside of GE's scope, but despite the resources being
freed up after their deletion, the GE management is still trying to handle
them. From this there are two questions

1) What is the correct way to kill jobs which were submitted by qsub, do
not appear in qstat, but are still running on slave nodes and consuming
resources (as apparently the approach I used has left a lot of mess!)

2) Given that gridengine is reporting on all these missing jobs, how do I
go about removing them from the gridengine management infrastructure
permanently?

I rebooted a number of nodes last week, and this stopped the generation of
the error messages (i.e. the nodes were generating the messages up until
they was rebooted), but the files in /active_jobs and in /jobs remained
(obviously the /tmp directory was removed given /tmp is tmpfs). In a test
case, the /jobs file which is regenerated on the non-rebooted nodes is not
regenerated when removed from the nodes I rebooted.

An entirely legitimate solution might just be to be to reboot the node, and
then delete the /active_jobs and /jobs files but given the mess I now have
on my hands, I'm worried there may be further issues down the line if I
continue down this manual override approach.

Any thoughts would be greatly appreciated. I've tried to include as much
detail as possible but can, of course, provide any other info needed.

~ alex
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to