Am 05.06.2014 um 11:51 schrieb Esztermann, Ansgar:

> Hi everyone,
> 
> we have a strange problem here where jobs die through SIGKILL (so far, I have 
> failed to find out what triggered the signal) but then some processes remain 
> on the node. We are using one of the killkids variants, but (at least) for 
> multi-node jobs, there are actually *two* gids in use on the job master: one 
> for the jobscript, mpiexec.hydra and qsh, and another one for qrsh_starter 
> and the actual executables.

This sounds like a problem in the MPI setup. There shouldn't be any local 
`qrsh` for recent MPI implementations (if so, you are right: it get's a new 
addgrpid). Using actual MPI libraries the local processes should be forked by 
the `mpiexec`. Is the name resolution working? I.e. all are using only the 
hostname *or* the FQDN?

- How is SGE set up regarding hostname/FQDN?
- What is recorded in the $PE_HOSTFILE for the names?
- Some Linux distribution (IIRC ROCKS) changed the default behavior of 
`hostname` to output always the FQDN which might not match the entries in the 
$PE_HOSTFILE.

-- Reuti


> terminate_method, however, runs only once (for the jobscript gid), so the 
> executables remain unchallenged. Unfortunately, one of them hangs in a write 
> while the others perform a busy wait, significantly slowing down the next job.
> 
> I guess the best way out of this would be to use a cgroups-capable GE 
> version, but I am somewhat reluctant to perform a major upgrade on a 
> production cluster unless absolutely necessary.
> 
> So, back to the question: is it normal to have two different gids with 
> ENABLE_ADDGRP_KILL? Is terminate_method supposed to run twice in this case?
> 
> Moreover, is it possible to find out what killed the job in the first place? 
> Login to the compute nodes is not allowed, so this must have happened without 
> manual intervention.
> 
> Thanks a lot,
> 
> A.
> 
> PS: Software is OGS/GE 2011.11
> -- 
> Ansgar Esztermann
> DV-Systemadministration
> Max-Planck-Institut für biophysikalische Chemie, Abteilung 105
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to