Am 05.06.2014 um 11:51 schrieb Esztermann, Ansgar: > Hi everyone, > > we have a strange problem here where jobs die through SIGKILL (so far, I have > failed to find out what triggered the signal) but then some processes remain > on the node. We are using one of the killkids variants, but (at least) for > multi-node jobs, there are actually *two* gids in use on the job master: one > for the jobscript, mpiexec.hydra and qsh, and another one for qrsh_starter > and the actual executables.
This sounds like a problem in the MPI setup. There shouldn't be any local `qrsh` for recent MPI implementations (if so, you are right: it get's a new addgrpid). Using actual MPI libraries the local processes should be forked by the `mpiexec`. Is the name resolution working? I.e. all are using only the hostname *or* the FQDN? - How is SGE set up regarding hostname/FQDN? - What is recorded in the $PE_HOSTFILE for the names? - Some Linux distribution (IIRC ROCKS) changed the default behavior of `hostname` to output always the FQDN which might not match the entries in the $PE_HOSTFILE. -- Reuti > terminate_method, however, runs only once (for the jobscript gid), so the > executables remain unchallenged. Unfortunately, one of them hangs in a write > while the others perform a busy wait, significantly slowing down the next job. > > I guess the best way out of this would be to use a cgroups-capable GE > version, but I am somewhat reluctant to perform a major upgrade on a > production cluster unless absolutely necessary. > > So, back to the question: is it normal to have two different gids with > ENABLE_ADDGRP_KILL? Is terminate_method supposed to run twice in this case? > > Moreover, is it possible to find out what killed the job in the first place? > Login to the compute nodes is not allowed, so this must have happened without > manual intervention. > > Thanks a lot, > > A. > > PS: Software is OGS/GE 2011.11 > -- > Ansgar Esztermann > DV-Systemadministration > Max-Planck-Institut für biophysikalische Chemie, Abteilung 105 > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
