On Jun 5, 2014, at 12:25 , Reuti <[email protected]> wrote:

> 
> Am 05.06.2014 um 11:51 schrieb Esztermann, Ansgar:
> 
>> Hi everyone,
>> 
>> we have a strange problem here where jobs die through SIGKILL (so far, I 
>> have failed to find out what triggered the signal) but then some processes 
>> remain on the node. We are using one of the killkids variants, but (at 
>> least) for multi-node jobs, there are actually *two* gids in use on the job 
>> master: one for the jobscript, mpiexec.hydra and qsh, and another one for 
>> qrsh_starter and the actual executables.
> 
> This sounds like a problem in the MPI setup. There shouldn't be any local 
> `qrsh` for recent MPI implementations (if so, you are right: it get's a new 
> addgrpid). Using actual MPI libraries the local processes should be forked by 
> the `mpiexec`. Is the name resolution working? I.e. all are using only the 
> hostname *or* the FQDN?

After some more detailed probing, this does not seem to be the case: 
mpiexec.hydra exec()s qrsh once for each host. `hostname` is not called, so 
I've rewritten the machinefile to contain short hostnames only, but to no 
avail: qrsh is used nonetheless.
This is IntelMPI 4.2.3.048 (the latest being .049, but no relevant changes in 
the release notes).

A.
-- 
Ansgar Esztermann
DV-Systemadministration
Max-Planck-Institut für biophysikalische Chemie, Abteilung 105

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to