Hi,

I wonder whether it came ever to the discussion, that SGE can have a similar 
behavior like Torque/PBS regarding the mangling of hostnames. It's similiar to 
https://github.com/open-mpi/ompi/issues/2328, in the behavior that a node can 
have multiple network interfaces and each has an unique name. SGE's operation 
can be routed to a specific network interface by the use of a file:

$SGE_ROOT/$SGE_CELL/common/host_aliases

which has the format:

<sge-name of the node> <one or more blanks> <real long or short hostname>

Hence in the generated $PE_HOSTFILE the name known to SGE is listed, although 
the `hostname` command provides the real name. Open MPI would in this case 
start a `qrsh -inherit …` call instead of forking, as it thinks that these are 
different machines (assuming an allocation_rule of $PE_SLOTS so that the 
`mpiexec` is supposed to be on the same machine as the started tasks).

I tried to go the "old" way to provide a start_proc_args to the PE to create a 
symbolic link to `hostname` in $TMPDIR, so that inside the job script an 
adjusted `hostname` call is available, but obviously Open MPI calls 
gethostname() directly and not by an external binary.

So I mangled the hostname in the created machinefile in the jobscript to feed 
an "adjusted" $PE_HOSTFILE to Open MPI and then it's working as intended: Open 
MPI creates forks.

Does anyone else need such a patch in Open MPI and is it suitable to be 
included?

-- Reuti

PS: Only the headnodes have more than one network interface in our case and 
hence it's didn't come to my attention up to now, as now there was a need to 
use also some cores on the headnodes. They are known internally to SGE as 
"login" and "master", but the external names may be "foo" and "baz" which 
gethostname() returns.
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to