Il 14.01.2015 11:05, Reuti ha scritto:
Hi,
Am 14.01.2015 um 10:09 schrieb Roberto Nunnari:
Hi.
man sge_pe states:
control_slaves
This parameter can be set to TRUE or FALSE (the default). It indicates
whether Oracle Grid Engine is the creator of the slave tasks of a parallel
application via sge_execd(8) and sge_shepherd(8) and thus has full control
over all processes in a parallel application, which enables capabilities such
as resource limitation and correct accounting. However, to gain control over
the slave tasks of a parallel application, a sophisticated PE interface is
required, which works closely together with Oracle Grid Engine facilities.
Such PE interfaces are available through your local Oracle Grid Engine support
office.
Does that mean that you need to buy some software from Oracle in order to take
advantage of 'control_slaves TRUE' ?
No.
It mainly refers to the fact that it depends on the parallel application
whether any preparation might be necessary by supplying scripts for
start/stop_proc_args and set up or tuning the started application not to do
nasty things like jumping out of the process tree.
Technically its value must be set to TRUE to allow that a started job script is
allowed to perform `qrsh --inherit ...` to reach other nodes without any
`rsh`/`ssh` at all (in my clusters `ssh` is available for admin staff only).
Interesting.. once I try to do the same, but a program stopped to work..
so I implemented a (half)solution where ssh is for admins only on the
master node, and for all users on the execution nodes.
While these scripts were mandatory for many parallel applications in the past,
MPICH and Open MPI (./configure --with-sge for the latter) in the actual
versions support SGE out of the box.
For Open MPI you can look for the value:
$ ompi_info | grep grid
MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.5)
Yes. It's like that, thank you. :-)
whether it's set up in your version. Care must be taken with Open MPI 1.8 and
newer as by default they issue a core binding independent from SGE's one and
always start at socket/core 0/0, i.e. if more than one Open MPI job is running
on a node it's necessary to either switch of Open MPI's core binding (and/or
use SGE's one) or reformat the by SGE granted core list that it can be used by
Open MPI.
humm.. I see that on CentOS 6.6 they introduced openmpi 1.8.1..
# ompi_info | grep grid
MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.8.1)
while on CentOS 6.4:
# ompi_info | grep grid
MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.5.4)
..so does that means that even though it's version 1.8.1, it doesn't use
the default core binding that breaks SGE? I rephrase my question: if I
upgrade my execution nodes from CentOS 6.4 (that use openmpi 1.5.4) to
CentOS 6.6 (that use openmpi 1.8.1) SGE PE jobs will continue to work or
will it need some tweeks?
You talk about 'switch off openmpi's core binding and/or use SGE's
one'.. how do you do it? at build time or at run time? What's the
command line switch?
Thank you and best regards.
Robi
-- Reuti
In my production environment, I have four PEs and two are set as
'control_slaves FALSE' and two 'control_slaves TRUE'.. and as long as I know,
all of them behave as expected.. that has been like that for about 9 years,
since I inherited the SGE cluster..
Can anybody cast some light on it, please?
my present environment:
- OGE 6.2u7
- on the execution nodes: openmpi 1.5.4
- on the master node: openmpi 1.4
Thank you and best regards.
Robi
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users