On Tue, Jun 12, 2012 at 12:58 AM, Joseph A. Farran <[email protected]> wrote:
> Yes it makes sense not to introduce new options.

Hi Joseph,

Sorry I was busy with many things (apparently - as it's almost 2am
here and I am still up!), and I've asked Ron to handle some of the
mailing list questions...


> I am not familiar with cgroups, so I need to read up on it.

Cgroups is useful even as a Linux sys admin tool. If you are not going
to write code that interacts with cgroups, then skip the Linux Kernel
cgroups docs, and read the cgroups guide from Oracle, RHEL, and SuSE
Linux:

http://www.oracle.com/technetwork/articles/servers-storage-admin/resource-controllers-linux-1506602.html
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch01.html
http://doc.opensuse.org/products/draft/SLES/SLES-tuning_sd_draft/cha.tuning.cgroups.html

Also, Linux Container (LXC) is the most famous user of cgroups. The
real benefit of cgroups is the ability and the interface it provides
to offer the very fine grain control of resources to groups of
processes. In the future more kernel "controllers" will provide
enhanced cgroups support...


> On the subject of OpenMPI and OGE - does OGE correctly suspend and resumes
> programs compiled with OpenMPI using the OpenMPI s/r implementation?

You will need to configure a few things... basically read this FAQ entry:

http://www.open-mpi.org/faq/?category=running#suspend-resume

SIGSTOP is sent by Grid Engine by default, that's why you need to use
"suspend_method SIGTSTP", which gives orte a chance to catch &
propagate the signal to the slave tasks.

Rayson


>
> Joseph
>
>
> On 6/11/2012 9:21 PM, Ron Chen wrote:
>>
>> We have not implemented a flag for it, and it is not hard to add one. One
>> thing about adding a new option is, we will then need to support it even if
>> it turns out to be not needed, and we are careful not to add too much extra
>> code, and that's why I will do more research first and decide if it is
>> really needed.
>>
>> I Google searched for TCP suspend issues, and found that some developers
>> say that it is safe if the processes are suspended when they are at a
>> quiescent point.
>>
>> So if in-flight messages are processed first before suspending, which
>> should be the case for the freezer cgroup subsystem, then it should be safe
>> to handle it without adding a new flag.
>>
>> See: http://www.kernel.org/doc/Documentation/cgroups/freezer-subsystem.txt
>>
>> (And Rayson added cgroups support in GE 2011.11 U1, while cgroups is Linux
>> only, Linux is run by most of the clusters, at least doing small to
>> medium-scale HPC.)
>>
>> IBM also planned to use Containers/Cgroups in IBM BlueWaters (before IBM
>> cancelled the project in 2011) to perform checkpointing and restart.
>>
>> https://events.linuxfoundation.org/slides/2011/lfcs/lfcs2011_hpc_smith.pdf
>>
>>  -Ron
>>
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to