Hi Reuti, that's why it is SIGTSTP, not SIGSTOP. Erik Soyez.
On Wed, 13 Jun 2012, Reuti wrote:
Am 13.06.2012 um 08:39 schrieb Erik Soyez:
Rayson, yes, it kind of worked with 6.2u5, but we used it mainly with
HP-MPI which only needs a SIGTSTP for the master process in order to
suspend the entire job. Regards, Erik.
How does this work? Usually the sigstop can't be trapped. So, are the
other processes on the slave nodes stopping theirselfs as some kind of
heartbeat is missing as the master process is already stapped? Lateron
on a sigcont the master process will have to wake them up again by
distributing the signal of course.
On Wed, 13 Jun 2012, Rayson Ho wrote:
On Wed, Jun 13, 2012 at 1:47 AM, Erik Soyez
<[email protected]> wrote:
You probably need some kind of cronjob to suspend and unsuspend your
parallel jobs correctly. Or does anyone have a patch for this?
Erik,
So is/was it really working when you try it with SGE 6.2u5??
I have not looked into the code that handles parallel job suspension
in detail (we were working on "near-by" code in 2008 and Shannon was
also looking into the suspending parallel jobs at that time, and thus
we just relied on him to debug the code :-D ).
However, in order to properly handle the case you metioned, the
qmaster will need to keep track of the number of times subordination
happens to a job. And I can already think of issues if the accounting
code is not accurate enough.
Do you know if other batch systems handle the case you mentioned correctly?
On Tue, 12 Jun 2012, Joseph Farran wrote:
Well, for our needs, we *REALLY* need Parallel Job suspension. It's
not even a choice for us.
If Torque/Maui can do it, I am sure OGE can do it without issues.
Can someone please tell me what patch I need to install to un-break /
turn-on Parallel job suspension?
If you guys are that paranoid about PE suspension, how about adding an
on/off flag for this since the code is already there and let the admin pick?
On 06/12/2012 06:52 AM, Dave Love wrote:
"Joseph A. Farran"<[email protected]> writes:
If you guys are taking requests, *please* add suspension and ignore old
Sun recommendation.
Support for suspension exists, it's just broken (per the issue Reuti
pointed to). The use of | is clearly wrong, but the other bit isn't
clear. It's one of the available patches I wanted to understand before
applying (and had forgotten about). Can anyone cast more light on it?
--
--
Vorstandsvorsitzender/Chairman of the board of management:
Gerd-Lothar Leonhart
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Michael Heinrichs,
Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Philippe Miltin
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users