Hi, Am 13.06.2012 um 11:11 schrieb Erik Soyez:
> Hi Reuti, that's why it is SIGTSTP, not SIGSTOP. Erik Soyez. aha, and this one can be defined in suspend_method then. -- Reuti > On Wed, 13 Jun 2012, Reuti wrote: > >> Am 13.06.2012 um 08:39 schrieb Erik Soyez: >> >>> Rayson, yes, it kind of worked with 6.2u5, but we used it mainly with >>> HP-MPI which only needs a SIGTSTP for the master process in order to >>> suspend the entire job. Regards, Erik. >> >> How does this work? Usually the sigstop can't be trapped. So, are the other >> processes on the slave nodes stopping theirselfs as some kind of heartbeat >> is missing as the master process is already stopped? Lateron on a sigcont >> the master process will have to wake them up again by distributing the >> signal of course. >> >> >>> On Wed, 13 Jun 2012, Rayson Ho wrote: >>> >>>> On Wed, Jun 13, 2012 at 1:47 AM, Erik Soyez >>>> <[email protected]> wrote: >>>>> You probably need some kind of cronjob to suspend and unsuspend your >>>>> parallel jobs correctly. Or does anyone have a patch for this? >>>> >>>> Erik, >>>> >>>> So is/was it really working when you try it with SGE 6.2u5?? >>>> >>>> I have not looked into the code that handles parallel job suspension >>>> in detail (we were working on "near-by" code in 2008 and Shannon was >>>> also looking into the suspending parallel jobs at that time, and thus >>>> we just relied on him to debug the code :-D ). >>>> >>>> However, in order to properly handle the case you metioned, the >>>> qmaster will need to keep track of the number of times subordination >>>> happens to a job. And I can already think of issues if the accounting >>>> code is not accurate enough. >>>> >>>> Do you know if other batch systems handle the case you mentioned correctly? >>>> >>>> >>>>> On Tue, 12 Jun 2012, Joseph Farran wrote: >>>>> >>>>>> Well, for our needs, we *REALLY* need Parallel Job suspension. It's >>>>>> not even a choice for us. >>>>>> >>>>>> If Torque/Maui can do it, I am sure OGE can do it without issues. >>>>>> >>>>>> Can someone please tell me what patch I need to install to un-break / >>>>>> turn-on Parallel job suspension? >>>>>> >>>>>> If you guys are that paranoid about PE suspension, how about adding an >>>>>> on/off flag for this since the code is already there and let the admin >>>>>> pick? >>>>>> >>>>>> >>>>>> On 06/12/2012 06:52 AM, Dave Love wrote: >>>>>>> >>>>>>> "Joseph A. Farran"<[email protected]> writes: >>>>>>> >>>>>>>> If you guys are taking requests, *please* add suspension and ignore old >>>>>>>> Sun recommendation. >>>>>>> >>>>>>> Support for suspension exists, it's just broken (per the issue Reuti >>>>>>> pointed to). The use of | is clearly wrong, but the other bit isn't >>>>>>> clear. It's one of the available patches I wanted to understand before >>>>>>> applying (and had forgotten about). Can anyone cast more light on it? > > > -- > > > > > > > > > > > > > > > > > > > > > > > -- > Vorstandsvorsitzender/Chairman of the board of management: > Gerd-Lothar Leonhart > Vorstand/Board of Management: > Dr. Bernd Finkbeiner, Michael Heinrichs, Dr. Arno Steitz, Dr. Ingrid Zech > Vorsitzender des Aufsichtsrats/ > Chairman of the Supervisory Board: > Philippe Miltin > Sitz/Registered Office: Tuebingen > Registergericht/Registration Court: Stuttgart > Registernummer/Commercial Register No.: HRB 382196 > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
