Yes, as I said in my one of me previous messages, I was wrong about this, the job only dies if you explicity send a SIGKILL/INT/TERM to the qrsh process, so this is a non-issue.
Thanks for your help. 2012/8/28 Reuti <[email protected]> > Am 28.08.2012 um 19:20 schrieb Julien Nicoulaud: > > > Yes, exactly that. > > So in total you have two issues: > > - clean shutdown > - proper handling in case of a worker.sh crashes > > I don't get the problem with the second case. The `qrsh -inherit ...` will > return and that's all. > > How is the task failure generated in your environment and what do you > observe? > > -- Reuti > > > > 2012/8/28 Reuti <[email protected]> > > Am 28.08.2012 um 11:48 schrieb Julien Nicoulaud: > > > > > The FORBID_APPEROR parameter seems to be specific to applications > returning 100. > > > > > > My concern was about a random slave process crashing in the middle of > the run, > > > > Your application is fault-tolerant in such a way, that the other > processes discover that one worker.sh crashes and compensate this failure? > > > > -- Reuti > > > > > > > but I realize after some testing you really have to explicitely send a > signal to the qrsh process to trigger task failure detection. > > > > > > Thanks for you help ! > > > > > > 2012/8/28 William Hay <[email protected]> > > > On 27 August 2012 17:20, Julien Nicoulaud <[email protected]> > wrote: > > > > Thanks for your answer, I'll deal with the clean shutdown in my > application. > > > > > > > > However, do you know whether this task failure detection can be > disabled ? > > > > It is acceptable for me to have one worker process crashing, but not > if it > > > > kills the whole job as a side effect... > > > > > > For a similar but not identical problem we found setting > > > FORBID_APPERROR=true in qmaster_params prevented gratuitous jobkills > > > when a subtask of a job finished. > > > > > > William > > > > > > > > > > > > > > > 2012/8/26 Reuti <[email protected]> > > > >> > > > >> Hi, > > > >> > > > >> Am 26.08.2012 um 15:42 schrieb Julien Nicoulaud: > > > >> > > > >> > I'm working on setting up a tightly integrated parallel > environment for > > > >> > my application using the "qrsh -inherit" method, but I can't find > the right > > > >> > way to terminate the qrsh sub-tasks. Whatever method I try, the > parent job > > > >> > always ends with "Unable to run job N" > > > >> > > > >> You will get this message only if you start it with `-sync y`. It > won't be > > > >> in any logfile otherwise. But I don't face the issue, that the > workers run > > > >> forever. They are killed by the exit of the complete job, although > not in a > > > >> nice way but by a `kill`. > > > >> > > > >> Maybe you can set in `qconf -mconf`: "execd_params > > > >> ENABLE_ADDGRP_KILL=TRUE" > > > >> > > > >> == > > > >> > > > >> The usual way to shut down slave tasks: use your own protocol which > you > > > >> want to implement and tell your worker.sh this way: "Hey, kill > yourself." > > > >> > > > >> == > > > >> > > > >> In principle it's supported to handle signals and the sge_execd can > tell > > > >> the sge_shepherd to signal its kids. For a "normal" binary you can > implement > > > >> actions to handle it in a proper way. Using the tight integration > by `qrsh > > > >> -inherit ...` there is the special situation, that also the > "qrsh_starter" > > > >> will get the signal and it will just exit forcing the job to end. > > > >> > > > >> -- Reuti > > > >> > > > >> > > > >> > message and the qmaster log contains: > > > >> > > > > >> > tightly integrated parallel task 159.1 task 1.vbox-centos6-3 > failed - > > > >> > killing job > > > >> > > > > >> > Does anyone know the right way to handle this ? > > > >> > > > > >> > If this can help, I shared my test scripts here: > > > >> > https://gist.github.com/3479264 > > > >> > • test.sh: submits master.sh as a N slots parallel job > > > >> > • master.sh: > > > >> > • Launches N-1 worker.sh with "qrsh -inherit" in the > > > >> > background > > > >> > • Works for a while > > > >> > • Sends TERM to qrsh processes > > > >> > • worker.sh: works until killed > > > >> > By the way, I'm using SGE 6.2u5. > > > >> > > > > >> > Any help on this is welcome! > > > >> > > > > >> > Regards, > > > >> > Julien > > > >> > _______________________________________________ > > > >> > users mailing list > > > >> > [email protected] > > > >> > https://gridengine.org/mailman/listinfo/users > > > >> > > > > > > > > > > _______________________________________________ > > > users mailing list > > > [email protected] > > > https://gridengine.org/mailman/listinfo/users > > > > > >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
