Am 27.08.2012 um 18:20 schrieb Julien Nicoulaud:

> Thanks for your answer, I'll deal with the clean shutdown in my application.
> 
> However, do you know whether this task failure detection can be disabled ? It 
> is acceptable for me to have one worker process crashing, but not if it kills 
> the whole job as a side effect...

But the job is over anyway if I get you right. Or do you want to quit the 
slaves and continue with the master task?

If you want to quit all slave tasks: maybe create a file <jobid>.stop and the 
slaves detect the presence and quit theirselfs then.

-- Reuti


> 
> 2012/8/26 Reuti <[email protected]>
> Hi,
> 
> Am 26.08.2012 um 15:42 schrieb Julien Nicoulaud:
> 
> > I'm working on setting up a tightly integrated parallel environment for my 
> > application using the "qrsh -inherit" method, but I can't find the right 
> > way to terminate the qrsh sub-tasks. Whatever method I try, the parent job 
> > always ends with "Unable to run job N"
> 
> You will get this message only if you start it with `-sync y`. It won't be in 
> any logfile otherwise. But I don't face the issue, that the workers run 
> forever. They are killed by the exit of the complete job, although not in a 
> nice way but by a `kill`.
> 
> Maybe you can set in `qconf -mconf`: "execd_params      
> ENABLE_ADDGRP_KILL=TRUE"
> 
> ==
> 
> The usual way to shut down slave tasks: use your own protocol which you want 
> to implement and tell your worker.sh this way: "Hey, kill yourself."
> 
> ==
> 
> In principle it's supported to handle signals and the sge_execd can tell the 
> sge_shepherd to signal its kids. For a "normal" binary you can implement 
> actions to handle it in a proper way. Using the tight integration by `qrsh 
> -inherit ...` there is the special situation, that also the "qrsh_starter" 
> will get the signal and it will just exit forcing the job to end.
> 
> -- Reuti
> 
> 
> > message and the qmaster log contains:
> >
> > tightly integrated parallel task 159.1 task 1.vbox-centos6-3 failed - 
> > killing job
> >
> > Does anyone know the right way to handle this ?
> >
> > If this can help, I shared my test scripts here: 
> > https://gist.github.com/3479264
> >       • test.sh: submits master.sh as a N slots parallel job
> >       • master.sh:
> >               • Launches N-1 worker.sh with "qrsh -inherit" in the 
> > background
> >               • Works for a while
> >               • Sends TERM to qrsh processes
> >       • worker.sh: works until killed
> > By the way, I'm using SGE 6.2u5.
> >
> > Any help on this is welcome!
> >
> > Regards,
> > Julien
> > _______________________________________________
> > users mailing list
> > [email protected]
> > https://gridengine.org/mailman/listinfo/users
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to