On 7 April 2015 at 22:15, Brian Brazil <[email protected]> wrote:
> On 7 April 2015 at 21:28, Erb, Stephan <[email protected]> > wrote: > >> Brian, do you have any particular plans regarding your shutdown >> requirements? I have seen that you have filed another issue [1] which is >> also concerned with graceful shutdown. >> > > Given this thread, I now only wish to hit a different endpoint than > /quitquitquit (and I may aswell do /abortabortabort while I'm at it). The > rest is changes to our internal shutdown handling. > > >> Stephan >> >> PS: For what it's worth, I implemented the 'quick fix' version to my >> problem stated in the beginning of this thread [2]. >> > > That's handy. When writing the code up today I noticed that hitting > /quitquitquit wasn't unittested. I hope to have that up for review tomorrow > with unittests, which you could build on to do a more end-to-end unittest > for your code. > This is now up at https://reviews.apache.org/r/32973/ Brian > > Brian > > >> [1] https://issues.apache.org/jira/browse/AURORA-1257 >> [2] https://reviews.apache.org/r/32889/ >> >> ________________________________________ >> From: Brian Brazil <[email protected]> >> Sent: Tuesday, March 24, 2015 10:48 PM >> To: [email protected] >> Subject: Re: Graceful task shutdown >> >> On 24 March 2015 at 21:33, George Sirois <[email protected]> wrote: >> >> > Unfortunately I don't think my change will be able to make it in as-is. >> > >> > As Brian Wickman pointed out, it could introduce serious problems >> because >> > there are varying timeouts across the scheduler/executor, so if you set >> > your wait time to be too high, the scheduler might start to consider the >> > tasks lost because they stayed in the transient KILLING state for too >> long. >> > >> >> Hmm, what sort of work is involved in resolving that? >> >> In my case I need at least 12s after the /qqq before sending the TERM. >> >> Brian >> >> >> > >> > I do think the lifecycle modules idea would solve Stephan's issue. >> > >> > On Tue, Mar 24, 2015 at 5:06 PM, Brian Brazil <[email protected] >> > >> > wrote: >> > >> > > On 24 March 2015 at 20:57, Erb, Stephan <[email protected]> >> > > wrote: >> > > >> > > > Hi everyone, >> > > > >> > > > we are implementing the /health endpoint in our services but omit >> the >> > > > implementation of the unauthenticated lifecycle methods >> /quitquitquit >> > and >> > > > /abortabortabort. >> > > > >> > > > As a consequence, stopping a service is taxed by 10 seconds waiting >> > time >> > > > [1]. I would like to get rid of this unnecessary delay and can >> think of >> > > two >> > > > solutions: >> > > > >> > > > a) Only perform the escalation wait when the http_signaler reports >> that >> > > > the message could be delivered to the service. This is a rather >> simple >> > > and >> > > > localized fix. >> > > > >> > > > b) Use another port for lifecycle events. This would require a new >> > > > addition to the task configuration and proper plumbing throughout >> the >> > > rest >> > > > of the system. Backward compatibility could be achieved by using >> > 'health' >> > > > as the default lifecycle management port. >> > > > >> > > > Any thoughts? I would be happy with the simple solution, but in the >> end >> > > > it's your call :-) >> > > > >> > > >> > > __george mentioned on IRC working on a change that'll let the wait >> time >> > be >> > > configurable (which is something I also need), would that cover your >> use >> > > case? >> > > >> > > There were also discussions on IRC about custom lifecycle modules. >> > > >> > > Brian >> > > >> > > >> > > > >> > > > Best Regards, >> > > > Stephan >> > > > >> > > > [1] >> > > > >> > > >> > >> https://github.com/apache/incubator-aurora/blob/master/src/main/python/apache/aurora/executor/thermos_task_runner.py#L123 >> > > >> > >> > >
