I was on vacation but wanted to thank Biao for summarizing the current state! Thanks!
On Mon, Jul 15, 2019 at 2:00 AM Biao Liu <mmyy1...@gmail.com> wrote: > Hi Aaron, > > From my understanding, you want shutting down a Task Manager without > restart the job which has tasks running on this Task Manager? > > Based on current implementation, if there is a Task Manager is down, the > tasks on it would be treated as failed. The behavior of task failure is > defined via `FailoverStrategy` which is `RestartAllStrategy` by default. > That's the reason why the whole job restarts when a Task Manager has gone. > As Paul said, you could try "region restart failover strategy" when 1.9 is > released. It might be helpful however it depends on your job topology. > > The deeper reason of this issue is the consistency semantics of Flink, > AT_LEAST_ONCE or EXACTLY_ONCE. Flink must respect these semantics. So there > is no much choice of `FailoverStrategy`. > It might be improved in the future. There are some discussions in the > mailing list that providing some weaker consistency semantics to improve > the `FailoverStrategy`. We are pushing forward this improvement. I hope it > can be included in 1.10. > > Regarding your question, I guess the answer is no for now. A more frequent > checkpoint or a savepoint manually triggered might be helpful by a quicker > recovery. > > > Paul Lam <paullin3...@gmail.com> 于2019年7月12日周五 上午10:25写道: > >> Hi, >> >> Maybe region restart strategy can help. It restarts minimum required >> tasks. Note that it’s recommended to use only after 1.9 release, see [1], >> unless you’re running a stateless job. >> >> [1] https://issues.apache.org/jira/browse/FLINK-10712 >> >> Best, >> Paul Lam >> >> 在 2019年7月12日,03:38,Aaron Levin <aaronle...@stripe.com> 写道: >> >> Hello, >> >> Is there a way to gracefully terminate a Task Manager beyond just killing >> it (this seems to be what `./taskmanager.sh stop` does)? Specifically I'm >> interested in a way to replace a Task Manager that has currently-running >> tasks. It would be great if it was possible to terminate a Task Manager >> without restarting the job, though I'm not sure if this is possible. >> >> Context: at my work we regularly cycle our hosts for maintenance and >> security. Each time we do this we stop the task manager running on the host >> being cycled. This causes the entire job to restart, resulting in downtime >> for the job. I'd love to decrease this downtime if at all possible. >> >> Thanks! Any insight is appreciated! >> >> Best, >> >> Aaron Levin >> >> >>