It's a known issue: https://issues.apache.org/jira/browse/MESOS-3070
Putting in place a protection mechanism sounds good, but is rather complicated. See the comment in this ticket: https://issues.apache.org/jira/browse/MESOS-6785 On Wed, Dec 20, 2017 at 8:26 PM, Zhitao Li <zhitaoli...@gmail.com> wrote: > Hi all, > > We have seen a mesos master crash loop after a leader failover. After more > investigation, it seems that a same task ID was managed to be created onto > multiple Mesos agents in the cluster. > > One possible logical sequence which can lead to such problem: > > 1. Task T1 was launched to master M1 on agent A1 for framework F; > 2. Master M1 failed over to M2; > 3. Before A1 reregistered to M2, the same T1 was launched on to agent A2: > M2 does not know previous T1 yet so it accepted it and sent to A2; > 4. A1 reregistered: this probably crashed M2 (because same task cannot be > added twice); > 5. When M3 tries to come up after M2, it further crashes because both A1 > and A2 tried to add a T1 to the framework. > > (I only have logs to prove the last step right now) > > This happened on 1.4.0 masters. > > Although this is probably triggered by incorrect retry logic on framework > side, I wonder whether Mesos master should do extra protection to prevent > such issue to cause master crash loop. Some possible ideas are to instruct > one of the agents carrying tasks w/ duplicate ID to terminate corresponding > tasks, or just refuse to reregister such agents and instruct them to > shutdown. > > I also filed MESOS-8353 <https://issues.apache.org/jira/browse/MESOS-8353> > to track this potential bug. Thanks! > > > -- > > Cheers, > > Zhitao Li >