[ 
https://issues.apache.org/jira/browse/MESOS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046537#comment-14046537
 ] 

Benjamin Mahler commented on MESOS-1517:
----------------------------------------

There's only a few types of messages involved here. When a master fails over, 
the slaves and frameworks will try to re-register before doing anything else.

This means that if we queue up the messages we'll only be reducing the need for 
frameworks and slaves to retry registration, which is already something that is 
required of them. So I think this change would mostly be beneficial for our 
integration tests where the retries are not desirable. :)

> Maintain a queue of messages that arrive before the master recovers.
> --------------------------------------------------------------------
>
>                 Key: MESOS-1517
>                 URL: https://issues.apache.org/jira/browse/MESOS-1517
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master
>            Reporter: Benjamin Mahler
>              Labels: reliability
>
> Currently when the master is recovering, we drop all incoming messages. If 
> slaves and frameworks knew about the leading master only once it has 
> recovered, then we would only expect to see messages after we've recovered.
> We previously considered enqueuing all messages through the recovery future, 
> but this has the downside of forcing all messages to go through the master's 
> queue twice:
> {code}
>   // TODO(bmahler): Consider instead re-enqueing *all* messages
>   // through recover(). What are the performance implications of
>   // the additional queueing delay and the accumulated backlog
>   // of messages post-recovery?
>   if (!recovered.get().isReady()) {
>     VLOG(1) << "Dropping '" << event.message->name << "' message since "
>             << "not recovered yet";
>     ++metrics.dropped_messages;
>     return;
>   }
> {code}
> However, an easy solution to this problem is to maintain an explicit queue of 
> incoming messages that gets flushed once we finish recovery. This ensures 
> that all messages post-recovery are processed normally.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to