Re: Implicit reconcile "pauses" offer stream in large cluster

Benjamin Mahler Wed, 13 Dec 2017 13:03:01 -0800

You can check the diff, for example:
https://github.com/apache/mesos/compare/1.3.0...1.4.0


I didn't notice any changes that look like they would cause this.

What do the master logs show during the time frame?
Have you profiled what the master and scheduler are doing during this time
frame?

On Tue, Dec 12, 2017 at 10:46 AM, Zhitao Li <[email protected]> wrote:

> Hi,
>
> We have seen some potential problems when trying to upgrading Mesos from
> 1.3 to 1.4: when an implicit reconciliation happened for a large framework
> (Aurora) , the scheduler would not see any offer for several minutes.
> Strangely this does not show up once we revert back to 1.3.
>
> A couple of questions:
>
> 1) Is there any between 1.3 and 1.4 which can make this slower?
> 2) FWICT by reading code of implicit reconcile, Mesos master sends back
> status for all active and pending tasks for the framework (which has 70k+
> in our cluster right now) in batch before yielding to any other messages.
> Has anyone thought about supporting some kind of "pagination": i.e, master
> would only send back N status updates, then delay for S seconds, then send
> back next batch of N updates, until all active tasks are handled. This is
> pretty much how Aurora triggers explicit reconcile to Mesos, and we don't
> see any issue when processing it this way.
>
> Thanks!
>
>
> --
> Cheers,
>
> Zhitao Li
>

Re: Implicit reconcile "pauses" offer stream in large cluster

Reply via email to