Zhitao any further updates on this? Thx
> On Dec 13, 2017, at 1:02 PM, Benjamin Mahler <bmah...@apache.org> wrote: > > You can check the diff, for example: > https://github.com/apache/mesos/compare/1.3.0...1.4.0 > > I didn't notice any changes that look like they would cause this. > > What do the master logs show during the time frame? > Have you profiled what the master and scheduler are doing during this time > frame? > >> On Tue, Dec 12, 2017 at 10:46 AM, Zhitao Li <zhitaoli...@gmail.com> wrote: >> >> Hi, >> >> We have seen some potential problems when trying to upgrading Mesos from >> 1.3 to 1.4: when an implicit reconciliation happened for a large framework >> (Aurora) , the scheduler would not see any offer for several minutes. >> Strangely this does not show up once we revert back to 1.3. >> >> A couple of questions: >> >> 1) Is there any between 1.3 and 1.4 which can make this slower? >> 2) FWICT by reading code of implicit reconcile, Mesos master sends back >> status for all active and pending tasks for the framework (which has 70k+ >> in our cluster right now) in batch before yielding to any other messages. >> Has anyone thought about supporting some kind of "pagination": i.e, master >> would only send back N status updates, then delay for S seconds, then send >> back next batch of N updates, until all active tasks are handled. This is >> pretty much how Aurora triggers explicit reconcile to Mesos, and we don't >> see any issue when processing it this way. >> >> Thanks! >> >> >> -- >> Cheers, >> >> Zhitao Li >>