Hi,

We have seen some potential problems when trying to upgrading Mesos from
1.3 to 1.4: when an implicit reconciliation happened for a large framework
(Aurora) , the scheduler would not see any offer for several minutes.
Strangely this does not show up once we revert back to 1.3.

A couple of questions:

1) Is there any between 1.3 and 1.4 which can make this slower?
2) FWICT by reading code of implicit reconcile, Mesos master sends back
status for all active and pending tasks for the framework (which has 70k+
in our cluster right now) in batch before yielding to any other messages.
Has anyone thought about supporting some kind of "pagination": i.e, master
would only send back N status updates, then delay for S seconds, then send
back next batch of N updates, until all active tasks are handled. This is
pretty much how Aurora triggers explicit reconcile to Mesos, and we don't
see any issue when processing it this way.

Thanks!


-- 
Cheers,

Zhitao Li

Reply via email to