We had an interesting problem with resource offers today and I would like
to confirm this problem and request an enhancement. Here's the summary in
the right sequence of events:

1. resource offer O1 for slave A arrives
2. mesos disconnects
3. mesos reregisters
4. mesos offer O2 for slave A arrives
    (our framework keeps offers for sometime if unused, therefore, we now
have both O1 and O2, incorrectly)
5. launch task T1 using offers O1 and O2
6. framework thinks it has no offers with it now for slave A, will wait for
new offer after mesos consumes resources for task T1
7. mesos sends TASK_LOST for T1 saying it was using an invalid offer
    (even though only O1 was invalid, O2 is gone missing silently)
8. no more offers come for slave A
9. basically we have an offer leak problem.

To work around this, I am changing my framework so that when it receives
mesos reregistration callback (step 3 above), it removes all existing
offers. This should fix the problem.

However, I am wondering if #7 can be improved in Mesos. When a task is (or
set of tasks are) launched using multiple offers, if at least one of the
offers is invalid, then Mesos should treat all offers as given up by the
framework. This will send TASK_LOST to the framework, but, also make the
valid offers available again through new offers.

I am thinking this will be critical to do when Mesos starts rescinding
offers. Because in that case the frameworks cannot rely on the strategy
like the one I am using with reregistration.

Sharma

Reply via email to