Re: Review Request 61473: Do not kill non partition aware tasks.

Jiang Yan Xu Tue, 28 Nov 2017 15:09:06 -0800


> On Nov. 27, 2017, 4:25 p.m., Ilya Pronin wrote:
> > src/master/http.cpp
> > Lines 343-345 (patched)
> > <https://reviews.apache.org/r/61473/diff/20/?file=1902099#file1902099line343>
> >
> >     I may be ignorant of the discussion behind this, but since we don't 
> > treat these tasks as completed anymore, do we really need to maintain 
> > backwards compatibility here, in stats and metrics? This kind of hides the 
> > real state of things.
> 
> Megha Sharma wrote:
>     Copying @xujyan to the discussion.

It's true that this isn't perfect and my first intuition was to export the real 
state. We chatted with Vinod as well. Ultimately we think it's worse to export 
information inconsistently. More than metrics there are v0 http endpoints, the 
webUI and v1 operator APIs that expose the task's state (in addition to the 
scheduler API). It would be weird for the task's state to show up differently 
via different APIs.

Secondly, the mechanism implemented by Mesos to handle the backwards 
compatibility w.r.t TASK_LOST vs. new states is to decide which to use when the 
task status is **created**, unlike how we handle old/new reservation formats 
and framework API devolve/evolve, for example. This means we lose the original 
state in upon creation and have no way to export it later. We had to use a 
`bool unreachable` to *remember* the original state for unreachable tasks but 
this cannot be generalized to all "new states" and I don't think it's worth 
treating unreachable tasks specically.

Does it make sense?

- Jiang Yan

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/#review191674
-----------------------------------------------------------

On Nov. 28, 2017, 9:28 a.m., Megha Sharma wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61473/
> -----------------------------------------------------------
> 
> (Updated Nov. 28, 2017, 9:28 a.m.)
> 
> 
> Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-7215
>     https://issues.apache.org/jira/browse/MESOS-7215
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Master will not kill the tasks for non-Partition aware frameworks
> when an unreachable agent re-registers with the master.
> Master used to send a ShutdownFrameworkMessages to the agent
> to kill the tasks from non partition aware frameworks including
> the ones that are still registered which was problematic because
> the offer from this agent could still go to the same framework which
> could then launch new tasks. The agent would then receive tasks
> of the same framework and ignore them because it thinks the
> framework is shutting down. The framework is not shutting down of
> course, so from the master and the scheduler's perspective the task
> is pending in STAGING forever until the next agent reregistration,
> which could happen much later. This commit fixes the problem by
> not shutting down the non-partition aware frameworks on such an
> agent.
> 
> 
> Diffs
> -----
> 
>   include/mesos/mesos.proto e194093e490741acc552fd3ad328fd710b4b4435 
>   include/mesos/v1/mesos.proto 6fb1139683952877667abbcf8bf84b5b31bcd29e 
>   src/master/http.cpp 10084125deb839a9846a4f64d2e433ff02754c02 
>   src/master/master.hpp a309fc78ee2613762f3d5d22ac7559afc7aac4a3 
>   src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
>   src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 
> 
> 
> Diff: https://reviews.apache.org/r/61473/diff/25/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>

Re: Review Request 61473: Do not kill non partition aware tasks.

Reply via email to