[ 
https://issues.apache.org/jira/browse/MESOS-6274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534444#comment-15534444
 ] 

Anand Mazumdar commented on MESOS-6274:
---------------------------------------

{noformat}
commit 914ab0f640377cfed9cc8a9dabfa40adec500c0e
Author: Anand Mazumdar <an...@apache.org>
Date:   Thu Sep 29 16:38:07 2016 -0700

    Disallowed HTTP executors to subscribe before containerizer recovery.

    Previously, it was possible for a HTTP based executor to subscribe
    with the agent before the containerizer recovery is done. This
    was a problem since calling `containerizer->update()` etc. would
    result in a failure.

    Review: https://reviews.apache.org/r/52408/

commit 6b99555fa808eb32e32c3624704d0971568ca795
Author: Anand Mazumdar <an...@apache.org>
Date:   Thu Sep 29 16:37:53 2016 -0700

    Added `RecoveryInfo` struct to the agent.

    This struct would container all the recovery related metadata
    on the agent from now on. Eventually, we would add component
    specific recovery information to this struct e.g, the executors
    can now subscribe again with the agent etc.

    Review: https://reviews.apache.org/r/52407/
{noformat}

Keeping the JIRA open for doing the backporting.

> Agent should not allow HTTP executors to re-subscribe before containerizer 
> recovery is done.
> --------------------------------------------------------------------------------------------
>
>                 Key: MESOS-6274
>                 URL: https://issues.apache.org/jira/browse/MESOS-6274
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.0.0, 1.0.1
>            Reporter: Jie Yu
>            Assignee: Anand Mazumdar
>            Priority: Blocker
>              Labels: mesosphere
>             Fix For: 1.1.0, 1.0.2
>
>
> In the old API, agent will send a reconnect request to the executor and then 
> the executor will register with the agent.
> Now, in the new API, agent will allow an executor to re-subscribe before 
> containerizer recovery is done. This is problematic because containerizer has 
> no idea about the containers yet, calling containerizer->update will lead to 
> a failure, causing the container being killed.
> {noformat}
> [04:04:11]W:   [Step 10/10] I0929 04:04:11.693418 22646 
> containerizer.cpp:580] Recovering containerizer
> [04:04:11]W:   [Step 10/10] I0929 04:04:11.693444 22646 
> containerizer.cpp:636] Recovering container 
> 568968cc-f41c-475a-bb2b-45d8babd853d for executor 'default' of framework 
> 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000
> [04:04:11]W:   [Step 10/10] I0929 04:04:11.693445 22645 http.cpp:273] HTTP 
> POST for /agent/api/v1/executor from 172.30.2.198:42683
> [04:04:11]W:   [Step 10/10] I0929 04:04:11.693567 22645 slave.cpp:3017] 
> Received Subscribe request for HTTP executor 'default' of framework 
> 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000 (via HTTP)
> [04:04:11]W:   [Step 10/10] I0929 04:04:11.693613 22645 slave.cpp:3080] 
> Creating a marker file for HTTP based executor 'default' of framework 
> 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000 (via HTTP) at path 
> '/mnt/teamcity/temp/buildTmp/SlaveRecoveryTest_0_ROOT_CGROUPS_ReconnectDefaultExecutor_XpQvvJ/meta/slaves/7e4c8518-cb45-4b09-9fa8-c029d56289e2-S0/frameworks/7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000/executors/default/runs/568968cc-f41c-475a-bb2b-45d8babd853d/http.marker'
> [04:04:11]W:   [Step 10/10] I0929 04:04:11.693733 22645 slave.cpp:3609] 
> Handling status update TASK_RUNNING (UUID: 
> 6cc3f9a7-d020-46f0-82c1-39fbb9d43786) for task 
> db1f9b1b-75d2-4d96-831f-48d6f28301e8 of framework 
> 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000
> [04:04:11]W:   [Step 10/10] I0929 04:04:11.693801 22645 slave.cpp:3609] 
> Handling status update TASK_RUNNING (UUID: 
> f80d217b-7844-4134-8cc8-db6998ac437e) for task 
> 3a583cbb-8ea9-440a-864d-e68a23472368 of framework 
> 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000
> [04:04:11]W:   [Step 10/10] E0929 04:04:11.694232 22648 slave.cpp:2055] 
> Failed to update resources for container 568968cc-f41c-475a-bb2b-45d8babd853d 
> of executor 'default' of framework 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000, 
> destroying container: Collect failed: Unknown container
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to