[ 
https://issues.apache.org/jira/browse/MESOS-6274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15531850#comment-15531850
 ] 

Anand Mazumdar commented on MESOS-6274:
---------------------------------------

[~jieyu] Discussed this offline with Jie. This has been the behavior on the 
agent since 0.28. I am surprised that this showed up now. (probably the recent 
containerizer changes triggered it, thanks for that!)

We should only allow executors to subscribe after the containerizer has 
recovered if this is a concern. I guess we only introduced the 
{{containerizer->update}} recently in the {{statueUpdate}} handler?

> Agent should not allow an executor to re-subscribe before containerizer 
> recovery is done.
> -----------------------------------------------------------------------------------------
>
>                 Key: MESOS-6274
>                 URL: https://issues.apache.org/jira/browse/MESOS-6274
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.0.0, 1.0.1
>            Reporter: Jie Yu
>            Priority: Blocker
>
> In the old API, agent will send a reconnect request to the executor and then 
> the executor will register with the agent.
> Now, in the new API, agent will allow an executor to re-subscribe before 
> containerizer recovery is done. This is problematic because containerizer has 
> no idea about the containers yet, calling containerizer->update will lead to 
> a failure, causing the container being killed.
> {noformat}
> [04:04:11]W:   [Step 10/10] I0929 04:04:11.693418 22646 
> containerizer.cpp:580] Recovering containerizer
> [04:04:11]W:   [Step 10/10] I0929 04:04:11.693444 22646 
> containerizer.cpp:636] Recovering container 
> 568968cc-f41c-475a-bb2b-45d8babd853d for executor 'default' of framework 
> 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000
> [04:04:11]W:   [Step 10/10] I0929 04:04:11.693445 22645 http.cpp:273] HTTP 
> POST for /agent/api/v1/executor from 172.30.2.198:42683
> [04:04:11]W:   [Step 10/10] I0929 04:04:11.693567 22645 slave.cpp:3017] 
> Received Subscribe request for HTTP executor 'default' of framework 
> 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000 (via HTTP)
> [04:04:11]W:   [Step 10/10] I0929 04:04:11.693613 22645 slave.cpp:3080] 
> Creating a marker file for HTTP based executor 'default' of framework 
> 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000 (via HTTP) at path 
> '/mnt/teamcity/temp/buildTmp/SlaveRecoveryTest_0_ROOT_CGROUPS_ReconnectDefaultExecutor_XpQvvJ/meta/slaves/7e4c8518-cb45-4b09-9fa8-c029d56289e2-S0/frameworks/7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000/executors/default/runs/568968cc-f41c-475a-bb2b-45d8babd853d/http.marker'
> [04:04:11]W:   [Step 10/10] I0929 04:04:11.693733 22645 slave.cpp:3609] 
> Handling status update TASK_RUNNING (UUID: 
> 6cc3f9a7-d020-46f0-82c1-39fbb9d43786) for task 
> db1f9b1b-75d2-4d96-831f-48d6f28301e8 of framework 
> 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000
> [04:04:11]W:   [Step 10/10] I0929 04:04:11.693801 22645 slave.cpp:3609] 
> Handling status update TASK_RUNNING (UUID: 
> f80d217b-7844-4134-8cc8-db6998ac437e) for task 
> 3a583cbb-8ea9-440a-864d-e68a23472368 of framework 
> 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000
> [04:04:11]W:   [Step 10/10] E0929 04:04:11.694232 22648 slave.cpp:2055] 
> Failed to update resources for container 568968cc-f41c-475a-bb2b-45d8babd853d 
> of executor 'default' of framework 7e4c8518-cb45-4b09-9fa8-c029d56289e2-0000, 
> destroying container: Collect failed: Unknown container
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to