[
https://issues.apache.org/jira/browse/MESOS-8839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16757399#comment-16757399
]
Benjamin Bannier commented on MESOS-8839:
-----------------------------------------
Reopening as we saw this again in our internal CI with something close to
today's {{master}} {{HEAD}}.
> Resource provider manager registrar recovery can race with agent on agent
> state leading to hard failures
> --------------------------------------------------------------------------------------------------------
>
> Key: MESOS-8839
> URL: https://issues.apache.org/jira/browse/MESOS-8839
> Project: Mesos
> Issue Type: Bug
> Components: agent, storage
> Affects Versions: 1.6.0, 1.8.0
> Reporter: Benjamin Bannier
> Assignee: Benjamin Bannier
> Priority: Blocker
> Attachments: log
>
>
> When running in the agent the resource provider manager persists its state
> into the agent's state. The agent uses a LevelDB state which protects against
> concurrent access. The way we modelled LevelDB an {{fetch}} when a lock is
> present leads to a failed {{Future}} result. When the resource provider
> manager encounters a failed recovery it emits a fatal error, e.g.,
> {noformat}
> 11:48:26 F0425 11:48:26.650568 26819 manager.cpp:254] Failed to recover
> resource provider manager registry: Failed: IO error: lock
> /tmp/ParentChildContainerTypeAndContentType_AgentContainerAPITest_RecoverNestedContainer_10_HXbQCK/meta/slaves/6645885c-050a-4518-b896-a20b3e72a070-S0/resource_provider_registry/LOCK:
> already held by process
> 11:48:26 *** Check failure stack trace: ***{noformat}
> We should not fail hard for such recoverable failure scenarios.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)