[ https://issues.apache.org/jira/browse/MESOS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695138#comment-16695138 ]
Chun-Hung Hsiao commented on MESOS-9223: ---------------------------------------- There are to problems here: 1. How to make the agent more robust when handling SLRP failures. 2. How to surface the SLRP failures. It seems to me 2 can be addressed by MESOS-8380. Thought dumps for 1: We can make the {{LocalResourceProviderDaemon}} act like systemd: retry launching the SLRP when there is a launch failure, potentially with an exponential backoff. > Storage local provider does not sufficiently handle container launch failures > or errors > --------------------------------------------------------------------------------------- > > Key: MESOS-9223 > URL: https://issues.apache.org/jira/browse/MESOS-9223 > Project: Mesos > Issue Type: Improvement > Components: agent, storage > Reporter: Benjamin Bannier > Assignee: Chun-Hung Hsiao > Priority: Critical > > The storage local resource provider as currently implemented does not handle > launch failures or task errors of its standalone containers well enough, If > e.g., a RP container fails to come up during node start a warning would be > logged, but an operator still needs to detect degraded functionality, > manually check the state of containers with {{GET_CONTAINERS}}, and decide > whether the agent needs restarting; I suspect they do not have always have > enough context for this decision. It would be better if the provider would > either enforce a restart by failing over the whole agent, or by retrying the > operation (optionally: up to some maximum amount of retries). -- This message was sent by Atlassian JIRA (v7.6.3#76005)