[ https://issues.apache.org/jira/browse/MESOS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640285#comment-16640285 ]
James DeFelice commented on MESOS-9223: --------------------------------------- Regardless of whether retries are implemented, it would be nice to have an API that exposed the reason for the error. e.g. the last log line, or Mesos error related to the container failure. > Storage local provider does not sufficiently handle container launch failures > or errors > --------------------------------------------------------------------------------------- > > Key: MESOS-9223 > URL: https://issues.apache.org/jira/browse/MESOS-9223 > Project: Mesos > Issue Type: Improvement > Components: agent, storage > Reporter: Benjamin Bannier > Assignee: Chun-Hung Hsiao > Priority: Blocker > > The storage local resource provider as currently implemented does not handle > launch failures or task errors of its standalone containers well enough, If > e.g., a RP container fails to come up during node start a warning would be > logged, but an operator still needs to detect degraded functionality, > manually check the state of containers with {{GET_CONTAINERS}}, and decide > whether the agent needs restarting; I suspect they do not have always have > enough context for this decision. It would be better if the provider would > either enforce a restart by failing over the whole agent, or by retrying the > operation (optionally: up to some maximum amount of retries). -- This message was sent by Atlassian JIRA (v7.6.3#76005)