[ 
https://issues.apache.org/jira/browse/MESOS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16738638#comment-16738638
 ] 

Benjamin Bannier commented on MESOS-9223:
-----------------------------------------

Capturing results from offline sync with [~chhsia0], [~jieyu], and [~jdef]:

 The agent should expose metrics reflecting resource provider-related state 
changes (e.g., a metric for resource provider (re)subscriptions, and other 
relevant events the RP manager currently exposes). If this could be implemented 
in the RP manager, we could reuse the code for ERPs where masters would hold RP 
managers. We'll want to expose metrics aggregated over all agents, but probably 
also metrics per RP to simplify triage.

We are not yet sure how RPs can surface reasons for e.g., disconnects as no 
message channel from RP up to RP manager exists. Right now this could be 
implemented by making use of e.g., out of band transport of plugin logs (e.g., 
via journald).

> Storage local provider does not sufficiently handle container launch failures 
> or errors
> ---------------------------------------------------------------------------------------
>
>                 Key: MESOS-9223
>                 URL: https://issues.apache.org/jira/browse/MESOS-9223
>             Project: Mesos
>          Issue Type: Improvement
>          Components: agent, storage
>            Reporter: Benjamin Bannier
>            Assignee: Benjamin Bannier
>            Priority: Critical
>
> The storage local resource provider as currently implemented does not handle 
> launch failures or task errors of its standalone containers well enough, If 
> e.g., a RP container fails to come up during node start a warning would be 
> logged, but an operator still needs to detect degraded functionality, 
> manually check the state of containers with {{GET_CONTAINERS}}, and decide 
> whether the agent needs restarting; I suspect they do not have always have 
> enough context for this decision. It would be better if the provider would 
> either enforce a restart by failing over the whole agent, or by retrying the 
> operation (optionally: up to some maximum amount of retries).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to