[ 
https://issues.apache.org/jira/browse/MESOS-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16805529#comment-16805529
 ] 

Chun-Hung Hsiao commented on MESOS-9667:
----------------------------------------

Some more thoughts after discussing with [~vinodkone] and [~greggomann]:

1. Initialize the RP manager as early as possible.
2. Maybe we can consider change {{publishResources}} here:
   
https://github.com/apache/mesos/blob/7c8a9a9218b5b3a9a2acbf8c10899355773377ef/src/slave/slave.cpp#L5027
   to only do {{publishResources}} only if it's a fresh executor launch. We 
could either check if {{queuedTasks}} is nonempty, or check if the slave is in 
recovery state.
3. Or, refactor {{onUnscheduleGCFailure}} here:
   
https://github.com/apache/mesos/blob/7c8a9a9218b5b3a9a2acbf8c10899355773377ef/src/slave/slave.cpp#L2171
   to handle task status update for general failure, then insert 
{{publishResources}} right after {{unschedule}}, and remove 
{{publishResources}} elsewhere.

We can see which of 2 and 3 leads to cleaner code.

> Check failure when executor for task using resource provider resources 
> subscribes before agent is registered
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-9667
>                 URL: https://issues.apache.org/jira/browse/MESOS-9667
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>    Affects Versions: 1.8.0
>            Reporter: Benjamin Bannier
>            Assignee: Benjamin Bannier
>            Priority: Blocker
>              Labels: foundations, mesosphere, mesosphere-dss-ga
>
> When an executor for a task using resource provider resources subscribes 
> before the agent has registered with the master, we trigger a fatal assertion,
> {code:java}
> Mar 21 13:42:47 agent1 mesos-agent[17277]: F0321 13:42:46.845535 17295 
> slave.cpp:8834] Check failed: 'resourceProviderManager.get()' Must be non NULL
> Mar 21 13:42:47 agent1 mesos-agent[17277]: *** Check failure stack trace: 
> *{code}
> The reason for this failure is that we attempt to publish resources to the 
> resource provider via the resource provider manager, but the resource 
> provider manager is only created once the agent has registered with the 
> master.
> As a workaround one can terminate the executors and their tasks, and let the 
> framework relaunch the tasks (provided it supports that).
> A possible workaround could be to prevent such executors from subscribing 
> until the resource provider manager is available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to