[jira] [Commented] (MESOS-8317) Check failed when newly registered executor has launched tasks.

2017-12-09 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284945#comment-16284945
 ] 

James Peach commented on MESOS-8317:


The executor failed because it had older protobufs than the scheduler. It was 
using the JSON content type and the Go jsonpb package pukes if it receives a 
field that it doesn't know about. The field in question was the {{protocol}} 
field in the {{HealthCheck}} message.

> Check failed when newly registered executor has launched tasks.
> ---
>
> Key: MESOS-8317
> URL: https://issues.apache.org/jira/browse/MESOS-8317
> Project: Mesos
>  Issue Type: Bug
>Reporter: James Peach
>
> This check in {{slave/slave.cpp}} can fail:
> {code}
>4105   if (state != RECOVERING &&
>4106   executor->queuedTasks.empty() &&
>4107   executor->queuedTaskGroups.empty()) {
>4108 CHECK(executor->launchedTasks.empty())
>4109 << " Newly registered executor '" << executor->id
>4110 << "' has launched tasks";
>4111 
>4112 LOG(WARNING) << "Shutting down the executor " << *executor
>4113  << " because it has no tasks to run";
>4114 
>4115 _shutdownExecutor(framework, executor);
>4116 
>4117 return;
>4118   }
> {code}
> This happens with the following sequence of events:
> 1. HTTP executor subscribes
> 2. Agent sends a LAUNCH message that the executor can't decode
> 3. HTTP executor closes the channel and re-subscribes
> 4. Agent hits the above check because the executor sends and empty task list 
> (it never understood the LAUNCH message), but the agent thinks that a task 
> should have been launched.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8317) Check failed when newly registered executor has launched tasks.

2017-12-08 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284396#comment-16284396
 ] 

Vinod Kone commented on MESOS-8317:
---

Was there some incompatibility that was accidentally introduced in the `LAUNCH` 
message for it to be not decoded by an executor?

Regardless, it's better for the agent to handle this gracefully than a hard 
CHECK failure.

> Check failed when newly registered executor has launched tasks.
> ---
>
> Key: MESOS-8317
> URL: https://issues.apache.org/jira/browse/MESOS-8317
> Project: Mesos
>  Issue Type: Bug
>Reporter: James Peach
>
> This check in {{slave/slave.cpp}} can fail:
> {code}
>4105   if (state != RECOVERING &&
>4106   executor->queuedTasks.empty() &&
>4107   executor->queuedTaskGroups.empty()) {
>4108 CHECK(executor->launchedTasks.empty())
>4109 << " Newly registered executor '" << executor->id
>4110 << "' has launched tasks";
>4111 
>4112 LOG(WARNING) << "Shutting down the executor " << *executor
>4113  << " because it has no tasks to run";
>4114 
>4115 _shutdownExecutor(framework, executor);
>4116 
>4117 return;
>4118   }
> {code}
> This happens with the following sequence of events:
> 1. HTTP executor subscribes
> 2. Agent sends a LAUNCH message that the executor can't decode
> 3. HTTP executor closes the channel and re-subscribes
> 4. Agent hits the above check because the executor sends and empty task list 
> (it never understood the LAUNCH message), but the agent thinks that a task 
> should have been launched.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8317) Check failed when newly registered executor has launched tasks.

2017-12-08 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284378#comment-16284378
 ] 

James Peach commented on MESOS-8317:


/cc [~vinodkone]

> Check failed when newly registered executor has launched tasks.
> ---
>
> Key: MESOS-8317
> URL: https://issues.apache.org/jira/browse/MESOS-8317
> Project: Mesos
>  Issue Type: Bug
>Reporter: James Peach
>
> This check in {{slave/slave.cpp}} can fail:
> {code}
>4105   if (state != RECOVERING &&
>4106   executor->queuedTasks.empty() &&
>4107   executor->queuedTaskGroups.empty()) {
>4108 CHECK(executor->launchedTasks.empty())
>4109 << " Newly registered executor '" << executor->id
>4110 << "' has launched tasks";
>4111 
>4112 LOG(WARNING) << "Shutting down the executor " << *executor
>4113  << " because it has no tasks to run";
>4114 
>4115 _shutdownExecutor(framework, executor);
>4116 
>4117 return;
>4118   }
> {code}
> This happens with the following sequence of events:
> 1. HTTP executor subscribes
> 2. Agent sends a LAUNCH message that the executor can't decode
> 3. HTTP executor closes the channel and re-subscribes
> 4. Agent hits the above check because the executor sends and empty task list 
> (it never understood the LAUNCH message), but the agent thinks that a task 
> should have been launched.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)