[ 
https://issues.apache.org/jira/browse/MESOS-6252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15647321#comment-15647321
 ] 

Markus Jura commented on MESOS-6252:
------------------------------------

I see what you are saying. Mesos is doing a validation if for the executor id 
the start command is the same to ensure that the task can work correctly. 
However, I feel that should be a framework concern. If the framework decides to 
build and `ExecutorInfo` with the same executor id, it should know what it is 
doing. It should not be the case that Mesos is performing an additional 
validation on the start command. Ideally, an API would be nice in which the 
framework only need to specify an executor id, not the whole `ExecutorInfo` 
object in case an executor is already running, but this is another topic and I 
am not demanding this.

Because of this situation we currently need to store the `ExecutorInfo` object 
per slave as framework state in order to use the same executor start command 
for an existing executor. If we don't do that our start command would change if 
another framework node creates the `ExecutorInfo` object, e.g. because of a 
failover to another framework node. Again, we are only doing that because Mesos 
is performing a validation on the start command which I find is the 
responsibility of the framework itself (if it decides to use the same executor 
id).

> Do not validate start command when re-establishing connection to executor
> -------------------------------------------------------------------------
>
>                 Key: MESOS-6252
>                 URL: https://issues.apache.org/jira/browse/MESOS-6252
>             Project: Mesos
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.28.1
>         Environment: coreos
>            Reporter: Markus Jura
>
> When a framework re-connects to an existing executor then Mesos is checking 
> if the new start command of the {{ExecutorInfo}} equals the old start 
> command. 
> In case of the ConductR framework, these start command can be different due 
> to a different value in the ConductR agent argument {{--core-node}}.
> As a result, Mesos master is sending a {{TASK_ERROR}} for each running task 
> to the framework. The reason of the error is {{REASON_TASK_INVALID}}.
> {code}
> 2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR 
> MesosSchedulerClient 
> [sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22, 
> akkaTimestamp=11:34:48.713UTC, 
> akkaSource=akka.tcp://stop-all-bundles-1@10.0.0.248:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client,
>  sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state 
> TASK_ERROR received by the scheduler: task_id {
>   value: "fe65b273-61c1-4ccf-8852-bb04e2dd9380"
> }
> state: TASK_ERROR
> message: "Task has invalid ExecutorInfo (existing ExecutorInfo with same 
> ExecutorID is not 
> compatible).\n------------------------------------------------------------\nExisting
>  ExecutorInfo:\nexecutor_id {\n  value: 
> \"conductr-node-10.0.0.249-executor\"\n}\nresources {\n  name: \"cpus\"\n  
> type: SCALAR\n  scalar {\n    value: 0.9\n  }\n  role: \"*\"\n}\nresources 
> {\n  name: \"mem\"\n  type: SCALAR\n  scalar {\n    value: 402.653184\n  }\n  
> role: \"*\"\n}\nresources {\n  name: \"disk\"\n  type: SCALAR\n  scalar {\n   
>  value: 1000\n  }\n  role: \"*\"\n}\nresources {\n  name: \"ports\"\n  type: 
> RANGES\n  ranges {\n    range {\n      begin: 2552\n      end: 2552\n    }\n  
>   range {\n      begin: 10000\n      end: 10999\n    }\n  }\n  role: 
> \"*\"\n}\ncommand {\n  uris {\n    value: 
> \"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n    
> executable: false\n    extract: true\n    cache: false\n  }\n  uris {\n    
> value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n   
>  executable: false\n    extract: true\n    cache: false\n  }\n  value: 
> \"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\' && export JAVA_HOME=$(echo $(pwd)/jre*) 
> && ./conductr-agent-*/bin/conductr-agent -Dconfig.resource=mesos.conf 
> -Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552 
> -Dconductr-agent.run.allocated-ports.start=10000 
> -Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.246:9004 
> --core-system-name stop-all-bundles-1\"\n}\nframework_id {\n  value: 
> \"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource: 
> \"conductr\"\n\n------------------------------------------------------------\nTask\'s
>  ExecutorInfo:\nexecutor_id {\n  value: 
> \"conductr-node-10.0.0.249-executor\"\n}\nresources {\n  name: \"cpus\"\n  
> type: SCALAR\n  scalar {\n    value: 0.9\n  }\n  role: \"*\"\n}\nresources 
> {\n  name: \"mem\"\n  type: SCALAR\n  scalar {\n    value: 402.653184\n  }\n  
> role: \"*\"\n}\nresources {\n  name: \"disk\"\n  type: SCALAR\n  scalar {\n   
>  value: 1000\n  }\n  role: \"*\"\n}\nresources {\n  name: \"ports\"\n  type: 
> RANGES\n  ranges {\n    range {\n      begin: 2552\n      end: 2552\n    }\n  
>   range {\n      begin: 10000\n      end: 10999\n    }\n  }\n  role: 
> \"*\"\n}\ncommand {\n  uris {\n    value: 
> \"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n    
> executable: false\n    extract: true\n    cache: false\n  }\n  uris {\n    
> value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n   
>  executable: false\n    extract: true\n    cache: false\n  }\n  value: 
> \"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\' && export JAVA_HOME=$(echo $(pwd)/jre*) 
> && ./conductr-agent-*/bin/conductr-agent -Dconfig.resource=mesos.conf 
> -Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552 
> -Dconductr-agent.run.allocated-ports.start=10000 
> -Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.248:9004 
> --core-system-name stop-all-bundles-1\"\n}\nframework_id {\n  value: 
> \"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource: 
> \"conductr\"\n\n------------------------------------------------------------\n"
> slave_id {
>   value: "1154b639-c536-41d1-b9df-a57b24792acb-S4"
> }
> timestamp: 1.474889688506464E9
> source: SOURCE_MASTER
> reason: REASON_TASK_INVALID
> 2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR 
> MesosSchedulerClient 
> [sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22, 
> akkaTimestamp=11:34:48.714UTC, 
> akkaSource=akka.tcp://stop-all-bundles-1@10.0.0.248:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client,
>  sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state 
> TASK_ERROR received by the scheduler: task_id {
>   value: "40034b01-e853-4ada-882f-9aaab67f77c2"
> }
> {code}
> Mesos should only validate the executor id. If the new id of the 
> {{ExecutorInfo}} object equals the old one then it should allow the 
> reconnection to the running executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to