[ 
https://issues.apache.org/jira/browse/MESOS-6252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15643792#comment-15643792
 ] 

Markus Jura commented on MESOS-6252:
------------------------------------

The executor id alone should determine if the same executor should be used or 
not. If an executor with id 123 exists on the slave and the framework sends an 
ExecutorInfo object with the executor id 123 then I'd just re-use this 
executor. In our case, the executor start command is created programmatically 
and is different depending on the IP address of the framework. If the executor 
already exists then the start command of the ExecutorInfo should be just 
ignored.

> Do not validate start command when re-establishing connection to executor
> -------------------------------------------------------------------------
>
>                 Key: MESOS-6252
>                 URL: https://issues.apache.org/jira/browse/MESOS-6252
>             Project: Mesos
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.28.1
>         Environment: coreos
>            Reporter: Markus Jura
>
> When a framework re-connects to an existing executor then Mesos is checking 
> if the new start command of the {{ExecutorInfo}} equals the old start 
> command. 
> In case of the ConductR framework, these start command can be different due 
> to a different value in the ConductR agent argument {{--core-node}}.
> As a result, Mesos master is sending a {{TASK_ERROR}} for each running task 
> to the framework. The reason of the error is {{REASON_TASK_INVALID}}.
> {code}
> 2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR 
> MesosSchedulerClient 
> [sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22, 
> akkaTimestamp=11:34:48.713UTC, 
> akkaSource=akka.tcp://stop-all-bundles-1@10.0.0.248:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client,
>  sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state 
> TASK_ERROR received by the scheduler: task_id {
>   value: "fe65b273-61c1-4ccf-8852-bb04e2dd9380"
> }
> state: TASK_ERROR
> message: "Task has invalid ExecutorInfo (existing ExecutorInfo with same 
> ExecutorID is not 
> compatible).\n------------------------------------------------------------\nExisting
>  ExecutorInfo:\nexecutor_id {\n  value: 
> \"conductr-node-10.0.0.249-executor\"\n}\nresources {\n  name: \"cpus\"\n  
> type: SCALAR\n  scalar {\n    value: 0.9\n  }\n  role: \"*\"\n}\nresources 
> {\n  name: \"mem\"\n  type: SCALAR\n  scalar {\n    value: 402.653184\n  }\n  
> role: \"*\"\n}\nresources {\n  name: \"disk\"\n  type: SCALAR\n  scalar {\n   
>  value: 1000\n  }\n  role: \"*\"\n}\nresources {\n  name: \"ports\"\n  type: 
> RANGES\n  ranges {\n    range {\n      begin: 2552\n      end: 2552\n    }\n  
>   range {\n      begin: 10000\n      end: 10999\n    }\n  }\n  role: 
> \"*\"\n}\ncommand {\n  uris {\n    value: 
> \"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n    
> executable: false\n    extract: true\n    cache: false\n  }\n  uris {\n    
> value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n   
>  executable: false\n    extract: true\n    cache: false\n  }\n  value: 
> \"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\' && export JAVA_HOME=$(echo $(pwd)/jre*) 
> && ./conductr-agent-*/bin/conductr-agent -Dconfig.resource=mesos.conf 
> -Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552 
> -Dconductr-agent.run.allocated-ports.start=10000 
> -Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.246:9004 
> --core-system-name stop-all-bundles-1\"\n}\nframework_id {\n  value: 
> \"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource: 
> \"conductr\"\n\n------------------------------------------------------------\nTask\'s
>  ExecutorInfo:\nexecutor_id {\n  value: 
> \"conductr-node-10.0.0.249-executor\"\n}\nresources {\n  name: \"cpus\"\n  
> type: SCALAR\n  scalar {\n    value: 0.9\n  }\n  role: \"*\"\n}\nresources 
> {\n  name: \"mem\"\n  type: SCALAR\n  scalar {\n    value: 402.653184\n  }\n  
> role: \"*\"\n}\nresources {\n  name: \"disk\"\n  type: SCALAR\n  scalar {\n   
>  value: 1000\n  }\n  role: \"*\"\n}\nresources {\n  name: \"ports\"\n  type: 
> RANGES\n  ranges {\n    range {\n      begin: 2552\n      end: 2552\n    }\n  
>   range {\n      begin: 10000\n      end: 10999\n    }\n  }\n  role: 
> \"*\"\n}\ncommand {\n  uris {\n    value: 
> \"https://downloads.mesosphere.com/java/jre-8u92-linux-x64.tar.gz\"\n    
> executable: false\n    extract: true\n    cache: false\n  }\n  uris {\n    
> value: \"http://10.0.7.185/ConductR/markusjura/conductr-agent-0.1.0.tgz\"\n   
>  executable: false\n    extract: true\n    cache: false\n  }\n  value: 
> \"GLOBIGNORE=\\\'*.tar.gz:*.tgz\\\' && export JAVA_HOME=$(echo $(pwd)/jre*) 
> && ./conductr-agent-*/bin/conductr-agent -Dconfig.resource=mesos.conf 
> -Dakka.loglevel=DEBUG -Dakka.remote.netty.tcp.port=2552 
> -Dconductr-agent.run.allocated-ports.start=10000 
> -Dconductr-agent.run.allocated-ports.end=10999 --core-node 10.0.0.248:9004 
> --core-system-name stop-all-bundles-1\"\n}\nframework_id {\n  value: 
> \"stop-all-bundles-1\"\n}\nname: \"conductr-agent\"\nsource: 
> \"conductr\"\n\n------------------------------------------------------------\n"
> slave_id {
>   value: "1154b639-c536-41d1-b9df-a57b24792acb-S4"
> }
> timestamp: 1.474889688506464E9
> source: SOURCE_MASTER
> reason: REASON_TASK_INVALID
> 2016-09-26T11:34:48Z ip-10-0-0-248.us-west-2.compute.internal ERROR 
> MesosSchedulerClient 
> [sourceThread=stop-all-bundles-1-akka.actor.default-dispatcher-22, 
> akkaTimestamp=11:34:48.714UTC, 
> akkaSource=akka.tcp://stop-all-bundles-1@10.0.0.248:9004/user/reaper/mesos-client-supervisor/singleton/mesos-client,
>  sourceActorSystem=stop-all-bundles-1] - Unexpected Mesos task state 
> TASK_ERROR received by the scheduler: task_id {
>   value: "40034b01-e853-4ada-882f-9aaab67f77c2"
> }
> {code}
> Mesos should only validate the executor id. If the new id of the 
> {{ExecutorInfo}} object equals the old one then it should allow the 
> reconnection to the running executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to