[ 
https://issues.apache.org/jira/browse/MESOS-5061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243063#comment-15243063
 ] 

Dan Osborne commented on MESOS-5061:
------------------------------------

I spent some time looking at Zogg's setup and found the following to be true.

After its initialization, the first thing the executor does is register with 
the Slave. Since we're using the network isolator here, the  registration 
message should have a src address of the newly initialized networking 
namespace. Since calico is handling the isolation, this means we'll see a 
registration with an IP src from the default 192.168.0.0/16 range, like the 
following example:

I0414 23:31:44.479730   205 slave.cpp:2642] Got registration for executor 
'star_probe-b.0bd467c0-0299-11e6-ad3b-0242ac110005' of framework 
6a1ae9aa-ad50-44c1-8809-58791c5bcbe5-0000 from executor(1)@192.168.0.4:35454`

In Zogg's test, we're seeing a registration message that is using the Slave's 
IP address. This is of course false information. When the slave then tries to 
handshake with the registration request, it of course fails, since there is no 
executor using that IP/port. This explains why we see tasks stuck in staging - 
mesos-slave has completely lost contact with the executor.

Can someone shine light on how the Executor picks this IP, or if its just 
extracted from the source IP of the registration Method? 

Versioning info:
Mesos Manually built from 0.27.0
Net-modules (basically latest): 
https://github.com/mesosphere/net-modules/commits/625b67992ceca535cf2c76ea980b64aa8f4b33e1

I'm going to work to get this reproducible using the net-modules docker-compose 
demo. In the meantime, any thoughts?

> process.cpp:1966] Failed to shutdown socket with fd x: Transport endpoint is 
> not connected
> ------------------------------------------------------------------------------------------
>
>                 Key: MESOS-5061
>                 URL: https://issues.apache.org/jira/browse/MESOS-5061
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization, modules
>    Affects Versions: 0.27.0, 0.27.1, 0.28.0, 0.27.2
>         Environment: Centos 7.1
>            Reporter: Zogg
>             Fix For: 0.29.0
>
>
> When launching a task through Marathon and asking the task to assign an IP 
> (using Calico networking):
> {noformat}
> {
>     "id":"/calico-apps",
>     "apps": [
>         {
>             "id": "hello-world-1",
>             "cmd": "ip addr && sleep 30000",
>             "cpus": 0.1,
>             "mem": 64.0,
>             "ipAddress": {
>                 "groups": ["calico-k8s-network"]
>             }
>         }
>     ]
> }
> {noformat}
> Mesos slave fails to launch a task, locking in STAGING state forewer, with 
> error:
> {noformat}
> [centos@rtmi-worker-001 mesos]$ tail mesos-slave.INFO
> I0325 20:35:43.420171 13495 slave.cpp:2642] Got registration for executor 
> 'calico-apps_hello-world-1.23ff72e9-f2c9-11e5-bb22-be052ff413d3' of framework 
> 23b404e4-700a-4348-a7c0-226239348981-0000 from executor(1)@10.0.0.10:33443
> I0325 20:35:43.422652 13495 slave.cpp:1862] Sending queued task 
> 'calico-apps_hello-world-1.23ff72e9-f2c9-11e5-bb22-be052ff413d3' to executor 
> 'calico-apps_hello-world-1.23ff72e9-f2c9-11e5-bb22-be052ff413d3' of framework 
> 23b404e4-700a-4348-a7c0-226239348981-0000 at executor(1)@10.0.0.10:33443
> E0325 20:35:43.423159 13502 process.cpp:1966] Failed to shutdown socket with 
> fd 22: Transport endpoint is not connected
> I0325 20:35:43.423316 13501 slave.cpp:3481] executor(1)@10.0.0.10:33443 exited
> {noformat}
> However, when deploying a task without ipAddress field, mesos slave launches 
> a task successfully. 
> Tested with various Mesos/Marathon/Calico versions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to