[ 
https://issues.apache.org/jira/browse/MESOS-5225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245085#comment-15245085
 ] 

Avinash Sridharan edited comment on MESOS-5225 at 4/18/16 4:44 AM:
-------------------------------------------------------------------

Thanks Qian !!
This does seem like a bug. 

Odd part is we do set the rootfs on which we are going to bind mount the 
network files by checking the `ContainerConfig` 
https://github.com/apache/mesos/blob/0845ec04395faeb05a518a81c89c87b726dc8711/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L566

However, my suspicion is that for command tasks the rootfs in `ContainerConfig` 
is set to the actual rootfs of the container. (Need to confirm this). 

This does raise a question though, even if we bind mount the files to the 
corresponding files in the host file system, we still need to bind mount the 
same files to the container file system as well. Reason being, that after 
`pivot_root` the process will start treating the container as the root 
filesystem and if the network files are not bind mounted into the rootfs of the 
container, we will start seeing the same failure.

I am thinking that the fix should be to bind mount the files to the rootfs of 
the container and the rootfs of the host file system. These mount points will 
get destroyed anyway when the mnt namespace is destroyed (container dies).


was (Author: avin...@mesosphere.io):
Thanks Qian !!
This does seem like a bug. 

Odd part is we do set the rootfs on which we are going to bind mount the 
network files by checking the `ContainerConfig` 
https://github.com/apache/mesos/blob/0845ec04395faeb05a518a81c89c87b726dc8711/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L566

However, my suspicion is that for command tasks the rootfs in `ContainerConfig` 
is set to the actual rootfs of the container. (Need to confirm this). 

This does raise a question though, even we bind mount the files to the 
corresponding files in the host file system, we still need to bind mount the 
same files to the container file system as well. Reason being, that after 
`pivot_root` the process will start treating the container as the root 
filesystem and if the network files are not bind mounted into the rootfs of the 
container, we will start seeing the same failure.

I am thinking that the fix should be to bind mount the files to the rootfs of 
the container and the rootfs of the host file system. These mount points will 
get destroyed anyway when the mnt namespace is destroyed (container dies).

> Command executor can not start when joining a CNI network
> ---------------------------------------------------------
>
>                 Key: MESOS-5225
>                 URL: https://issues.apache.org/jira/browse/MESOS-5225
>             Project: Mesos
>          Issue Type: Bug
>          Components: isolation
>            Reporter: Qian Zhang
>            Assignee: Qian Zhang
>
> Reproduce steps:
> 1. Start master
> {code}
> sudo ./bin/mesos-master.sh --work_dir=/tmp
> {code}
>  
> 2. Start agent
> {code}
> sudo ./bin/mesos-slave.sh --master=192.168.122.171:5050 
> --containerizers=mesos --image_providers=docker 
> --isolation=filesystem/linux,docker/runtime,network/cni 
> --network_cni_config_dir=/opt/cni/net_configs 
> --network_cni_plugins_dir=/opt/cni/plugins}}
> {code}
>  
> 3. Launch a command task with mesos-execute, and it will join a CNI network 
> {{net1}}.
> {code}
> sudo src/mesos-execute --master=192.168.122.171:5050 --name=test 
> --docker_image=library/busybox --networks=net1 --command="sleep 10" 
> --shell=true
> I0418 08:25:35.746758 24923 scheduler.cpp:177] Version: 0.29.0
> Subscribed with ID '3c4796f0-eee7-4939-a036-7c6387c370eb-0000'
> Submitted task 'test' to agent 'b74535d8-276f-4e09-ab47-53e3721ab271-S0'
> Received status update TASK_FAILED for task 'test'
>   message: 'Executor terminated'
>   source: SOURCE_AGENT
>   reason: REASON_EXECUTOR_TERMINATED
> {code}
> So the task failed with the reason "executor terminated". Here is the agent 
> log:
> {code}
> I0418 08:25:35.804873 24911 slave.cpp:1514] Got assigned task test for 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb-0000
> I0418 08:25:35.807937 24911 slave.cpp:1633] Launching task test for framework 
> 3c4796f0-eee7-4939-a036-7c6387c370eb-0000
> I0418 08:25:35.812503 24911 paths.cpp:528] Trying to chown 
> '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-0000/executors/t
> est/runs/2b29d6d6-b314-477f-b734-7771d07d41e3' to user 'root'
> I0418 08:25:35.820339 24911 slave.cpp:5620] Launching executor test of 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb-0000 with resources 
> cpus(*):0.1; mem(*):32 in work directory '/t
> mp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-0000/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3'
> I0418 08:25:35.822576 24914 containerizer.cpp:698] Starting container 
> '2b29d6d6-b314-477f-b734-7771d07d41e3' for executor 'test' of framework 
> '3c4796f0-eee7-4939-a036-7c6387c370eb-00
> 00'
> I0418 08:25:35.825996 24911 slave.cpp:1851] Queuing task 'test' for executor 
> 'test' of framework 3c4796f0-eee7-4939-a036-7c6387c370eb-0000
> I0418 08:25:35.832348 24911 provisioner.cpp:285] Provisioning image rootfs 
> '/tmp/mesos/provisioner/containers/2b29d6d6-b314-477f-b734-7771d07d41e3/backends/copy/rootfses/d219ec3a-ea3
> 1-45f6-b578-a62cd02392e7' for container 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:36.061249 24913 linux_launcher.cpp:281] Cloning child process 
> with flags = CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWNS
> I0418 08:25:36.071208 24915 cni.cpp:643] Bind mounted '/proc/24950/ns/net' to 
> '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' 
> for container 2b29d6d6-b314-4
> 77f-b734-7771d07d41e3
> I0418 08:25:36.250573 24916 cni.cpp:962] Got assigned IPv4 address 
> '192.168.1.2/24' from CNI network 'net1' for container 
> 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:36.252002 24917 cni.cpp:765] Unable to find DNS nameservers for 
> container 2b29d6d6-b314-477f-b734-7771d07d41e3. Using host '/etc/resolv.conf'
> I0418 08:25:37.663487 24916 containerizer.cpp:1696] Executor for container 
> '2b29d6d6-b314-477f-b734-7771d07d41e3' has exited
> I0418 08:25:37.663745 24916 containerizer.cpp:1461] Destroying container 
> '2b29d6d6-b314-477f-b734-7771d07d41e3'
> I0418 08:25:37.670574 24915 cgroups.cpp:2676] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:37.676864 24912 cgroups.cpp:1409] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 after 
> 6.061056ms
> I0418 08:25:37.680552 24913 cgroups.cpp:2694] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:37.683346 24913 cgroups.cpp:1438] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/2b29d6d6-b314-477f-b734-7771d07d41e3 after 
> 2.46016ms
> I0418 08:25:37.874023 24914 cni.cpp:1121] Unmounted the network namespace 
> handle 
> '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3/ns' 
> for container 2b29d6d6-b31
> 4-477f-b734-7771d07d41e3
> I0418 08:25:37.874194 24914 cni.cpp:1132] Removed the container directory 
> '/run/mesos/isolators/network/cni/2b29d6d6-b314-477f-b734-7771d07d41e3'
> I0418 08:25:37.877306 24912 linux.cpp:814] Ignoring unmounting sandbox/work 
> directory for container 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:37.879295 24912 provisioner.cpp:338] Destroying container rootfs 
> at 
> '/tmp/mesos/provisioner/containers/2b29d6d6-b314-477f-b734-7771d07d41e3/backends/copy/rootfses/d219ec3
> a-ea31-45f6-b578-a62cd02392e7' for container 
> 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:37.970871 24914 slave.cpp:4113] Executor 'test' of framework 
> 3c4796f0-eee7-4939-a036-7c6387c370eb-0000 exited with status 1
> I0418 08:25:37.975452 24914 slave.cpp:3201] Handling status update 
> TASK_FAILED (UUID: a5e19b2d-b234-4adc-8791-9046af4c1395) for task test of 
> framework 3c4796f0-eee7-4939-a036-7c6387c
> 370eb-0000 from @0.0.0.0:0
> W0418 08:25:37.978974 24911 containerizer.cpp:1303] Ignoring update for 
> unknown container: 2b29d6d6-b314-477f-b734-7771d07d41e3
> I0418 08:25:37.980370 24917 status_update_manager.cpp:320] Received status 
> update TASK_FAILED (UUID: a5e19b2d-b234-4adc-8791-9046af4c1395) for task test 
> of framework 3c4796f0-eee7-49
> 39-a036-7c6387c370eb-0000
> I0418 08:25:37.983105 24913 slave.cpp:3599] Forwarding the update TASK_FAILED 
> (UUID: a5e19b2d-b234-4adc-8791-9046af4c1395) for task test of framework 
> 3c4796f0-eee7-4939-a036-7c6387c3
> 70eb-0000 to master@192.168.122.171:5050
> I0418 08:25:38.017352 24917 slave.cpp:2232] Asked to shut down framework 
> 3c4796f0-eee7-4939-a036-7c6387c370eb-0000 by master@192.168.122.171:5050
> I0418 08:25:38.018487 24917 slave.cpp:2257] Shutting down framework 
> 3c4796f0-eee7-4939-a036-7c6387c370eb-0000
> I0418 08:25:38.019630 24917 slave.cpp:4217] Cleaning up executor 'test' of 
> framework 3c4796f0-eee7-4939-a036-7c6387c370eb-0000
> I0418 08:25:38.020967 24911 gc.cpp:55] Scheduling 
> '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-0000/executors/test/runs/
> 2b29d6d6-b314-477f-b734-7771d07d41e3' for gc 6.99999975983704days in the 
> future
> I0418 08:25:38.022328 24917 slave.cpp:4305] Cleaning up framework 
> 3c4796f0-eee7-4939-a036-7c6387c370eb-0000
> I0418 08:25:38.022847 24915 status_update_manager.cpp:282] Closing status 
> update streams for framework 3c4796f0-eee7-4939-a036-7c6387c370eb-0000
> I0418 08:25:38.022459 24912 gc.cpp:55] Scheduling 
> '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-0000/executors/test'
>  for 
> gc 6.99999974402963days in the future
> I0418 08:25:38.023483 24916 gc.cpp:55] Scheduling 
> '/tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-0000'
>  for gc 6.9999997358
> 2222days in the future
> ...
> {code}
> And this is the stderr of the executor:
> {code}
> cat 
> /tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-0000/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3/stderr
>  
> + /home/stack/workspace/mesos/build/src/mesos-containerizer mount 
> --help=false --operation=make-rslave --path=/
> + grep -E /tmp/mesos/.+ /proc/self/mountinfo
> + grep -v 2b29d6d6-b314-477f-b734-7771d07d41e3
> + cut -d  -f5
> + xargs --no-run-if-empty umount -l
> + mount -n --rbind 
> /tmp/mesos/provisioner/containers/2b29d6d6-b314-477f-b734-7771d07d41e3/backends/copy/rootfses/d219ec3a-ea31-45f6-b578-a62cd02392e7
>  
> /tmp/mesos/slaves/b74535d8-276f-4e09-ab47-53e3721ab271-S0/frameworks/3c4796f0-eee7-4939-a036-7c6387c370eb-0000/executors/test/runs/2b29d6d6-b314-477f-b734-7771d07d41e3/.rootfs
> Failed to obtain the IP address for '2b29d6d6-b314-477f-b734-7771d07d41e3'; 
> the DNS service may not be able to resolve it: Name or service not known
> {code}
> So the reason why executor terminated is, the libprocess in it failed to 
> resolved its hostname {{2b29d6d6-b314-477f-b734-7771d07d41e3}}, see 
> https://github.com/apache/mesos/blob/0.28.0/3rdparty/libprocess/src/process.cpp#L929:L935
>  for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to