Re: Review Request 44514: Implemented prepare() method of "network/cni" isolator.

Jie Yu Tue, 15 Mar 2016 09:17:46 -0700


> On March 11, 2016, 8:02 p.m., Jie Yu wrote:
> > src/slave/containerizer/mesos/isolators/network/cni.hpp, line 85
> > <https://reviews.apache.org/r/44514/diff/2/?file=1294397#file1294397line85>
> >
> >     I would suggest we have a `Info` for each container.
> >     
> >     ```
> >     struct Info
> >     {
> >       ...
> >     };
> >     ```
> >     
> >     Remember that for any field you put into the Info struct, you need to 
> > be able to 'recover' it in recover function. It's also possible that 
> > ExecutorInfo/TaskInfo are not available during recover (orphans due to 
> > wiped meta data). You need to think about how to recover (e.g., the name of 
> > the network) that the container has joined.
> 
> Avinash sridharan wrote:
>     We also need to recover the IP address associated with the container. 
> Since this has to be returned as part of `NetworkINfo` . Need to figure out 
> how to recover this?
> 
> Jie Yu wrote:
>     I am not sure. Why do we need to store/recover IP address? When destroy a 
> container, the CNI plugin only needs container ID (which can be the same as 
> our container Id), the namespace handle and the networkconfig file. The rest 
> can all be fixed.
>     
>     I think the trick part is: when there is an orphan container (no 
> executorinfo/taskinfo), how can we figure out its network namespace handle 
> and networconfig file so that we can destroy it (call CNI plugin DEL). We 
> need to know the name of the network to get the networkcofig file, and the 
> pid to get the network namespace handle.
> 
> Qian Zhang wrote:
>     Yes, that is the tricky part. For orphan container (e.g., the container 
> created by the framework without checkpoint enabled), when agent restarts, in 
> the recover() we only have container ID which is not enough to invoke CNI 
> plugin DEL. Totally we need:
>     1) Container ID.
>     2) Network namespace path.
>     3) Network configuration.
>     4) Name of the interface inside the container.
>     
>     I think currently we only have 1).
>     For 2), my idea is, in `isolate()`, before calling CNI plugin, we do a 
> bind mount (`/var/run/netns/[pid]` -> `proc/[pid]/ns/net`), and then a symbol 
> link (`/var/run/mesos/cni/netns/[containerId]` -> `/var/run/netns/[pid]`). 
> And then in `recover()`, since we have `containerId` for orphan containers, 
> we should be able to figure out the `pid` based on the symbol link, and then 
> get the network namespace path `/proc/[pid]/ns/net`.
>     For 3), actually when we call CNI plugin for a container in `isolate()`, 
> the CNI plugin (actually the CNI `host-local` IPAM plugin) will create a file 
> under `/var/lib/cni/networks/[neworkName]/`, the name of the file is the IP 
> assigned to the container, and the content of file is the container ID. So if 
> a container is assigned an IP from a CNI network, it must have a file under 
> `/var/lib/cni/networks/[neworkName]/`, we may leverage this information to 
> figure out the name of CNI networks the orphan container joins. But this may 
> only work for the CNI network which uses `host-local` IPAM plugin, for `dhcp` 
> IPAM plugin, I think it will not write any files for containers.
>     For 4), I do not have idea for it yet :-(
>     
>     Or another solution would be, when we create the symbol link in 
> `isolate()`, we may encode more info into the name of symbol link, e.g., 
> `/var/run/mesos/cni/netns/[containerId]-[net1]_[ifName in 
> net1]-[net2]_[ifName in net2]...`, however this seems ugly to me.
> 
> Avinash sridharan wrote:
>     For (2) why not bind mount directly to 
> /var/run/mesos/cni/netns/[continaerId] and create a simlink to 
> /var/run/netns/. Also, Jie mentioned it would be a good idea to prefix the 
> `[containerId]` with a conanical string like mesos-cni. For (3) I don't think 
> we should rely on files created by CNI plugins. There is no guarantee that 
> the IPAM plugin would maintain those files. Instead if we have already bind 
> moutned the network namespace, the namespace will persist in the kernel even 
> after the container has exited (became an orphan). At recover we just need to 
> walk to list of `containerID` stored under /var/run/mesos/cni/netns and for 
> each network namespace we can run the `ip netns comamnd` to retrieve the link 
> and IP address for that network namespace.
> 
> Qian Zhang wrote:
>     For 2), what symlink do you want to create under `/var/run/netns`? 
> `/var/run/netns/[pid] -> /var/run/mesos/cni/netns/[containerId]`? Or 
> `/var/run/netns/[containerId] -> /var/run/mesos/cni/netns/[contiainerId]`? 
> And you can take a look at 
> https://github.com/apache/mesos/blob/0.27.2/src/slave/containerizer/mesos/isolators/network/port_mapping.cpp#L2205:L2243,
>  What it does is same as what I proposed above.
>     
>     And good point for 3), I agree in `recover()`, we can run `ip netns exec 
> [namespace] ip addr` to retrieve the link and IP, but the problem is how we 
> can know which CNI networks that these links/IPs belong to? E.g., after run 
> `ip netns exec [namespace] ip addr` for a container, we get eth0/10.0.0.3 and 
> eth1/10.0.1.3, then how can we know the CNI networks they belong to? Maybe we 
> can encode CNI network name in the interface name when calling CNI plugin in 
> `isolate()`, e.g., name the interface name as `net0-eth0` rather than `eth0`, 
> but this seems a bit strange to me and not consistent with the interface 
> naming convention in Linux.
> 
> Jie Yu wrote:
>     I don't think we need the symlink. We can just use a bind mount 
> /var/run/netns/mesos_cni_<ContainerID>. You can specify the namespace handle 
> using the bind mount directly (no need to get the pid). I think the symlink 
> in port mapping isolator is mainly for legacy reason (because initially, we 
> only have the pid bind mount). But turns out that we definitely need to 
> checkpoint some data under /var/run (for ifname and network name). The layout 
> might be something like the following:
>     ```
>     /var/run/mesos/isolators/network/cni/<ContainerID>/
>       |-- ns -> /proc/<pid>/ns/net (bind mount)
>       |-- <net1>/
>             |-- ifname
>             |-- ipv4
>       |-- <net2>/
>             |-- ifname
>             |-- ipv4
>     ```
> 
> Qian Zhang wrote:
>     Thank Jie. I agree that we only need a bind mount and no need the symlink.
>     
>     And for the layout under 
> `/var/run/mesos/isolators/network/cni/<ContainerID>/`, I have a few questions:
>     1. If we create `|-- ns -> /proc/<pid>/ns/net (bind mount)`, then I think 
> we do not need the bind mount `/var/run/netns/mesos_cni_<ContainerID> -> 
> /proc/<pid>/ns/net`, right?
>     2. Do we need a sub-dir (e.g., `<net1>/` in your layout) for each CNI 
> network that the container joins? Or what about just a file with network name 
> as its file name and the line in the file is `ifname:ipv4`.
>     3. Do you think that we need to introduce an agent flag for operator to 
> specify the directory `/var/run/mesos/isolators/network/cni/` when starting 
> agent? Or we can just keep it internal?
>     
>     So basically the layout in my mind is like:
>       /var/run/mesos/isolators/network/cni/<ContainerID>/
>         |-- ns -> /proc/<pid>/ns/net (bind mount)
>         |-- <net1>: ifname1:ip1  
>                     ifname2:ip2
>         |-- <net2>: ifname3:ip3
>                     ifname4:ip4

Second thought: can we avoid the bind mount above. Maintaining bind mounts is 
hard because we need to clean them up in all sorts of failure cases. I think we 
can just save the pid there.

When the container finishes and after we call DEL on the CNI plugin, we remove 
the pid file, indicating that the CNI cleanup has been done. Then, we remove 
the whole directory for the container. The caveat here is that a pid might be 
reused, then the namespace handle based on proc might not be correct. But this 
is very unlikely (due to pid allocation in kernel). We made the same assumption 
in other places in the code as well.

For 2), I prefer a directory based layout so that we don't need to write a 
parsing logic. We followed this pattern for other checkpointed data as well 
(e.g., slave's meta data).

For 3), Let's fix that for now. We can always add an agent flag later. Remember 
that in many systems, /var/run is a symlink to /run. So you need to use 
os::realpath first to get the real path of /var/run.

- Jie

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/44514/#review123200
-----------------------------------------------------------

On March 10, 2016, 2:20 p.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/44514/
> -----------------------------------------------------------
> 
> (Updated March 10, 2016, 2:20 p.m.)
> 
> 
> Review request for mesos, Avinash sridharan, Gilbert Song, and Jie Yu.
> 
> 
> Bugs: MESOS-4759
>     https://issues.apache.org/jira/browse/MESOS-4759
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Implemented prepare() method of "network/cni" isolator.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/mesos/isolators/network/cni.hpp PRE-CREATION 
>   src/slave/containerizer/mesos/isolators/network/cni.cpp PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/44514/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>

Re: Review Request 44514: Implemented prepare() method of "network/cni" isolator.

Reply via email to