> On March 12, 2016, 4:02 a.m., Jie Yu wrote: > > src/slave/containerizer/mesos/isolators/network/cni.hpp, line 85 > > <https://reviews.apache.org/r/44514/diff/2/?file=1294397#file1294397line85> > > > > I would suggest we have a `Info` for each container. > > > > ``` > > struct Info > > { > > ... > > }; > > ``` > > > > Remember that for any field you put into the Info struct, you need to > > be able to 'recover' it in recover function. It's also possible that > > ExecutorInfo/TaskInfo are not available during recover (orphans due to > > wiped meta data). You need to think about how to recover (e.g., the name of > > the network) that the container has joined. > > Avinash sridharan wrote: > We also need to recover the IP address associated with the container. > Since this has to be returned as part of `NetworkINfo` . Need to figure out > how to recover this? > > Jie Yu wrote: > I am not sure. Why do we need to store/recover IP address? When destroy a > container, the CNI plugin only needs container ID (which can be the same as > our container Id), the namespace handle and the networkconfig file. The rest > can all be fixed. > > I think the trick part is: when there is an orphan container (no > executorinfo/taskinfo), how can we figure out its network namespace handle > and networconfig file so that we can destroy it (call CNI plugin DEL). We > need to know the name of the network to get the networkcofig file, and the > pid to get the network namespace handle. > > Qian Zhang wrote: > Yes, that is the tricky part. For orphan container (e.g., the container > created by the framework without checkpoint enabled), when agent restarts, in > the recover() we only have container ID which is not enough to invoke CNI > plugin DEL. Totally we need: > 1) Container ID. > 2) Network namespace path. > 3) Network configuration. > 4) Name of the interface inside the container. > > I think currently we only have 1). > For 2), my idea is, in `isolate()`, before calling CNI plugin, we do a > bind mount (`/var/run/netns/[pid]` -> `proc/[pid]/ns/net`), and then a symbol > link (`/var/run/mesos/cni/netns/[containerId]` -> `/var/run/netns/[pid]`). > And then in `recover()`, since we have `containerId` for orphan containers, > we should be able to figure out the `pid` based on the symbol link, and then > get the network namespace path `/proc/[pid]/ns/net`. > For 3), actually when we call CNI plugin for a container in `isolate()`, > the CNI plugin (actually the CNI `host-local` IPAM plugin) will create a file > under `/var/lib/cni/networks/[neworkName]/`, the name of the file is the IP > assigned to the container, and the content of file is the container ID. So if > a container is assigned an IP from a CNI network, it must have a file under > `/var/lib/cni/networks/[neworkName]/`, we may leverage this information to > figure out the name of CNI networks the orphan container joins. But this may > only work for the CNI network which uses `host-local` IPAM plugin, for `dhcp` > IPAM plugin, I think it will not write any files for containers. > For 4), I do not have idea for it yet :-( > > Or another solution would be, when we create the symbol link in > `isolate()`, we may encode more info into the name of symbol link, e.g., > `/var/run/mesos/cni/netns/[containerId]-[net1]_[ifName in > net1]-[net2]_[ifName in net2]...`, however this seems ugly to me. > > Avinash sridharan wrote: > For (2) why not bind mount directly to > /var/run/mesos/cni/netns/[continaerId] and create a simlink to > /var/run/netns/. Also, Jie mentioned it would be a good idea to prefix the > `[containerId]` with a conanical string like mesos-cni. For (3) I don't think > we should rely on files created by CNI plugins. There is no guarantee that > the IPAM plugin would maintain those files. Instead if we have already bind > moutned the network namespace, the namespace will persist in the kernel even > after the container has exited (became an orphan). At recover we just need to > walk to list of `containerID` stored under /var/run/mesos/cni/netns and for > each network namespace we can run the `ip netns comamnd` to retrieve the link > and IP address for that network namespace. > > Qian Zhang wrote: > For 2), what symlink do you want to create under `/var/run/netns`? > `/var/run/netns/[pid] -> /var/run/mesos/cni/netns/[containerId]`? Or > `/var/run/netns/[containerId] -> /var/run/mesos/cni/netns/[contiainerId]`? > And you can take a look at > https://github.com/apache/mesos/blob/0.27.2/src/slave/containerizer/mesos/isolators/network/port_mapping.cpp#L2205:L2243, > What it does is same as what I proposed above. > > And good point for 3), I agree in `recover()`, we can run `ip netns exec > [namespace] ip addr` to retrieve the link and IP, but the problem is how we > can know which CNI networks that these links/IPs belong to? E.g., after run > `ip netns exec [namespace] ip addr` for a container, we get eth0/10.0.0.3 and > eth1/10.0.1.3, then how can we know the CNI networks they belong to? Maybe we > can encode CNI network name in the interface name when calling CNI plugin in > `isolate()`, e.g., name the interface name as `net0-eth0` rather than `eth0`, > but this seems a bit strange to me and not consistent with the interface > naming convention in Linux. > > Jie Yu wrote: > I don't think we need the symlink. We can just use a bind mount > /var/run/netns/mesos_cni_<ContainerID>. You can specify the namespace handle > using the bind mount directly (no need to get the pid). I think the symlink > in port mapping isolator is mainly for legacy reason (because initially, we > only have the pid bind mount). But turns out that we definitely need to > checkpoint some data under /var/run (for ifname and network name). The layout > might be something like the following: > ``` > /var/run/mesos/isolators/network/cni/<ContainerID>/ > |-- ns -> /proc/<pid>/ns/net (bind mount) > |-- <net1>/ > |-- ifname > |-- ipv4 > |-- <net2>/ > |-- ifname > |-- ipv4 > ```
Thank Jie. I agree that we only need a bind mount and no need the symlink. And for the layout under `/var/run/mesos/isolators/network/cni/<ContainerID>/`, I have a few questions: 1. If we create `|-- ns -> /proc/<pid>/ns/net (bind mount)`, then I think we do not need the bind mount `/var/run/netns/mesos_cni_<ContainerID> -> /proc/<pid>/ns/net`, right? 2. Do we need a sub-dir (e.g., `<net1>/` in your layout) for each CNI network that the container joins? Or what about just a file with network name as its file name and the line in the file is `ifname:ipv4`. 3. Do you think that we need to introduce an agent flag for operator to specify the directory `/var/run/mesos/isolators/network/cni/` when starting agent? Or we can just keep it internal? So basically the layout in my mind is like: /var/run/mesos/isolators/network/cni/<ContainerID>/ |-- ns -> /proc/<pid>/ns/net (bind mount) |-- <net1>: ifname1:ip1 ifname2:ip2 |-- <net2>: ifname3:ip3 ifname4:ip4 - Qian ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/44514/#review123200 ----------------------------------------------------------- On March 10, 2016, 10:20 p.m., Qian Zhang wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/44514/ > ----------------------------------------------------------- > > (Updated March 10, 2016, 10:20 p.m.) > > > Review request for mesos, Avinash sridharan, Gilbert Song, and Jie Yu. > > > Bugs: MESOS-4759 > https://issues.apache.org/jira/browse/MESOS-4759 > > > Repository: mesos > > > Description > ------- > > Implemented prepare() method of "network/cni" isolator. > > > Diffs > ----- > > src/slave/containerizer/mesos/isolators/network/cni.hpp PRE-CREATION > src/slave/containerizer/mesos/isolators/network/cni.cpp PRE-CREATION > > Diff: https://reviews.apache.org/r/44514/diff/ > > > Testing > ------- > > make check > > > Thanks, > > Qian Zhang > >