[ https://issues.apache.org/jira/browse/AURORA-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jake Farrell updated AURORA-1790: --------------------------------- Assignee: (was: Jake Farrell) > Aurora CNI Support > ------------------ > > Key: AURORA-1790 > URL: https://issues.apache.org/jira/browse/AURORA-1790 > Project: Aurora > Issue Type: Epic > Reporter: Stephan Erb > > The [Container Network Interface > (CNI)|https://github.com/containernetworking/cni/blob/master/SPEC.md] is a > plug-in based networking solution for containers. CNI is [supported by the > Mesos Unified > Containerizer|https://github.com/apache/mesos/blob/master/docs/cni.md]. > CNI support in Aurora would enable cluster operators to isolate tasks on the > network level. This includes features such as IP-per container, or security > policies ensuring that only designated subsets of containers can communicate > with each other. Both are important feature for multi-tenant environments. > h2. Mesos Protobufs > In order to launch a task using CNI, Mesos requires frameworks to populate > the > [NetworkInfo|https://github.com/apache/mesos/blob/0f97117bac3e1382744e9a847ce11b7589fc45bd/include/mesos/mesos.proto#L1916-L1999] > protobuf. The following shows relevant subset of fields: > {code} > /** > * Describes a container configuration and allows extensible > * configurations for different container implementations. > * > * NOTE: In the Aurora case, this is set as part of ExecutorInfo > */ > message ContainerInfo { > ... > // A list of network requests. A framework can request multiple IP addresses > // for the container. > repeated NetworkInfo network_infos = 7; > ... > } > /** > * Describes a network request from a framework as well as network resolution > * provided by Mesos. > */ > message NetworkInfo { > ... > // For the CNI case, empty during task/executor launch and only used > // in TaskStatus messages to inform the framework scheduler about > // the IP addresses bound to a container > repeated IPAddress ip_addresses = 5; > // Name of the network which will be used by network isolator to determine > // the network that the container joins. It's up to the network isolator > // to decide how to interpret this field. > optional string name = 6; > // To tag certain metadata to be used by Isolator/IPAM, e.g., rack, etc. > // Opaque to Mesos but interpreted by the CNI plugin > optional Labels labels = 4; > ... > } > /** > * Container related information that is resolved during container > * setup. The information is sent back to the framework as part of the > * TaskStatus message. > */ > message ContainerStatus { > // This field can be reliably used to identify the container IP address. > repeated NetworkInfo network_infos = 1; > ... > } > {code} > h2. Challenges > * In contrast to ports or other resources, this is the first time an > important detail is only discovered asynchronously after a task has been > launched, i.e. the scheduler will only learn about the IP addresses of the > launched task after having received its first {{TaskStatus}}. > * A task can now live in multiple networks and can have multiple IP addresses. > h2. Necessary Changes > In order to implement CNI support in Aurora, several changes across the > entire code base are needed. > h3. Mesos > * As of today, it seems like there is no reliable way to discover > CNI-assigned IPs from within an executor (see MESOS-6281). This is crucial > for us, as Thermos is responsible to announce itself into Zookeeper > serversets. > h3. Thermos > * The Observer UI needs to be updated to handle multiple IP addresses. > * The ZK serverset announcement needs to be adjusted to publish all > IP-addresses. > * A replacement/addition for pystachio {{{{mesos.hostname}}}} is required so > that usercode can discover its current IP addresses. This relates to > MESOS-6281. > h3. Aurora Scheduler > * Feature toggle allowing operators to enabe/disable CNI support. > * Plumbing of NetworkInfo name and labels touching Thrift API, storage, and > task launch mechanism. > * Extension of {{TaskStatusHandlerImpl}} and > [{{StateManager}}|https://github.com/apache/aurora/blob/783baaefb9a814ca01fad78181fe3df3de5b34af/src/main/java/org/apache/aurora/scheduler/state/StateManager.java] > storage layer to persist received IP addresses. > h3. Aurora Client > * Extension of the Pystachio configuration so that user-defined jobs can join > operator enabled networks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)