[ https://issues.apache.org/jira/browse/MESOS-10163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170261#comment-17170261 ]
Greg Mann commented on MESOS-10163: ----------------------------------- {noformat} commit c78dc333fc893a43d40dc33299a61987198a6ea9 (HEAD -> master, origin/master, origin/HEAD) Author: Greg Mann <g...@mesosphere.io> Date: Mon Aug 3 10:11:57 2020 -0700 Added interface for the CSI server. This component will hold objects associated with CSI plugins running on the agent. Review: https://reviews.apache.org/r/72707/ {noformat} > Implement a new component to launch CSI plugins as standalone containers and > make CSI gRPC calls > ------------------------------------------------------------------------------------------------ > > Key: MESOS-10163 > URL: https://issues.apache.org/jira/browse/MESOS-10163 > Project: Mesos > Issue Type: Task > Reporter: Qian Zhang > Assignee: Greg Mann > Priority: Major > > *Background:* > Originally we want `volume/csi` isolator to leverage the existing [service > manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51] > to launch CSI plugins as standalone containers and currently service manager > needs to call the following agent HTTP APIs: > # `GET_CONTAINERS` to get all standalone containers in its `recover` method. > # `KILL_CONTAINER` and `WAIT_CONTAINER` to kill the outdated standalone > containers in its `recover` method. > # `LAUNCH_CONTAINER` via the existing > [ContainerDaemon|https://github.com/apache/mesos/blob/1.10.0/src/slave/container_daemon.hpp#L41:L46] > to launch CSI plugin as standalone container when its `getEndpoint` method > is called. > The problem with the above design is, `volume/csi` isolator may need to clean > up orphan container during agent recovery which is triggered by containerizer > (see > [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/containerizer.cpp#L1272:L1275] > for details), to clean up an orphan container which is using a CSI volume, > `volume/csi` isolator needs to instantiate and recover the service manager > and get CSI plugin’s endpoint from it (i.e., service manager’s `getEndpoint` > method will be called by `volume/csi` isolator during agent recovery. And as > I mentioned above service manager’s `getEndpoint` may need to call > `LAUNCH_CONTAINER` to launch CSI plugin as standalone container, since agent > is still in recovering state, such agent HTTP call will be just rejected by > agent. So we have to instantiate and recover service manager *after agent > recovery is done*, but in `volume/csi` isolator we do not have such > information (i.e. the signal that agent recovery is done). > *Solution* > We need to implement a new component (like `CSIVolumeManager` or a better > name?) in Mesos agent which is responsible for launching CSI plugins as > standalone containers (via the existing [service > manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51]) > and making CSI gRPC calls (via the existing [volume > manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/volume_manager.hpp#L55:L56]). > * We can instantiate this new component in the `main` method of agent and > pass it to both containerizer and agent (i.e. it will be a member of the > `Slave` object), and containerizer will in turn pass it to the `volume/csi` > isolator. > * Since this new component relies on service manager which will call agent > HTTP APIs, we need to pass agent URL to it, like `process::http::URL(scheme, > agentIP, agentPort, agentLibprocessId + "/api/v1")`, see > [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L459:L471] > for an example. > * When agent registers/reregisters with master (`Slave::registered` and > `Slave::reregistered`), we should call this new component’s `start` method > (see > [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1740:L1742] > and > [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1825:L1827] > as examples) which will scan the directory `--csi_plugin_config_dir` and > create the `service manager - volume manager` pair for each CSI plugin loaded > from that directory. > * For the `volume/csi` isolator, it needs to call this new component’s > `publishVolume` and `unpublishVolume` methods in its `prepare` and `cleanup` > method. > In the case of clean up orphan containers during agent recovery, `volume/csi` > isolator will just call this new component’s `unpublishVolume` method as > usual, and it is this new component’s responsibility to only make the actual > CSI gRPC call after agent recovery is done and agent has registered with > master (e.g., when this new component’s start method is called). -- This message was sent by Atlassian Jira (v8.3.4#803005)