[ https://issues.apache.org/jira/browse/MESOS-10163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17182170#comment-17182170 ]
Greg Mann commented on MESOS-10163: ----------------------------------- {noformat} commit 68b481085fb82b475e108b9aa39935a8d7729983 Author: Greg Mann <g...@mesosphere.io> Date: Thu Aug 20 19:26:48 2020 -0700 Fixed a bug in CSI volume manager initialization. Previously, the volume managers would assume that they could make CONTROLLER_SERVICE calls during plugin initialization, regardless of whether or not the plugin provides that service. Review: https://reviews.apache.org/r/72726/ {noformat} {noformat} commit 5ed30db48785007e35805886a024ebb8a61a7037 Author: Greg Mann <g...@mesosphere.io> Date: Thu Aug 20 19:27:02 2020 -0700 Added the CSI server to the Mesos agent. This patch adds a CSI server to the Mesos agent in both the agent binary and in tests. Review: https://reviews.apache.org/r/72761/ {noformat} {noformat} commit 4ff51041df860dbcc2247ef47a0596e5132da190 Author: Greg Mann g...@mesosphere.io Date: Thu Aug 20 19:27:23 2020 -0700 Initialized plugins lazily in the CSI server. Review: https://reviews.apache.org/r/72779/ {noformat} > Implement a new component to launch CSI plugins as standalone containers and > make CSI gRPC calls > ------------------------------------------------------------------------------------------------ > > Key: MESOS-10163 > URL: https://issues.apache.org/jira/browse/MESOS-10163 > Project: Mesos > Issue Type: Task > Reporter: Qian Zhang > Assignee: Greg Mann > Priority: Major > > *Background:* > Originally we want `volume/csi` isolator to leverage the existing [service > manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51] > to launch CSI plugins as standalone containers and currently service manager > needs to call the following agent HTTP APIs: > # `GET_CONTAINERS` to get all standalone containers in its `recover` method. > # `KILL_CONTAINER` and `WAIT_CONTAINER` to kill the outdated standalone > containers in its `recover` method. > # `LAUNCH_CONTAINER` via the existing > [ContainerDaemon|https://github.com/apache/mesos/blob/1.10.0/src/slave/container_daemon.hpp#L41:L46] > to launch CSI plugin as standalone container when its `getEndpoint` method > is called. > The problem with the above design is, `volume/csi` isolator may need to clean > up orphan container during agent recovery which is triggered by containerizer > (see > [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/containerizer.cpp#L1272:L1275] > for details), to clean up an orphan container which is using a CSI volume, > `volume/csi` isolator needs to instantiate and recover the service manager > and get CSI plugin’s endpoint from it (i.e., service manager’s `getEndpoint` > method will be called by `volume/csi` isolator during agent recovery. And as > I mentioned above service manager’s `getEndpoint` may need to call > `LAUNCH_CONTAINER` to launch CSI plugin as standalone container, since agent > is still in recovering state, such agent HTTP call will be just rejected by > agent. So we have to instantiate and recover service manager *after agent > recovery is done*, but in `volume/csi` isolator we do not have such > information (i.e. the signal that agent recovery is done). > *Solution* > We need to implement a new component (like `CSIVolumeManager` or a better > name?) in Mesos agent which is responsible for launching CSI plugins as > standalone containers (via the existing [service > manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51]) > and making CSI gRPC calls (via the existing [volume > manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/volume_manager.hpp#L55:L56]). > * We can instantiate this new component in the `main` method of agent and > pass it to both containerizer and agent (i.e. it will be a member of the > `Slave` object), and containerizer will in turn pass it to the `volume/csi` > isolator. > * Since this new component relies on service manager which will call agent > HTTP APIs, we need to pass agent URL to it, like `process::http::URL(scheme, > agentIP, agentPort, agentLibprocessId + "/api/v1")`, see > [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L459:L471] > for an example. > * When agent registers/reregisters with master (`Slave::registered` and > `Slave::reregistered`), we should call this new component’s `start` method > (see > [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1740:L1742] > and > [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1825:L1827] > as examples) which will scan the directory `--csi_plugin_config_dir` and > create the `service manager - volume manager` pair for each CSI plugin loaded > from that directory. > * For the `volume/csi` isolator, it needs to call this new component’s > `publishVolume` and `unpublishVolume` methods in its `prepare` and `cleanup` > method. > In the case of clean up orphan containers during agent recovery, `volume/csi` > isolator will just call this new component’s `unpublishVolume` method as > usual, and it is this new component’s responsibility to only make the actual > CSI gRPC call after agent recovery is done and agent has registered with > master (e.g., when this new component’s start method is called). -- This message was sent by Atlassian Jira (v8.3.4#803005)