[jira] [Assigned] (MESOS-8400) Handle plugin crashes gracefully in SLRP recovery.
[ https://issues.apache.org/jira/browse/MESOS-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier reassigned MESOS-8400: --- Assignee: (was: Benjamin Bannier) > Handle plugin crashes gracefully in SLRP recovery. > -- > > Key: MESOS-8400 > URL: https://issues.apache.org/jira/browse/MESOS-8400 > Project: Mesos > Issue Type: Improvement >Reporter: Chun-Hung Hsiao >Priority: Blocker > Labels: mesosphere, mesosphere-dss-post-ga, storage > > When a CSI plugin crashes, the container daemon in SLRP will reset its > corresponding {{csi::Client}} service future. However, if a CSI call races > with a plugin crash, the call may be issued before the service future is > reset, resulting in a failure for that CSI call. MESOS-9517 partly addresses > this for {{CreateVolume}} and {{DeleteVolume}} calls, but calls in the SLRP > recovery path, e.g., {{ListVolume}}, {{GetCapacity}}, {{Probe}}, could make > the SLRP unrecoverable. > There are two main issues: > 1. For {{Probe}}, we should investigate if it is needed to make a few retry > attempts, then after that, we should recover from failed attempts (e.g., kill > the plugin container), then make the container daemon relaunch the plugin > instead of failing the daemon. > 2. For other calls in the recovery path, we should either retry the call, or > make the local resource provider daemon be able to restart the SLRP after it > fails. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-8400) Handle plugin crashes gracefully in SLRP recovery.
[ https://issues.apache.org/jira/browse/MESOS-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier reassigned MESOS-8400: --- Sprint: Resource Mgmt: RI-17 Sprint 53 Assignee: Benjamin Bannier > Handle plugin crashes gracefully in SLRP recovery. > -- > > Key: MESOS-8400 > URL: https://issues.apache.org/jira/browse/MESOS-8400 > Project: Mesos > Issue Type: Improvement >Reporter: Chun-Hung Hsiao >Assignee: Benjamin Bannier >Priority: Blocker > Labels: mesosphere, mesosphere-dss-post-ga, storage > > When a CSI plugin crashes, the container daemon in SLRP will reset its > corresponding {{csi::Client}} service future. However, if a CSI call races > with a plugin crash, the call may be issued before the service future is > reset, resulting in a failure for that CSI call. MESOS-9517 partly addresses > this for {{CreateVolume}} and {{DeleteVolume}} calls, but calls in the SLRP > recovery path, e.g., {{ListVolume}}, {{GetCapacity}}, {{Probe}}, could make > the SLRP unrecoverable. > There are two main issues: > 1. For {{Probe}}, we should investigate if it is needed to make a few retry > attempts, then after that, we should recover from failed attempts (e.g., kill > the plugin container), then make the container daemon relaunch the plugin > instead of failing the daemon. > 2. For other calls in the recovery path, we should either retry the call, or > make the local resource provider daemon be able to restart the SLRP after it > fails. -- This message was sent by Atlassian Jira (v8.3.2#803003)