[jira] [Commented] (MESOS-9334) Container stuck at ISOLATING state due to libevent poll never returns
[ https://issues.apache.org/jira/browse/MESOS-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16669327#comment-16669327 ] Qian Zhang commented on MESOS-9334: --- commit 610064942d4a75f16f045480ca9e3414d37f1ecc Author: Benjamin Mahler Date: Fri Oct 26 10:50:54 2018 +0800 Fixed an early fd close in the cgroups event notifier. The cgroups event notifier was closing the eventfd while an `io::read()` operation may be in progress. This can lead to bugs where the fd gets re-used and read from a stale io::read. Review: https://reviews.apache.org/r/69123/ commit 9bd3a32b68b165d9a1a45548cebe3b22069cecc0 Author: Qian Zhang Date: Fri Oct 26 10:57:20 2018 +0800 Ensured failed / discarded cgroups OOM notification is logged. Failed or discarded OOM notificaitions in the cgroups memory subsystem were not being logged, due to the continuation being accidentally set up using `onReady` rather than `onAny`. Review: https://reviews.apache.org/r/69188 > Container stuck at ISOLATING state due to libevent poll never returns > - > > Key: MESOS-9334 > URL: https://issues.apache.org/jira/browse/MESOS-9334 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Critical > > We found UCR container may be stuck at `ISOLATING` state: > {code:java} > 2018-10-03 09:13:23: I1003 09:13:23.274561 2355 containerizer.cpp:3122] > Transitioning the state of container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 > from PREPARING to ISOLATING > 2018-10-03 09:13:23: I1003 09:13:23.279223 2354 cni.cpp:962] Bind mounted > '/proc/5244/ns/net' to > '/run/mesos/isolators/network/cni/1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54/ns' > for container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 > 2018-10-03 09:23:22: I1003 09:23:22.879868 2354 containerizer.cpp:2459] > Destroying container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 in ISOLATING state > {code} > In the above logs, the state of container > `1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54` was transitioned to `ISOLATING` at > 09:13:23, but did not transitioned to any other states until it was destroyed > due to the executor registration timeout (10 mins). And the destroy can never > complete since it needs to wait for the container to finish isolating. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9334) Container stuck at ISOLATING state due to libevent poll never returns
[ https://issues.apache.org/jira/browse/MESOS-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664603#comment-16664603 ] Qian Zhang commented on MESOS-9334: --- RR: https://reviews.apache.org/r/69123/ > Container stuck at ISOLATING state due to libevent poll never returns > - > > Key: MESOS-9334 > URL: https://issues.apache.org/jira/browse/MESOS-9334 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Critical > > We found UCR container may be stuck at `ISOLATING` state: > {code:java} > 2018-10-03 09:13:23: I1003 09:13:23.274561 2355 containerizer.cpp:3122] > Transitioning the state of container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 > from PREPARING to ISOLATING > 2018-10-03 09:13:23: I1003 09:13:23.279223 2354 cni.cpp:962] Bind mounted > '/proc/5244/ns/net' to > '/run/mesos/isolators/network/cni/1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54/ns' > for container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 > 2018-10-03 09:23:22: I1003 09:23:22.879868 2354 containerizer.cpp:2459] > Destroying container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 in ISOLATING state > {code} > In the above logs, the state of container > `1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54` was transitioned to `ISOLATING` at > 09:13:23, but did not transitioned to any other states until it was destroyed > due to the executor registration timeout (10 mins). And the destroy can never > complete since it needs to wait for the container to finish isolating. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9334) Container stuck at ISOLATING state due to libevent poll never returns
[ https://issues.apache.org/jira/browse/MESOS-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661975#comment-16661975 ] Qian Zhang commented on MESOS-9334: --- After reading some libevent code and our code to call libevent, I think the root cause of this issue is, after we call libevent to poll an fd, that fd is disabled inside libevent due to a race. Here is the flow: # Container1 is launched and cgroups memory subsystem calls the function `cgroups::memory::oom::listen()` to listen OOM event for this container, and that function will internally open an fd, call libevent to poll it, and return a future to cgroups memory subsystem. # Container1 exits and when we destroy it, the cleanup method of cgroups memory subsystem will discard the future got in #1. As the result, `Listener::finalize()` will be called (see [this code|https://github.com/apache/mesos/blob/1.7.0/src/linux/cgroups.cpp#L1069:L1087] for details), and it will ** Discard the future returned by libevent poll which will cause `pollDiscard()` called and that will trigger `pollCallback` to be executed *asynchronously* (see [this code|https://github.com/apache/mesos/blob/1.7.0/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp#L66:L70] for details). ** Close the fd opened in #1 *immediately* which means the fd can be reused now. # Container2 is launched, and CNI isolator calls `io::read` to read the stdout/stderr of CNI plugin for this container. Internally `io::read` *reuses* the fd closed in #2 and call libevent to poll it. # Now the function `pollCallback` for container1 is executed, and it will delete the poll object which will trigger `event_free` to deallocate the event for this container (see [this code|https://github.com/apache/mesos/blob/1.7.0/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp#L50:L52] for details). Internally `event_free` will call `event_del` -> `event_del_internal` -> `evmap_io_del` -> `evsel->del` to *disable* the fd (see [this code|https://github.com/libevent/libevent/blob/release-2.0.22-stable/event-internal.h#L78:L79] for details), but that fd is now used to read stdout/stderr for container2 in #3. Since the fd is disabled inside libevent, the `io::read` we do in #3 will never return so the container2 will be stuck at `ISOLATING` state. > Container stuck at ISOLATING state due to libevent poll never returns > - > > Key: MESOS-9334 > URL: https://issues.apache.org/jira/browse/MESOS-9334 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Critical > > We found UCR container may be stuck at `ISOLATING` state: > {code:java} > 2018-10-03 09:13:23: I1003 09:13:23.274561 2355 containerizer.cpp:3122] > Transitioning the state of container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 > from PREPARING to ISOLATING > 2018-10-03 09:13:23: I1003 09:13:23.279223 2354 cni.cpp:962] Bind mounted > '/proc/5244/ns/net' to > '/run/mesos/isolators/network/cni/1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54/ns' > for container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 > 2018-10-03 09:23:22: I1003 09:23:22.879868 2354 containerizer.cpp:2459] > Destroying container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 in ISOLATING state > {code} > In the above logs, the state of container > `1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54` was transitioned to `ISOLATING` at > 09:13:23, but did not transitioned to any other states until it was destroyed > due to the executor registration timeout (10 mins). And the destroy can never > complete since it needs to wait for the container to finish isolating. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9334) Container stuck at ISOLATING state due to libevent poll never returns
[ https://issues.apache.org/jira/browse/MESOS-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658795#comment-16658795 ] Qian Zhang commented on MESOS-9334: --- I added some logs into `libevent_poll.cpp` (see the diff below for details) and reproduced this issue a couple of times. {code:java} --- a/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp +++ b/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp @@ -32,11 +34,17 @@ struct Poll }; -void pollCallback(evutil_socket_t, short what, void* arg) +void pollCallback(evutil_socket_t fd, short what, void* arg) { Poll* poll = reinterpret_cast(arg); + LOG(INFO) << "==pollCallback starts with fd " << fd +<< " and with poll " << poll << "=="; + if (poll->promise.future().hasDiscard()) { +LOG(INFO) << "==pollCallback discards with fd " + << fd << "=="; + poll->promise.discard(); } else { // Convert libevent specific EV_READ / EV_WRITE to io::* specific @@ -44,17 +52,24 @@ void pollCallback(evutil_socket_t, short what, void* arg) short events = ((what & EV_READ) ? io::READ : 0) | ((what & EV_WRITE) ? io::WRITE : 0); +LOG(INFO) << "==pollCallback sets promise with fd " << fd + << " and with events " << events << "=="; + poll->promise.set(events); } // Deleting the `poll` also destructs `ev` and hence triggers `event_free`, // which makes the event non-pending. delete poll; + + LOG(INFO) << "==pollCallback ends with fd " << fd << "=="; } void pollDiscard(const std::weak_ptr& ev, short events) { + LOG(INFO) << "==pollDiscard is called=="; + // Discarding inside the event loop prevents `pollCallback()` from being // called twice if the future is discarded. run_in_event_loop([=]() { @@ -78,6 +93,9 @@ Future poll(int_fd fd, short events) internal::Poll* poll = new internal::Poll(); + LOG(INFO) << "==libevent starts polling with fd " << fd +<< " and with poll " << poll << "=="; + Future future = poll->promise.future(); // Convert io::READ / io::WRITE to libevent specific values of these {code} Here is what I found in the agent log when this issue occurred (fd 48 is the stderr file descriptor of `NetworkCniIsolatorSetup`): {code:java} I1021 15:57:45.00 2116 libevent_poll.cpp:96] ==libevent starts polling with fd 48 and with poll 0x7f60df029eb0== I1021 15:57:45.00 2117 libevent_poll.cpp:41] ==pollCallback starts with fd 48 and with poll 0x7f60e6e56c70== I1021 15:57:45.00 2117 libevent_poll.cpp:45] ==pollCallback discards with fd 48== I1021 15:57:45.00 2117 libevent_poll.cpp:65] ==pollCallback ends with fd 48== {code} We can see libevent started to poll fd 48 with the poll object whose address is 0x7f60df029eb0, but when `pollCallback` was called for fd 48, the address of poll object is different (0x7f60e6e56c70) which has been discarded (see the third log line) !!! And when I searched 0x7f60e6e56c70 in the agent log, I found: {code:java} I1021 15:57:22.00 2115 memory.cpp:478] Started listening for OOM events for container 4753a5ef-eccd-4373-b8f0-a4b40abf0fb5 I1021 15:57:22.00 2115 libevent_poll.cpp:96] ==libevent starts polling with fd 48 and with poll 0x7f60e6e56c70== {code} So the poll object 0x7f60e6e56c70 was created 23 seconds ago with the same file descriptor (fd 48) which is used to listen OOM events for another container (4753a5ef-eccd-4373-b8f0-a4b40abf0fb5), and that container was destroyed at 15:57:44 right before agent started to wait the stderr of `NetworkCniIsolatorSetup` (15:57:45). {code:java} I1021 15:57:44.00 2114 containerizer.cpp:2459] Destroying container 4753a5ef-eccd-4373-b8f0-a4b40abf0fb5 in RUNNING state {code} I reproduced this issue a couple of times, the observations I had from agent log are same as the above. > Container stuck at ISOLATING state due to libevent poll never returns > - > > Key: MESOS-9334 > URL: https://issues.apache.org/jira/browse/MESOS-9334 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Qian Zhang >Priority: Critical > > We found UCR container may be stuck at `ISOLATING` state: > {code:java} > 2018-10-03 09:13:23: I1003 09:13:23.274561 2355 containerizer.cpp:3122] > Transitioning the state of container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 > from PREPARING to ISOLATING > 2018-10-03 09:13:23: I1003 09:13:23.279223 2354 cni.cpp:962] Bind mounted > '/proc/5244/ns/net' to > '/run/mesos/isolators/network/cni/1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54/ns' > for container
[jira] [Commented] (MESOS-9334) Container stuck at ISOLATING state due to libevent poll never returns
[ https://issues.apache.org/jira/browse/MESOS-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16656592#comment-16656592 ] Qian Zhang commented on MESOS-9334: --- I added some logs into Mesos agent and libprocess, and then found the CNI isolator may hang at two places when isolating a container: # Wait for a CNI plugin (see [this code|https://github.com/apache/mesos/blob/1.7.0/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L1327] for details). When this issue occurred, I found actually the CNI plugin has finished its job and exited, but the CNI isolator hung at reading the CNI plugin's stdout/stderr. # Wait for `NetworkCniIsolatorSetup` (see [this code|https://github.com/apache/mesos/blob/1.7.0/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L1154] for details). Similarly when this issue occurred, I found actually `NetworkCniIsolatorSetup` has finished its job and exited, but CNI isolator hung at reading its stderr. For both of the above two cases, I found it was [libevent poll|https://github.com/apache/mesos/blob/1.7.0/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp#L75] never return for the fd to read stdout/stderr, and it seems there was no fd leak when this issue occurred. > Container stuck at ISOLATING state due to libevent poll never returns > - > > Key: MESOS-9334 > URL: https://issues.apache.org/jira/browse/MESOS-9334 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Qian Zhang >Priority: Critical > > We found UCR container may be stuck at `ISOLATING` state: > > {code:java} > 2018-10-03 09:13:23: I1003 09:13:23.274561 2355 containerizer.cpp:3122] > Transitioning the state of container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 > from PREPARING to ISOLATING > 2018-10-03 09:13:23: I1003 09:13:23.279223 2354 cni.cpp:962] Bind mounted > '/proc/5244/ns/net' to > '/run/mesos/isolators/network/cni/1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54/ns' > for container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 > 2018-10-03 09:23:22: I1003 09:23:22.879868 2354 containerizer.cpp:2459] > Destroying container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 in ISOLATING state > {code} > > In the above logs, the state of container > `1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54` was transitioned to `ISOLATING` at > 09:13:23, but did not transitioned to any other states until it was destroyed > due to the executor registration timeout (10 mins). And the destroy can never > complete since it needs to wait for the container to finish isolating. -- This message was sent by Atlassian JIRA (v7.6.3#76005)