[ https://issues.apache.org/jira/browse/MESOS-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879280#comment-16879280 ]
Benno Evers edited comment on MESOS-9867 at 7/5/19 1:56 PM: ------------------------------------------------------------ Since this requires very specific conditions in order to happen, I just added a warning if these conditions are met. Over time, this will solve itself as people are using newer and newer libevent versions. {noformat} commit 1a6760c60dc823b088ffbcf48909cf3e371570f3 Author: Benno Evers <bev...@mesosphere.com> Date: Wed Jun 26 16:30:12 2019 +0200 Added warnings about known problems with libevent epoll backend. Some SSL options are known to cause issues in combination with older versions of libevent. Detect and warn about this situation. See MESOS-9867 for details. Review: https://reviews.apache.org/r/70993 {noformat} was (Author: bennoe): Since this requires very specific conditions in order to happen, I just added a warning if these conditions are met. Over time, this will solve itself as people are using newer and newer libevent versions. {noformat} commit 1a6760c60dc823b088ffbcf48909cf3e371570f3 (HEAD -> master, origin/master, mesosphere-private/ci/bevers/tls-hostname-validation) Author: Benno Evers <bev...@mesosphere.com> Date: Wed Jun 26 16:30:12 2019 +0200 Added warnings about known problems with libevent epoll backend. Some SSL options are known to cause issues in combination with older versions of libevent. Detect and warn about this situation. See MESOS-9867 for details. Review: https://reviews.apache.org/r/70993 {noformat} > Libevent fd cleanup failure may cause hangs in combination with client > certificate validation > --------------------------------------------------------------------------------------------- > > Key: MESOS-9867 > URL: https://issues.apache.org/jira/browse/MESOS-9867 > Project: Mesos > Issue Type: Bug > Reporter: Benno Evers > Priority: Major > Labels: libevent, libprocess, ssl, tls > > A listening LibeventSSLSocket will check cryptographic certificate validity > during the OpenSSL handshake and afterwards call the `openssl::verify()` > function to perform hostname validation and other checks on the client > certificate. If these checks fail, the bufferevent is deleted and the > connection closed: > {noformat} > // libevent_ssl_socket.cpp, accept_SSL_callback() > if (verify.isError()) { > VLOG(1) << "Failed accept, verification error: " << > verify.error(); > request->promise.fail(verify.error()); > SSL_free(ssl); > bufferevent_free(bev); > // TODO(jmlvanre): Clean up for readability. Consider RAII > // or constructing the impl earlier. > CHECK(request->socket >= 0); > Try<Nothing> close = os::close(request->socket); > if (close.isError()) { > LOG(FATAL) > << "Failed to close socket " << stringify(request->socket) > << ": " << close.error(); > } > delete request; > return; > } > {noformat} > However, when we close the socket fd in the above code, libevent had already > registered that file descriptor with epoll() to watch for read and write > events on that socket. Since the socket is closed, attempts to remove the > corresponding fd from the epoll() structs will fail: (See also: > https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-broken-22/) > {noformat} > [warn] Epoll MOD(4) on fd 9 failed. Old events were 6; read change was 2 > (del); write change was 0 (none): Bad file descriptor > [warn] Epoll MOD(1) on fd 9 failed. Old events were 6; read change was 0 > (none); write change was 2 (del): Bad file descriptor > {noformat} > However, that in itself is harmless since the kernel will remove the kernel > object that was associated with fd 9 from the data structure associated with > that epoll instance in the kernel. So while we get an error attempting to > remove fd 9, there is actually nothing left to remove. However, in a case of > epoll failure, libprocess does not adjust the number of readers and writers > on that file descriptor: > {noformat} > // evmap.c, evmap_io_del() > [...] > if (evsel->del(base, ev->ev_fd, old, res, extra) == -1) > return (-1); > [...] > ctx->nread = nread; > ctx->nwrite = nwrite; > {noformat} > In the above, ctx is part of an array collecting information for each file > descriptor. That still wouldn't be so bad, however libevent also only adds > file descriptors to `epoll()` struct the *first* time we attempt to create a > read or write event on that file descriptor: > {noformat} > // evmap.c, evmap_io_add() > if (ev->ev_events & EV_READ) { > if (++nread == 1) > res |= EV_READ; > } > if (ev->ev_events & EV_WRITE) { > if (++nwrite == 1) > res |= EV_WRITE; > } > [...] > if (res) { > [...] > if (evsel->add(base, ev->ev_fd, > old, (ev->ev_events & EV_ET) | res, extra) == -1) > return (-1); > [...] > } > {noformat} > So when the same file descriptor is attempted to be used again by libevent > for epoll() polling, the process will hang because reads or writes to that > file descriptor are never noticed. > This can be reproduced for example by running a test where the > `verify()`-callback fails on the server side twice in a row: (note, the > LIBPROCESS_IP below is set in order to induce a test failure, result may vary > on your local network and ssl configuration) > {noformat} > LIBPROCESS_IP=127.0.1.1 ./libprocess-tests > --gtest_filter="*VerifyCertificate*" --gtest_repeat=2 > {noformat} > There is a chance that the issue described here is the same as the ominous > "issues" described in MESOS-3008, -- This message was sent by Atlassian JIRA (v7.6.3#76005)