[ https://issues.apache.org/jira/browse/MESOS-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873908#comment-16873908 ]
Benno Evers edited comment on MESOS-9867 at 6/27/19 11:25 AM: -------------------------------------------------------------- This was fixed upstream with libevent 2.1.8: https://github.com/libevent/libevent/commit/9b5a527f5bf898250a797dde59cadb4f64e8967a was (Author: bennoe): This was fixed upstream with libevent-2.1.10: https://github.com/libevent/libevent/commit/9b5a527f5bf898250a797dde59cadb4f64e8967a > Libevent fd cleanup failure may cause hangs in subsequent tests > --------------------------------------------------------------- > > Key: MESOS-9867 > URL: https://issues.apache.org/jira/browse/MESOS-9867 > Project: Mesos > Issue Type: Bug > Reporter: Benno Evers > Priority: Major > Labels: libevent, libprocess, ssl, tls > > A listening LibeventSSLSocket will check cryptographic certificate validity > during the OpenSSL handshake and afterwards call the `openssl::verify()` > function to perform hostname validation and other checks on the client > certificate. If these checks fail, the bufferevent is deleted and the > connection closed: > {noformat} > // libevent_ssl_socket.cpp, accept_SSL_callback() > if (verify.isError()) { > VLOG(1) << "Failed accept, verification error: " << > verify.error(); > request->promise.fail(verify.error()); > SSL_free(ssl); > bufferevent_free(bev); > // TODO(jmlvanre): Clean up for readability. Consider RAII > // or constructing the impl earlier. > CHECK(request->socket >= 0); > Try<Nothing> close = os::close(request->socket); > if (close.isError()) { > LOG(FATAL) > << "Failed to close socket " << stringify(request->socket) > << ": " << close.error(); > } > delete request; > return; > } > {noformat} > However, when we close the socket fd in the above code, libevent had already > registered that file descriptor with epoll() to watch for read and write > events on that socket. Since the socket is closed, attempts to remove the > corresponding fd from the epoll() structs will fail: (See also: > https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-broken-22/) > {noformat} > [warn] Epoll MOD(4) on fd 9 failed. Old events were 6; read change was 2 > (del); write change was 0 (none): Bad file descriptor > [warn] Epoll MOD(1) on fd 9 failed. Old events were 6; read change was 0 > (none); write change was 2 (del): Bad file descriptor > {noformat} > However, that in itself is harmless since the kernel will remove the kernel > object that was associated with fd 9 from the data structure associated with > that epoll instance in the kernel. So while we get an error attempting to > remove fd 9, there is actually nothing left to remove. However, in a case of > epoll failure, libprocess does not adjust the number of readers and writers > on that file descriptor: > {noformat} > // evmap.c, evmap_io_del() > [...] > if (evsel->del(base, ev->ev_fd, old, res, extra) == -1) > return (-1); > [...] > ctx->nread = nread; > ctx->nwrite = nwrite; > {noformat} > In the above, ctx is part of an array collecting information for each file > descriptor. That still wouldn't be so bad, however libevent also only adds > file descriptors to `epoll()` struct the *first* time we attempt to create a > read or write event on that file descriptor: > {noformat} > // evmap.c, evmap_io_add() > if (ev->ev_events & EV_READ) { > if (++nread == 1) > res |= EV_READ; > } > if (ev->ev_events & EV_WRITE) { > if (++nwrite == 1) > res |= EV_WRITE; > } > [...] > if (res) { > [...] > if (evsel->add(base, ev->ev_fd, > old, (ev->ev_events & EV_ET) | res, extra) == -1) > return (-1); > [...] > } > {noformat} > So when the same file descriptor is attempted to be used again by libevent > for epoll() polling, the process will hang because reads or writes to that > file descriptor are never noticed. > This can be reproduced for example by running a test where the > `verify()`-callback fails on the server side twice in a row: (note, the > LIBPROCESS_IP below is set in order to induce a test failure, result may vary > on your local network and ssl configuration) > {noformat} > LIBPROCESS_IP=127.0.1.1 ./libprocess-tests > --gtest_filter="*VerifyCertificate*" --gtest_repeat=2 > {noformat} > There is a chance that the issue described here is the same as the ominous > "issues" described in MESOS-3008, -- This message was sent by Atlassian JIRA (v7.6.3#76005)