[
https://issues.apache.org/jira/browse/MESOS-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873908#comment-16873908
]
Benno Evers edited comment on MESOS-9867 at 6/27/19 11:25 AM:
--------------------------------------------------------------
This was fixed upstream with libevent 2.1.8:
https://github.com/libevent/libevent/commit/9b5a527f5bf898250a797dde59cadb4f64e8967a
was (Author: bennoe):
This was fixed upstream with libevent-2.1.10:
https://github.com/libevent/libevent/commit/9b5a527f5bf898250a797dde59cadb4f64e8967a
> Libevent fd cleanup failure may cause hangs in subsequent tests
> ---------------------------------------------------------------
>
> Key: MESOS-9867
> URL: https://issues.apache.org/jira/browse/MESOS-9867
> Project: Mesos
> Issue Type: Bug
> Reporter: Benno Evers
> Priority: Major
> Labels: libevent, libprocess, ssl, tls
>
> A listening LibeventSSLSocket will check cryptographic certificate validity
> during the OpenSSL handshake and afterwards call the `openssl::verify()`
> function to perform hostname validation and other checks on the client
> certificate. If these checks fail, the bufferevent is deleted and the
> connection closed:
> {noformat}
> // libevent_ssl_socket.cpp, accept_SSL_callback()
> if (verify.isError()) {
> VLOG(1) << "Failed accept, verification error: " <<
> verify.error();
> request->promise.fail(verify.error());
> SSL_free(ssl);
> bufferevent_free(bev);
> // TODO(jmlvanre): Clean up for readability. Consider RAII
> // or constructing the impl earlier.
> CHECK(request->socket >= 0);
> Try<Nothing> close = os::close(request->socket);
> if (close.isError()) {
> LOG(FATAL)
> << "Failed to close socket " << stringify(request->socket)
> << ": " << close.error();
> }
> delete request;
> return;
> }
> {noformat}
> However, when we close the socket fd in the above code, libevent had already
> registered that file descriptor with epoll() to watch for read and write
> events on that socket. Since the socket is closed, attempts to remove the
> corresponding fd from the epoll() structs will fail: (See also:
> https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-broken-22/)
> {noformat}
> [warn] Epoll MOD(4) on fd 9 failed. Old events were 6; read change was 2
> (del); write change was 0 (none): Bad file descriptor
> [warn] Epoll MOD(1) on fd 9 failed. Old events were 6; read change was 0
> (none); write change was 2 (del): Bad file descriptor
> {noformat}
> However, that in itself is harmless since the kernel will remove the kernel
> object that was associated with fd 9 from the data structure associated with
> that epoll instance in the kernel. So while we get an error attempting to
> remove fd 9, there is actually nothing left to remove. However, in a case of
> epoll failure, libprocess does not adjust the number of readers and writers
> on that file descriptor:
> {noformat}
> // evmap.c, evmap_io_del()
> [...]
> if (evsel->del(base, ev->ev_fd, old, res, extra) == -1)
> return (-1);
> [...]
> ctx->nread = nread;
> ctx->nwrite = nwrite;
> {noformat}
> In the above, ctx is part of an array collecting information for each file
> descriptor. That still wouldn't be so bad, however libevent also only adds
> file descriptors to `epoll()` struct the *first* time we attempt to create a
> read or write event on that file descriptor:
> {noformat}
> // evmap.c, evmap_io_add()
> if (ev->ev_events & EV_READ) {
> if (++nread == 1)
> res |= EV_READ;
> }
> if (ev->ev_events & EV_WRITE) {
> if (++nwrite == 1)
> res |= EV_WRITE;
> }
> [...]
> if (res) {
> [...]
> if (evsel->add(base, ev->ev_fd,
> old, (ev->ev_events & EV_ET) | res, extra) == -1)
> return (-1);
> [...]
> }
> {noformat}
> So when the same file descriptor is attempted to be used again by libevent
> for epoll() polling, the process will hang because reads or writes to that
> file descriptor are never noticed.
> This can be reproduced for example by running a test where the
> `verify()`-callback fails on the server side twice in a row: (note, the
> LIBPROCESS_IP below is set in order to induce a test failure, result may vary
> on your local network and ssl configuration)
> {noformat}
> LIBPROCESS_IP=127.0.1.1 ./libprocess-tests
> --gtest_filter="*VerifyCertificate*" --gtest_repeat=2
> {noformat}
> There is a chance that the issue described here is the same as the ominous
> "issues" described in MESOS-3008,
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)