Benjamin Mahler created MESOS-5988: -------------------------------------- Summary: PollSocketImpl can write to a stale fd. Key: MESOS-5988 URL: https://issues.apache.org/jira/browse/MESOS-5988 Project: Mesos Issue Type: Bug Components: libprocess Reporter: Benjamin Mahler Priority: Blocker Fix For: 1.0.1
When tracking down MESOS-5986 with [~greggomann] and [~anandmazumdar]. We were curious why PollSocketImpl avoids the same issue. It seems that PollSocketImpl has a similar race, however in the case of PollSocketImpl we will simply write to a stale file descriptor. One example is {{PollSocketImpl::send(const char*, size_t)}}: https://github.com/apache/mesos/blob/1.0.0/3rdparty/libprocess/src/poll_socket.cpp#L241-L245 {code} Future<size_t> PollSocketImpl::send(const char* data, size_t size) { return io::poll(get(), io::WRITE) .then(lambda::bind(&internal::socket_send_data, get(), data, size)); } Future<size_t> socket_send_data(int s, const char* data, size_t size) { CHECK(size > 0); while (true) { ssize_t length = send(s, data, size, MSG_NOSIGNAL); #ifdef __WINDOWS__ int error = WSAGetLastError(); #else int error = errno; #endif // __WINDOWS__ if (length < 0 && net::is_restartable_error(error)) { // Interrupted, try again now. continue; } else if (length < 0 && net::is_retryable_error(error)) { // Might block, try again later. return io::poll(s, io::WRITE) .then(lambda::bind(&internal::socket_send_data, s, data, size)); } else if (length <= 0) { // Socket error or closed. if (length < 0) { const string error = os::strerror(errno); VLOG(1) << "Socket error while sending: " << error; } else { VLOG(1) << "Socket closed while sending"; } if (length == 0) { return length; } else { return Failure(ErrnoError("Socket send failed")); } } else { CHECK(length > 0); return length; } } } {code} If the last reference to the {{Socket}} goes away before the {{socket_send_data}} loop completes, then we will write to a stale fd! It turns out that we have avoided this issue because in libprocess we happen to keep a reference to the {{Socket}} around when sending: https://github.com/apache/mesos/blob/1.0.0/3rdparty/libprocess/src/process.cpp#L1678-L1707 {code} void send(Encoder* encoder, Socket socket) { switch (encoder->kind()) { case Encoder::DATA: { size_t size; const char* data = static_cast<DataEncoder*>(encoder)->next(&size); socket.send(data, size) .onAny(lambda::bind( &internal::_send, lambda::_1, socket, encoder, size)); break; } case Encoder::FILE: { off_t offset; size_t size; int fd = static_cast<FileEncoder*>(encoder)->next(&offset, &size); socket.sendfile(fd, offset, size) .onAny(lambda::bind( &internal::_send, lambda::_1, socket, encoder, size)); break; } } } {code} However, this may not be true in all call-sites going forward. Currently, it appears that http::Connection can trigger this bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)