Csaba Ringhofer created THRIFT-5846:
---------------------------------------
Summary: Hang when closing TBufferedTransport + TSslTransport
Key: THRIFT-5846
URL: https://issues.apache.org/jira/browse/THRIFT-5846
Project: Thrift
Issue Type: Bug
Components: C++ - Library
Reporter: Csaba Ringhofer
The issue occurred in Impala tests when switching to OpenSsl 3.2 (IMPALA-13680).
While the root cause of the regression may be some change in OpenSsl, Thrift's
behavior seems also strange as it tries to do SSL handshake when closing the
transport:
{code}
apache::thrift::transport::TSSLSocket::waitForEvent(bool) [TSSLSocket.cpp :
881 + 0xa]
apache::thrift::transport::TSSLSocket::initializeHandshake() [TSSLSocket.cpp :
683 + 0x12]
apache::thrift::transport::TSSLSocket::flush() [TSSLSocket.cpp : 613 + 0x5]
apache::thrift::transport::TBufferedTransport::close() [TBufferTransports.cpp :
133 + 0x3]
apache::thrift::server::TAcceptQueueServer::Task::run() [TAcceptQueueServer.cpp
: 111 + 0x3]
{code}
Note that Impala uses Thrift 0.16.0 with a few patches so line numbers may not
match with the original Thrift source code, but the Impala changes do not seem
relevant in this case.
The timeline of the events is:
1. TSSLSocket wrapped in TBufferedTransport is created and opened
2. the first TSSLSocket::initializeHandshake() fails and leads to throwing an
TTransportException with "SSL_accept: wrong version number (SSL_error_code = 1)"
3. Impala tries to close the transport
4. as the transport is TBufferedTransport, it calls flush on close()
https://github.com/apache/thrift/blob/7734c393ed0f0632c658c05e33a4d6592cf2912c/lib/cpp/src/thrift/transport/TBufferTransports.h#L236
5. TBufferedTransport::flush() calls the underlying transport's flush()
https://github.com/apache/thrift/blob/7734c393ed0f0632c658c05e33a4d6592cf2912c/lib/cpp/src/thrift/transport/TBufferTransports.cpp#L134
6. TSSlSocket::flush() calls initializeHandshake()
https://github.com/apache/thrift/blob/7734c393ed0f0632c658c05e33a4d6592cf2912c/lib/cpp/src/thrift/transport/TSSLSocket.cpp#L586
The hang only happens with the "wrong version number" error, other SSL_accept
errors were handled as before (e.g. "unsupported protocol").
My expectation is that if the transport's open() was successful then calling
close() should be safe, even if a later function throws an exception. In case
of TSSLSocket open() just opens the socket and the first read()/write() will do
the actual SSL handshake. If the handshake fails then the TSSLSocket enters a
strange state where isOpen() returns true but the ssl_ object is not usable.
I think that a possible solution is to shutdown+free ssl_ if the handshake
failed:
https://github.com/apache/thrift/blob/7734c393ed0f0632c658c05e33a4d6592cf2912c/lib/cpp/src/thrift/transport/TSSLSocket.cpp#L698
This means that isOpen() would return false after the error +
TSSLSocket::flush() would not try to reestablish connection. It would also make
sense to skip flushing in TBufferedTransport::close() if the underlying
transport is not open.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)