Impala Public Jenkins has submitted this change and it was merged. Change subject: IMPALA-5221: Avoid re-use of stale SASL contexts. ......................................................................
IMPALA-5221: Avoid re-use of stale SASL contexts. The TSaslTransport is written as a thrift extension that is a wrapper around the Cyrus-SASL APIs. This transport is then used by Impala's RPC layer. On RHEL7 systems that use newer versions of the Cyrus-SASL library, we noticed that we sometimes crash inside the Cyrus-SASL thirdparty while trying to lock an internal mutex. During my investigation, I found that we needed to fix the order of negotiation that happens in an edge case. The steps to use the Cyrus-SASL APIs for SASL negitiation are the following (Replace '_client_' with '_server_' for server calls): sasl_client_new() sasl_client_start() sasl_client_step() sasl_dispose() < --- When we're done with the connection. sasl_client_new() was being called in the constructor TSaslClient() which is invoked from SaslAuthProvider::WrapClientTransport(). sasl_client_start() and sasl_client_step() were being called under TSaslTransport::open(). If for some reason we hit an error during SASL negotiation, the TSaslTransport::open() call would fail. When we fail, the ThriftClientImpl::OpenWithRetry() does a retry, which directly retries the negotiation from sasl_client_start(). This caused the use of already freed resources from the first negotiation failure, hence causing the crash. To fix this, we make sure that on a negotiation failure, we dispose of all the resources properly by calling sasl_dispose() and retry the negotiation from the start by calling sasl_client_new() first, and then the remaining steps. This is done by moving the sasl_client_new() and sasl_server_new() calls out of the TSaslClient/TSaslServer constructors and into a new call called TSasl*::setupSaslContext(), which is called under TSaslTransport::open(). The patch is fairly large for the above mentioned change, however, most of it is just plumbing. Testing: Tested on systems with older SASL versions to make sure we don't regress. Also tested on systems with newer SASL versions where we previously saw the crash and verified that we don't see them anymore. Also, tested with GSSAPI and LDAP mechanisms. Change-Id: I7d509e14dfe46fa28d0626cf84daf6de82955d80 Reviewed-on: http://gerrit.cloudera.org:8080/7116 Reviewed-by: Sailesh Mukil <sail...@cloudera.com> Tested-by: Impala Public Jenkins --- M be/src/transport/TSasl.cpp M be/src/transport/TSasl.h M be/src/transport/TSaslClientTransport.cpp M be/src/transport/TSaslClientTransport.h M be/src/transport/TSaslServerTransport.cpp M be/src/transport/TSaslServerTransport.h M be/src/transport/TSaslTransport.cpp M be/src/transport/TSaslTransport.h 8 files changed, 234 insertions(+), 82 deletions(-) Approvals: Impala Public Jenkins: Verified Sailesh Mukil: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/7116 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I7d509e14dfe46fa28d0626cf84daf6de82955d80 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Sailesh Mukil <sail...@cloudera.com> Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com> Gerrit-Reviewer: Henry Robinson <he...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Sailesh Mukil <sail...@cloudera.com>