Alexey Serbin created KUDU-3217: ----------------------------------- Summary: Limit the amount of time spent by SASL method run by WrapSaslCall Key: KUDU-3217 URL: https://issues.apache.org/jira/browse/KUDU-3217 Project: Kudu Issue Type: Improvement Components: master, security, tserver Reporter: Alexey Serbin
There were reports on connection negotiation issues in a secure Kudu cluster: Kudu clients (e.g., {{kudu cluster ksck}} tool) would fail to establish a connection to a tablet server or master. The issue happened rather rarely, but in a busy cluster that was a major nuisance because the target server (master or tablet server) would not accept any new connections for a very long time, and usually the solution was to restart the process. The stack traces collected from the diagnostic files pointed to a situation where one negotiation thread acquired the mutex in {{WrapSaslCall()}} in read mode and stuck for a long time (at least a few minutes), while the Kerberos creds renewal thread was waiting on the lock to be acquired in write mode. Consequently, all other connection negotiation threads were blocked on the same mutex after that since the mutex is of {{RWMutex::Priority::PREFER_WRITING}} priority. The stacks of the related threads looked like the following: {noformat} tids=[380992] 0x7f060e1e45e0 <unknown> 0x7f060c4afa3d __GI___poll 0x7f0605813ffa <unknown> 0x7f06058147b5 <unknown> 0x7f0605814b07 sss_pac_make_request 0x7f0605813dab <unknown> 0x7f060df31deb <unknown> 0x7f060df53661 <unknown> 0x7f060df5407a krb5_rd_req_decoded 0x7f060d3d190c <unknown> 0x7f060d3d2dea <unknown> 0x7f060d3d2f49 <unknown> 0x7f060d3c0606 gss_accept_sec_context 0x7f060765c04c <unknown> 0x7f060d610b9b sasl_server_step 0x7f060d611109 sasl_server_start tids=[170406,168067,167927,170163,168917,169251] 0x7f060e1e45e0 <unknown> 0x7f060e1e0092 __GI___pthread_rwlock_rdlock 0x1d09883 kudu::rpc::WrapSaslCall() 0x1d0bd21 kudu::rpc::ServerNegotiation::InitSaslServer() 0x1d112f7 kudu::rpc::ServerNegotiation::AuthenticateBySasl() 0x1d13096 kudu::rpc::ServerNegotiation::Negotiate() 0x1d2420c kudu::rpc::Negotiation::RunNegotiation() 0x1cfa905 kudu::internal::Invoker<>::Run() 0x1e9234f kudu::ThreadPool::DispatchThread() 0x1e88014 kudu::Thread::SuperviseThread() 0x7f060e1dce25 start_thread 0x7f060c4ba34d __clone tids=[380520] 0x7f060e1e45e0 <unknown> 0x7f060e1e02ac __GI___pthread_rwlock_wrlock 0x1d3b737 kudu::security::(anonymous namespace)::RenewThread() 0x1e88014 kudu::Thread::SuperviseThread() 0x7f060e1dce25 start_thread 0x7f060c4ba34d __clone {noformat} Thread {{80992}} is the thread that acquired the mutex in read mode and stuck in a SASL call (the latter went through SSSD). Thread {{380520}} is the Kerberos creds renewal thread, trying to lock the mutex in write mode. The rest are connection negotiation threads trying to acquire the lock in read mode. Further investigation revealed an [issue in SSSD|https://github.com/SSSD/sssd/issues/4544], where the stack of the stuck thread looks exactly the same as the stack of {{80992}} (modulo the latter didn't have debug symbols to show information on every function in the stack): {noformat} #0 0x00007f29342dcdfd in poll () from /lib64/libc.so.6 #1 0x00007f2901e722ba in sss_cli_make_request_nochecks () from /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so #2 0x00007f2901e72a75 in sss_cli_check_socket () from /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so #3 0x00007f2901e72e07 in sss_pac_make_request () from /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so #4 0x00007f2901e71feb in sssdpac_verify () from /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so #5 0x00007f29364ea3d3 in krb5int_authdata_verify () from /lib64/libkrb5.so.3 #6 0x00007f293650b621 in rd_req_decoded_opt () from /lib64/libkrb5.so.3 #7 0x00007f293650c03a in krb5_rd_req_decoded () from /lib64/libkrb5.so.3 #8 0x00007f292d592b3f in kg_accept_krb5 () from /lib64/libgssapi_krb5.so.2 #9 0x00007f292d5941fa in krb5_gss_accept_sec_context_ext () from /lib64/libgssapi_krb5.so.2 #10 0x00007f292d594359 in krb5_gss_accept_sec_context () from /lib64/libgssapi_krb5.so.2 #11 0x00007f292d5816d6 in gss_accept_sec_context () from /lib64/libgssapi_krb5.so.2 #12 0x00007f292d7c3edc in gssapi_server_mech_step () from /usr/lib64/sasl2/libgssapiv2.so #13 0x00007f29349e5b9b in sasl_server_step () from /lib64/libsasl2.so.3 #14 0x00007f29349e6109 in sasl_server_start () from /lib64/libsasl2.so.3 {noformat} With that, given that there might be many other bugs in that path and a remote KDC might be very slow to respond if overloaded in a larger cluster, it would be great to limit the amount of time spent by the SASL call run by {{WrapSaslCall()}}, so if it's over the limit, the code would return {{Status::TimedOut()}} or {{Status::ServiceUnavailable()}} status. Also, it doesn't seem like a very good idea to lock a mutex and issue a SASL call since the latter is often turns to be a call to remote KDC. -- This message was sent by Atlassian Jira (v8.3.4#803005)