Alexey Serbin created KUDU-3217:
-----------------------------------

             Summary: Limit the amount of time spent by SASL method run by 
WrapSaslCall
                 Key: KUDU-3217
                 URL: https://issues.apache.org/jira/browse/KUDU-3217
             Project: Kudu
          Issue Type: Improvement
          Components: master, security, tserver
            Reporter: Alexey Serbin


There were reports on connection negotiation issues in a secure Kudu cluster: 
Kudu clients (e.g., {{kudu cluster ksck}} tool) would fail to establish a 
connection to a tablet server or master.  The issue happened rather rarely, but 
in a busy cluster that was a major nuisance because the target server (master 
or tablet server) would not accept any new connections for a very long time, 
and usually the solution was to restart the process.

The stack traces collected from the diagnostic files pointed to a situation 
where one negotiation thread acquired the mutex in {{WrapSaslCall()}} in read 
mode and stuck for a long time (at least a few minutes), while the Kerberos 
creds renewal thread was waiting on the lock to be acquired in write mode.  
Consequently, all other connection negotiation threads were blocked on the same 
mutex after that since the mutex is of {{RWMutex::Priority::PREFER_WRITING}} 
priority.

The stacks of the related threads looked like the following:
{noformat}
  tids=[380992]
      0x7f060e1e45e0 <unknown>
      0x7f060c4afa3d __GI___poll
      0x7f0605813ffa <unknown>
      0x7f06058147b5 <unknown>
      0x7f0605814b07 sss_pac_make_request
      0x7f0605813dab <unknown>
      0x7f060df31deb <unknown>
      0x7f060df53661 <unknown>
      0x7f060df5407a krb5_rd_req_decoded
      0x7f060d3d190c <unknown>
      0x7f060d3d2dea <unknown>
      0x7f060d3d2f49 <unknown>
      0x7f060d3c0606 gss_accept_sec_context
      0x7f060765c04c <unknown>
      0x7f060d610b9b sasl_server_step
      0x7f060d611109 sasl_server_start
  tids=[170406,168067,167927,170163,168917,169251]
      0x7f060e1e45e0 <unknown>
      0x7f060e1e0092 __GI___pthread_rwlock_rdlock
           0x1d09883 kudu::rpc::WrapSaslCall()
           0x1d0bd21 kudu::rpc::ServerNegotiation::InitSaslServer()
           0x1d112f7 kudu::rpc::ServerNegotiation::AuthenticateBySasl()
           0x1d13096 kudu::rpc::ServerNegotiation::Negotiate()
           0x1d2420c kudu::rpc::Negotiation::RunNegotiation()
           0x1cfa905 kudu::internal::Invoker<>::Run()
           0x1e9234f kudu::ThreadPool::DispatchThread()
           0x1e88014 kudu::Thread::SuperviseThread()
      0x7f060e1dce25 start_thread
      0x7f060c4ba34d __clone
  tids=[380520]
      0x7f060e1e45e0 <unknown>
      0x7f060e1e02ac __GI___pthread_rwlock_wrlock
           0x1d3b737 kudu::security::(anonymous namespace)::RenewThread()
           0x1e88014 kudu::Thread::SuperviseThread()
      0x7f060e1dce25 start_thread
      0x7f060c4ba34d __clone
{noformat} 

Thread {{80992}} is the thread that acquired the mutex in read mode and stuck 
in a SASL call (the latter went through SSSD).  Thread {{380520}} is the 
Kerberos creds renewal thread, trying to lock the mutex in write mode.  The 
rest are connection negotiation threads trying to acquire the lock in read mode.

Further investigation revealed an [issue in 
SSSD|https://github.com/SSSD/sssd/issues/4544], where the stack of the stuck 
thread looks exactly the same as the stack of {{80992}} (modulo the latter 
didn't have debug symbols to show information on every function in the stack):
{noformat}
#0  0x00007f29342dcdfd in poll () from /lib64/libc.so.6
#1  0x00007f2901e722ba in sss_cli_make_request_nochecks () from 
/usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
#2  0x00007f2901e72a75 in sss_cli_check_socket () from 
/usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
#3  0x00007f2901e72e07 in sss_pac_make_request () from 
/usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
#4  0x00007f2901e71feb in sssdpac_verify () from 
/usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
#5  0x00007f29364ea3d3 in krb5int_authdata_verify () from /lib64/libkrb5.so.3
#6  0x00007f293650b621 in rd_req_decoded_opt () from /lib64/libkrb5.so.3
#7  0x00007f293650c03a in krb5_rd_req_decoded () from /lib64/libkrb5.so.3
#8  0x00007f292d592b3f in kg_accept_krb5 () from /lib64/libgssapi_krb5.so.2
#9  0x00007f292d5941fa in krb5_gss_accept_sec_context_ext () from 
/lib64/libgssapi_krb5.so.2
#10 0x00007f292d594359 in krb5_gss_accept_sec_context () from 
/lib64/libgssapi_krb5.so.2
#11 0x00007f292d5816d6 in gss_accept_sec_context () from 
/lib64/libgssapi_krb5.so.2
#12 0x00007f292d7c3edc in gssapi_server_mech_step () from 
/usr/lib64/sasl2/libgssapiv2.so
#13 0x00007f29349e5b9b in sasl_server_step () from /lib64/libsasl2.so.3
#14 0x00007f29349e6109 in sasl_server_start () from /lib64/libsasl2.so.3
{noformat}

With that, given that there might be many other bugs in that path and a remote 
KDC might be very slow to respond if overloaded in a larger cluster, it would 
be great to limit the amount of time spent by the SASL call run by 
{{WrapSaslCall()}}, so if it's over the limit, the code would return 
{{Status::TimedOut()}} or {{Status::ServiceUnavailable()}} status.

Also, it doesn't seem like a very good idea to lock a mutex and issue a SASL 
call since the latter is often turns to be a call to remote KDC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to