[ 
https://issues.apache.org/jira/browse/KUDU-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241736#comment-17241736
 ] 

Alexey Serbin commented on KUDU-3217:
-------------------------------------

A few more references on the fix for the related issue in SSSD:
* Bug report in RedHat bugzilla: 
https://bugzilla.redhat.com/show_bug.cgi?id=1461462
* Corresponding published errata: 
https://access.redhat.com/errata/RHEA-2018:0929
* SSSD 1.16 release notes (the fix is in 1.16 release): 
https://sssd.io/docs/users/relnotes/notes_1_16_0.html

> Limit the amount of time spent by SASL method run by WrapSaslCall
> -----------------------------------------------------------------
>
>                 Key: KUDU-3217
>                 URL: https://issues.apache.org/jira/browse/KUDU-3217
>             Project: Kudu
>          Issue Type: Improvement
>          Components: master, security, tserver
>            Reporter: Alexey Serbin
>            Priority: Major
>
> There were reports on connection negotiation issues in a secure Kudu cluster: 
> Kudu clients (e.g., {{kudu cluster ksck}} tool) would fail to establish a 
> connection to a tablet server or master.  The issue happened rather rarely, 
> but in a busy cluster that was a major nuisance because the target server 
> (master or tablet server) would not accept any new connections for a very 
> long time, and usually the solution was to restart the process.
> The stack traces collected from the diagnostic files pointed to a situation 
> where one negotiation thread acquired the mutex in {{WrapSaslCall()}} in read 
> mode and stuck for a long time (at least a few minutes), while the Kerberos 
> creds renewal thread was waiting on the lock to be acquired in write mode.  
> Consequently, all other connection negotiation threads were blocked on the 
> same mutex after that since the mutex is of 
> {{RWMutex::Priority::PREFER_WRITING}} priority.
> The stacks of the related threads looked like the following:
> {noformat}
>   tids=[380992]
>       0x7f060e1e45e0 <unknown>
>       0x7f060c4afa3d __GI___poll
>       0x7f0605813ffa <unknown>
>       0x7f06058147b5 <unknown>
>       0x7f0605814b07 sss_pac_make_request
>       0x7f0605813dab <unknown>
>       0x7f060df31deb <unknown>
>       0x7f060df53661 <unknown>
>       0x7f060df5407a krb5_rd_req_decoded
>       0x7f060d3d190c <unknown>
>       0x7f060d3d2dea <unknown>
>       0x7f060d3d2f49 <unknown>
>       0x7f060d3c0606 gss_accept_sec_context
>       0x7f060765c04c <unknown>
>       0x7f060d610b9b sasl_server_step
>       0x7f060d611109 sasl_server_start
>   tids=[170406,168067,167927,170163,168917,169251]
>       0x7f060e1e45e0 <unknown>
>       0x7f060e1e0092 __GI___pthread_rwlock_rdlock
>            0x1d09883 kudu::rpc::WrapSaslCall()
>            0x1d0bd21 kudu::rpc::ServerNegotiation::InitSaslServer()
>            0x1d112f7 kudu::rpc::ServerNegotiation::AuthenticateBySasl()
>            0x1d13096 kudu::rpc::ServerNegotiation::Negotiate()
>            0x1d2420c kudu::rpc::Negotiation::RunNegotiation()
>            0x1cfa905 kudu::internal::Invoker<>::Run()
>            0x1e9234f kudu::ThreadPool::DispatchThread()
>            0x1e88014 kudu::Thread::SuperviseThread()
>       0x7f060e1dce25 start_thread
>       0x7f060c4ba34d __clone
>   tids=[380520]
>       0x7f060e1e45e0 <unknown>
>       0x7f060e1e02ac __GI___pthread_rwlock_wrlock
>            0x1d3b737 kudu::security::(anonymous namespace)::RenewThread()
>            0x1e88014 kudu::Thread::SuperviseThread()
>       0x7f060e1dce25 start_thread
>       0x7f060c4ba34d __clone
> {noformat} 
> Thread {{80992}} is the thread that acquired the mutex in read mode and stuck 
> in a SASL call (the latter went through SSSD).  Thread {{380520}} is the 
> Kerberos creds renewal thread, trying to lock the mutex in write mode.  The 
> rest are connection negotiation threads trying to acquire the lock in read 
> mode.
> Further investigation revealed an [issue in 
> SSSD|https://github.com/SSSD/sssd/issues/4544], where the stack of the stuck 
> thread looks exactly the same as the stack of {{80992}} (modulo the latter 
> didn't have debug symbols to show information on every function in the stack):
> {noformat}
> #0  0x00007f29342dcdfd in poll () from /lib64/libc.so.6
> #1  0x00007f2901e722ba in sss_cli_make_request_nochecks () from 
> /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
> #2  0x00007f2901e72a75 in sss_cli_check_socket () from 
> /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
> #3  0x00007f2901e72e07 in sss_pac_make_request () from 
> /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
> #4  0x00007f2901e71feb in sssdpac_verify () from 
> /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
> #5  0x00007f29364ea3d3 in krb5int_authdata_verify () from /lib64/libkrb5.so.3
> #6  0x00007f293650b621 in rd_req_decoded_opt () from /lib64/libkrb5.so.3
> #7  0x00007f293650c03a in krb5_rd_req_decoded () from /lib64/libkrb5.so.3
> #8  0x00007f292d592b3f in kg_accept_krb5 () from /lib64/libgssapi_krb5.so.2
> #9  0x00007f292d5941fa in krb5_gss_accept_sec_context_ext () from 
> /lib64/libgssapi_krb5.so.2
> #10 0x00007f292d594359 in krb5_gss_accept_sec_context () from 
> /lib64/libgssapi_krb5.so.2
> #11 0x00007f292d5816d6 in gss_accept_sec_context () from 
> /lib64/libgssapi_krb5.so.2
> #12 0x00007f292d7c3edc in gssapi_server_mech_step () from 
> /usr/lib64/sasl2/libgssapiv2.so
> #13 0x00007f29349e5b9b in sasl_server_step () from /lib64/libsasl2.so.3
> #14 0x00007f29349e6109 in sasl_server_start () from /lib64/libsasl2.so.3
> {noformat}
> With that, given that there might be many other bugs in that path and a 
> remote KDC might be very slow to respond if overloaded in a larger cluster, 
> it would be great to limit the amount of time spent by the SASL call run by 
> {{WrapSaslCall()}}, so if it's over the limit, the code would return 
> {{Status::TimedOut()}} or {{Status::ServiceUnavailable()}} status.
> Also, it doesn't seem like a very good idea to lock a mutex and issue a SASL 
> call since the latter is often turns to be a call to remote KDC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to