[ 
https://issues.apache.org/jira/browse/KUDU-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241736#comment-17241736
 ] 

Alexey Serbin edited comment on KUDU-3217 at 12/1/20, 11:17 PM:
----------------------------------------------------------------

A few more references on the issue in SSSD:
* GitHub issue report: https://github.com/SSSD/sssd/issues/4544
* SSSD 1.16 release notes (the fix is in 1.16 release): 
https://sssd.io/docs/users/relnotes/notes_1_16_0.html
* Bug report in RedHat bugzilla: 
https://bugzilla.redhat.com/show_bug.cgi?id=1461462
* Corresponding errata published by RedHat: 
https://access.redhat.com/errata/RHEA-2018:0929


was (Author: aserbin):
A few more references on the fix for the related issue in SSSD:
* Bug report in RedHat bugzilla: 
https://bugzilla.redhat.com/show_bug.cgi?id=1461462
* Corresponding published errata: 
https://access.redhat.com/errata/RHEA-2018:0929
* SSSD 1.16 release notes (the fix is in 1.16 release): 
https://sssd.io/docs/users/relnotes/notes_1_16_0.html

> Limit the amount of time spent by SASL method run by WrapSaslCall
> -----------------------------------------------------------------
>
>                 Key: KUDU-3217
>                 URL: https://issues.apache.org/jira/browse/KUDU-3217
>             Project: Kudu
>          Issue Type: Improvement
>          Components: master, security, tserver
>            Reporter: Alexey Serbin
>            Priority: Major
>
> There were reports on connection negotiation issues in a secure Kudu cluster: 
> Kudu clients (e.g., {{kudu cluster ksck}} tool) would fail to establish a 
> connection to a tablet server or master.  The issue happened rather rarely, 
> but in a busy cluster that was a major nuisance because the target server 
> would not accept any new connections for a very long time, and usually the 
> solution was to restart the server ({{kudu-tserver}} or {{kudu-master}} 
> correspondingly).
> The stack traces collected from the diagnostic files pointed to a situation 
> where one negotiation thread acquired the mutex in {{WrapSaslCall()}} in read 
> mode and stuck for a long time (several minutes), while the Kerberos creds 
> renewal thread was waiting on the lock to be acquired in write mode.  
> Consequently, all other connection negotiation threads were blocked on the 
> same mutex after that since the mutex is of 
> {{RWMutex::Priority::PREFER_WRITING}} priority.
> The stacks of the related threads looked like the following:
> {noformat}
>   tids=[380992]
>       0x7f060e1e45e0 <unknown>
>       0x7f060c4afa3d __GI___poll
>       0x7f0605813ffa <unknown>
>       0x7f06058147b5 <unknown>
>       0x7f0605814b07 sss_pac_make_request
>       0x7f0605813dab <unknown>
>       0x7f060df31deb <unknown>
>       0x7f060df53661 <unknown>
>       0x7f060df5407a krb5_rd_req_decoded
>       0x7f060d3d190c <unknown>
>       0x7f060d3d2dea <unknown>
>       0x7f060d3d2f49 <unknown>
>       0x7f060d3c0606 gss_accept_sec_context
>       0x7f060765c04c <unknown>
>       0x7f060d610b9b sasl_server_step
>       0x7f060d611109 sasl_server_start
>   tids=[170406,168067,167927,170163,168917,169251]
>       0x7f060e1e45e0 <unknown>
>       0x7f060e1e0092 __GI___pthread_rwlock_rdlock
>            0x1d09883 kudu::rpc::WrapSaslCall()
>            0x1d0bd21 kudu::rpc::ServerNegotiation::InitSaslServer()
>            0x1d112f7 kudu::rpc::ServerNegotiation::AuthenticateBySasl()
>            0x1d13096 kudu::rpc::ServerNegotiation::Negotiate()
>            0x1d2420c kudu::rpc::Negotiation::RunNegotiation()
>            0x1cfa905 kudu::internal::Invoker<>::Run()
>            0x1e9234f kudu::ThreadPool::DispatchThread()
>            0x1e88014 kudu::Thread::SuperviseThread()
>       0x7f060e1dce25 start_thread
>       0x7f060c4ba34d __clone
>   tids=[380520]
>       0x7f060e1e45e0 <unknown>
>       0x7f060e1e02ac __GI___pthread_rwlock_wrlock
>            0x1d3b737 kudu::security::(anonymous namespace)::RenewThread()
>            0x1e88014 kudu::Thread::SuperviseThread()
>       0x7f060e1dce25 start_thread
>       0x7f060c4ba34d __clone
> {noformat} 
> Thread {{380992}} is the thread that acquired the mutex as a reader and stuck 
> in a SASL call (the latter went through the SSSD PAC plugin).  Thread 
> {{380520}} is the Kerberos creds renewal thread, trying to acquire the mutex 
> as a writer.  The rest are connection negotiation threads trying to acquire 
> the lock as readers.
> Further investigation revealed [an issue in 
> SSSD|https://github.com/SSSD/sssd/issues/4544], where the stack of the stuck 
> thread looks exactly the same as the stack of {{380992}} (the latter didn't 
> have debug symbols to show information on every function in the stack):
> {noformat}
> #0  0x00007f29342dcdfd in poll () from /lib64/libc.so.6
> #1  0x00007f2901e722ba in sss_cli_make_request_nochecks () from 
> /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
> #2  0x00007f2901e72a75 in sss_cli_check_socket () from 
> /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
> #3  0x00007f2901e72e07 in sss_pac_make_request () from 
> /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
> #4  0x00007f2901e71feb in sssdpac_verify () from 
> /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
> #5  0x00007f29364ea3d3 in krb5int_authdata_verify () from /lib64/libkrb5.so.3
> #6  0x00007f293650b621 in rd_req_decoded_opt () from /lib64/libkrb5.so.3
> #7  0x00007f293650c03a in krb5_rd_req_decoded () from /lib64/libkrb5.so.3
> #8  0x00007f292d592b3f in kg_accept_krb5 () from /lib64/libgssapi_krb5.so.2
> #9  0x00007f292d5941fa in krb5_gss_accept_sec_context_ext () from 
> /lib64/libgssapi_krb5.so.2
> #10 0x00007f292d594359 in krb5_gss_accept_sec_context () from 
> /lib64/libgssapi_krb5.so.2
> #11 0x00007f292d5816d6 in gss_accept_sec_context () from 
> /lib64/libgssapi_krb5.so.2
> #12 0x00007f292d7c3edc in gssapi_server_mech_step () from 
> /usr/lib64/sasl2/libgssapiv2.so
> #13 0x00007f29349e5b9b in sasl_server_step () from /lib64/libsasl2.so.3
> #14 0x00007f29349e6109 in sasl_server_start () from /lib64/libsasl2.so.3
> {noformat}
> Given that there might be many other bugs in that path and a KDC might be 
> slow to respond to a particular request, it would be great to limit the 
> amount of time spent by the SASL call run by {{WrapSaslCall()}}.  If it's 
> over the limit, the code would return {{Status::TimedOut()}} or 
> {{Status::ServiceUnavailable()}} status and the client side could handle the 
> response appropriately, but at least Kudu masters and tablet server would be 
> able to accept new connections and handle those new requests in a timely 
> manner.
> Also, it doesn't seem like a very good idea to acquire a lock and issue a 
> SASL call since the latter is often turns to be a remote call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to