Re: [Freeipa-users] performance scaling of sssd / freeipa

Sullivan, Daniel [CRI] Fri, 20 Jan 2017 09:21:38 -0800

Sorry I didn’t realize you might want all sssd logs… Working on it.

Dan


> On Jan 20, 2017, at 10:27 AM, Sumit Bose <sb...@redhat.com> wrote:
> 
> On Fri, Jan 20, 2017 at 03:41:46PM +0000, Sullivan, Daniel [CRI] wrote:
>> Hi,
>> 
>> I have some more information on this issue.  I’m tracing it down through the 
>> slapd logs and  I am continuing to struggle; I was hoping that somebody 
>> could possibly help me provided this additional information.
>> 
>> On the domain logs (on the DC), I see an ldap_search_ext operation 155 with 
>> a timeout of 60:
>> 
>> (Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [sdap_get_generic_ext_step] (0x0400): calling ldap_search_ext with 
>> [(&(objectClass=ipaOverrideAnchor)(ipaAnchorUUID=:SID:S-1-5-21-xxx))][cn=Default
>>  Trust View,cn=views,cn=accounts,dc=xxx,dc=xxx,dc=uchicago,dc=edu].
>> (Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [sdap_get_generic_ext_step] (0x2000): ldap_search_ext called, msgid = 155
>> (Fri Jan 20 08:42:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [sdap_op_add] 
>> (0x2000): New operation 155 timeout 60
>> 
>> Then, on the slapd log for the server, I see the incoming request.  Based on 
>> the result, it looks like the request completes successfully within 1 second:
>> 
>> [20/Jan/2017:08:42:35.179890746 -0600] conn=1518 op=31 SRCH base="cn=Default 
>> Trust View,cn=views,cn=accounts,dc=xxx,dc=xxx,dc=uchicago,dc=edu" scope=2 
>> filter="(&(objectClass=ipaOverrideAnchor)(ipaAnchorUUID=:SID:S-1-5-21-xxx))" 
>> attrs=ALL
>> [20/Jan/2017:08:42:35.683485674 -0600] conn=1518 op=31 RESULT err=0 tag=101 
>> nentries=0 etime=1 notes=U
>> 
>> Then, subsequently, the domain log (on the DC), I see the same operation 155 
>> timeout (expectedly almost a minute later).
> 
> So it looks like ns-ldap send the response but it never reached SSSD in
> time. Can you send what it happening in the SSSD logs between 08:42:34
> and 08:43:34 (if you prefer you can send it to me directly). Maybe there
> is an operation which blocks SSSD for too long so that it returns too
> late to the main loop to process the response?
> 
> bye,
> Sumit
> 
>> 
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [sdap_op_timeout] (0x1000): Issuing timeout for 155
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [sdap_op_destructor] (0x1000): Abandoning operation 155
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [generic_ext_search_handler] (0x0040): sdap_get_generic_ext_recv failed 
>> [110]: Connection timed out
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [ipa_get_ad_override_done] (0x0040): ipa_get_ad_override request failed.
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [sdap_id_op_destroy] (0x4000): releasing operation connection
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [ipa_initgr_get_overrides_override_done] (0x0040): IPA override lookup 
>> failed: 110
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [ipa_id_get_groups_overrides_done] (0x0040): IPA resolve user groups 
>> overrides failed [110].
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [be_mark_dom_offline] (0x1000): Marking subdomain bsdad.uchicago.edu offline
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [be_mark_subdom_offline] (0x1000): Marking subdomain bsdad.uchicago.edu as 
>> inactive
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [ipa_srv_ad_acct_lookup_done] (0x0040): ipa_get_*_acct request failed: 
>> [110]: Connection timed out.
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [ipa_subdomain_account_done] (0x0040): ipa_get_*_acct request failed: [110]: 
>> Connection timed out.
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [sdap_id_op_destroy] (0x4000): releasing operation connection
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [dp_reply_std_set] (0x0080): DP Error is OK on failed request?
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [dp_req_done] 
>> (0x0400): DP Request [Account #4292]: Request handler finished [0]: Success
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] [_dp_req_recv] 
>> (0x0400): DP Request [Account #4292]: Receiving request data.
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [dp_req_reply_list_success] (0x0400): DP Request [Account #4292]: Finished. 
>> Success.
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [dp_req_reply_std] (0x1000): DP Request [Account #4292]: Returning 
>> [Success]: 0,110,Success
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [dp_table_value_destructor] (0x0400): Removing 
>> [0:1:0x0001:1:1::bsdad.uchicago.edu:name=user...@domain.uchicago.edu] from 
>> reply table
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [dp_req_destructor] (0x0400): DP Request [Account #4292]: Request removed.
>> (Fri Jan 20 08:43:34 2017) [sssd[be[xxx.xxx.uchicago.edu]]] 
>> [dp_req_destructor] (0x0400): Number of active DP request: 1
>> 
>> It seems like whatever problem I have is a communication issue between sssd 
>> on the domain controller, and ns-slapd.  I’m starting to venture down the 
>> road of doing trace/packet level debug logging on the LDAP server; this 
>> could be very time consuming and frustrating (I already have it enabled, I 
>> am starting to manipulate the log settings to yield potentially valuable 
>> data), I’m hoping that somebody might be able to provide some insight based 
>> on the information above; I can definitely lookup the user on both domain 
>> controllers & both IPA servers only use themselves for IPA servers.  Thank 
>> you so much for reading and for your help.
>> 
>> Dan
>> 
>> 
>>> On Jan 19, 2017, at 4:15 PM, Sullivan, Daniel [CRI] 
>>> <dsulliv...@bsd.uchicago.edu> wrote:
>>> 
>>> Hi,
>>> 
>>> I’ve received incredibly good support from this mailing list previously; I 
>>> am hoping that somebody can help me succeed in my ongoing efforts.  I have 
>>> spent a few days on this at this point and I can’t seem to figure it out 
>>> how to address this issue.  On my DCs I am seeing excessive ldap_search_ext 
>>> and sdap_get_generic_ext_recv timeouts created solely by the invocation of 
>>> the ‘id’ command on sssd clients.  This problem seems to present itself 
>>> only when I parallelize lookups for an ‘uncached’ user (i.e. I have never 
>>> performed an initial lookup).  Individual arbitrary one-off lookups for a 
>>> single uncached user on a single system almost always work fine.  This 
>>> leads me to believe this is a performance tuning issue.
>>> 
>>> We operate in an academic research computing unit (i.e. we have an HPC 
>>> cluster), and I need the ability to lookup the same user in parallel (using 
>>> the id command) across a relatively large number of systems, for example to 
>>> spawn jobs that require large amounts of CPU cores and/or memory.  Right 
>>> now I am doing about 50 parallel lookups for the same user to induce this 
>>> problem.  
>>> 
>>> Here is some background information:
>>> 
>>> 1) I have read Jakub's “Anatomy of an SSSD Lookup” as well as “Performance 
>>> Tuning of SSSD for large IPA-AD deployments”, as well as implemented 
>>> recommendations from the performance tuning doc, including moving the sssd 
>>> cache to tmpfs.
>>> 2) We are on ipa-server 4.4.0-14.el7_3.4 using a trusted AD domain; all of 
>>> our consumed users and groups are in the AD trusted domain.  We have two 
>>> domain controllers; each is a RHEL 7.3 VM with 6 GB of memory.  Almost all 
>>> (if not all) of our clients are running at least sssd 1.14, and are all 
>>> RHEL 6/7.  Neither DC is swapping, and both have 2 CPUs.
>>> 3) I have tuned SSSD clients on the DCs and all clients to include these 
>>> options (the problem persists):
>>> a) ldap_opt_timeout = 60
>>> b) ldap_search_timeout = 60
>>> 4) On both DCs, I can clear the SSSD cache, and lookup all 2000 or so users 
>>> in my environment with 40 concurrent lookups occurring locally on each DC 
>>> (using UNIX job control).  I can process all 2000 lookups in this manner 
>>> without any failures (on either DC), and have ‘pre-populated’ the SSSD 
>>> cache on both DC’s by doing this.
>>> 6) I have made no additional performance tuning changes other than what has 
>>> been described.
>>> 
>>> Would anybody be able to advise on any potential tuning that would be 
>>> required (presumably on the DCs), to facilitate 50 parallel lookups without 
>>> experiencing sdap_get_generic_ext_recv or  ldap_search_ext  timeouts?  
>>> Should I be able to do this sort of thing with relative ease?  I was hoping 
>>> this would be the sort of thing that would just work, but based on my 
>>> relatively extensive testing it doesn’t.  Any advice anybody could provide 
>>> would be greatly appreciated.
>>> 
>>> Thank you,
>>> 
>>> Dan Sullivan
>>> 
>>> 
>> 
>> 
>> -- 
>> Manage your subscription for the Freeipa-users mailing list:
>> https://www.redhat.com/mailman/listinfo/freeipa-users
>> Go to http://freeipa.org for more info on the project
> 
> -- 
> Manage your subscription for the Freeipa-users mailing list:
> https://www.redhat.com/mailman/listinfo/freeipa-users
> Go to http://freeipa.org for more info on the project


-- 
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go to http://freeipa.org for more info on the project

Re: [Freeipa-users] performance scaling of sssd / freeipa

Reply via email to