On Fri, Aug 12, 2016 at 03:41:26PM +0200, Petr Cech wrote:
> On 08/12/2016 03:07 PM, Jakub Hrozek wrote:
> > > Logs now look like:
> > > >
> > > > [root@albireo sssd]# grep 'child' sssd_ipa.cygnus.dev.log
> > > > [child_handler_setup] (0x2000): Setting up signal handler up for pid 
> > > > [18835]
> > > > [child_handler_setup] (0x2000): Signal handler set up for pid [18835]
> > > > [set_tgt_child_timeout] (0x0400): Setting 6 seconds timeout for tgt 
> > > > child
> > > >
> > > > (6 seconds later)
> > > >
> > > > [get_tgt_timeout_handler] (0x4000): timeout for sending SIGTERM to tgt 
> > > > child
> > > > [18835] reached.
> > > > [get_tgt_timeout_handler] (0x0400): Setting 2 seconds timeout for 
> > > > sending
> > > > SIGKILL to tgt child
> > > > [sdap_get_tgt_recv] (0x0020): Cannot parse child response: [22][Invalid
> > > > argument]
> > > > [sdap_kinit_done] (0x0020): child failed (22 [Invalid argument])
> > > > [child_sig_handler] (0x1000): Waiting for child [18835].
> > > > [child_sig_handler] (0x0020): child [18835] failed with status [7].
> > > > [child_callback] (0x0020): LDAP child was terminated due to timeout
> > I'm sorry, but these patches still don't fix the issue I was seeing.
> > Before the patches, when I timed out the child process, I saw:
> > (Fri Aug 12 14:54:24 2016) [[sssd[ldap_child[31721]]]] 
> > [ldap_child_get_tgt_sync] (0x0100): Principal name is: 
> > [host/client.ipa.t...@ipa.test]
> > (Fri Aug 12 14:54:24 2016) [[sssd[ldap_child[31721]]]] 
> > [ldap_child_get_tgt_sync] (0x0100): Using keytab [MEMORY:/etc/krb5.keytab]
> > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [get_tgt_timeout_handler] 
> > (0x4000): timeout for tgt child [31721] reached.
> > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [sdap_kinit_done] (0x0080): 
> > Communication with KDC timed out, trying the next one
> > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [_be_fo_set_port_status] 
> > (0x8000): Setting status: PORT_NOT_WORKING. Called from: 
> > /sssd/src/providers/ldap/sdap_async_connection.c: sdap_kinit_done: 1207
> > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [fo_set_port_status] 
> > (0x0100): Marking port 0 of server 'unidirect.ipa.test' as 'not working'
> > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [fo_set_port_status] 
> > (0x0400): Marking port 0 of duplicate server 'unidirect.ipa.test' as 'not 
> > working'
> > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [sdap_kinit_next_kdc] 
> > (0x1000): Resolving next KDC for service KERBEROS
> > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [fo_resolve_service_send] 
> > (0x0100): Trying to resolve service 'KERBEROS'
> > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [get_server_status] 
> > (0x1000): Status of server 'unidirect.ipa.test' is 'name resolved'
> > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [get_port_status] (0x1000): 
> > Port status of port 0 for server 'unidirect.ipa.test' is 'not working'
> > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [fo_resolve_service_send] 
> > (0x0020): No available servers for service 'KERBEROS'
> > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [be_resolve_server_done] 
> > (0x1000): Server resolution failed: [5]: Input/output error
> > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [sdap_cli_kinit_done] 
> > (0x0400): Cannot get a TGT: ret [1432158228](Network I/O Error)
> > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [sdap_cli_connect_recv] 
> > (0x0040): Unable to establish connection [13]: Permission denied
> > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [_be_fo_set_port_status] 
> > (0x8000): Setting status: PORT_NOT_WORKING. Called from: 
> > /sssd/src/providers/ldap/sdap_async_connection.c: sdap_cli_connect_recv: 
> > 2048
> > 
> > After the patch, I see:
> > (Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [get_tgt_timeout_handler] 
> > (0x4000): timeout for sending SIGTERM to tgt child [17291] reached.
> > (Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [get_tgt_timeout_handler] 
> > (0x0400): Setting 2 seconds timeout for sending SIGKILL to tgt child
> > (Fri Aug 12 15:01:05 2016) [[sssd[ldap_child[17291]]]] [sig_term_handler] 
> > (0x0010): Received signal [Terminated] [15], shutting down
> > (Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [read_pipe_handler] 
> > (0x0400): EOF received, client finished
> > (Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [sdap_get_tgt_recv] 
> > (0x0020): Cannot parse child response: [22][Invalid argument]
> > here ---^
> > 
> > This is the part I don't like, we try to read the response from the child's
> > pipe and fail with a bad error message. I thought this was because with
> > the preivous patch, we exit the child with zero, but I guess this was not
> > the case.. Anyway, we still should fix this, the message would be really
> > confusing to admins.
> 
> Hi Jakub,
> 
> I know I had the same
> 'Cannot parse child response: [22][Invalid argument]'
> in my logs too, see my last mail above.
> 
> Some lines after it I have new message
> '[child_callback] (0x0020): LDAP child was terminated due to timeout'
> 
> I understand that it is not satisfying. The main purpose of this patch was
> deleting temporary files if timeout reached. And this is IMO solved I think.
> It is true that we have big effort to have understandable logs.
> 
> Yesterday I looked deeper at mechanism how signals, child handler and all
> such stuff works in ldap_child case. I wasn't able to tell to
> sdap_get_tgt_recv() that the state is changed and we shouldn't parse the
> answer.
> 
> If I understand handler stuff right I need introduce in ldap_child handler
> whole mechanism like in src/providers/be_dyndns.c
> 
> The mechanism is calling the child_handler_setup() with all parameters which
> means rewrite sdap_child_helpers in appropriate manner.

I think it would be enough to pass the tevent_req to the
child_callback() and cancel it with tevent_req_error(req, ETIMEDOUT)

The problem is that with the current patches, the whole request is
marked as 'succeeded', that's why the _recv() function tries to parse
the response. If you look at sdap_get_tgt_recv(), if the request is
marked as failure, the processing aborts in TEVENT_REQ_RETURN_ON_ERROR..

> 
> Maybe there is simpler way please tell me if. But if this ticket is urgent
> (removing of temporary files) we could push it and solve logs (I mean
> rewrite handler mechanism) in next ticket.
> 
> Regards
> 
> -- 
> Petr^4 Čech
> _______________________________________________
> sssd-devel mailing list
> sssd-devel@lists.fedorahosted.org
> https://lists.fedorahosted.org/admin/lists/sssd-devel@lists.fedorahosted.org
_______________________________________________
sssd-devel mailing list
sssd-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/admin/lists/sssd-devel@lists.fedorahosted.org

Reply via email to