On Mon, Aug 15, 2016 at 09:47:27AM +0200, Petr Cech wrote:
> On 08/12/2016 04:13 PM, Jakub Hrozek wrote:
> > On Fri, Aug 12, 2016 at 03:41:26PM +0200, Petr Cech wrote:
> > > On 08/12/2016 03:07 PM, Jakub Hrozek wrote:
> > > > > Logs now look like:
> > > > > > 
> > > > > > [root@albireo sssd]# grep 'child' sssd_ipa.cygnus.dev.log
> > > > > > [child_handler_setup] (0x2000): Setting up signal handler up for 
> > > > > > pid [18835]
> > > > > > [child_handler_setup] (0x2000): Signal handler set up for pid 
> > > > > > [18835]
> > > > > > [set_tgt_child_timeout] (0x0400): Setting 6 seconds timeout for tgt 
> > > > > > child
> > > > > > 
> > > > > > (6 seconds later)
> > > > > > 
> > > > > > [get_tgt_timeout_handler] (0x4000): timeout for sending SIGTERM to 
> > > > > > tgt child
> > > > > > [18835] reached.
> > > > > > [get_tgt_timeout_handler] (0x0400): Setting 2 seconds timeout for 
> > > > > > sending
> > > > > > SIGKILL to tgt child
> > > > > > [sdap_get_tgt_recv] (0x0020): Cannot parse child response: 
> > > > > > [22][Invalid
> > > > > > argument]
> > > > > > [sdap_kinit_done] (0x0020): child failed (22 [Invalid argument])
> > > > > > [child_sig_handler] (0x1000): Waiting for child [18835].
> > > > > > [child_sig_handler] (0x0020): child [18835] failed with status [7].
> > > > > > [child_callback] (0x0020): LDAP child was terminated due to timeout
> > > > I'm sorry, but these patches still don't fix the issue I was seeing.
> > > > Before the patches, when I timed out the child process, I saw:
> > > > (Fri Aug 12 14:54:24 2016) [[sssd[ldap_child[31721]]]] 
> > > > [ldap_child_get_tgt_sync] (0x0100): Principal name is: 
> > > > [host/client.ipa.t...@ipa.test]
> > > > (Fri Aug 12 14:54:24 2016) [[sssd[ldap_child[31721]]]] 
> > > > [ldap_child_get_tgt_sync] (0x0100): Using keytab 
> > > > [MEMORY:/etc/krb5.keytab]
> > > > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] 
> > > > [get_tgt_timeout_handler] (0x4000): timeout for tgt child [31721] 
> > > > reached.
> > > > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [sdap_kinit_done] 
> > > > (0x0080): Communication with KDC timed out, trying the next one
> > > > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [_be_fo_set_port_status] 
> > > > (0x8000): Setting status: PORT_NOT_WORKING. Called from: 
> > > > /sssd/src/providers/ldap/sdap_async_connection.c: sdap_kinit_done: 1207
> > > > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [fo_set_port_status] 
> > > > (0x0100): Marking port 0 of server 'unidirect.ipa.test' as 'not working'
> > > > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [fo_set_port_status] 
> > > > (0x0400): Marking port 0 of duplicate server 'unidirect.ipa.test' as 
> > > > 'not working'
> > > > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [sdap_kinit_next_kdc] 
> > > > (0x1000): Resolving next KDC for service KERBEROS
> > > > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] 
> > > > [fo_resolve_service_send] (0x0100): Trying to resolve service 'KERBEROS'
> > > > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [get_server_status] 
> > > > (0x1000): Status of server 'unidirect.ipa.test' is 'name resolved'
> > > > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [get_port_status] 
> > > > (0x1000): Port status of port 0 for server 'unidirect.ipa.test' is 'not 
> > > > working'
> > > > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] 
> > > > [fo_resolve_service_send] (0x0020): No available servers for service 
> > > > 'KERBEROS'
> > > > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [be_resolve_server_done] 
> > > > (0x1000): Server resolution failed: [5]: Input/output error
> > > > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [sdap_cli_kinit_done] 
> > > > (0x0400): Cannot get a TGT: ret [1432158228](Network I/O Error)
> > > > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [sdap_cli_connect_recv] 
> > > > (0x0040): Unable to establish connection [13]: Permission denied
> > > > (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [_be_fo_set_port_status] 
> > > > (0x8000): Setting status: PORT_NOT_WORKING. Called from: 
> > > > /sssd/src/providers/ldap/sdap_async_connection.c: 
> > > > sdap_cli_connect_recv: 2048
> > > > 
> > > > After the patch, I see:
> > > > (Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] 
> > > > [get_tgt_timeout_handler] (0x4000): timeout for sending SIGTERM to tgt 
> > > > child [17291] reached.
> > > > (Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] 
> > > > [get_tgt_timeout_handler] (0x0400): Setting 2 seconds timeout for 
> > > > sending SIGKILL to tgt child
> > > > (Fri Aug 12 15:01:05 2016) [[sssd[ldap_child[17291]]]] 
> > > > [sig_term_handler] (0x0010): Received signal [Terminated] [15], 
> > > > shutting down
> > > > (Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [read_pipe_handler] 
> > > > (0x0400): EOF received, client finished
> > > > (Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [sdap_get_tgt_recv] 
> > > > (0x0020): Cannot parse child response: [22][Invalid argument]
> > > > here ---^
> > > > 
> > > > This is the part I don't like, we try to read the response from the 
> > > > child's
> > > > pipe and fail with a bad error message. I thought this was because with
> > > > the preivous patch, we exit the child with zero, but I guess this was 
> > > > not
> > > > the case.. Anyway, we still should fix this, the message would be really
> > > > confusing to admins.
> > > 
> > > Hi Jakub,
> > > 
> > > I know I had the same
> > > 'Cannot parse child response: [22][Invalid argument]'
> > > in my logs too, see my last mail above.
> > > 
> > > Some lines after it I have new message
> > > '[child_callback] (0x0020): LDAP child was terminated due to timeout'
> > > 
> > > I understand that it is not satisfying. The main purpose of this patch was
> > > deleting temporary files if timeout reached. And this is IMO solved I 
> > > think.
> > > It is true that we have big effort to have understandable logs.
> > > 
> > > Yesterday I looked deeper at mechanism how signals, child handler and all
> > > such stuff works in ldap_child case. I wasn't able to tell to
> > > sdap_get_tgt_recv() that the state is changed and we shouldn't parse the
> > > answer.
> > > 
> > > If I understand handler stuff right I need introduce in ldap_child handler
> > > whole mechanism like in src/providers/be_dyndns.c
> > > 
> > > The mechanism is calling the child_handler_setup() with all parameters 
> > > which
> > > means rewrite sdap_child_helpers in appropriate manner.
> > 
> > I think it would be enough to pass the tevent_req to the
> > child_callback() and cancel it with tevent_req_error(req, ETIMEDOUT)
> > 
> > The problem is that with the current patches, the whole request is
> > marked as 'succeeded', that's why the _recv() function tries to parse
> > the response. If you look at sdap_get_tgt_recv(), if the request is
> > marked as failure, the processing aborts in TEVENT_REQ_RETURN_ON_ERROR..
> 
> Hi Jakub,
> 
> I tried resolve it by your advice. I have still the same problem -- request
> is unreachable in child_callback() (Bad magic value).
> 
> I found one little hack. If we sent SIGTERM signal to child, child can not
> send its answer so we can say child is already time-outed. And after this we
> are able to write debug message in child_callback().

Why does it need to send any answer? The parent would see the special
exit code and use that to know the child timed out. The point is the
child should not send any answer in the failure cases, only report the
reason why it finished with an exit code.

> 
> Please see last 5 commits here:
> https://github.com/celestian/sssd/commits/t3106_v1

Interesting, is the correct pointer address being passed as pvt to:
     child_ctx->cb(child_ctx->child_status, child_ctx->sige,
     child_ctx->pvt);
?

You can also try running sssd_be with valgrind for more clues to see if
the request is maybe terminated before.

> 
> 627200c WIP: LDAP: Handling of SIGCHLD <-- There are both solution commented
> in code
> 888df5b DEBUG: Sleep <-- Adds sleep loop for timeouting the child
> 
> 9235712 LDAP: Adding SIGCHLD callback
> 625191d LDAP: Adding SIGTERM signal before SIGKILL
> 5331972 LDAP: Adding support for SIGTERM signal
> 
> Regards
> 
> -- 
> Petr^4 Čech
> _______________________________________________
> sssd-devel mailing list
> sssd-devel@lists.fedorahosted.org
> https://lists.fedorahosted.org/admin/lists/sssd-devel@lists.fedorahosted.org
_______________________________________________
sssd-devel mailing list
sssd-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/admin/lists/sssd-devel@lists.fedorahosted.org

Reply via email to