[SSSD] Re: [PATCH 1/2] LDAP: Adding support for SIGTERM signal

Pavel Březina Wed, 17 Aug 2016 04:11:20 -0700

On 08/15/2016 01:27 PM, Petr Cech wrote:

On 08/15/2016 09:59 AM, Jakub Hrozek wrote:

On Mon, Aug 15, 2016 at 09:47:27AM +0200, Petr Cech wrote:

On 08/12/2016 04:13 PM, Jakub Hrozek wrote:

On Fri, Aug 12, 2016 at 03:41:26PM +0200, Petr Cech wrote:

On 08/12/2016 03:07 PM, Jakub Hrozek wrote:

Logs now look like:


[root@albireo sssd]# grep 'child' sssd_ipa.cygnus.dev.log
[child_handler_setup] (0x2000): Setting up signal handler up for
pid [18835]
[child_handler_setup] (0x2000): Signal handler set up for pid
[18835]
[set_tgt_child_timeout] (0x0400): Setting 6 seconds timeout for
tgt child

(6 seconds later)

[get_tgt_timeout_handler] (0x4000): timeout for sending SIGTERM
to tgt child
[18835] reached.
[get_tgt_timeout_handler] (0x0400): Setting 2 seconds timeout
for sending
SIGKILL to tgt child
[sdap_get_tgt_recv] (0x0020): Cannot parse child response:
[22][Invalid
argument]
[sdap_kinit_done] (0x0020): child failed (22 [Invalid argument])
[child_sig_handler] (0x1000): Waiting for child [18835].
[child_sig_handler] (0x0020): child [18835] failed with status [7].
[child_callback] (0x0020): LDAP child was terminated due to timeout

I'm sorry, but these patches still don't fix the issue I was seeing.
Before the patches, when I timed out the child process, I saw:
(Fri Aug 12 14:54:24 2016) [[sssd[ldap_child[31721]]]]
[ldap_child_get_tgt_sync] (0x0100): Principal name is:
[host/client.ipa.t...@ipa.test]
(Fri Aug 12 14:54:24 2016) [[sssd[ldap_child[31721]]]]
[ldap_child_get_tgt_sync] (0x0100): Using keytab
[MEMORY:/etc/krb5.keytab]
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
[get_tgt_timeout_handler] (0x4000): timeout for tgt child [31721]
reached.
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [sdap_kinit_done]
(0x0080): Communication with KDC timed out, trying the next one
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
[_be_fo_set_port_status] (0x8000): Setting status:
PORT_NOT_WORKING. Called from:
/sssd/src/providers/ldap/sdap_async_connection.c: sdap_kinit_done:
1207
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
[fo_set_port_status] (0x0100): Marking port 0 of server
'unidirect.ipa.test' as 'not working'
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
[fo_set_port_status] (0x0400): Marking port 0 of duplicate server
'unidirect.ipa.test' as 'not working'
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
[sdap_kinit_next_kdc] (0x1000): Resolving next KDC for service
KERBEROS
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
[fo_resolve_service_send] (0x0100): Trying to resolve service
'KERBEROS'
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [get_server_status]
(0x1000): Status of server 'unidirect.ipa.test' is 'name resolved'
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [get_port_status]
(0x1000): Port status of port 0 for server 'unidirect.ipa.test' is
'not working'
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
[fo_resolve_service_send] (0x0020): No available servers for
service 'KERBEROS'
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
[be_resolve_server_done] (0x1000): Server resolution failed: [5]:
Input/output error
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
[sdap_cli_kinit_done] (0x0400): Cannot get a TGT: ret
[1432158228](Network I/O Error)
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
[sdap_cli_connect_recv] (0x0040): Unable to establish connection
[13]: Permission denied
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
[_be_fo_set_port_status] (0x8000): Setting status:
PORT_NOT_WORKING. Called from:
/sssd/src/providers/ldap/sdap_async_connection.c:
sdap_cli_connect_recv: 2048

After the patch, I see:
(Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]]
[get_tgt_timeout_handler] (0x4000): timeout for sending SIGTERM to
tgt child [17291] reached.
(Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]]
[get_tgt_timeout_handler] (0x0400): Setting 2 seconds timeout for
sending SIGKILL to tgt child
(Fri Aug 12 15:01:05 2016) [[sssd[ldap_child[17291]]]]
[sig_term_handler] (0x0010): Received signal [Terminated] [15],
shutting down
(Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [read_pipe_handler]
(0x0400): EOF received, client finished
(Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [sdap_get_tgt_recv]
(0x0020): Cannot parse child response: [22][Invalid argument]
here ---^

This is the part I don't like, we try to read the response from
the child's
pipe and fail with a bad error message. I thought this was because
with
the preivous patch, we exit the child with zero, but I guess this
was not
the case.. Anyway, we still should fix this, the message would be
really
confusing to admins.


Hi Jakub,

I know I had the same
'Cannot parse child response: [22][Invalid argument]'
in my logs too, see my last mail above.

Some lines after it I have new message
'[child_callback] (0x0020): LDAP child was terminated due to timeout'

I understand that it is not satisfying. The main purpose of this
patch was
deleting temporary files if timeout reached. And this is IMO solved
I think.
It is true that we have big effort to have understandable logs.

Yesterday I looked deeper at mechanism how signals, child handler
and all
such stuff works in ldap_child case. I wasn't able to tell to
sdap_get_tgt_recv() that the state is changed and we shouldn't
parse the
answer.

If I understand handler stuff right I need introduce in ldap_child
handler
whole mechanism like in src/providers/be_dyndns.c

The mechanism is calling the child_handler_setup() with all
parameters which
means rewrite sdap_child_helpers in appropriate manner.


I think it would be enough to pass the tevent_req to the
child_callback() and cancel it with tevent_req_error(req, ETIMEDOUT)

The problem is that with the current patches, the whole request is
marked as 'succeeded', that's why the _recv() function tries to parse
the response. If you look at sdap_get_tgt_recv(), if the request is
marked as failure, the processing aborts in
TEVENT_REQ_RETURN_ON_ERROR..


Hi Jakub,

I tried resolve it by your advice. I have still the same problem --
request
is unreachable in child_callback() (Bad magic value).

I found one little hack. If we sent SIGTERM signal to child, child
can not
send its answer so we can say child is already time-outed. And after
this we
are able to write debug message in child_callback().


Why does it need to send any answer? The parent would see the special
exit code and use that to know the child timed out. The point is the
child should not send any answer in the failure cases, only report the
reason why it finished with an exit code.


Please see last 5 commits here:
https://github.com/celestian/sssd/commits/t3106_v1


Interesting, is the correct pointer address being passed as pvt to:
     child_ctx->cb(child_ctx->child_status, child_ctx->sige,
     child_ctx->pvt);
?

You can also try running sssd_be with valgrind for more clues to see if
the request is maybe terminated before.


627200c WIP: LDAP: Handling of SIGCHLD <-- There are both solution
commented
in code
888df5b DEBUG: Sleep <-- Adds sleep loop for timeouting the child

9235712 LDAP: Adding SIGCHLD callback
625191d LDAP: Adding SIGTERM signal before SIGKILL
5331972 LDAP: Adding support for SIGTERM signal


Hi all,

I discussed this issue with Jakub and we found a solution (patch #3).
Please see the attached patch set.

And because Jakub is interested in it I would like to ask somebody else
for code review. Thanks.

Regards


Hi, child_callback() should be static. Besides those patches LGTM.
_______________________________________________
sssd-devel mailing list
sssd-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/admin/lists/sssd-devel@lists.fedorahosted.org

[SSSD] Re: [PATCH 1/2] LDAP: Adding support for SIGTERM signal

Reply via email to