On Fri, Aug 12, 2016 at 12:53:25PM +0200, Petr Cech wrote: > On 08/12/2016 11:06 AM, Jakub Hrozek wrote: > > On Thu, Aug 11, 2016 at 02:23:39PM +0200, Petr Cech wrote: > > > On 08/09/2016 02:49 PM, Jakub Hrozek wrote: > > > > On Tue, Aug 09, 2016 at 12:57:58PM +0200, Petr Cech wrote: > > > > > On 08/09/2016 11:26 AM, Jakub Hrozek wrote: > > > > > > On Mon, Aug 08, 2016 at 09:46:55AM +0200, Petr Cech wrote: > > > > > > > > On 08/04/2016 05:01 PM, Petr Cech wrote: > > > > > > > > > > On 08/04/2016 04:35 PM, Petr Cech wrote: > > > > > > > > > > > > Hi list, > > > > > > > > > > > > > > > > > > > > > > > > there is the first version of patch for [1]. I need > > > > > > > > > > > > to investigate if we have the same issue in other > > > > > > > > > > > > *_childs. > > > > > > > > > > > > > > > > > > > > > > > > I tested it with sleep() added into ldap_child code. > > > > > > > > > > > > > > > > > > > > > > > > [1] https://fedorahosted.org/sssd/ticket/3106 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I know, it is not perfect. For example 2 seconds between > > > > > > > > > > SIGTERM and > > > > > > > > > > SIGKILL should be constant. I will send the second version > > > > > > > > > > with other > > > > > > > > > > childs, but not today. > > > > > > > > > > > > > > > > > > > > Regards > > > > > > > > > > > > > > > > Hello all, > > > > > > > > > > > > > > > > there is fixed patch set. There is only one minor change -- > > > > > > > > constant for > > > > > > > > time between signals. > > > > > > > > > > > > > > > > Is there better name for global_ccname_file_dummy? I did not > > > > > > > > want to break > > > > > > > > the naming... but it looks really long. > > > > > > I think this name is fine, but I would prefer to not use the > > > > > > variable in > > > > > > the code, except for setting it after the temporary file is created > > > > > > and > > > > > > then set the variable back to NULL after the temporary file is > > > > > > renamed > > > > > > to the real one. > > > > > > > > > > Addressed. > > > > > > > > > > Thanks Jakub, it looks better. :-) > > > > > > > > OK, now I actually tested the patch and it regresses the way we handle > > > > responses, because the handler exits the process with success error > > > > code, the provider then tries to parse the child output, which is not > > > > there.. > > > > > > > > I think the easiest way would be to define a return code of ldap_child > > > > that would be used in exit in the signal handler and the back end code > > > > would print a nice debug message telling the admin that the child timed > > > > out. > > > > > > Thanks Jakub, > > > improved patch set is attached. > > > > > > Regards > > > > > > -- > > > Petr^4 Čech > > > > > From f84d4d2fcc7c1f2cb404a02aeea7abd12c14dd61 Mon Sep 17 00:00:00 2001 > > > From: Petr Cech <pc...@redhat.com> > > > Date: Fri, 22 Jul 2016 14:28:54 +0200 > > > Subject: [PATCH] LDAP: Fixing of removing netgroup from cache > > > > This patch is from a different set. > > Sorry, the right one is attached. > > > > From 1f40587cbb256bab98e8e35670ca76426b41e6b1 Mon Sep 17 00:00:00 2001 > > > From: Petr Cech <pc...@redhat.com> > > > Date: Thu, 4 Aug 2016 16:27:40 +0200 > > > Subject: [PATCH 2/3] LDAP: Adding SIGTERM signal before SIGKILL > > > > > > We add better termination of ldap_child. If ldap_child reaches > > > the timeout for termination parent sents SIGTERM signal. Child > > > has 2 seconds for removing temporary file and exit. > > > If it is not sufficient there is SIGKILL send to the child. > > > > > > Resolves: > > > https://fedorahosted.org/sssd/ticket/3106 > > > > [...] > > > > > +static void get_tgt_timeout_handler(struct tevent_context *ev, > > > + struct tevent_timer *te, > > > + struct timeval tv, void *pvt) > > > +{ > > > + struct tevent_req *req = talloc_get_type(pvt, struct tevent_req); > > > + struct sdap_get_tgt_state *state = tevent_req_data(req, > > > + struct sdap_get_tgt_state); > > > + int ret; > > > + > > > + DEBUG(SSSDBG_TRACE_ALL, > > > + "timeout for sending SIGTERM to tgt child [%d] reached.\n", > > > + state->child->pid); > > > + > > > + ret = kill(state->child->pid, SIGTERM); > > > + if (ret == -1) { > > > > Please use: > > ret = errno > > and then use ret instead of errno in the DEBUG statement to avoid > > overwriting errno if we ever change this branch in the future. > > Thanks, addressed. > > > > + DEBUG(SSSDBG_CRIT_FAILURE, > > > + "Sending SIGTERM failed [%d][%s].\n", errno, > > > strerror(errno)); > > > + } > > > > > From 4ca2a6b90dd7d7ee792b0e144a4e6d432f4693a4 Mon Sep 17 00:00:00 2001 > > > From: Petr Cech <pc...@redhat.com> > > > Date: Thu, 11 Aug 2016 14:18:15 +0200 > > > Subject: [PATCH 3/3] LDAP: Adding SIGCHLD callback > > > > > > This patch adds SIGCHLD callback for ldap_child. So if timeout is > > > reached and ldap_child is terminated by handler we have debug message > > > about it. > > > > > > Resolves: > > > https://fedorahosted.org/sssd/ticket/3106 > > > > This patch does not apply for me, so I'll just comment on the diff. > > I applied it on top of master. If there is some issue please tell me. > > > > --- > > > src/providers/ldap/ldap_child.c | 2 +- > > > src/providers/ldap/sdap_child_helpers.c | 10 +++++++++- > > > src/util/child_common.h | 1 + > > > 3 files changed, 11 insertions(+), 2 deletions(-) > > > > > > diff --git a/src/providers/ldap/ldap_child.c > > > b/src/providers/ldap/ldap_child.c > > > index > > > 2edfc30ec7b4a3ff03d86ae41f086a6128451a9b..b6913fdaa4c8f87f33cc8c10f9a029d63626f279 > > > 100644 > > > --- a/src/providers/ldap/ldap_child.c > > > +++ b/src/providers/ldap/ldap_child.c > > > @@ -54,7 +54,7 @@ static void sig_term_handler(int sig) > > > } > > > } > > > > > > - _exit(0); > > > + _exit(SIGNAL_EXIT_CODE); > > > > I would prefer CHILD_SIGTERM_EXIT_CODE or CHILD_TIMEOUT_EXIT_CODE > > (prefix and change the name) > > Addressed. Naming is one big issue :-) > > > > } > > > > > > static krb5_context krb5_error_ctx; > > > diff --git a/src/providers/ldap/sdap_child_helpers.c > > > b/src/providers/ldap/sdap_child_helpers.c > > > index > > > 9f9c9b57a7ca5e1217d56ab560cc941432084df8..98b0fe777fd18fad6ffdaefc48ea156a45f4e803 > > > 100644 > > > --- a/src/providers/ldap/sdap_child_helpers.c > > > +++ b/src/providers/ldap/sdap_child_helpers.c > > > @@ -69,6 +69,14 @@ static void sdap_close_fd(int *fd) > > > *fd = -1; > > > } > > > > > > +void child_callback(int child_status, struct tevent_signal *sige, void > > > *pvt) > > > +{ > > > + if (WEXITSTATUS(child_status) == SIGNAL_EXIT_CODE) { > > > + DEBUG(SSSDBG_CRIT_FAILURE, > > > + "LDAP child was terminated due to timeout\n"); > > > + } > > > +} > > > + > > > static errno_t sdap_fork_child(struct tevent_context *ev, > > > struct sdap_child *child) > > > { > > > @@ -110,7 +118,7 @@ static errno_t sdap_fork_child(struct tevent_context > > > *ev, > > > sss_fd_nonblocking(child->io->read_from_child_fd); > > > sss_fd_nonblocking(child->io->write_to_child_fd); > > > > > > - ret = child_handler_setup(ev, pid, NULL, NULL, NULL); > > > + ret = child_handler_setup(ev, pid, child_callback, NULL, NULL); > > > if (ret != EOK) { > > > goto fail; > > > } > > > diff --git a/src/util/child_common.h b/src/util/child_common.h > > > index > > > eb7623987238d5d9e9032b9bbbbef49188f5b1d2..e88661349a1568ffc981e0a3134f7ccea61b394e > > > 100644 > > > --- a/src/util/child_common.h > > > +++ b/src/util/child_common.h > > > @@ -36,6 +36,7 @@ > > > #define CHILD_MSG_CHUNK 256 > > > > > > #define SIGTERM_TO_SIGKILL_TIME 2 > > > +#define SIGNAL_EXIT_CODE EINTR > > > > I'm not sure if we want to use errno constants here, I would say just > > defining our own constant is better, at least to avoid overflowing the > > maximum exit value of 255. I don't think we even have to worry about > > clashing with constants from sysexits.h, this is our internal API anyway. > > I choose 7. > > Logs now look like: > > [root@albireo sssd]# grep 'child' sssd_ipa.cygnus.dev.log > [child_handler_setup] (0x2000): Setting up signal handler up for pid [18835] > [child_handler_setup] (0x2000): Signal handler set up for pid [18835] > [set_tgt_child_timeout] (0x0400): Setting 6 seconds timeout for tgt child > > (6 seconds later) > > [get_tgt_timeout_handler] (0x4000): timeout for sending SIGTERM to tgt child > [18835] reached. > [get_tgt_timeout_handler] (0x0400): Setting 2 seconds timeout for sending > SIGKILL to tgt child > [sdap_get_tgt_recv] (0x0020): Cannot parse child response: [22][Invalid > argument] > [sdap_kinit_done] (0x0020): child failed (22 [Invalid argument]) > [child_sig_handler] (0x1000): Waiting for child [18835]. > [child_sig_handler] (0x0020): child [18835] failed with status [7]. > [child_callback] (0x0020): LDAP child was terminated due to timeout
I'm sorry, but these patches still don't fix the issue I was seeing. Before the patches, when I timed out the child process, I saw: (Fri Aug 12 14:54:24 2016) [[sssd[ldap_child[31721]]]] [ldap_child_get_tgt_sync] (0x0100): Principal name is: [host/client.ipa.t...@ipa.test] (Fri Aug 12 14:54:24 2016) [[sssd[ldap_child[31721]]]] [ldap_child_get_tgt_sync] (0x0100): Using keytab [MEMORY:/etc/krb5.keytab] (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [get_tgt_timeout_handler] (0x4000): timeout for tgt child [31721] reached. (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [sdap_kinit_done] (0x0080): Communication with KDC timed out, trying the next one (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_NOT_WORKING. Called from: /sssd/src/providers/ldap/sdap_async_connection.c: sdap_kinit_done: 1207 (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [fo_set_port_status] (0x0100): Marking port 0 of server 'unidirect.ipa.test' as 'not working' (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [fo_set_port_status] (0x0400): Marking port 0 of duplicate server 'unidirect.ipa.test' as 'not working' (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [sdap_kinit_next_kdc] (0x1000): Resolving next KDC for service KERBEROS (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [fo_resolve_service_send] (0x0100): Trying to resolve service 'KERBEROS' (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [get_server_status] (0x1000): Status of server 'unidirect.ipa.test' is 'name resolved' (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [get_port_status] (0x1000): Port status of port 0 for server 'unidirect.ipa.test' is 'not working' (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [fo_resolve_service_send] (0x0020): No available servers for service 'KERBEROS' (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [be_resolve_server_done] (0x1000): Server resolution failed: [5]: Input/output error (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [sdap_cli_kinit_done] (0x0400): Cannot get a TGT: ret [1432158228](Network I/O Error) (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [sdap_cli_connect_recv] (0x0040): Unable to establish connection [13]: Permission denied (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_NOT_WORKING. Called from: /sssd/src/providers/ldap/sdap_async_connection.c: sdap_cli_connect_recv: 2048 After the patch, I see: (Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [get_tgt_timeout_handler] (0x4000): timeout for sending SIGTERM to tgt child [17291] reached. (Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [get_tgt_timeout_handler] (0x0400): Setting 2 seconds timeout for sending SIGKILL to tgt child (Fri Aug 12 15:01:05 2016) [[sssd[ldap_child[17291]]]] [sig_term_handler] (0x0010): Received signal [Terminated] [15], shutting down (Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [read_pipe_handler] (0x0400): EOF received, client finished (Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [sdap_get_tgt_recv] (0x0020): Cannot parse child response: [22][Invalid argument] here ---^ This is the part I don't like, we try to read the response from the child's pipe and fail with a bad error message. I thought this was because with the preivous patch, we exit the child with zero, but I guess this was not the case.. Anyway, we still should fix this, the message would be really confusing to admins. (Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [sdap_kinit_done] (0x0020): child failed (22 [Invalid argument]) (Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [sdap_cli_kinit_done] (0x0400): Cannot get a TGT: ret [22](Invalid argument) (Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [sdap_cli_connect_recv] (0x0040): Unable to establish connection [13]: Permission denied (Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [_be_fo_set_port_status] (0x8000): Setting status: PORT_NOT_WORKING. Called from: /sssd/src/providers/ldap/sdap_async_connection.c: sdap_cli_connect_recv: 2048 _______________________________________________ sssd-devel mailing list sssd-devel@lists.fedorahosted.org https://lists.fedorahosted.org/admin/lists/sssd-devel@lists.fedorahosted.org