On Fri, Aug 12, 2016 at 12:53:25PM +0200, Petr Cech wrote:
> On 08/12/2016 11:06 AM, Jakub Hrozek wrote:
> > On Thu, Aug 11, 2016 at 02:23:39PM +0200, Petr Cech wrote:
> > > On 08/09/2016 02:49 PM, Jakub Hrozek wrote:
> > > > On Tue, Aug 09, 2016 at 12:57:58PM +0200, Petr Cech wrote:
> > > > > On 08/09/2016 11:26 AM, Jakub Hrozek wrote:
> > > > > > On Mon, Aug 08, 2016 at 09:46:55AM +0200, Petr Cech wrote:
> > > > > > > > On 08/04/2016 05:01 PM, Petr Cech wrote:
> > > > > > > > > > On 08/04/2016 04:35 PM, Petr Cech wrote:
> > > > > > > > > > > > Hi list,
> > > > > > > > > > > > 
> > > > > > > > > > > > there is the first version of patch for [1]. I need
> > > > > > > > > > > > to investigate if we have the same issue in other
> > > > > > > > > > > > *_childs.
> > > > > > > > > > > > 
> > > > > > > > > > > > I tested it with sleep() added into ldap_child code.
> > > > > > > > > > > > 
> > > > > > > > > > > > [1] https://fedorahosted.org/sssd/ticket/3106
> > > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > I know, it is not perfect. For example 2 seconds between 
> > > > > > > > > > SIGTERM and
> > > > > > > > > > SIGKILL should be constant. I will send the second version 
> > > > > > > > > > with other
> > > > > > > > > > childs, but not today.
> > > > > > > > > > 
> > > > > > > > > > Regards
> > > > > > > > 
> > > > > > > > Hello all,
> > > > > > > > 
> > > > > > > > there is fixed patch set. There is only one minor change -- 
> > > > > > > > constant for
> > > > > > > > time between signals.
> > > > > > > > 
> > > > > > > > Is there better name for global_ccname_file_dummy? I did not 
> > > > > > > > want to break
> > > > > > > > the naming... but it looks really long.
> > > > > > I think this name is fine, but I would prefer to not use the 
> > > > > > variable in
> > > > > > the code, except for setting it after the temporary file is created 
> > > > > > and
> > > > > > then set the variable back to NULL after the temporary file is 
> > > > > > renamed
> > > > > > to the real one.
> > > > > 
> > > > > Addressed.
> > > > > 
> > > > > Thanks Jakub, it looks better. :-)
> > > > 
> > > > OK, now I actually tested the patch and it regresses the way we handle
> > > > responses, because the handler exits the process with success error
> > > > code, the provider then tries to parse the child output, which is not
> > > > there..
> > > > 
> > > > I think the easiest way would be to define a return code of ldap_child
> > > > that would be used in exit in the signal handler and the back end code
> > > > would print a nice debug message telling the admin that the child timed
> > > > out.
> > > 
> > > Thanks Jakub,
> > > improved patch set is attached.
> > > 
> > > Regards
> > > 
> > > --
> > > Petr^4 Čech
> > 
> > > From f84d4d2fcc7c1f2cb404a02aeea7abd12c14dd61 Mon Sep 17 00:00:00 2001
> > > From: Petr Cech <pc...@redhat.com>
> > > Date: Fri, 22 Jul 2016 14:28:54 +0200
> > > Subject: [PATCH] LDAP: Fixing of removing netgroup from cache
> > 
> > This patch is from a different set.
> 
> Sorry, the right one is attached.
> 
> > > From 1f40587cbb256bab98e8e35670ca76426b41e6b1 Mon Sep 17 00:00:00 2001
> > > From: Petr Cech <pc...@redhat.com>
> > > Date: Thu, 4 Aug 2016 16:27:40 +0200
> > > Subject: [PATCH 2/3] LDAP: Adding SIGTERM signal before SIGKILL
> > > 
> > > We add better termination of ldap_child. If ldap_child reaches
> > > the timeout for termination parent sents SIGTERM signal. Child
> > > has 2 seconds for removing temporary file and exit.
> > > If it is not sufficient there is SIGKILL send to the child.
> > > 
> > > Resolves:
> > > https://fedorahosted.org/sssd/ticket/3106
> > 
> > [...]
> > 
> > > +static void get_tgt_timeout_handler(struct tevent_context *ev,
> > > +                                      struct tevent_timer *te,
> > > +                                      struct timeval tv, void *pvt)
> > > +{
> > > +    struct tevent_req *req = talloc_get_type(pvt, struct tevent_req);
> > > +    struct sdap_get_tgt_state *state = tevent_req_data(req,
> > > +                                            struct sdap_get_tgt_state);
> > > +    int ret;
> > > +
> > > +    DEBUG(SSSDBG_TRACE_ALL,
> > > +          "timeout for sending SIGTERM to tgt child [%d] reached.\n",
> > > +          state->child->pid);
> > > +
> > > +    ret = kill(state->child->pid, SIGTERM);
> > > +    if (ret == -1) {
> > 
> > Please use:
> >     ret = errno
> >     and then use ret instead of errno in the DEBUG statement to avoid
> >     overwriting errno if we ever change this branch in the future.
> 
> Thanks, addressed.
> 
> > > +        DEBUG(SSSDBG_CRIT_FAILURE,
> > > +              "Sending SIGTERM failed [%d][%s].\n", errno, 
> > > strerror(errno));
> > > +    }
> > 
> > > From 4ca2a6b90dd7d7ee792b0e144a4e6d432f4693a4 Mon Sep 17 00:00:00 2001
> > > From: Petr Cech <pc...@redhat.com>
> > > Date: Thu, 11 Aug 2016 14:18:15 +0200
> > > Subject: [PATCH 3/3] LDAP: Adding SIGCHLD callback
> > > 
> > > This patch adds SIGCHLD callback for ldap_child. So if timeout is
> > > reached and ldap_child is terminated by handler we have debug message
> > > about it.
> > > 
> > > Resolves:
> > > https://fedorahosted.org/sssd/ticket/3106
> > 
> > This patch does not apply for me, so I'll just comment on the diff.
> 
> I applied it on top of master. If there is some issue please tell me.
> 
> > > ---
> > >  src/providers/ldap/ldap_child.c         |  2 +-
> > >  src/providers/ldap/sdap_child_helpers.c | 10 +++++++++-
> > >  src/util/child_common.h                 |  1 +
> > >  3 files changed, 11 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/src/providers/ldap/ldap_child.c 
> > > b/src/providers/ldap/ldap_child.c
> > > index 
> > > 2edfc30ec7b4a3ff03d86ae41f086a6128451a9b..b6913fdaa4c8f87f33cc8c10f9a029d63626f279
> > >  100644
> > > --- a/src/providers/ldap/ldap_child.c
> > > +++ b/src/providers/ldap/ldap_child.c
> > > @@ -54,7 +54,7 @@ static void sig_term_handler(int sig)
> > >          }
> > >      }
> > > 
> > > -    _exit(0);
> > > +    _exit(SIGNAL_EXIT_CODE);
> > 
> > I would prefer CHILD_SIGTERM_EXIT_CODE or CHILD_TIMEOUT_EXIT_CODE
> > (prefix and change the name)
> 
> Addressed. Naming is one big issue :-)
> 
> > >  }
> > > 
> > >  static krb5_context krb5_error_ctx;
> > > diff --git a/src/providers/ldap/sdap_child_helpers.c 
> > > b/src/providers/ldap/sdap_child_helpers.c
> > > index 
> > > 9f9c9b57a7ca5e1217d56ab560cc941432084df8..98b0fe777fd18fad6ffdaefc48ea156a45f4e803
> > >  100644
> > > --- a/src/providers/ldap/sdap_child_helpers.c
> > > +++ b/src/providers/ldap/sdap_child_helpers.c
> > > @@ -69,6 +69,14 @@ static void sdap_close_fd(int *fd)
> > >      *fd = -1;
> > >  }
> > > 
> > > +void child_callback(int child_status, struct tevent_signal *sige, void 
> > > *pvt)
> > > +{
> > > +    if (WEXITSTATUS(child_status) == SIGNAL_EXIT_CODE) {
> > > +        DEBUG(SSSDBG_CRIT_FAILURE,
> > > +              "LDAP child was terminated due to timeout\n");
> > > +    }
> > > +}
> > > +
> > >  static errno_t sdap_fork_child(struct tevent_context *ev,
> > >                                 struct sdap_child *child)
> > >  {
> > > @@ -110,7 +118,7 @@ static errno_t sdap_fork_child(struct tevent_context 
> > > *ev,
> > >          sss_fd_nonblocking(child->io->read_from_child_fd);
> > >          sss_fd_nonblocking(child->io->write_to_child_fd);
> > > 
> > > -        ret = child_handler_setup(ev, pid, NULL, NULL, NULL);
> > > +        ret = child_handler_setup(ev, pid, child_callback, NULL, NULL);
> > >          if (ret != EOK) {
> > >              goto fail;
> > >          }
> > > diff --git a/src/util/child_common.h b/src/util/child_common.h
> > > index 
> > > eb7623987238d5d9e9032b9bbbbef49188f5b1d2..e88661349a1568ffc981e0a3134f7ccea61b394e
> > >  100644
> > > --- a/src/util/child_common.h
> > > +++ b/src/util/child_common.h
> > > @@ -36,6 +36,7 @@
> > >  #define CHILD_MSG_CHUNK     256
> > > 
> > >  #define SIGTERM_TO_SIGKILL_TIME 2
> > > +#define SIGNAL_EXIT_CODE EINTR
> > 
> > I'm not sure if we want to use errno constants here, I would say just
> > defining our own constant is better, at least to avoid overflowing the
> > maximum exit value of 255. I don't think we even have to worry about
> > clashing with constants from sysexits.h, this is our internal API anyway.
> 
> I choose 7.
> 
> Logs now look like:
> 
> [root@albireo sssd]# grep 'child' sssd_ipa.cygnus.dev.log
> [child_handler_setup] (0x2000): Setting up signal handler up for pid [18835]
> [child_handler_setup] (0x2000): Signal handler set up for pid [18835]
> [set_tgt_child_timeout] (0x0400): Setting 6 seconds timeout for tgt child
> 
> (6 seconds later)
> 
> [get_tgt_timeout_handler] (0x4000): timeout for sending SIGTERM to tgt child
> [18835] reached.
> [get_tgt_timeout_handler] (0x0400): Setting 2 seconds timeout for sending
> SIGKILL to tgt child
> [sdap_get_tgt_recv] (0x0020): Cannot parse child response: [22][Invalid
> argument]
> [sdap_kinit_done] (0x0020): child failed (22 [Invalid argument])
> [child_sig_handler] (0x1000): Waiting for child [18835].
> [child_sig_handler] (0x0020): child [18835] failed with status [7].
> [child_callback] (0x0020): LDAP child was terminated due to timeout

I'm sorry, but these patches still don't fix the issue I was seeing.
Before the patches, when I timed out the child process, I saw:
(Fri Aug 12 14:54:24 2016) [[sssd[ldap_child[31721]]]] 
[ldap_child_get_tgt_sync] (0x0100): Principal name is: 
[host/client.ipa.t...@ipa.test]
(Fri Aug 12 14:54:24 2016) [[sssd[ldap_child[31721]]]] 
[ldap_child_get_tgt_sync] (0x0100): Using keytab [MEMORY:/etc/krb5.keytab]
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [get_tgt_timeout_handler] 
(0x4000): timeout for tgt child [31721] reached.
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [sdap_kinit_done] (0x0080): 
Communication with KDC timed out, trying the next one
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [_be_fo_set_port_status] 
(0x8000): Setting status: PORT_NOT_WORKING. Called from: 
/sssd/src/providers/ldap/sdap_async_connection.c: sdap_kinit_done: 1207
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [fo_set_port_status] (0x0100): 
Marking port 0 of server 'unidirect.ipa.test' as 'not working'
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [fo_set_port_status] (0x0400): 
Marking port 0 of duplicate server 'unidirect.ipa.test' as 'not working'
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [sdap_kinit_next_kdc] (0x1000): 
Resolving next KDC for service KERBEROS
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [fo_resolve_service_send] 
(0x0100): Trying to resolve service 'KERBEROS'
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [get_server_status] (0x1000): 
Status of server 'unidirect.ipa.test' is 'name resolved'
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [get_port_status] (0x1000): Port 
status of port 0 for server 'unidirect.ipa.test' is 'not working'
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [fo_resolve_service_send] 
(0x0020): No available servers for service 'KERBEROS'
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [be_resolve_server_done] 
(0x1000): Server resolution failed: [5]: Input/output error
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [sdap_cli_kinit_done] (0x0400): 
Cannot get a TGT: ret [1432158228](Network I/O Error)
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [sdap_cli_connect_recv] 
(0x0040): Unable to establish connection [13]: Permission denied
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [_be_fo_set_port_status] 
(0x8000): Setting status: PORT_NOT_WORKING. Called from: 
/sssd/src/providers/ldap/sdap_async_connection.c: sdap_cli_connect_recv: 2048

After the patch, I see:
(Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [get_tgt_timeout_handler] 
(0x4000): timeout for sending SIGTERM to tgt child [17291] reached.
(Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [get_tgt_timeout_handler] 
(0x0400): Setting 2 seconds timeout for sending SIGKILL to tgt child
(Fri Aug 12 15:01:05 2016) [[sssd[ldap_child[17291]]]] [sig_term_handler] 
(0x0010): Received signal [Terminated] [15], shutting down
(Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [read_pipe_handler] (0x0400): 
EOF received, client finished
(Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [sdap_get_tgt_recv] (0x0020): 
Cannot parse child response: [22][Invalid argument]
here ---^

This is the part I don't like, we try to read the response from the child's
pipe and fail with a bad error message. I thought this was because with
the preivous patch, we exit the child with zero, but I guess this was not
the case.. Anyway, we still should fix this, the message would be really
confusing to admins.

(Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [sdap_kinit_done] (0x0020): 
child failed (22 [Invalid argument])
(Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [sdap_cli_kinit_done] (0x0400): 
Cannot get a TGT: ret [22](Invalid argument)
(Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [sdap_cli_connect_recv] 
(0x0040): Unable to establish connection [13]: Permission denied
(Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [_be_fo_set_port_status] 
(0x8000): Setting status: PORT_NOT_WORKING. Called from: 
/sssd/src/providers/ldap/sdap_async_connection.c: sdap_cli_connect_recv: 2048
_______________________________________________
sssd-devel mailing list
sssd-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/admin/lists/sssd-devel@lists.fedorahosted.org

Reply via email to