On 08/17/2016 01:10 PM, Pavel Březina wrote:
On 08/15/2016 01:27 PM, Petr Cech wrote:
On 08/15/2016 09:59 AM, Jakub Hrozek wrote:
On Mon, Aug 15, 2016 at 09:47:27AM +0200, Petr Cech wrote:
On 08/12/2016 04:13 PM, Jakub Hrozek wrote:
On Fri, Aug 12, 2016 at 03:41:26PM +0200, Petr Cech wrote:
On 08/12/2016 03:07 PM, Jakub Hrozek wrote:
Logs now look like:

[root@albireo sssd]# grep 'child' sssd_ipa.cygnus.dev.log
[child_handler_setup] (0x2000): Setting up signal handler up for
pid [18835]
[child_handler_setup] (0x2000): Signal handler set up for pid
[18835]
[set_tgt_child_timeout] (0x0400): Setting 6 seconds timeout for
tgt child

(6 seconds later)

[get_tgt_timeout_handler] (0x4000): timeout for sending SIGTERM
to tgt child
[18835] reached.
[get_tgt_timeout_handler] (0x0400): Setting 2 seconds timeout
for sending
SIGKILL to tgt child
[sdap_get_tgt_recv] (0x0020): Cannot parse child response:
[22][Invalid
argument]
[sdap_kinit_done] (0x0020): child failed (22 [Invalid argument])
[child_sig_handler] (0x1000): Waiting for child [18835].
[child_sig_handler] (0x0020): child [18835] failed with status
[7].
[child_callback] (0x0020): LDAP child was terminated due to
timeout
I'm sorry, but these patches still don't fix the issue I was seeing.
Before the patches, when I timed out the child process, I saw:
(Fri Aug 12 14:54:24 2016) [[sssd[ldap_child[31721]]]]
[ldap_child_get_tgt_sync] (0x0100): Principal name is:
[host/client.ipa.t...@ipa.test]
(Fri Aug 12 14:54:24 2016) [[sssd[ldap_child[31721]]]]
[ldap_child_get_tgt_sync] (0x0100): Using keytab
[MEMORY:/etc/krb5.keytab]
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
[get_tgt_timeout_handler] (0x4000): timeout for tgt child [31721]
reached.
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [sdap_kinit_done]
(0x0080): Communication with KDC timed out, trying the next one
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
[_be_fo_set_port_status] (0x8000): Setting status:
PORT_NOT_WORKING. Called from:
/sssd/src/providers/ldap/sdap_async_connection.c: sdap_kinit_done:
1207
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
[fo_set_port_status] (0x0100): Marking port 0 of server
'unidirect.ipa.test' as 'not working'
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
[fo_set_port_status] (0x0400): Marking port 0 of duplicate server
'unidirect.ipa.test' as 'not working'
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
[sdap_kinit_next_kdc] (0x1000): Resolving next KDC for service
KERBEROS
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
[fo_resolve_service_send] (0x0100): Trying to resolve service
'KERBEROS'
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [get_server_status]
(0x1000): Status of server 'unidirect.ipa.test' is 'name resolved'
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [get_port_status]
(0x1000): Port status of port 0 for server 'unidirect.ipa.test' is
'not working'
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
[fo_resolve_service_send] (0x0020): No available servers for
service 'KERBEROS'
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
[be_resolve_server_done] (0x1000): Server resolution failed: [5]:
Input/output error
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
[sdap_cli_kinit_done] (0x0400): Cannot get a TGT: ret
[1432158228](Network I/O Error)
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
[sdap_cli_connect_recv] (0x0040): Unable to establish connection
[13]: Permission denied
(Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
[_be_fo_set_port_status] (0x8000): Setting status:
PORT_NOT_WORKING. Called from:
/sssd/src/providers/ldap/sdap_async_connection.c:
sdap_cli_connect_recv: 2048

After the patch, I see:
(Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]]
[get_tgt_timeout_handler] (0x4000): timeout for sending SIGTERM to
tgt child [17291] reached.
(Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]]
[get_tgt_timeout_handler] (0x0400): Setting 2 seconds timeout for
sending SIGKILL to tgt child
(Fri Aug 12 15:01:05 2016) [[sssd[ldap_child[17291]]]]
[sig_term_handler] (0x0010): Received signal [Terminated] [15],
shutting down
(Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [read_pipe_handler]
(0x0400): EOF received, client finished
(Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [sdap_get_tgt_recv]
(0x0020): Cannot parse child response: [22][Invalid argument]
here ---^

This is the part I don't like, we try to read the response from
the child's
pipe and fail with a bad error message. I thought this was because
with
the preivous patch, we exit the child with zero, but I guess this
was not
the case.. Anyway, we still should fix this, the message would be
really
confusing to admins.

Hi Jakub,

I know I had the same
'Cannot parse child response: [22][Invalid argument]'
in my logs too, see my last mail above.

Some lines after it I have new message
'[child_callback] (0x0020): LDAP child was terminated due to timeout'

I understand that it is not satisfying. The main purpose of this
patch was
deleting temporary files if timeout reached. And this is IMO solved
I think.
It is true that we have big effort to have understandable logs.

Yesterday I looked deeper at mechanism how signals, child handler
and all
such stuff works in ldap_child case. I wasn't able to tell to
sdap_get_tgt_recv() that the state is changed and we shouldn't
parse the
answer.

If I understand handler stuff right I need introduce in ldap_child
handler
whole mechanism like in src/providers/be_dyndns.c

The mechanism is calling the child_handler_setup() with all
parameters which
means rewrite sdap_child_helpers in appropriate manner.

I think it would be enough to pass the tevent_req to the
child_callback() and cancel it with tevent_req_error(req, ETIMEDOUT)

The problem is that with the current patches, the whole request is
marked as 'succeeded', that's why the _recv() function tries to parse
the response. If you look at sdap_get_tgt_recv(), if the request is
marked as failure, the processing aborts in
TEVENT_REQ_RETURN_ON_ERROR..

Hi Jakub,

I tried resolve it by your advice. I have still the same problem --
request
is unreachable in child_callback() (Bad magic value).

I found one little hack. If we sent SIGTERM signal to child, child
can not
send its answer so we can say child is already time-outed. And after
this we
are able to write debug message in child_callback().

Why does it need to send any answer? The parent would see the special
exit code and use that to know the child timed out. The point is the
child should not send any answer in the failure cases, only report the
reason why it finished with an exit code.


Please see last 5 commits here:
https://github.com/celestian/sssd/commits/t3106_v1

Interesting, is the correct pointer address being passed as pvt to:
     child_ctx->cb(child_ctx->child_status, child_ctx->sige,
     child_ctx->pvt);
?

You can also try running sssd_be with valgrind for more clues to see if
the request is maybe terminated before.


627200c WIP: LDAP: Handling of SIGCHLD <-- There are both solution
commented
in code
888df5b DEBUG: Sleep <-- Adds sleep loop for timeouting the child

9235712 LDAP: Adding SIGCHLD callback
625191d LDAP: Adding SIGTERM signal before SIGKILL
5331972 LDAP: Adding support for SIGTERM signal

Hi all,

I discussed this issue with Jakub and we found a solution (patch #3).
Please see the attached patch set.

And because Jakub is interested in it I would like to ask somebody else
for code review. Thanks.

Regards

Hi, child_callback() should be static. Besides those patches LGTM.

Thanks Pavel, your comment is addressed. New version attached.

Regards

Petr

--
Petr^4 Čech
>From ecef80dca4234432e5e17739aed52555b1bbeb11 Mon Sep 17 00:00:00 2001
From: Petr Cech <pc...@redhat.com>
Date: Thu, 4 Aug 2016 16:25:28 +0200
Subject: [PATCH 1/3] LDAP: Adding support for SIGTERM signal

We add support for handling SIGTERM signal. If ldap_child receives
SIGTERM signal it removes temporary file.

Resolves:
https://fedorahosted.org/sssd/ticket/3106
---
 src/providers/ldap/ldap_child.c | 29 +++++++++++++++++++++++++++++
 src/util/child_common.h         |  2 ++
 2 files changed, 31 insertions(+)

diff --git a/src/providers/ldap/ldap_child.c b/src/providers/ldap/ldap_child.c
index 52c271f36cb57b774e6091e7322e4b1394c78969..ffcbc3985691b965c76a06805068118628adc198 100644
--- a/src/providers/ldap/ldap_child.c
+++ b/src/providers/ldap/ldap_child.c
@@ -33,6 +33,30 @@
 #include "providers/backend.h"
 #include "providers/krb5/krb5_common.h"
 
+char *global_ccname_file_dummy = NULL;
+
+static void sig_term_handler(int sig)
+{
+    int ret;
+
+    DEBUG(SSSDBG_FATAL_FAILURE, "Received signal [%s] [%i], shutting down\n",
+                                strsignal(sig), sig);
+
+    if (global_ccname_file_dummy != NULL) {
+        ret = unlink(global_ccname_file_dummy);
+        if (ret != 0) {
+            DEBUG(SSSDBG_FATAL_FAILURE, "Unlink file [%s] failed [%i][%s]\n",
+                                        global_ccname_file_dummy,
+                                        errno, strerror(errno));
+        } else {
+            DEBUG(SSSDBG_FATAL_FAILURE, "Unlink file [%s]\n",
+                                        global_ccname_file_dummy);
+        }
+    }
+
+    _exit(CHILD_TIMEOUT_EXIT_CODE);
+}
+
 static krb5_context krb5_error_ctx;
 #define LDAP_CHILD_DEBUG(level, error) KRB5_DEBUG(level, krb5_error_ctx, error)
 
@@ -405,6 +429,7 @@ static krb5_error_code ldap_child_get_tgt_sync(TALLOC_CTX *memctx,
               strerror(krberr), krberr);
         goto done;
     }
+    global_ccname_file_dummy = ccname_file_dummy;
 
     ret = sss_unique_filename(tmp_ctx, ccname_file_dummy);
     if (ret != EOK) {
@@ -490,6 +515,7 @@ static krb5_error_code ldap_child_get_tgt_sync(TALLOC_CTX *memctx,
               "rename failed [%d][%s].\n", ret, strerror(ret));
         goto done;
     }
+    global_ccname_file_dummy = NULL;
 
     krberr = 0;
     *ccname_out = talloc_steal(memctx, ccname);
@@ -631,6 +657,9 @@ int main(int argc, const char *argv[])
         }
     }
 
+    BlockSignals(false, SIGTERM);
+    CatchSignal(SIGTERM, sig_term_handler);
+
     DEBUG(SSSDBG_TRACE_FUNC, "ldap_child started.\n");
 
     main_ctx = talloc_new(NULL);
diff --git a/src/util/child_common.h b/src/util/child_common.h
index 2a62869034a466b465a481286950678af73667ab..d843cc36d20f9b0214ea1d2b1fa26251080303a7 100644
--- a/src/util/child_common.h
+++ b/src/util/child_common.h
@@ -35,6 +35,8 @@
 #define IN_BUF_SIZE         512
 #define CHILD_MSG_CHUNK     256
 
+#define CHILD_TIMEOUT_EXIT_CODE 7
+
 struct response {
     uint8_t *buf;
     size_t size;
-- 
2.7.4

>From 876b0918f135e5c35bf730de1c2efa2ad7dc7dd9 Mon Sep 17 00:00:00 2001
From: Petr Cech <pc...@redhat.com>
Date: Thu, 4 Aug 2016 16:27:40 +0200
Subject: [PATCH 2/3] LDAP: Adding SIGTERM signal before SIGKILL

We add better termination of ldap_child. If ldap_child reaches
the timeout for termination parent sents SIGTERM signal. Child
has 2 seconds for removing temporary file and exit.
If it is not sufficient there is SIGKILL send to the child.

Resolves:
https://fedorahosted.org/sssd/ticket/3106
---
 src/providers/ldap/sdap_child_helpers.c | 40 +++++++++++++++++++++++++++++----
 src/util/child_common.h                 |  1 +
 2 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/src/providers/ldap/sdap_child_helpers.c b/src/providers/ldap/sdap_child_helpers.c
index 92642e8e47e530e5aef5f78bf1c7a2427979275e..2c0757d9a8c3f82779175da4a540f019ae1575d6 100644
--- a/src/providers/ldap/sdap_child_helpers.c
+++ b/src/providers/ldap/sdap_child_helpers.c
@@ -416,9 +416,7 @@ int sdap_get_tgt_recv(struct tevent_req *req,
     return EOK;
 }
 
-
-
-static void get_tgt_timeout_handler(struct tevent_context *ev,
+static void get_tgt_sigkill_handler(struct tevent_context *ev,
                                     struct tevent_timer *te,
                                     struct timeval tv, void *pvt)
 {
@@ -428,7 +426,8 @@ static void get_tgt_timeout_handler(struct tevent_context *ev,
     int ret;
 
     DEBUG(SSSDBG_TRACE_ALL,
-          "timeout for tgt child [%d] reached.\n", state->child->pid);
+          "timeout for sending SIGKILL to tgt child [%d] reached.\n",
+          state->child->pid);
 
     ret = kill(state->child->pid, SIGKILL);
     if (ret == -1) {
@@ -439,6 +438,39 @@ static void get_tgt_timeout_handler(struct tevent_context *ev,
     tevent_req_error(req, ETIMEDOUT);
 }
 
+static void get_tgt_timeout_handler(struct tevent_context *ev,
+                                      struct tevent_timer *te,
+                                      struct timeval tv, void *pvt)
+{
+    struct tevent_req *req = talloc_get_type(pvt, struct tevent_req);
+    struct sdap_get_tgt_state *state = tevent_req_data(req,
+                                            struct sdap_get_tgt_state);
+    int ret;
+
+    DEBUG(SSSDBG_TRACE_ALL,
+          "timeout for sending SIGTERM to tgt child [%d] reached.\n",
+          state->child->pid);
+
+    ret = kill(state->child->pid, SIGTERM);
+    if (ret == -1) {
+        ret = errno;
+        DEBUG(SSSDBG_CRIT_FAILURE,
+              "Sending SIGTERM failed [%d][%s].\n", ret, strerror(ret));
+    }
+
+    DEBUG(SSSDBG_TRACE_FUNC,
+          "Setting %d seconds timeout for sending SIGKILL to tgt child\n",
+          SIGTERM_TO_SIGKILL_TIME);
+
+    tv = tevent_timeval_current_ofs(SIGTERM_TO_SIGKILL_TIME, 0);
+
+    te = tevent_add_timer(ev, req, tv, get_tgt_sigkill_handler, req);
+    if (te == NULL) {
+        DEBUG(SSSDBG_CRIT_FAILURE, "tevent_add_timer failed.\n");
+        tevent_req_error(req, ECANCELED);
+    }
+}
+
 static errno_t set_tgt_child_timeout(struct tevent_req *req,
                                      struct tevent_context *ev,
                                      int timeout)
diff --git a/src/util/child_common.h b/src/util/child_common.h
index d843cc36d20f9b0214ea1d2b1fa26251080303a7..37116e2a73ede765f68b3dba8ab69aea432303bb 100644
--- a/src/util/child_common.h
+++ b/src/util/child_common.h
@@ -35,6 +35,7 @@
 #define IN_BUF_SIZE         512
 #define CHILD_MSG_CHUNK     256
 
+#define SIGTERM_TO_SIGKILL_TIME 2
 #define CHILD_TIMEOUT_EXIT_CODE 7
 
 struct response {
-- 
2.7.4

>From 78594ba6464011aeaee9a21c21debe7f87305471 Mon Sep 17 00:00:00 2001
From: Petr Cech <pc...@redhat.com>
Date: Thu, 11 Aug 2016 14:18:15 +0200
Subject: [PATCH 3/3] LDAP: Adding SIGCHLD callback

This patch adds SIGCHLD callback for ldap_child. So if timeout is
reached and ldap_child is terminated by handler we have debug message
about it.

Resolves:
https://fedorahosted.org/sssd/ticket/3106
---
 src/providers/ldap/sdap_child_helpers.c | 32 ++++++++++++++++++++++++++------
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/src/providers/ldap/sdap_child_helpers.c b/src/providers/ldap/sdap_child_helpers.c
index 2c0757d9a8c3f82779175da4a540f019ae1575d6..ab733175215a41ec67270f7c0cd7ca4aa796be92 100644
--- a/src/providers/ldap/sdap_child_helpers.c
+++ b/src/providers/ldap/sdap_child_helpers.c
@@ -69,8 +69,21 @@ static void sdap_close_fd(int *fd)
     *fd = -1;
 }
 
+static void child_callback(int child_status,
+                           struct tevent_signal *sige,
+                           void *pvt)
+{
+    if (WEXITSTATUS(child_status) == CHILD_TIMEOUT_EXIT_CODE) {
+        DEBUG(SSSDBG_CRIT_FAILURE,
+              "LDAP child was terminated due to timeout\n");
+
+        struct tevent_req *req = talloc_get_type(pvt, struct tevent_req);
+        tevent_req_error(req, ETIMEDOUT);
+    }
+}
+
 static errno_t sdap_fork_child(struct tevent_context *ev,
-                               struct sdap_child *child)
+                               struct sdap_child *child, struct tevent_req *req)
 {
     int pipefd_to_child[2] = PIPE_INIT;
     int pipefd_from_child[2] = PIPE_INIT;
@@ -110,7 +123,7 @@ static errno_t sdap_fork_child(struct tevent_context *ev,
         sss_fd_nonblocking(child->io->read_from_child_fd);
         sss_fd_nonblocking(child->io->write_to_child_fd);
 
-        ret = child_handler_setup(ev, pid, NULL, NULL, NULL);
+        ret = child_handler_setup(ev, pid, child_callback, req, NULL);
         if (ret != EOK) {
             goto fail;
         }
@@ -256,6 +269,8 @@ struct sdap_get_tgt_state {
     struct sdap_child *child;
     ssize_t len;
     uint8_t *buf;
+
+    struct tevent_timer *kill_te;
 };
 
 static errno_t set_tgt_child_timeout(struct tevent_req *req,
@@ -308,7 +323,7 @@ struct tevent_req *sdap_get_tgt_send(TALLOC_CTX *mem_ctx,
         goto fail;
     }
 
-    ret = sdap_fork_child(state->ev, state->child);
+    ret = sdap_fork_child(state->ev, state->child, req);
     if (ret != EOK) {
         DEBUG(SSSDBG_CRIT_FAILURE, "sdap_fork_child failed.\n");
         goto fail;
@@ -379,7 +394,12 @@ static void sdap_get_tgt_done(struct tevent_req *subreq)
 
     sdap_close_fd(&state->child->io->read_from_child_fd);
 
-    tevent_req_done(req);
+    if (state->kill_te == NULL) {
+        tevent_req_done(req);
+        return;
+    }
+
+    // wait for child callback to terminate the request
 }
 
 int sdap_get_tgt_recv(struct tevent_req *req,
@@ -464,8 +484,8 @@ static void get_tgt_timeout_handler(struct tevent_context *ev,
 
     tv = tevent_timeval_current_ofs(SIGTERM_TO_SIGKILL_TIME, 0);
 
-    te = tevent_add_timer(ev, req, tv, get_tgt_sigkill_handler, req);
-    if (te == NULL) {
+    state->kill_te = tevent_add_timer(ev, req, tv, get_tgt_sigkill_handler, req);
+    if (state->kill_te == NULL) {
         DEBUG(SSSDBG_CRIT_FAILURE, "tevent_add_timer failed.\n");
         tevent_req_error(req, ECANCELED);
     }
-- 
2.7.4

_______________________________________________
sssd-devel mailing list
sssd-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/admin/lists/sssd-devel@lists.fedorahosted.org

Reply via email to