https://fedorahosted.org/sssd/ticket/1947

A fast explanation how _srv_ expansion works. _srv_ is inserted into server list as so called meta server. Let us consider following configuration:

*Setup*
ipa_server = _srv_, ipa.pb
server list will contain: meta -> ipa.pb

*Expansion*
meta -> ipa.pb:389 -> ipa.pb ; remove meta
ipa.pb:389 -> ipa.pb ; meta

*Collapse*
remove ipa.pb:389 ; insert meta
meta -> ipa.pb

The main problem is that expanded SRV servers are marked as NEUTRAL during online check, but they don't collapse back into a meta server.

This will trigger another SRV expansion, leaving the old server in the list and trying to add the servers again. This is present in both master and 1.9 (and probably older versions), although the result is slightly different.

In master, we don't insert a server into server list if it is already present. Because state->meta is orphaned from the previous SRV expansion, state->meta->next is NULL and SSSD crashes later.

In 1.9, we simply insert duplicate servers. Those servers are inserted after orphaned state->meta, state->meta is orphaned again, leaving those servers globally unreachable. However, it seemingly does not affect the fail over. You just run into d25e7c65.

Here are four patches for master, and two patches for 1.9.
From 3969b2651c91cd1cb40bcb0789f7c6027369fffd Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Pavel=20B=C5=99ezina?= <pbrez...@redhat.com>
Date: Tue, 18 Jun 2013 13:27:00 +0200
Subject: [PATCH 1/2] collapse_srv_lookup may free the server, make it clear
 from the API

https://fedorahosted.org/sssd/ticket/1947
---
 src/providers/fail_over.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/src/providers/fail_over.c b/src/providers/fail_over.c
index ee63479f8eebb1f0fa54d316d5d7bfc5c63da121..47bbd820d4d408f36e37f951e1d0e3a6a7200798 100644
--- a/src/providers/fail_over.c
+++ b/src/providers/fail_over.c
@@ -227,10 +227,11 @@ get_srv_query(TALLOC_CTX *mem_ctx, struct fo_server *server)
 }
 
 static struct fo_server *
-collapse_srv_lookup(struct fo_server *server)
+collapse_srv_lookup(struct fo_server **_server)
 {
-    struct fo_server *tmp, *meta;
+    struct fo_server *tmp, *meta, *server;
 
+    server = *_server;
     meta = server->srv_data->meta;
     DEBUG(4, ("Need to refresh SRV lookup for domain %s\n",
               meta->srv_data->dns_domain));
@@ -263,6 +264,8 @@ collapse_srv_lookup(struct fo_server *server)
     meta->srv_data->srv_lookup_status = SRV_NEUTRAL;
     meta->srv_data->last_status_change.tv_sec = 0;
 
+    *_server = NULL;
+
     return meta;
 }
 
@@ -1102,7 +1105,7 @@ resolve_srv_send(TALLOC_CTX *mem_ctx, struct tevent_context *ev,
           str_srv_data_status(status)));
     switch(status) {
     case SRV_EXPIRED: /* Need a refresh */
-        state->meta = collapse_srv_lookup(server);
+        state->meta = collapse_srv_lookup(&server);
         /* FALLTHROUGH.
          * "server" might be invalid now if the SRV
          * query collapsed
@@ -1113,7 +1116,7 @@ resolve_srv_send(TALLOC_CTX *mem_ctx, struct tevent_context *ev,
             DEBUG(SSSDBG_TRACE_FUNC,
                   ("SRV resolution of service '%s'. "
                    "dns_discovery_domain not specified. Need to look it up.\n",
-                   server->service->name));
+                   state->meta->service->name));
             subreq = resolve_get_domain_send(state, ev, ctx, resolv);
             if (subreq == NULL) {
                 ret = ENOMEM;
@@ -1126,7 +1129,7 @@ resolve_srv_send(TALLOC_CTX *mem_ctx, struct tevent_context *ev,
         DEBUG(SSSDBG_TRACE_FUNC,
               ("SRV resolution of service '%s'. "
                "Will use DNS discovery domain '%s'\n",
-               server->service->name, state->meta->srv_data->dns_domain));
+               state->meta->service->name, state->meta->srv_data->dns_domain));
         resolve_srv_cont(req);
         break;
     case SRV_RESOLVE_ERROR: /* query could not be resolved but don't retry yet */
-- 
1.7.11.7

From 438354fd9d2b41fc2e2a79f2820cef6d2f21c910 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Pavel=20B=C5=99ezina?= <pbrez...@redhat.com>
Date: Tue, 18 Jun 2013 13:28:22 +0200
Subject: [PATCH 2/2] failover: if expanded server is marked as neutral,
 invoke srv collapse

https://fedorahosted.org/sssd/ticket/1947

Otherwise we will do the SRV expansion once again:
1. leaving the old servers in server list
2. meta server is not inserted back in the list, the newly found
   servers are inserted behind meta server, meta server is orphaned
   and the new servers are forgotten
---
 src/providers/fail_over.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/providers/fail_over.c b/src/providers/fail_over.c
index 47bbd820d4d408f36e37f951e1d0e3a6a7200798..17ebbf4ca09870c9a1fc6d62537b3f662de71334 100644
--- a/src/providers/fail_over.c
+++ b/src/providers/fail_over.c
@@ -1111,6 +1111,13 @@ resolve_srv_send(TALLOC_CTX *mem_ctx, struct tevent_context *ev,
          * query collapsed
          * */
     case SRV_NEUTRAL: /* Request SRV lookup */
+        if (server != NULL && server != state->meta) {
+            /* A server created by expansion of meta server was marked as
+             * neutral. We have to collapse the servers and issue new
+             * SRV resolution. */
+            state->meta = collapse_srv_lookup(&server);
+        }
+
         if (state->meta->srv_data->dns_domain == NULL) {
             /* we need to look up our DNS domain first */
             DEBUG(SSSDBG_TRACE_FUNC,
-- 
1.7.11.7

From 415736b51e62949bd82a3ca6aa1d3f711bfb1191 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Pavel=20B=C5=99ezina?= <pbrez...@redhat.com>
Date: Mon, 10 Jun 2013 10:34:38 +0200
Subject: [PATCH 1/4] failover: do not return invalid pointer when server is
 already present

https://fedorahosted.org/sssd/ticket/1947
---
 src/providers/fail_over.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/src/providers/fail_over.c b/src/providers/fail_over.c
index 1d2813589495ebb2ff56e93cddaed9d5172e128e..745d7684c96de8d93476784a7048ba384f5b624d 100644
--- a/src/providers/fail_over.c
+++ b/src/providers/fail_over.c
@@ -733,6 +733,7 @@ static errno_t fo_add_server_list(struct fo_service *service,
                                   struct fo_server **_last_server)
 {
     struct fo_server *server = NULL;
+    struct fo_server *last_server = NULL;
     struct fo_server *srv_list = NULL;
     size_t i;
     errno_t ret;
@@ -750,8 +751,11 @@ static errno_t fo_add_server_list(struct fo_service *service,
         ret = fo_add_server_to_list(&srv_list, service->server_list,
                                     server, service->name);
         if (ret != EOK) {
-            talloc_free(server);
+            talloc_zfree(server);
+            continue;
         }
+
+        last_server = server;
     }
 
     if (srv_list != NULL) {
@@ -760,7 +764,7 @@ static errno_t fo_add_server_list(struct fo_service *service,
     }
 
     if (_last_server != NULL) {
-        *_last_server = server;
+        *_last_server = last_server == NULL ? after_server : last_server;
     }
 
     return EOK;
-- 
1.7.11.7

From 3181429e2493f2fccea064b6435429e8df853637 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Pavel=20B=C5=99ezina?= <pbrez...@redhat.com>
Date: Tue, 18 Jun 2013 12:28:36 +0200
Subject: [PATCH 2/4] failover: return error when SRV lookup returned only
 duplicates

https://fedorahosted.org/sssd/ticket/1947

Otherwise we risk that the meta server is removed from the server list,
but without a chance to return, because there may be no fo_server with
srv_data = meta.

Also if state->meta->next is NULL (it is still orphaned because we try
to errornously expand it without invoking collapse first), state->out
will be NULL and SSSD will crash.

New error code: ERR_SRV_DUPLICATES
---
 src/providers/fail_over.c | 23 +++++++++++++++++++++--
 src/util/util_errors.c    |  1 +
 src/util/util_errors.h    |  1 +
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/src/providers/fail_over.c b/src/providers/fail_over.c
index 745d7684c96de8d93476784a7048ba384f5b624d..072ad075757a30e0963f0df0608f063f3f01b563 100644
--- a/src/providers/fail_over.c
+++ b/src/providers/fail_over.c
@@ -1284,21 +1284,40 @@ resolve_srv_done(struct tevent_req *subreq)
                                      backup_servers, num_backup_servers,
                                      state->meta->srv_data,
                                      state->meta->user_data,
-                                     false, NULL);
+                                     false, &last_server);
             if (ret != EOK) {
                 goto done;
             }
         }
 
+        if (last_server == state->meta) {
+            /* SRV lookup returned only those servers
+             * that are already present. */
+            DEBUG(SSSDBG_TRACE_FUNC, ("SRV lookup did not return "
+                                      "any new server.\n"));
+            ret = ERR_SRV_DUPLICATES;
+            goto done;
+        }
+
+        /* At least one new server was inserted.
+         * We will return the first new server. */
+        if (state->meta->next == NULL) {
+            DEBUG(SSSDBG_CRIT_FAILURE,
+                 ("BUG: state->meta->next is NULL\n"));
+            ret = ERR_INTERNAL;
+            goto done;
+        }
+
         state->out = state->meta->next;
 
+        /* And remove meta server from the server list. It will be
+         * inserted again during srv collapse. */
         DLIST_REMOVE(state->service->server_list, state->meta);
         if (state->service->last_tried_server == state->meta) {
             state->service->last_tried_server = state->out;
         }
 
         set_srv_data_status(state->meta->srv_data, SRV_RESOLVED);
-
         ret = EOK;
         break;
     case ERR_SRV_NOT_FOUND:
diff --git a/src/util/util_errors.c b/src/util/util_errors.c
index 22a3045a6f9656d9ab8fe66673301a508e444771..015c5acaf1b9189d7852041907d7fa256910a744 100644
--- a/src/util/util_errors.c
+++ b/src/util/util_errors.c
@@ -44,6 +44,7 @@ struct err_string error_to_str[] = {
     { "Host Access Denied" }, /* ERR_ACCESS_DENIED */
     { "SRV record not found" }, /* ERR_SRV_NOT_FOUND */
     { "SRV lookup error" }, /* ERR_SRV_LOOKUP_ERROR */
+    { "SRV lookup did not return any new server "}, /* ERR_SRV_DUPLICATES */
     { "Dynamic DNS update failed" }, /* ERR_DYNDNS_FAILED */
     { "Dynamic DNS update timed out" }, /* ERR_DYNDNS_TIMEOUT */
     { "Dynamic DNS update not possible while offline" }, /* ERR_DYNDNS_OFFLINE */
diff --git a/src/util/util_errors.h b/src/util/util_errors.h
index 65d37aedb544bb303d7540fc59e1a802aee11898..ca44728234a24a1b234224811713d9b7d1f38ce6 100644
--- a/src/util/util_errors.h
+++ b/src/util/util_errors.h
@@ -66,6 +66,7 @@ enum sssd_errors {
     ERR_ACCESS_DENIED,
     ERR_SRV_NOT_FOUND,
     ERR_SRV_LOOKUP_ERROR,
+    ERR_SRV_DUPLICATES,
     ERR_DYNDNS_FAILED,
     ERR_DYNDNS_TIMEOUT,
     ERR_DYNDNS_OFFLINE,
-- 
1.7.11.7

From 4fcbd97eae057f57e4cb658a08ace5ed02363105 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Pavel=20B=C5=99ezina?= <pbrez...@redhat.com>
Date: Tue, 18 Jun 2013 12:59:45 +0200
Subject: [PATCH 3/4] collapse_srv_lookup may free the server, make it clear
 from the API

https://fedorahosted.org/sssd/ticket/1947
---
 src/providers/fail_over.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/src/providers/fail_over.c b/src/providers/fail_over.c
index 072ad075757a30e0963f0df0608f063f3f01b563..2669f227d17bf72ab0e7a7741f8b9e403fc2e1db 100644
--- a/src/providers/fail_over.c
+++ b/src/providers/fail_over.c
@@ -216,10 +216,11 @@ int fo_is_srv_lookup(struct fo_server *s)
 }
 
 static struct fo_server *
-collapse_srv_lookup(struct fo_server *server)
+collapse_srv_lookup(struct fo_server **_server)
 {
-    struct fo_server *tmp, *meta;
+    struct fo_server *tmp, *meta, *server;
 
+    server = *_server;
     meta = server->srv_data->meta;
     DEBUG(4, ("Need to refresh SRV lookup for domain %s\n",
               meta->srv_data->dns_domain));
@@ -252,6 +253,8 @@ collapse_srv_lookup(struct fo_server *server)
     meta->srv_data->srv_lookup_status = SRV_NEUTRAL;
     meta->srv_data->last_status_change.tv_sec = 0;
 
+    *_server = NULL;
+
     return meta;
 }
 
@@ -1189,7 +1192,7 @@ resolve_srv_send(TALLOC_CTX *mem_ctx, struct tevent_context *ev,
           str_srv_data_status(status)));
     switch(status) {
     case SRV_EXPIRED: /* Need a refresh */
-        state->meta = collapse_srv_lookup(server);
+        state->meta = collapse_srv_lookup(&server);
         /* FALLTHROUGH.
          * "server" might be invalid now if the SRV
          * query collapsed
@@ -1202,9 +1205,9 @@ resolve_srv_send(TALLOC_CTX *mem_ctx, struct tevent_context *ev,
         }
 
         subreq = ctx->srv_send_fn(state, ev,
-                                  server->srv_data->srv,
-                                  server->srv_data->proto,
-                                  server->srv_data->discovery_domain,
+                                  state->meta->srv_data->srv,
+                                  state->meta->srv_data->proto,
+                                  state->meta->srv_data->discovery_domain,
                                   ctx->srv_pvt);
 
         tevent_req_set_callback(subreq, resolve_srv_done, req);
-- 
1.7.11.7

From 4c5daa081e693a45846f5395fe66ab67c9773995 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Pavel=20B=C5=99ezina?= <pbrez...@redhat.com>
Date: Tue, 18 Jun 2013 13:01:27 +0200
Subject: [PATCH 4/4] failover: if expanded server is marked as neutral,
 invoke srv collapse

https://fedorahosted.org/sssd/ticket/1947

Otherwise we will do the SRV expansion once again:
1. leaving the old servers in server list
2. meta server is not inserted back in the list, the newly found
   servers are inserted behind meta server, meta server is orphaned
   and the new servers are forgotten
---
 src/providers/fail_over.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/providers/fail_over.c b/src/providers/fail_over.c
index 2669f227d17bf72ab0e7a7741f8b9e403fc2e1db..ef4cc22661947fbe725ca2680a390ad5258a0bb7 100644
--- a/src/providers/fail_over.c
+++ b/src/providers/fail_over.c
@@ -1198,6 +1198,13 @@ resolve_srv_send(TALLOC_CTX *mem_ctx, struct tevent_context *ev,
          * query collapsed
          * */
     case SRV_NEUTRAL: /* Request SRV lookup */
+        if (server != NULL && server != state->meta) {
+            /* A server created by expansion of meta server was marked as
+             * neutral. We have to collapse the servers and issue new
+             * SRV resolution. */
+            state->meta = collapse_srv_lookup(&server);
+        }
+
         if (ctx->srv_send_fn == NULL || ctx->srv_recv_fn == NULL) {
             DEBUG(SSSDBG_OP_FAILURE, ("No SRV lookup plugin is set\n"));
             ret = ENOTSUP;
-- 
1.7.11.7

_______________________________________________
sssd-devel mailing list
sssd-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/sssd-devel

Reply via email to