On Mon, 2009-11-16 at 17:08 +0100, Martin Nagy wrote:
> Simo Sorce wrote:
> > On Mon, 2009-11-16 at 08:46 +0100, Martin Nagy wrote:
> > > Simo Sorce wrote:
> > > > While working on a patch to use failover in the ldap driver I found a
> > > > few bugs and a few things I felt missing.
> > > > 
> > > > Attached a patch to fix bugs and add a function to get back a server
> > > > name from a fo_server structure.
> > > > 
> > > > Simo.
> > > > 
> > > > P.S: if you are interested in the failover patch I have a working but
> > > > unfinished patch here:
> > > > http://fedorapeople.org/gitweb?p=simo/public_git/sssd.git;a=commit;h=29830482f5e44b5425fb91f82fd5a4ee692f3ae2
> > > 
> > > Just one thing. I was counting on the fact that the user of the fail
> > > over will make sure that he doesn't try to create the same service
> > > twice and doing so will mean a bug in the code. Looking at your patch
> > > and having done a very quick look at your branch, it seems like you
> > > count for this behavior (which I think is fairly reasonable). In that
> > > case, could you also please change this debugging statement?
> > > 
> > > DEBUG(1, ("Service %s already exists\n", name));
> > > 
> > > Please either completely remove it, or (IMHO better) change the debug
> > > level.
> > 
> > I will probably raise the debug level, but it's not a big deal for now.
> > 
> > > Regarding the ares error code, I'm not 100% sure what's the right
> > > thing to do. Perhaps we should use a constant, like EIO and count on
> > > the logging statement few lines up to be sufficient for the user. Any
> > > other ideas?
> > 
> > Not sure yet, what we need from callers are 3 type of errors.
> > 1. an error that tells us that no server is resolvable/available (to
> > avoid looping forever)
> > 2. an error that tells us the current server is not resolvable/available
> > 3. any other error should mean something really bad happened and we
> > should abort.
> > 
> > We miss n.2 for now so I am treating any error except ENOENT to mean the
> > current server is not resolvable and we need to try the next one.
> > 
> > Simo.
> 
> OK. Nothing actually wrong with the patch, so ack.

Ok I found another bug, so I am re-spinning the patch.


After the SERVER_NOT_WORKING timeout for a server was expired, we were
setting the status to SERVER_NAME_RESOLVED

This is wrong for 2 reasons:
1. we may have marked the server as SERVER_NOT_WORKING because we could
not resolve the name, this means the name was not resolved at all, by
setting it as SERVER_NAME_RESOLVED we have no way to force a new lookup,
we will basically never be able to lookup the server again.

2. If a sever becomes unavailable because the server has been moved, we
would never have a chance to resolve it to the new address again because
setting the server to SERVER_NOT_WORKING status wouldn't work (see point
1).

Even if the server was correctly resolved and SERVER_NOT_WORKING was set
because of other reasons (a service stopped working for example), we
want to try to resolve the name again just in case.

Simo.

-- 
Simo Sorce * Red Hat, Inc * New York
>From 732cf6b816fe4fa0b23d8092470e3711cb375872 Mon Sep 17 00:00:00 2001
From: Simo Sorce <sso...@redhat.com>
Date: Sun, 15 Nov 2009 18:07:36 -0500
Subject: [PATCH 1/2] Failover fixes and additions

---
 server/providers/fail_over.c |   28 ++++++++++++++++++++++------
 server/providers/fail_over.h |    2 ++
 server/resolv/async_resolv.c |    2 +-
 3 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/server/providers/fail_over.c b/server/providers/fail_over.c
index fa599e3..1b9e017 100644
--- a/server/providers/fail_over.c
+++ b/server/providers/fail_over.c
@@ -129,8 +129,8 @@ get_server_status(struct fo_server *server)
     if (timeout != 0 && server->common->server_status == SERVER_NOT_WORKING) {
         gettimeofday(&tv, NULL);
         if (STATUS_DIFF(server->common, tv) > timeout) {
-            server->common->server_status = SERVER_NAME_RESOLVED;
-            server->last_status_change.tv_sec = tv.tv_sec;
+            server->common->server_status = SERVER_NAME_NOT_RESOLVED;
+            server->last_status_change.tv_sec = 0;
         }
     }
 
@@ -196,6 +196,9 @@ fo_new_service(struct fo_ctx *ctx, const char *name,
     ret = fo_get_service(ctx, name, &service);
     if (ret == EOK) {
         DEBUG(1, ("Service %s already exists\n", name));
+        if (_service) {
+                *_service = service;
+        }
         return EEXIST;
     } else if (ret != ENOENT) {
         return ret;
@@ -215,7 +218,9 @@ fo_new_service(struct fo_ctx *ctx, const char *name,
     DLIST_ADD(ctx->service_list, service);
 
     talloc_set_destructor(service, service_destructor);
-    *_service = service;
+    if (_service) {
+        *_service = service;
+    }
 
     return EOK;
 }
@@ -227,7 +232,7 @@ fo_get_service(struct fo_ctx *ctx, const char *name,
     struct fo_service *service;
 
     DLIST_FOR_EACH(service, ctx->service_list) {
-        if (!strcasecmp(name, service->name)) {
+        if (!strcmp(name, service->name)) {
             *_service = service;
             return EOK;
         }
@@ -476,6 +481,10 @@ fo_resolve_service_done(struct tevent_req *subreq)
     while ((request = common->request_list) != NULL) {
         DLIST_REMOVE(common->request_list, request);
         if (resolv_status) {
+            /* FIXME FIXME: resolv_status is an ARES error.
+             * but any caller will expect classic error codes.
+             * also the send() function may return ENOENT, so this mix
+             * IS explosive (ENOENT = 2 = ARES_EFORMER) */
             tevent_req_error(request->req, resolv_status);
         } else {
             tevent_req_done(request->req);
@@ -490,11 +499,13 @@ fo_resolve_service_recv(struct tevent_req *req, struct fo_server **server)
 
     state = tevent_req_data(req, struct resolve_service_state);
 
-    TEVENT_REQ_RETURN_ON_ERROR(req);
-
+    /* always return the server if asked for, otherwise the caller
+     * cannot mark it as faulty in case we return an error */
     if (server)
         *server = state->server;
 
+    TEVENT_REQ_RETURN_ON_ERROR(req);
+
     return EOK;
 }
 
@@ -532,6 +543,11 @@ fo_get_server_port(struct fo_server *server)
     return server->port;
 }
 
+const char *fo_get_server_name(struct fo_server *server)
+{
+    return server->common->name;
+}
+
 struct hostent *
 fo_get_server_hostent(struct fo_server *server)
 {
diff --git a/server/providers/fail_over.h b/server/providers/fail_over.h
index b6e2f2f..5fa9ff0 100644
--- a/server/providers/fail_over.h
+++ b/server/providers/fail_over.h
@@ -101,6 +101,8 @@ void *fo_get_server_user_data(struct fo_server *server);
 
 int fo_get_server_port(struct fo_server *server);
 
+const char *fo_get_server_name(struct fo_server *server);
+
 struct hostent *fo_get_server_hostent(struct fo_server *server);
 
 #endif /* !__FAIL_OVER_H__ */
diff --git a/server/resolv/async_resolv.c b/server/resolv/async_resolv.c
index c05461d..6ac5e41 100644
--- a/server/resolv/async_resolv.c
+++ b/server/resolv/async_resolv.c
@@ -78,7 +78,7 @@ return_code(int ares_code)
         return ENOMEM;
     case ARES_EFILE:
     default:
-        return -1;
+        return EIO;
     }
 }
 
-- 
1.6.2.5

_______________________________________________
sssd-devel mailing list
sssd-devel@lists.fedorahosted.org
https://fedorahosted.org/mailman/listinfo/sssd-devel

Reply via email to