> can you please consult with Dominik Klein, who (unaware of your effort)
> incidentally also wrote an RA for conntrackd and posted it on this list
> a few days ago? It would be nice if the two of you could consolidate
> your efforts and come up with an updated patch. Maybe you guys could get
> together in #linux-ha on freenode one of these days?

I rather strongly think that the RA posted here is broken.

Michael and Jonathan suggested to just clone the resource and have it
run on multiple nodes at the same time. This is certainly necessary
because otherwise connection states would not be synchronized. But with
that RA: If you just commit and flush external caches during _start_,
how is the remaining node supposed to keep connections running that were
established through the failed machine? Since it is started already, it
will not commit its caches.

Maybe I just don't understand it, but isn't that broken?

Also the failback story, which would not work with that RA:

Am 10/15/2010 01:43 PM, schrieb Dominik Klein:
> The main challenge in this RA was the failback part. Say one system goes
> down completely. Then it loses the kernel connection tracking table and
> the external cache. Once it comes back, it will receive updates for new
> connections that are initiated through the master, but it will neither
> be sent the complete tracking table of the current master, nor can it
> request this (that's how I understand and tested conntrackd works,
> please correct me if I'm wrong :)).
>
> This may be acceptable for short-lived connections and configurations
> where there is no preferred master system, but it does become a problem
> if you have either of those.
>
> So my approach is to send a so called "bulk update" in two situations:
>
> a) in the notify pre promote call, if the local machine is not the
> machine to be promoted
> This part is responsible for sending the update to a preferred master
> that had previously failed (failback).
> b) in the notify post start call, if the local machine is the master
> This part is responsible for sending the update to a previously failed
> machine that re-joins the cluster but is not to be promoted right away.

Regards
Dominik
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to