Re: Replication using totem protocol

lichtner Mon, 16 Jan 2006 15:37:35 -0800

On Mon, 16 Jan 2006, Rajith Attapattu wrote:

> This is a very educating thread, maybe Jules can incoporate some of the
> ideas into your document on clustering.


Let's hope the thread also eventually translates into working code :)

> >1. The user should configure a minimum-degree-of-replication R. This is
> >the number of replicas of a specific session which need to be available in
> >order for an HTTP request to be serviced.
>
> 1.) How do u figure out the most efficient value for R?

I am not sure what you mean by efficient. If you mean that it maximizes
availability, I have seen a derivation in this book:

"Fault Tolerance in Distributed Systems"
Pankaj Jalote, 1994
Chapter 7, Section 5, "Degree of Replication"

He shows that the availability _as_ a function of the number of replicas
goes up and then down again, basically because more replicas defend
against failures but require more housekeeping, and the resources used to
do housekeeping cannot be used for servicing transactions.

I believe it is very difficulty to compute availability analytically, and
that the majority of downtime would not be due to hardware failures. It's
probably 1) power failures and 2) software failures. I think Pfister talks
about the various causes of downtime in his book.

> I assume when R increases, network chatter increases at a magnitue of X, and
> X depends on wether it's a multicast  protocol or 1->1 (first of all is this
> assumption correct ???).

I think for this thread we were assuming reliable multicast. See also
the thread about infiniband, which completely changes the calculus because
of the lack of context switching - that would be closer to just using a
symmetric multiprocessor.

> And when R reduces the chances of a request hitting a server where the
> session is not replicated is high.

That doesn't matter. When the request hits a server where the session is
not replicated you send a redirect - the system is available, but perhaps
the latency for that particular request is larger than for others.

> So the sweet spot is a balance btw the above to factors ??? or have I missed
> any other critical factor(s) ??

See reference above.

> 2.) When you say minimum-degree-of-replication it imples to me a floor?? is
> there like a ceiling value like maximum-degree-of-replication?? I guess we
> don't want the session to grow beyond a point.

Yes. See above. Availability goes down past a certain value of R.

> >2. When an HTTP request arrives, if the cluster which received does not
> >have R copies then it blocks (it waits until there are.) This should in
> >data centers because partitions are likely to be very short-lived (aka
> >virtual partitions, which are due to congestion, not to any hardware
> >issue.)
>
> 1) Can u pls elaborate a bit more on this, didn't really understand it, when
> u said wait untill, does it mean
>     a) wait till there are R no of replicas in the cluster?

At any time that there was a change in the composition of the cluster it
must review its global state and if necessary arrange for new session
replicas to be installed in some nodes, for replicas to be migrated, or
for replicas to be deleted. For example, if R=3 and replica no. 2 of
session 49030 was on node N7 which just bowed out, the cluster might
decide to install a replica of session 49030 on node N3.

Rearranging replicas, aka state-transfer, takes time. While that happens
you block new http requests for the relevant sessions.

>     b) or until a session is replicated within the server the http request
> is received?

No. See above. Although when rearranging replicas you have some freedom
and you are free to give priority to some nodes over others.

> 2) when u said virtual partition did u mean a sub set of nodes
> being isolated due to congestion.

Yes.

> By isolation I meant they have not able to
> replicate there sessions or receive replications from sessions from other
> nodes outside of the subset due to congestion. Is this correct??

It's also possible that all nodes are up to date on a given session, and
the virtual partition heals before the user tries to update the session
again.

A partition occurs when nodes 1 and 2 agree with each other that nodes 3
and 4 are no longer around and install a new group, a.k.a. "view", a.k.a.
"configuration".

But 3 and 4 may appear again soon after (e.g. 5 seconds) and so the
partition may end up having few consequences if any.

> 3) Assuming an HTTP request arrives and the cluster does not have R copies.
> How different is this situation from "an HTTP request arrives but no session
> replication in that server" ??
>
> >3. If at any time an HTTP reaches a server which does not have itself a
> >replica of the session it sends a client redirect to a node which does.
> How can this be achived?? Is it by having a central cordinator that handles
> a mapping or is this information replicated in all nodes on the entire
> cluster.
>
> information == "which clusters have replicas of each session"
>
> The point below gave me the impression that some of inventory has to be
> maintained centrally or cluster-wide (ideally in case controller dies).

You could use R replicas here also.

> >4. When a new cluster is formed (with nodes coming or going), it takes an
> >inventory of all the sessions and their version numbers. Sessions which do
> >not have the necessary degree of replication need to be fixed, which will
> >require some state transfer, and possibly migration of some session for
> >proper load balancing.
>
> Again how does the replication healing/shedding works. Assuming nodes die or
> comeback with carrying there state
> how does the cluster decide on adding or removing sessions to maintain the
> optimal R value.
> Where does the brain/logic for this sit?? Ideally distributable in case the
> controller dies.

You can design either distributed or centralized.

> General comments/questions
> -------------------------------------------
>
> 1. How much does the current impls like WADI, ACluster and ASpace address
> those above concerns?

The part about the organization of the state and state transfer has to be
coded. I think those tools are agnostic as far as application state goes.

> 2.) What aspects of the above concerns can be addresed with totem better
> than other protocols?

I don't see totem addressing state transfer. It does provide membership
and very well-behaved reliable multiast. Most importantly, since messages
are totally ordered it makes it much easier. Although in the case of
session replication there is no data sharing, so ordering is not as
critical.

> 3. Can SEDA like architechture solve the problem of deciding the value of R
> dynamically runtime from time to time based on load and network latency?? I
> guess the network latency can be messured with some metrics around token
> passing or something like that.

The value of R depends on how available you want your system to be.

I know what SEDA does but I don't see its relevance here, except to say
that if the application is based on SEDA you will get a better-behaved
application (and spend a lot of time coding it.)

Re: Replication using totem protocol

Reply via email to