Vivek Biswas wrote:

Use of "Internal Network Traffic" for auto-partitioning will give intelligence to
load balancer to make dynamic decision which otherwise would not been possible.
This will make system really flexible and not unnecessarily replicate all
the session states to all the server but few servers for backup.

Vivek - as I pointed out in my last email - your making some big assumptions here about the load balancer - how would you integrate this feature with mod_jk or BigIP ?


Partitioning of state is tried and trusted technology - the tricky bit is doing it so that you move as little state around as possible and integrate with the major players in the load-balancing market.



Jules

~Vivek Biswas

*/Jules Gosnell <[EMAIL PROTECTED]>/* wrote:

    interesting - I'll digest it fully over the weekend...

    I spotted a couple of points on the first read though....

    - I don't think that there is very much difference between my and
    your
    suggestions for auto-partitioning a cluster - it may well be
    possible to
    make this a pluggable strategy, so that different criteria such as
    'Internal Network traffic' could be used to make decisions.

    - If at all possible, the design should work with existing
    load-balancers, such as the Cisco and F5 product lines, as well as
    Open
    Source offerings such as mod_jk and mod_backhand. People may already
    have well established architectures that they wish to fit Geronimo
    into.
    A Geronimo-aware lb may be able to make optimisations that another lb
    will not see, but the cluster should still perform reasonably with
    other
    lbs.

    - Session migration will be expensive and should be avoided when
    possible. In cases where it is necessary, e.g. [multiple] node
    failure
    may lead to a node carrying more state than it can handle, it should
    probably progress as a low priority background task, so that the
    cluster
    is not suddenly flooded with messages which simply encourage more
    failures etc... and already stressed nodes are not further pegged
    out by
    the cost of [de]marshalling 1000s of sessions - this could quickly
    lead
    to a domino-effect bringing down th whole cluster (perhaps we could
    ringfence 'super-partions' ?).

    Whatever the final design, it needs to be as simple and as robust
    as it
    possibly can be. Hybrid solutions, which take advantage of the
    strengths
    of different approaches in exchange for a slight increase in
    complexity
    are often useful (hence my replication AND shared store approach).

    Whatever replication medium or shared store implementation we use,
    someone will always come along and want to use something else -
    JGroups,
    JMS, DB, JavaSpaces, NFS etc..., so these all need to be pluggable.


I will try to find the time over the weekend to write up my suggestion for auto-partitioning/healing and send it to the list. Then we can compare notes :-)

    Jules


Bhagwat, Hrishikesh wrote:

    >Attached is the PROTOCOL that can help us implement
    >Clustering without a DB. The PROTOCOL uses the concept
    >of dynamic partitioning (inspired from Jules Gosnell's
    >original idea). Also though it is not mentioned explicitly
    >in the document, many of the functions will operate using
    >JMX (thanks to Vivek Biswas :-) for enlightening me on this).
    >
    >Summary :
    >The protocol makes the system "dynamically change itself"
    >from Jetty Like topology where "ALL SERVERS BACKING UP
    >EACH OTHER" to a simple "PRIMARY - SECONDARY" topology
    >that contains only one backup per server. Thus as LOAD
    >increases partitions of ever diminishing sizes are created.
    >Since back-ups happen only within a partition the system
    >can scale well (also administrators can specify
    MAX_SERVERS_IN_CLUSTER
    >and MIN_SERVERS_IN_CLUSTER).
    >
    >The document is divided into
    >
    >1. Concept
    >2. Case of Failover
    >3. Case of Node Added
    >4. Case of excessive load on a Server
    >
    >while the attachment has topics 1 and 2 , i am still writing the
    >others. I though that topic-2 was particularly complex so i though
    >of putting it up on the mailing-list for all to review and
    >digest it till the relatively simple (i guess) topics 3and4 are ready
    >
    >thanks
    >-hb
    >
    >
    >
    >
    >-----Original Message-----
    >From: Jules Gosnell [mailto:[EMAIL PROTECTED]
    >Sent: Thursday, October 23, 2003 11:55 PM
    >To: Bhagwat, Hrishikesh; Geronimo Developers List
    >Cc: Biswas, Vivek
    >Subject: Re: [Re] Web Clustering : Stick Sessions with Shared Store -
    >curr ent state of play.
    >
    >
    >I agree on needing to be able to run without a db.
    >
    >If you think about it, a db is just a highly specialised node in the
    >cluster. It is much simpler, and therefore maintainable, if all
    nodes in
    >a Geronimo cluster are homogeneous. We can't lose the db your
    business
    >data lives in, but if we can avoid adding another for session
    >replication it might be of advantage.
    >
    >To this end, my design should work without a db. You just tune it so
    >that passivation never occurs - i.e. unlimited active sessions. You
    >trade of in-vm space against db space.
    >
    >Likewise, if you were an advocate of the 'shared store' approach,
    you
    >should be able to constrain the web container to keep '0' active
    >sessions in memory, but passivate everything immediately to the db.
    >
    >So, yes, I'm sure we are on the same page.
    >
    >
    >Jules
    >
    >
    >
    >
    >Bhagwat, Hrishikesh wrote:
    >
    >
    >
    >>hi Jules,
    >>
    >>I am happy that we are converging, in that, the following
    approach does very
    >>well cover some of the main goal that i was trying to achieve
    when i wrote my
    >>first proposal. I have marked those points below as [hb-X] with
    brief comments.
    >>
    >>Thus keeping the "Goal of the Architecture" the same I would
    like to propose a new
    >>scheme for "replication". While initially, I was keen on having
    NO DATA EXCHANGE BETWEEN
    >>SERVERS but rather on having them all just use the shared store
    to persist and retrieve session
    >>data, you were interested in the contrary. Your approach
    (mentioned in section 2 of
    http://wiki.codehaus.org/geronimo/Architecture/Clustering) is
    about Session object exchange
    >>between servers which are clustered and about clusters that are
    partitioned (stat/dynamically).
    >>I think the discussion there stops with some issues that you
    think that may arise with that
    >>kind of a scheme.
    >>
    >>After Vivek Biswas first pointed out to me, I have been feeling
    increasing uncomfortable with the idea of
    >>Geronimo being DEPENDENT on an EXTERNAL system like a Database
    for implementing its web clustering.
    >>Some ppl on the mailing list have spoken about performanced
    issues with the DB approach but i
    >>think with techniques like asny-write-to-DB etc such problems
    can be circumvented. Thus though I
    >>still believe that its a solution that can solve the problem, I
    am not comfortable with the usage of
    >>an external system to aid in clustering. I dont know of any GOOD
    (hi-av) DBs that comes for FREE. This
    >>mean that even if Geronimo is available for FREE it cant be used
    without purchasing a DB. The solution
    >>as a whole, is then, not truely FREE-of-cost.
    >>
    >>I have been since then looking at many alternatives ... like the
    Jetty implemented solution of
    >>having all m/c in one cluster to a 1Primary:1Secondary solution.
    >>
    >>Presenly I have something cooking. As I come close to working on
    details I find it quite similar to
    >>your original solution of dynamic patitioning. This is what
    again convinces me that we seem to converge.
    >>my porposal in to exchange note and to jointly come up with a
    detailed design.
    >>
    >>Do let me know your thoughts on that .... I am tring to complete
    a document on this new schema ASAP and will
    >>mail you the same.
    >>
    >>thanks
    >>- hb
    >>
    >>
    >>
    >>-----Original Message-----
    >>From: Jules Gosnell [mailto:[EMAIL PROTECTED]
    >>Sent: Thursday, October 23, 2003 3:44 AM
    >>To: [EMAIL PROTECTED]
    >>Subject: Re: [Re] Web Clustering : Stick Sessions with Shared
    Store -
    >>current state of play.
    >>
    >>
    >>Guys,
    >>
    >>since this topic has come up again, I thought this would a
    useful point
    >>to braindump my current ideas for comment and as a common point of
    >>reference...
    >>
    >>Here goes :
    >>
    >>
    >>
    >>Each session has one 'primary' node and n 'replicant' nodes
    associated
    >>with it.
    >>
    >>Sticky load-balancing is a hard requirement.
    >>
    >>Changes to the session may only occur on the primary node.
    >>
    >>[hb-1] i always intended only the OWNER NODE (remember ... i
    wasn't keen on having
    >>Primary/Sec. concept so i am using this term)to make changes to
    the SESSION. Also
    >>only ON CHANGE the session would be sent for persistance.
    >>
    >>
    >>Such changes are then replicated (possibly asynchronously, depending
    >>on data integrity requirements) to the replicant nodes.
    >>
    >>[hb-2] My approach was about NOT USING REPLICATION AT ALL but using
    >>a shared data store. Though I never mentioned it explicitly (bocz I
    >>regarded this as a finer detail) ... i always intended on having an
    >>async sys do the persistence.
    >>
    >>If, for any reason, a session is required to 'migrate' to
    another node
    >>(fail-over or clusterwide-state-balancing), this 'target' node
    makes a
    >>request to the cluster for this session, the current 'source' node
    >>handshakes and the migration ensues, after which the target node is
    >>promoted to primary status.
    >>
    >>[hb-3] I am not sure how a target server can initiate a
    migration process
    >>I would imagine that on a fail-over or on a systematic-removal
    of a node
    >>or any other such action that requires a cluster wide
    state-balancing, it is
    >>the lb/adminS that would sense this first and initiate actions.
    >>
    >>Any inbound request landing on a node that is not primary for the
    >>required session results in a forward/redirect of the request to
    it's
    >>current primary, or a migration of the session to the receiving node
    >>and it's promotion to primary.
    >>
    >>[hb-4] Not sure when this scenario would occur when a
    "secondary" would
    >>receieve a HTTP request even when the primary is functionning well.
    >>
    >>A shared store is used to passivate sessions that have been inactive
    >>for a given period, or are surplus to constraints on a node's
    session
    >>cache size.
    >>
    >>[hb-5] yes this is very much a point that I have been saying
    since my
    >>first proposal. Just to quote from that doc. "With a little
    intelligence
    >>built in an MS can store away, less busy sessions to DB and
    retrieve them
    >>when needed thus offering something that is near to "virtually
    unlimited
    >>amount of sessions (section 1.1.1.1)"
    >>
    >>Once in the shared store, a session is disassociated from it's
    primary
    >>and replicant nodes. Any node in the cluster, receiving a relevant
    >>request, may load the session, become it's primary and choose
    >>replicant nodes for it.
    >>[hb-6] this is a good optimization to sit on the scheme
    mentioned in [hb-5]
    >>
    >>Correct tuning of this feature, in a situation where frequent
    >>migration is taking place, might cut this dramatically.
    >>
    >>
    >>The reason for the hard node-level session affinity requirement
    is to
    >>ensure maximum cache hits in e.g. the business tier. If a web
    session
    >>is interacting with cached resources that are not explicitly tied to
    >>it (and so could be associated with the same replicant nodes), the
    >>only way to ensure that subsequent uses of this session hit
    resources
    >>in these caches is to ensure that these occur on the same node
    as the
    >>cache - i.e. the session's primary node.
    >>
    >>By only having one node that can write to a session, we remove the
    >>possibility of concurrent writes occurring on different nodes
    and the
    >>subsequent complexity of deciding how to merge them.
    >>
    >>[hb-7] I complete agree
    >>
    >>The above strategy will work for a 'implicit-affinity' lb (e.g.
    BigIP),
    >>which remembers the last node that a session was successfully
    accessed
    >>on and rolls this value forward as and when it has to fail-over to a
    >>new node. We should be able to migrate sessions forward to the next
    >>node picked by the lb, underneath it, keeping the two in sync.
    >>
    >>With an 'explicit-affinity' lb (e.g. mod_jk), where routing info is
    >>actually encoded into the jsessionid/JSESSIONID value (or maybe an
    >>auxiliary path param or cookie), it should be possible, in the
    case of
    >>fail-over, to choose a (probably) replicant node to promote to
    primary
    >>and to stick
    >>requests falling elsewhere to this new primary by resetting this
    >>routing info on their jsessionid/JSESSIONID and
    redirecting/forwarding
    >>them to it.
    >>
    >>If, in the future, we write/enhance an lb to be Geronimo-aware,
    we can
    >>be even smarter in the case of fail-over and just ask the cluster to
    >>choose a (probably) replicant node to promote to primary and
    then direct
    >>requests
    >>directly to this node.
    >>
    >>The cluster should dynamically inform the lb about joining/leaving
    >>nodes, and sessions should likewise maintain their primary/replicant
    >>lists accordingly.
    >>
    >>[hb-8] I complete agree
    >>
    >>LBs also need to be kept up to date with the locations and access
    >>points of the various webapps deployed around the cluster, relevant
    >>node and webapp stats (on which to base balancing decisions), etc...
    >>
    >>All of this information should be available to any member of the
    >>cluster and a Geronimo-aware lb should be a full cluster member.
    >>
    >>On shutting down every node in the cluster all session state should
    >>end up in the shared store.
    >>
    >>
    >>These are fairly broad brushstrokes, but they have been placed after
    >>some thought and outline the sort of picture that I would like
    to see.
    >>
    >>Your thoughts ?
    >>
    >>
    >>Jules
    >>
    >>
    >>
    >>
    >>
    >>
    >>Bhagwat, Hrishikesh wrote:
    >>
    >>
    >>
    >>
    >>
    >>>>I am also not convinced it reduces the amount of net traffic.
    After each
    >>>>request the MS must write to the shared store, which is the
    same traffic as
    >>>>a unicast write to another node or a multicast write to the
    partition
    >>>>(discounting the processing power needed to receive the message).
    >>>>
    >>>>
    >>>>
    >>>>
    >>>>
    >>>>
    >>>I agree. However, this is based on the assumption that only one
    unicast
    >>>write is required. In other words, this is a primary/secondary
    topology. I
    >>>think that hd did not intended such a topology and hence his
    statement.
    >>>
    >>>[hb] Yes i was not assuming a Pri/Sec design but a layout where
    any active server
    >>> can be request to pick up a client request which is destined
    to server that has just failed
    >>>
    >>>-----Original Message-----
    >>>From: gianny DAMOUR [mailto:[EMAIL PROTECTED]
    >>>Sent: Sunday, October 19, 2003 7:35 AM
    >>>To: [EMAIL PROTECTED]
    >>>Subject: [Re] Web Clustering : Stick Sessions with Shared Store
    >>>
    >>>
    >>>Jeremy Boynes wrote:
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>>However, as Andy says, the cost of storing a serialized object
    in a BLOB is
    >>>>significant. Other forms of shared store are available though
    which may
    >>>>offer better performance (e.g. a hi-av NFS server).
    >>>>
    >>>>
    >>>>
    >>>>
    >>>>
    >>>>
    >>>Do we need a shared repository or a replicated repository?
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>>The issue I have with hb's approach is the reliance on an
    Admin Server, of
    >>>>which there would need to be at least two and they would need
    to co-operate
    >>>>between themselves and with any load-balancers. I think this
    can be handled
    >>>>by the regular servers themselves just as efficiently.
    >>>>
    >>>>
    >>>>
    >>>>
    >>>>
    >>>>
    >>>I agree. It seems that in such a design an Admin Server is only
    used to
    >>>route incoming requests to the relevant node.
    >>>
    >>>However, I do not believe that regular servers can do this job.
    I assume
    >>>that they will implement a standard peer-to-peer cluster
    topology to provide
    >>>redundancies, however I do not see how they can handle the
    dispatch of
    >>>incoming requests.
    >>>
    >>>This feature seems to be either a client or a proxy one: I mean
    it should be
    >>>done prior to reach the nodes.
    >>>
    >>>For instance, this feature is treated on the client-side via a
    stub aware of
    >>>the available nodes in WebLogic. It seems that JBoss (correct
    me if I am
    >>>wrong) has also followed this design.
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>>I am also not convinced it reduces the amount of net traffic.
    After each
    >>>>request the MS must write to the shared store, which is the
    same traffic as
    >>>>a unicast write to another node or a multicast write to the
    partition
    >>>>(discounting the processing power needed to receive the message).
    >>>>
    >>>>
    >>>>
    >>>>
    >>>>
    >>>>
    >>>I agree. However, this is based on the assumption that only one
    unicast
    >>>write is required. In other words, this is a primary/secondary
    topology. I
    >>>think that hd did not intended such a topology and hence his
    statement.
    >>>
    >>>Gianny
    >>>
    >>>_________________________________________________________________
    >>>MSN Search, le moteur de recherche qui pense comme vous !
    >>>http://search.msn.fr/
    >>>
    >>>
    >>>
    >>>
    >>>
    >>>
    >>
    >>
    >>
    >>
    >
    >
    >
    >
    >------------------------------------------------------------------------
    >
    >
    >Imagine a Cluster having N managed servers providing complete
    redundancy to each other. Imagine that they are all arranged in a
    cirlce.
    >
    >They continue to be one big cluster till the Internal Network
    traffic (INT) is less than a certain thereshold. At this stage any
    server in the cluster is just as good as any other server in the
    cluster to provide fail-over support. This scheme cannot however
    scale well largely due to 2 factors
    >
    >1. As client requests increase the number of server-to-server
    exchanges for Session replication would
    >increase, stressing the network.
    >2. There would be an added overhead for each server to process
    ALL of these Sessions objects.
    >
    >
    > A ----------------- B ------------------- C ----------- D
    > | ----> |
    > | |
    > | |
    > J E
    > | |
    > | |
    > | |
    > I ----------------- H -------------------- G ----------- F
    >
    >
    >Here is a solution :
    >-------------------
    >After the INT crosses a certain threshold value, "A" stops
    sending Session-updates to "the server before itself" (J). It also
    sends a notification to J to that effect. J can now forget all the
    session of A that it had stored up untill now and thus get a chunk
    of free memory. Each server does the same thus "B cuts off A" and
    "C cuts off B" and so on. The number of exchanges now reduces from
    (n * n) to (n * (n-1)). In case of failure "A" still has a solid
    backup (All server from B to I).
    >
    >After somemore increase in the traffic the servers "PERGES" yet
    another server. Thus this time "A" will stop sending Session
    Objects to I (and ofcourse J). Follwing that it will send a
    notification (YOU_ARE_PURGED)to I . On the other hand it "A" would
    have received a similar notification, this time, from C. However A
    still has a strong enough backup (all servers from B to H) to fall
    on. Every machine holds with itself a list of ALL peers (A-B-C- ..
    in the correct order) and the recent "cutoff-count (no. of servers
    purged - this value is obviously the same for all the servers)".
    >
    >
    >Note that, in a way partitions are being created in the single
    large cluster. "A" OWNS a particition that originally contained
    B-J. It also is a member of 9 other particition owned respectively
    by B,C...J. As traffic and cliet-load increased A starts pushing
    out members of its partition thus making it smaller. In effect it
    gets removed from some other server's (say B .. for cutoff_count =
    1 and from C when cutover_count = 2) partition.
    >
    >As soon as a server is PURGED (goes out of a partition):
    >
    >1. It need not any more receieve session-updates from the
    Partition OWNER - saving network bandwidth
    >
    >2. It need not remember any Session objects that were earlier
    given to it by THAT partition owner - thus allowing it to have
    more free memory - to be utilized for serving the increased client
    load.
    >
    >
    >Under this scheme, I think, the Cluster will have a fairly
    uniform turn around time over a large range of "client load".
    >
    >Even as all of this happens the LB will continue allocate NEW
    CLIENTS in a simple round robin fashion while forwarding requests
    comming from all other clients, who already have established
    session with a server, to their respective servers.
    >
    >
    >Now we shall discuss 4 special cases
    >
    >1. Failover
    >
    >
    >Now imagine that "A" (actully any randomly selected server)
    fails. When LB finds out that it is not able to reach "A" it
    simple forwards the client request (destined for A) to the next
    available server in the list, "B". B tries to serve the request
    but doesnt find the SessionID in its list of "NATIVE" sessions.
    THIS IS A SIGNAL TO "B" that "A"HAS failed. Now it goes on to do
    the following CRUTIAL ACTIVITY :

    === message truncated ===

------------------------------------------------------------------------
Do you Yahoo!?
The New Yahoo! Shopping <http://shopping.yahoo.com/?__yltc=s%3A150000443%2Cd%3A22708228%2Cslk%3Atext%2Csec%3Amail> - with improved product search



-- /************************************* * Jules Gosnell * Partner * Core Developers Network (Europe) * http://www.coredevelopers.net *************************************/




Reply via email to