> Each time we remove an > instance, those users will go to a new Sling instance, and experience the > inconsistency. Each time we add an instance, we will invalidate all > stickiness and users will get re-assigned to a new Sling instance, and > experience the inconsistency.
I can understand issue around when existing Sling server is removed from the pool. However adding a new instance should not cause existing users to be reassigned Now to your queries --------------------------- > 1) When a brand new Sling instance discovers an existing JCR (Mongo), does it > automatically and immediately go to the latest head revision? It sees the latest head revision > Increasing load increases the number of seconds before a "sync," however > it's always near-exactly a second interval. Yes there is a "asyncDelay" setting in DocumentNodeStore which defaults to 1 sec. Currently its not possible to modify it via OSGi config though. >- What event is causing it to "miss the window" and wait until the next 1 >second synch interval? this periodic read also involves some other work. Like local cache invalidation, computing the external changes for observation etc which cause this time to increase. More the changes done more would be the time spent on that kind of work Stickyness and Eventual Consistency ------------------------------------------------- There are multiple level of eventual consistency [1]. If we go for sticky session then we are trying for "Session Consistency". However what we require in most cases is read-your-write consistency. We can discuss ways to do that efficiently with current Oak architecture. Something like this is best discuss on oak-dev though. One possible approach can be to use a temporary issued sticky cookie. Under this model 1. Sling cluster maintains a cluster wide service which records the current head revision of each cluster node and computes the minimum revision of them. 2. A Sling client (web browser) is free to connect to any server untill it performs a state change operation like POST or PUT 3. If it performs a state change operation then the server which performs that operation issues a cookie which is set to be sticky i.e. Load balancer is configured to treat that as cookie used to determine stickiness. So from now on all request from this browser would go to same server. This cookie lets say record the current head revision 4. In addition the Sling server would constantly get notified of minimum revision which is visible cluster wide. Once that revision becomes older than revision in #3 it removes the cookie on next response sent to that browser This state can be used to determine if server is safe to be taken out of the cluster or not. This is just a rough thought experiment which may or may not work and would require broader discussion! Chetan Mehrotra [1] http://www.allthingsdistributed.com/2008/12/eventually_consistent.html