Bertrand Delacretaz wrote
> That would be a pity, as I suppose you're starting to like Sling now ;-)

Mannnn you have no idea haha! I've got almost every dev in the office all
excited about this now haha. However, it seems our hands are tied.

I wrote local consistency test scripts which POST and immediately GET a
property, checking for consistency. 

Results on a 2-member Sling cluster and localhost mongodb:

-0% consistency with 50ms delay between POST and GET
-35% to 50% consistency with 1 second delay between POST and GET 
-90% consistency with 2 second delay
-98% to 100% consistency after 3 seconds delay.

So yes, you are all correct. 

True, we could use sticky sessions to avoid inconsistency... but only until
we scale our server-farm up or down, which we do daily.... So sticky
sessions doesn't really solve anything for us.

If you already understand how scaling nullifies the benefit of sticky
sessions, you can skip past this paragraph and move onto the next:
Each time we scale, users will lose their "stickiness." We have thousands of
write users ("authors"). Hundreds concurrently. Compare that to typical AEM
projects have less than 10 authors, and rarely more than 1 concurrently
(I've got several global-scale AEM implementations under my belt). For us,
it's a requirement that we add or remove app servers multiple times per day,
optimizing between AWS costs and performance. Each time we remove an
instance, those users will go to a new Sling instance, and experience the
inconsistency. Each time we add an instance, we will invalidate all
stickiness and users will get re-assigned to a new Sling instance, and
experience the inconsistency. If we don't do this invalidation and
re-assignment on scaling-up, it can takes hours potentially for a scale-up
to positively impact an overloaded cluster where all users are permanently
stuck to their current app server instance.

As you can see, we need to deal with the inconsistency problem, regardless
of whether we use sticky sessions.

I have some ideas, but none are appealing, and would benefit greatly from
your guys' knowledge:

1) Race condition
If this delay to "catch up" to latest revision is mostly predictable, it
doesn't grow as the repo grows in size, or if it doesn't change due to other
variables, we can measure it and then account for it reliably with
user-feedback (loading screen or whatever). This *might* be a race condition
we can live with. 

My results above show as much as 3 or 4 seconds to "catch up."  I must know
what determines the duration of this revision catch-up time. Is it a
function of repo size? Does the delay grow as the repo size grows? Does the
delay grow as usage increases? Does the delay grow as the number of Sling
instances in the cluster grow? Does the delay grow as network latency grows
(I'm testing all on the same machine with practically no latency compared to
a distributed production deployment). Is there any Sling dev, who is
familiar with the algorithm that Sling uses to select a "newer" revision,
who could answer this for me? ... perhaps it's just polling on a predictable
time period! :) 

2) Browser knows what revision it's on.
The browser could know what JCR Revision it's on, learning that revision
after every POST or PUT, perhaps in some response header. When its future
requests are sent to a Sling instance on an older revision, it could wait
until that instance "catches up." This sounds like a horrible example of
client code operating on knowledge of underlying implementation details, and
we're not at all excited about the chaos to implement it. That being said,
can we programmatically check the revision that the current Sling instance
is reading from?

3) "Pause" during scale-up or scale-down.
Each time we add or remove a sling instance, all users experience a "pause"
screen while their new Sling Instance "catches up." This is essentially the
same as the race condition in #1, except we'd constrain users to only
experience this when we scale up or down. However, we are *extremely*
unhappy to impact our users just because we're scaling up or down,
especially when we must do so frequently. 

Anybody have any other ideas?

Other questions:

1) When a brand new Sling instance discovers an existing JCR (Mongo), does
it automatically and immediately go to the latest head revision? Or is there
some progression through the revisions, and it takes time for the Sling
instance to catch up to the latest?

2) Is there any reason, BESIDES JCR CONSISTENCY, why a Sling cluster must be
deployed with sticky-sessions? What other problems would we introduce by not
having sticky sessions?

I seem to have used this email to track my own thoughts more than anything,
my sincere thanks if you've taken the time to read the whole thing.




--
View this message in context: 
http://apache-sling.73963.n3.nabble.com/Not-sticky-sessions-with-Sling-tp4069530p4069709.html
Sent from the Sling - Users mailing list archive at Nabble.com.

Reply via email to