Hi Stefan,

Thanks again for your input, more inline below.

2015-03-18 12:59 GMT+01:00 Stefan Egli <[email protected]>:

> Hi Timothee,
>
> On 3/17/15 5:37 PM, "Timothée Maret" <[email protected]> wrote:
>
> >>Due to an edge-case in job distribution (job is started and executed on
> >> CRX master, master crashes before slave is updated, slave becomes
> >>master,
> >> slave executes job a second time) the suggestion is to make anything
> >>*but*
> >> the CRX master become the discovery leader.
> >
> >
> >Ok, but isn't it still prone to double execution even when leader !=
> >master
> >?
> >Assuming one master, two slaves and the following scenario: master
> >receives
> >a job, the job replicated to slaves, one slave executes the job and
> >commits
> >its changes, slave crashes before the changes are replicated, the other
> >slave picks the job and execute it again.
>
> The time window between committing the changes and finishing the job is
> much smaller though. There is no absolute guarantee, right, but it is less
> likely.
>
>
I agree.


> >Can we offer guarantees against double execution, unordered or missing
> >execution without some sort of distributed locking, a way to make sure the
> >content is replicated or some sort of centralised job dispatcher ?
>
> An absolute guarantee not. And I don't think we aim to magically make this
> work with this 'slave be the leader' default. But it reduces the
> likelihood a lot.
>
>
thanks for shedding some light here regarding the guarantees offered.


> Re unordered/missing execution: if there is network partitioning (real of
> pseudo) then you the ordering would no longer be guaranteed, agreed. Not
> sure if you could really miss a job execution though! Network partitioning
> is not currently supported though.
>
> >Anyway, AFAIU enforcing 'leader != master' would be against an
> >active/passive setup.
> >Indeed, if enabled, an application could either process on exactly one crx
> >slave
>
> Right. Why would it be 'against' such a setup though?


Because it leads to writing on the slave and have changes replicated to
master where we would expect the slave to only receives replicated changes
from master in an active/passive setup.


> The application
> should not depend on the underlying cluster technology nor deployment.
> Ideally it would just make use of the fact that one instance in the
> cluster is nominated 'leader' and if it has something to execute only
> once, then it should choose that leader to do it.
>
> >>
> >> I fear there is no explicit way atm to force the behavior you want.
> >>About
> >> the closest one I can think of is: the leader is defined to be stable,
> >>ie
> >> once an instance is leader, it stays leader until it leaves/crashes. Or
> >>in
> >> other words: the first instance started on a fresh setup becomes leader.
> >>
> >
> >IIUC, currently we can have either I. strong guarantees that 'leader !=
> >master' or II. best effort to enforce 'leader == master'.
> >Assuming avoiding quirks in jobs processing requires a broader solution
> >than what was introduced in SLING-3253, wouldn't it make sense to allow
> >guaranteeing II. ?
>
> What you can always do is make your implementation also check on the
> underlying repository descriptor yourself - and take that one if it is
> set, otherwise use the sling discovery..
>

Yes, we could still implement it at the application level. However at the
application level we can't influence code in Sling which is leveraging the
leader which IIRC is the case for Sling jobs.


>
> >IMO the leader would still be relatively stable (not impacted by addition
> >of new instances in the topology) and would allow to guarantee an
> >active/passive cluster setup.
>
> Both I and II have the negative side-effect that in case the master
> crashes, the leader might change. So in that sense, they both break the
> 'strong leader' argument - so it would not introduce anything more
> negative there.
>

+1


> So yes, discovery could support II - but you could also read the
> descriptor explicitly as an alternative.
>
> Depends on which way you'd like to go - if you'd like to have this though,
> could you pls create a ticket?
>

For the time being, our service could go with a custom implementation. I
have opened SLING-4516 to track adding the possibility to configure CRX
master == Sling leader as it might be beneficial to other deployments
running on CRX. This addition only makes sense for this clustering
technology though.

Regards,

Timothee


>
> Cheers,
> Stefan
>
>
>

Reply via email to