Alexey,

>> Can you describe the complete protocol changes first
The current goal is to find a better way.
I had at least 5 scenarios discarded because of finding corner cases
(Thanks to Alexey Scherbakov, Aleksei Plekhanov and Nikita Amelchev).
That's why I explained what we able to improve and why I think it works.

>> we need to remove this property, not add new logic that relies on it.
Agree.

>> How are you going to synchronize this?
Thanks for providing this case, seems it discards #1 + #2.2 case and #2.1
still possible with some optimizations.

"Zombie eating transactions" case can be theoretically solved, I think.
As I said before we may perform "Local switch" in case affinity was not
changed (except failed mode miss) other cases require regular PME.
In this case, new-primary is an ex-backup and we expect that old-primary
will try to update new-primary as a backup.
New primary will handle operations as a backup until it notified it's a
primary now.
Operations from ex-primary will be discarded or remapped once new-primary
notified it became the primary.

Discarding is a big improvement,
remapping is a huge improvement,
there is no 100% warranty that ex-primary will try to update new-primary as
a backup.

A lot of corner cases here.
So, seems minimized sync is a better solution.

Finally, according to your and Alexey Scherbakov's comments, the better
case is just to improve PME to wait for less, at least now.
Seems, we have to wait for (or cancel, I vote for this case - any
objections?) operations related to the failed primaries and wait for
recovery finish (which is fast).
In case affinity was changed and backup-primary switch (not related to the
failed primaries) required between the owners or even rebalance required,
we should just ignore this and perform only "Recovery PME".
Regular PME should happen later (if necessary), it can be even delayed (if
necessary).

Sounds good?

On Wed, Sep 18, 2019 at 11:46 AM Alexey Goncharuk <
alexey.goncha...@gmail.com> wrote:

> Anton,
>
> I looked through the presentation and the ticket but did not find any new
> protocol description that you are going to implement. Can you describe the
> complete protocol changes first (to save time for both you and during the
> later review)?
>
> Some questions that I have in mind:
>  * It looks like that for "Local Switch" optimization you assume that node
> failure happens immediately for the whole cluster. This is not true - some
> nodes may "see" a node A failure, while others still consider it alive.
> Moreover, node A may not know yet that it is about to be failed and process
> requests correctly. This may happen, for example, due to a long GC pause on
> node A. In this case, node A resumes it's execution and proceeds to work as
> a primary (it has not received a segmentation event yet), node B also did
> not receive the A FAILED event yet. Node C received the event, ran the
> "Local switch" and assumed a primary role, node D also received the A
> FAILED event and switched to the new primary. Transactions from nodes B and
> D will be processed simultaneously on different primaries. How are you
> going to synchronize this?
>  * IGNITE_MAX_COMPLETED_TX_COUNT is fragile and we need to remove this
> property, not add new logic that relies on it. There is no way a user can
> calculate this property or adjust it in runtime. If a user decreases this
> property below a safe value, we will get inconsistent update counters and
> cluster desync. Besides, IGNITE_MAX_COMPLETED_TX_COUNT is quite a large
> value and will push HWM forward very quickly, much faster than during
> regular updates (you will have to increment it for each partition)
>
> ср, 18 сент. 2019 г. в 10:53, Anton Vinogradov <a...@apache.org>:
>
> > Igniters,
> >
> > Recently we had a discussion devoted to the non-blocking PME.
> > We agreed that the most important case is a blocking on node failure and
> it
> > can be splitted to:
> >
> > 1) Affected partition’s operations latency will be increased by node
> > failure detection duration.
> > So, some operations may be freezed for 10+ seconds at real clusters just
> > waiting for a failed primary response.
> > In other words, some operations will be blocked even before blocking PME
> > started.
> >
> > The good news here that "bigger cluster decrease blocked operations
> > percent".
> >
> > Bad news that these operations may block non-affected operations at
> > - customers code (single_thread/striped pool usage)
> > - multikey operations (tx1 one locked A and waits for failed B,
> > non-affected tx2 waits for A)
> > - striped pools inside AI (when some task wais for tx.op() in sync way
> and
> > the striped thread is busy)
> > - etc ...
> >
> > Seems, we already, thanks to StopNodeFailureHandler (if configured),
> always
> > send node left event before node stop to minimize the waiting period.
> > So, only cases cause the hang without the stop are the problems now.
> >
> > Anyway, some additional research required here and it will be nice if
> > someone willing to help.
> >
> > 2) Some optimizations may speed-up node left case (eliminate upcoming
> > operations blocking).
> > A full list can be found at presentation [1].
> > List contains 8 optimizations, but I propose to implement some at phase
> one
> > and the rest at phase two.
> > Assuming that real production deployment has Baseline enabled we able to
> > gain speed-up by implementing the following:
> >
> > #1 Switch on node_fail/node_left event locally instead of starting real
> PME
> > (Local switch).
> > Since BLT enabled we always able to switch to the new-affinity primaries
> > (no need to preload partitions).
> > In case we're not able to switch to new-affinity primaries (all missed or
> > BLT disabled) we'll just start regular PME.
> > The new-primary calculation can be performed locally or by the
> coordinator
> > (eg. attached to the node_fail message).
> >
> > #2 We should not wait for any already started operations completion
> (since
> > they not related to failed primary partitions).
> > The only problem is a recovery which may cause update-counters
> duplications
> > in case of unsynced HWM.
> >
> > #2.1 We may wait only for recovery completion (Micro-blocking switch).
> > Just block (all at this phase) upcoming operations during the recovery by
> > incrementing the topology version.
> > So in other words, it will be some kind of PME with waiting, but it will
> > wait for recovery (fast) instead of finishing current operations (long).
> >
> > #2.2 Recovery, theoretically, can be async.
> > We have to solve unsynced HWM issue (to avoid concurrent usage of the
> same
> > counters) to make it happen.
> > We may just increment HWM with IGNITE_MAX_COMPLETED_TX_COUNT at
> new-primary
> > and continue recovery in an async way.
> > Currently, IGNITE_MAX_COMPLETED_TX_COUNT specifies the number of
> committed
> > transactions we expect between "the first backup committed tx1" and "the
> > last backup committed the same tx1".
> > I propose to use it to specify the number of prepared transactions we
> > expect between "the first backup prepared tx1" and "the last backup
> > prepared the same tx1".
> > Both cases look pretty similar.
> > In this case, we able to make switch fully non-blocking, with async
> > recovery.
> > Thoughts?
> >
> > So, I'm going to implement both improvements at "Lightweight version of
> > partitions map exchange" issue [2] if no one minds.
> >
> > [1]
> >
> >
> https://docs.google.com/presentation/d/1Ay7OZk_iiJwBCcA8KFOlw6CRmKPXkkyxCXy_JNg4b0Q/edit?usp=sharing
> > [2] https://issues.apache.org/jira/browse/IGNITE-9913
> >
>

Reply via email to