[ 
https://issues.apache.org/jira/browse/IGNITE-12617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107034#comment-17107034
 ] 

Anton Vinogradov commented on IGNITE-12617:
-------------------------------------------

[~ascherbakov],
Thank you for the review!

> Double latch waiting if replicated caches are in topology.
Single latch waiting on healthy cells.
The only broken cell will wait for a partitioned-recovery latch.

> 2. It degrades to be a no-op if backups are spread by grid nodes (this is a 
> default behavior with rendezvous affinity).
Sure, but this fix is for real production cases where the baseline set should 
be set as well.
So, this will not fix every case but allow us to speed-up the production.
Regular deployment still may have ... PME of node left.

> I would like to propose an algorithm, which should provide the same latency 
> decrease ...
In addition to counters, we should also wait for recovery finish to have 
consistent partitions before allowing any operations on it.
As we discussed privately, it seems to be possible to perform a 
recovery-await-free switch just acquiring locks on prepared keys before 
finishing the exchange future, but this case requires additional research.

> PME-free switch should wait for recovery only at affected nodes.
> ----------------------------------------------------------------
>
>                 Key: IGNITE-12617
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12617
>             Project: Ignite
>          Issue Type: Task
>            Reporter: Anton Vinogradov
>            Assignee: Anton Vinogradov
>            Priority: Major
>              Labels: iep-45
>             Fix For: 2.9
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Since IGNITE-9913, new-topology operations allowed immediately after 
> cluster-wide recovery finished.
> But is there any reason to wait for a cluster-wide recovery if only one node 
> failed?
> In this case, we should recover only the failed node's backups.
> Unfortunately, {{RendezvousAffinityFunction}} tends to spread the node's 
> backup partitions to the whole cluster. In this case, we, obviously, have to 
> wait for cluster-wide recovery on switch.
> But what if only some nodes will be the backups for every primary?
> In case nodes combined into virtual cells where, for each partition, backups 
> located at the same cell with primaries, it's possible to finish the switch 
> outside the affected cell before tx recovery finish.
> This optimization will allow us to start and even finish new operations 
> outside the failed cell without a cluster-wide switch finish (broken cell 
> recovery) waiting.
> In other words, switch (when left/fail + baseline + rebalanced) will have 
> little effect on the operation's (not related to failed cell) latency.
> In other words
> - We should wait for tx recovery before finishing the switch only on a broken 
> cell.
> - We should wait for replicated caches tx recovery everywhere since every 
> node is a backup of a failed one.
> - Upcoming operations related to the broken cell (including all replicated 
> caches operations) will require a cluster-wide switch finish to be processed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to