[jira] [Comment Edited] (IGNITE-13171) Proper handling of a rebalancing with disabled WAL

Alexey Scherbakov (Jira) Fri, 17 Jul 2020 06:08:16 -0700


    [ 
https://issues.apache.org/jira/browse/IGNITE-13171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17159270#comment-17159270
 ]


Alexey Scherbakov edited comment on IGNITE-13171 at 7/17/20, 1:07 PM:
----------------------------------------------------------------------

Turns out the change has become more complicated than I've expected.

The list of changes:

# Fixed a race with delayed partition owning (due to disabled group durability 
during rebalancing) and new topology event causing partitions owned while 
clearing (source of partition inconsistency).
# Cache version calculation has been improved. A calculation has been optimized 
and order is now correctly synced between primary and backups. 
# Fixed rebalance future compatilibity issues for delayed partition owning. For 
now rebalance future is not completed until checkpoint is done, and correctly 
checked for compatibility with newer topology versions. 
# Removed start and end versions for transactions.
# Checkpoint is triggered for each group having WAL disabled during 
rebalancing, so durability for a group is enabled as soon as group rebalanced. 
# Cache group sync future is completed only if rebalancing was finished without 
cancellation, meaning a data was loaded or some unrecoverable error. 
# Fixed partition state transtion from RENTING to OWNING/LOST when a last 
supplier had left. 
# Removed synchronous partition destroying to avoid deadlocks. All partitions 
now are destroyed by partition eviction manager.
# Removed delay equals to DFLT_PRELOAD_RESEND_TIMEOUT before owning state is 
acked for a group after rebalancing. As as side effect, ideal distribution is 
achieved faster in unit tests. 
# Fixed multiple flaky tests running on unstable topology, which behavior was 
changed after introducing 8 and 9.
# Fixed an attempt to rebalance OWNING partition if arebalance has been 
restarted after a cancellation.


was (Author: ascherbakov):
Turns out the change has become more complicated than I've expected.

The list of changes:

# Fixed a race with delayed partition owning (due to disabled group durability 
during rebalancing) and new topology event causing partitions owned while 
clearing (source of partition inconsistency).
# Cache version calculation has been improved. A calculation has been optimized 
and order is now correctly synced between primary and backups. 
# Fixed rebalance future compatilibity issues for delayed partition owning. For 
now rebalance future is not completed until checkpoint is done, and correctly 
checked for compatibility with newer topology versions. 
# Removed start and end versions for transactions.
# Checkpoint is triggered for each group having WAL disabled during 
rebalancing, so durability for a group is enabled as soon as group rebalanced. 
# Cache group sync future is completed only if rebalancing was finished without 
cancellation, meaning a data was loaded or some unrecoverable error. 
# Fixed partition state transtion from RENTING to OWNING/LOST when a last 
supplier had left. 
# Removed synchronous partition destroying to avoid deadlocks. All partitions 
now are destroyed by partition eviction manager.
# Removed delay equals to DFLT_PRELOAD_RESEND_TIMEOUT before owning state is 
acked for a group after rebalancing. As as side effect, ideal distribution is 
achieved faster in unit tests. 
# Fixed multiple flaky tests running on unstable topology, which behavior was 
changed after introducing 8 and 9.
# Fixed an attempt to rebalance OWNING partition after rebalance restarted 
after a cancellation.

> Proper handling of a rebalancing with disabled WAL
> --------------------------------------------------
>
>                 Key: IGNITE-13171
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13171
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 2.8
>            Reporter: Alexey Scherbakov
>            Assignee: Alexey Scherbakov
>            Priority: Major
>             Fix For: 2.10
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Current implementation of a optimized rebalancing with disabled WAL used when 
> all local partitions are MOVING and persistence is enabled has multiple flaws:
>  # There are races between concurrent topology change and partition owning 
> after a checkpoint causing consistency issues.
>  # Partitions will not be owned after a topology has changed and new topology 
> version is compatible with previous.
> This is the reason for flaky tests [1] [2]
> [1] 
> [https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=1140020093875959306&branch=%3Cdefault%3E&tab=testDetails]
> [2] 
> https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=7421637930905964922&branch=%3Cdefault%3E&tab=testDetails



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (IGNITE-13171) Proper handling of a rebalancing with disabled WAL

Reply via email to