[ 
https://issues.apache.org/jira/browse/IGNITE-7832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16693108#comment-16693108
 ] 

Andrew Mashenkov edited comment on IGNITE-7832 at 11/20/18 11:45 AM:
---------------------------------------------------------------------

[~VitaliyB],
 # One event for all lost partitions per topology change would be more 
convenient for a user.
 # Resetting partition state should be consistent with topology version as 
partition can be lost again during recovery process. 
 So, simplest way is to pass topology version to *resetLostPartitions()* to 
fail it * *if topology has been changed in between user detect LOST event and 
user call *resetLostPartitions()*. 
 # Another possible way is to introduce some 'partition recovery handler' (e.g. 
user closure) that will be triggered in PME in async way and add a new 
partition state for tracking 'partition loss during recovery', but this way 
looks a bit tricky and requires deep understanding of PME process.

Feel free to split the ticket into multiple ones if you find it too complicated 
to be implemented within single step.


was (Author: amashenkov):
[~VitaliyB],
 # One event for all lost partitions per topology change would be more 
convenient for a user.
 # Resetting partition state should be consistent with topology version as 
partition can be lost again during recovery process. 
So, simplest way is to pass topology version to *resetLostPartitions()* to fail 
it ** if topology has been changed in between user detect LOST event and user 
call *resetLostPartitions()*. 
 # Another possible way is to introduce some 'partition recovery handler' (e.g. 
user closure) that will be triggered in PME in async way and add a new 
partition state for tracking 'partition loss during recovery', but this way 
looks a bit tricky and requires deep understanding of PME process.

> Ignite.resetLostPartitions() resets state under race.
> -----------------------------------------------------
>
>                 Key: IGNITE-7832
>                 URL: https://issues.apache.org/jira/browse/IGNITE-7832
>             Project: Ignite
>          Issue Type: Task
>          Components: cache
>            Reporter: Andrew Mashenkov
>            Assignee: Vitaliy Biryukov
>            Priority: Critical
>             Fix For: 2.8
>
>
> Assume, we have event listener that detects partition loss events and apply 
> some actions to recover lost data.
> After recovery process finished an Ignite.resetLostPartitions() method should 
> be called to mark all lost cache partitions as healthy.
> It is possible Ignite.resetLostPartitions() will be called during exchange, 
> but right before a new partition loss event will be fired.
> E.g. exchange thread own GridDhtPartitionTopologyImpl write lock in 
> detectLostPartitions() method, while user thread will wait for the lock 
> inside Ignite.resetLostPartitions().
> So, after a new partition loss will be detected, is will be not possible to 
> abort user action and state of just lost partition will be reset.
> For that case, we should either abort resetLostPartitions() or reset 
> partitions state regarding topology version provided by user some how.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to