[ 
https://issues.apache.org/jira/browse/ARTEMIS-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francesco Nigro updated ARTEMIS-3430:
-------------------------------------
    Description: 
This can be seen both as a bug or an improvement over the existing self-heal 
behaviour of activation sequence introduced by 
https://issues.apache.org/jira/browse/ARTEMIS-3340.

In short, the existing protocol to increase activation sequence while 
un-replicated is:
# remote i -> -(i + 1) ie remote CLAIM 
# local i -> (i + 1) ie local commit
# remote -(i + 1) -> (i + 1) ie remote COMMIT

This protocol has been designed to allow witness brokers to acknowledge if 
their data is no longer up-to-date and to save them to throw it away if still 
valuable, during a partial failure while increasing activation sequence.

In the current version, self-repairing is allowed just if live broker has 
performed 2. but not 3. ie local activation sequence is updated, but 
coordinated one isn't committed yet.
If the failing broker is restarted it can "fix" the coordinated sequence and 
move on to become live again, but if 2. fail (or just never happen), the 
coordinated activation sequence cannot be fixed if not with some admin 
intervention, after inspecting *all* local activation sequences.

The reason why other brokers cannot "fix" the sequence is because the local 
sequence of the failed broker is unknown and just roll-backing the claimed one 
(to the previous or to the right committed value) can makes the failed broker 
to believe to have up-to-date data too, causing journal misalignments.

The solution to this can be to fix the claimed sequence moving it to the right 
commit value while forbidding other broker to run un-replicated using it.
This is achieved by further increasing it *after* repaired: it would 
prematurely age other in-sync brokers (including the failed one), but allowing 
auto-repair without admin intervention.
The sole drawback of this strategy is that a further fail of the repairing 
broker while increasing sequence will give to it an exclusive responsibility to 
auto-repair (again, on restart) because no other brokers can have an 
high-enough local sequence.

  was:
This can be seen both as a bug or an improvement over the existing self-heal 
behaviour of activation sequence introduced by 
https://issues.apache.org/jira/browse/ARTEMIS-3340.

In short, the existing protocol to increase activation sequence while 
un-replicated is:
# remote i -> -(i + 1) ie remote CLAIM 
# local i -> (i + 1) ie local commit
# remote -(i + 1) -> (i + 1) ie remote COMMIT

This protocol has been designed to allow witness brokers to acknowledge if 
their data is no longer up-to-date and to save them to throw it away if still 
valuable, during a partial failure while increasing activation sequence.

In the current version, self-repairing is allowed just if live broker has 
performed 2. but not 3. ie local activation sequence is updated, but 
coordinated one isn't committed yet.
If the failing broker is restarted it can "fix" the coordinated sequence and 
move on to become live again, but if 2. fail (or just never happen), the 
coordinated activation sequence cannot be fixed if not with some admin 
intervention, after inspecting *all* local activation sequences.

The reason why other brokers cannot "fix" the sequence is because the local 
sequence of the failed broker is unknown and just roll-backing the claimed one 
(to the previous or to the right committed value) can makes the failed broker 
to believe to have up-to-date data too, causing journal misalignments.

The solution to this can be to fix the claimed sequence moving it to the right 
commit value while forbidding any broker to run un-replicated using it.
This is achieved by further increasing it *after* repaired: it would 
prematurely age other in-sync brokers (including the failed one), but allowing 
auto-repair without admin intervention.
The sole drawback of this strategy is that a further fail of the repairing 
broker while increasing sequence will give to it an exclusive responsibility to 
auto-repair (again, on restart) because no other brokers can have an 
high-enough local sequence.


> Activation Sequence Auto-Repair
> -------------------------------
>
>                 Key: ARTEMIS-3430
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3430
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>            Reporter: Francesco Nigro
>            Assignee: Francesco Nigro
>            Priority: Major
>             Fix For: 2.19.0
>
>
> This can be seen both as a bug or an improvement over the existing self-heal 
> behaviour of activation sequence introduced by 
> https://issues.apache.org/jira/browse/ARTEMIS-3340.
> In short, the existing protocol to increase activation sequence while 
> un-replicated is:
> # remote i -> -(i + 1) ie remote CLAIM 
> # local i -> (i + 1) ie local commit
> # remote -(i + 1) -> (i + 1) ie remote COMMIT
> This protocol has been designed to allow witness brokers to acknowledge if 
> their data is no longer up-to-date and to save them to throw it away if still 
> valuable, during a partial failure while increasing activation sequence.
> In the current version, self-repairing is allowed just if live broker has 
> performed 2. but not 3. ie local activation sequence is updated, but 
> coordinated one isn't committed yet.
> If the failing broker is restarted it can "fix" the coordinated sequence and 
> move on to become live again, but if 2. fail (or just never happen), the 
> coordinated activation sequence cannot be fixed if not with some admin 
> intervention, after inspecting *all* local activation sequences.
> The reason why other brokers cannot "fix" the sequence is because the local 
> sequence of the failed broker is unknown and just roll-backing the claimed 
> one (to the previous or to the right committed value) can makes the failed 
> broker to believe to have up-to-date data too, causing journal misalignments.
> The solution to this can be to fix the claimed sequence moving it to the 
> right commit value while forbidding other broker to run un-replicated using 
> it.
> This is achieved by further increasing it *after* repaired: it would 
> prematurely age other in-sync brokers (including the failed one), but 
> allowing auto-repair without admin intervention.
> The sole drawback of this strategy is that a further fail of the repairing 
> broker while increasing sequence will give to it an exclusive responsibility 
> to auto-repair (again, on restart) because no other brokers can have an 
> high-enough local sequence.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to