[ 
https://issues.apache.org/jira/browse/IGNITE-25501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirill Tkalenko updated IGNITE-25501:
-------------------------------------
    Description: 
When analyzing IGNITE-24802, it was discovered that if a snapshot is taken 
before stopping the partition leader, thereby disabling raft log suffix 
truncations. Then when the node returns, the logs will show the message "FATAL 
ERROR: Can't truncate logs before appliedId=LogId [index=26, term=2], 
lastIndexKept=0" and the partition will be in a healthy state and it will be 
possible to read records from it that should not be there. This needs to be 
fixed.

It is possible to simply put the partition in an erroneous state so that the 
user can then fix this situation himself using the disaster recovery mechanism.

To reproduce this, a raft snapshot needs to be taken in  
*org.apache.ignite.internal.ItTruncateRaftLogAndRestartNodesTest#enterNodeWithIndexGreaterThanCurrentMajority*
 before stopping the node with index "2", for example like this 
*org.apache.ignite.internal.replicator.Replica#createSnapshotOn*.

It is suggested to add a new test with this behavior, since the current test 
has another problem IGNITE-25502, it will be possible to get the partition 
state through 
*org.apache.ignite.internal.table.distributed.disaster.DisasterRecoveryManager#localTablePartitionStates*.

  was:
When analyzing IGNITE-24802, it was discovered that if a snapshot is taken 
before stopping the partition leader, thereby disabling raft log suffix 
truncations. Then when the node returns, the logs will show the message "FATAL 
ERROR: Can't truncate logs before appliedId=LogId [index=26, term=2], 
lastIndexKept=0" and the partition will be in a healthy state and it will be 
possible to read records from it that should not be there. This needs to be 
fixed.

It is possible to simply put the partition in an erroneous state so that the 
user can then fix this situation himself using the disaster recovery mechanism.

To reproduce this, a raft snapshot needs to be taken in  
*org.apache.ignite.internal.ItTruncateRaftLogAndRestartNodesTest#enterNodeWithIndexGreaterThanCurrentMajority*
 before stopping the node with index "2", for example like this 
*org.apache.ignite.internal.replicator.Replica#createSnapshotOn*.

It is suggested to add a new test with this behavior, since the current test 
has another problem, it will be possible to get the partition state through 
*org.apache.ignite.internal.table.distributed.disaster.DisasterRecoveryManager#localTablePartitionStates*.


> Incorrect partition state when entering node with index greater than current 
> majority after snapshot
> ----------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-25501
>                 URL: https://issues.apache.org/jira/browse/IGNITE-25501
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Kirill Tkalenko
>            Priority: Major
>              Labels: ignite-3
>
> When analyzing IGNITE-24802, it was discovered that if a snapshot is taken 
> before stopping the partition leader, thereby disabling raft log suffix 
> truncations. Then when the node returns, the logs will show the message 
> "FATAL ERROR: Can't truncate logs before appliedId=LogId [index=26, term=2], 
> lastIndexKept=0" and the partition will be in a healthy state and it will be 
> possible to read records from it that should not be there. This needs to be 
> fixed.
> It is possible to simply put the partition in an erroneous state so that the 
> user can then fix this situation himself using the disaster recovery 
> mechanism.
> To reproduce this, a raft snapshot needs to be taken in  
> *org.apache.ignite.internal.ItTruncateRaftLogAndRestartNodesTest#enterNodeWithIndexGreaterThanCurrentMajority*
>  before stopping the node with index "2", for example like this 
> *org.apache.ignite.internal.replicator.Replica#createSnapshotOn*.
> It is suggested to add a new test with this behavior, since the current test 
> has another problem IGNITE-25502, it will be possible to get the partition 
> state through 
> *org.apache.ignite.internal.table.distributed.disaster.DisasterRecoveryManager#localTablePartitionStates*.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to