[ 
https://issues.apache.org/jira/browse/IGNITE-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Puchkovskiy resolved IGNITE-25079.
----------------------------------------
    Fix Version/s: 3.2
       Resolution: Fixed

Fixed by IGNITE-25665

> Partial data loss after cluster restart
> ---------------------------------------
>
>                 Key: IGNITE-25079
>                 URL: https://issues.apache.org/jira/browse/IGNITE-25079
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Ivan Bessonov
>            Priority: Blocker
>              Labels: ignite-3
>             Fix For: 3.2
>
>
> h3. Scenario
>  # Begin a "long" explicit transaction, that would span several partitions.
>  # Insert a few entries.
>  # For some reason, nodes on the cluster should initiate flush (checkpoint) 
> on a table. This might happen in a real environment.
>  # Then we insert more entries in the same transaction.
>  # Commit.
>  # Wait for flush to complete, stop the cluster.
>  # Start the cluster.
>  # Wait for about a minute. In my test, I waited enough for two TX state 
> vacuum cycles to complete.
>  # Read the data.
> h3. Expected result
> You see the data of the entire transaction.
> h3. Actual result
> Data inserted before the checkpoint suddenly disappeared.
> h3. Test
> The following test should be inserted into {{{}ItInternalTableTest{}}}. 
> Should be greatly improved before committing, because it's long (1m+) and 
> ugly.
> {code:java}
> @Test
> public void testIgnite25079() throws Exception {
>     IgniteImpl node = node();
>     KeyValueView<Tuple, Tuple> keyValueView = table.keyValueView();
>     node.transactions().runInTransaction(tx -> {
>         for (int i = 0; i < 15; i++) {
>             putValue(keyValueView, i, tx);
>         }
>         CompletableFuture<Void> flushFuture = 
> unwrapTableViewInternal(table).internalTable()
>                 .storage().getMvPartition(0).flush(true);
>         assertThat(flushFuture, willCompleteSuccessfully());
>         for (int i = 15; i < 30; i++) {
>             putValue(keyValueView, i, tx);
>         }
>     });
>     CLUSTER.stopNode(0);
>     node = unwrapIgniteImpl(CLUSTER.startNode(0));
>     table = node.tables().table(TABLE_NAME);
>     Thread.sleep(61_000);
>     InternalTable internalTable = 
> unwrapTableViewInternal(table).internalTable();
>     CompletableFuture<List<BinaryRow>> getAllFuture = internalTable.getAll(
>             LongStream.range(0, 
> 30).mapToObj(ItInternalTableTest::createKeyRow).collect(Collectors.toList()),
>             node.clock().now(),
>             node.node()
>     );
>     assertThat(getAllFuture, willCompleteSuccessfully());
>     List<BinaryRow> res = getAllFuture.get();
>     assertEquals(30, res.size());
>     assertEquals(30, res.stream().filter(Objects::nonNull).count());
> } {code}
> h3. Why this happens
> {{StorageUpdateHandler#pendingRows}} is to blame. When we run a cleanup 
> process on a transaction, we read a list of row IDs from this field, assuming 
> it has all we need. In a provided test, we start local RAFT log reapplication 
> from the middle of the transaction. During that reapplication, we meat 15 
> inserted records, put them into {{{}pendingRows{}}}, and then execute a 
> cleanup request with that information on hand.
> In other words, cleanup command only resolves 15 write intents out of 30.
> If we wait for long enough, TX state storage will delete the state of our 
> transaction. All cleanup commands have been replicated, there are no reasons 
> not to do so. But, 15 write intents are not resolved, and at this point their 
> state will be determined as ABORTED.
> ABORTED write intents are rolled back upon encountering, this is why in test 
> we will only be able to read 15 out of 30 records. This is clearly considered 
> a data loss.
> But it also might introduce {*}data inconsistency on different replicas{*}. 
> The reason for that is that checkpoints happen at different moments in time 
> on different nodes. So, {{pendingRows}} field will have different content on 
> different nodes during their local recovery phase, and as a result some RO 
> queries will yield different results depending on where they're being run.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to