[
https://issues.apache.org/jira/browse/IGNITE-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roman Puchkovskiy resolved IGNITE-25079.
----------------------------------------
Fix Version/s: 3.2
Resolution: Fixed
Fixed by IGNITE-25665
> Partial data loss after cluster restart
> ---------------------------------------
>
> Key: IGNITE-25079
> URL: https://issues.apache.org/jira/browse/IGNITE-25079
> Project: Ignite
> Issue Type: Bug
> Reporter: Ivan Bessonov
> Priority: Blocker
> Labels: ignite-3
> Fix For: 3.2
>
>
> h3. Scenario
> # Begin a "long" explicit transaction, that would span several partitions.
> # Insert a few entries.
> # For some reason, nodes on the cluster should initiate flush (checkpoint)
> on a table. This might happen in a real environment.
> # Then we insert more entries in the same transaction.
> # Commit.
> # Wait for flush to complete, stop the cluster.
> # Start the cluster.
> # Wait for about a minute. In my test, I waited enough for two TX state
> vacuum cycles to complete.
> # Read the data.
> h3. Expected result
> You see the data of the entire transaction.
> h3. Actual result
> Data inserted before the checkpoint suddenly disappeared.
> h3. Test
> The following test should be inserted into {{{}ItInternalTableTest{}}}.
> Should be greatly improved before committing, because it's long (1m+) and
> ugly.
> {code:java}
> @Test
> public void testIgnite25079() throws Exception {
> IgniteImpl node = node();
> KeyValueView<Tuple, Tuple> keyValueView = table.keyValueView();
> node.transactions().runInTransaction(tx -> {
> for (int i = 0; i < 15; i++) {
> putValue(keyValueView, i, tx);
> }
> CompletableFuture<Void> flushFuture =
> unwrapTableViewInternal(table).internalTable()
> .storage().getMvPartition(0).flush(true);
> assertThat(flushFuture, willCompleteSuccessfully());
> for (int i = 15; i < 30; i++) {
> putValue(keyValueView, i, tx);
> }
> });
> CLUSTER.stopNode(0);
> node = unwrapIgniteImpl(CLUSTER.startNode(0));
> table = node.tables().table(TABLE_NAME);
> Thread.sleep(61_000);
> InternalTable internalTable =
> unwrapTableViewInternal(table).internalTable();
> CompletableFuture<List<BinaryRow>> getAllFuture = internalTable.getAll(
> LongStream.range(0,
> 30).mapToObj(ItInternalTableTest::createKeyRow).collect(Collectors.toList()),
> node.clock().now(),
> node.node()
> );
> assertThat(getAllFuture, willCompleteSuccessfully());
> List<BinaryRow> res = getAllFuture.get();
> assertEquals(30, res.size());
> assertEquals(30, res.stream().filter(Objects::nonNull).count());
> } {code}
> h3. Why this happens
> {{StorageUpdateHandler#pendingRows}} is to blame. When we run a cleanup
> process on a transaction, we read a list of row IDs from this field, assuming
> it has all we need. In a provided test, we start local RAFT log reapplication
> from the middle of the transaction. During that reapplication, we meat 15
> inserted records, put them into {{{}pendingRows{}}}, and then execute a
> cleanup request with that information on hand.
> In other words, cleanup command only resolves 15 write intents out of 30.
> If we wait for long enough, TX state storage will delete the state of our
> transaction. All cleanup commands have been replicated, there are no reasons
> not to do so. But, 15 write intents are not resolved, and at this point their
> state will be determined as ABORTED.
> ABORTED write intents are rolled back upon encountering, this is why in test
> we will only be able to read 15 out of 30 records. This is clearly considered
> a data loss.
> But it also might introduce {*}data inconsistency on different replicas{*}.
> The reason for that is that checkpoints happen at different moments in time
> on different nodes. So, {{pendingRows}} field will have different content on
> different nodes during their local recovery phase, and as a result some RO
> queries will yield different results depending on where they're being run.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)