[
https://issues.apache.org/jira/browse/IGNITE-25665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roman Puchkovskiy reassigned IGNITE-25665:
------------------------------------------
Assignee: Roman Puchkovskiy (was: Kirill Tkalenko)
> Persist pending entries list in "aipersist" engine
> --------------------------------------------------
>
> Key: IGNITE-25665
> URL: https://issues.apache.org/jira/browse/IGNITE-25665
> Project: Ignite
> Issue Type: Bug
> Reporter: Vladislav Pyatkov
> Assignee: Roman Puchkovskiy
> Priority: Major
> Labels: ignite-3
>
> h3. Motivation
> We need to persistently track pending rows to ensure they are preserved after
> a cluster restart. Otherwise, we risk losing them and inadvertently marking
> transaction statuses as aborted (as described in the root issue). This could
> lead to resolving write intents as aborted, resulting in permanent client
> data loss.
> h3. Definition of done
> Pending rows are persisted and fully recovered upon cluster restart.
> h3. Design
> The idea is to have a persistent double-linked list, constructed on a subset
> of row versions that represent write intents.
> Currently, each version chain represents the following structure:
> {code:java}
> Chain 1 = [timestamp, row] -> ... -> []
> Chain 2 = [timestamp, row] -> ... -> []{code}
> What we want to do is to connect all the chains that have write intents as
> their heads (i.e. {{{}timestamp == 0L{}}}), and enrich them with an
> information that would allow restoring information about pending transactions:
> {code:java}
> ...
> ^ |
> | v
> Chain 1 = [rowId, timestamp, row] -> ... -> []
> ^ |
> | v
> Chain 2 = [rowId, timestamp, row] -> ... -> []
> ^ |
> | v
> ...{code}
> This means enriching {{RowVersion}} class with:
> * {{RowId}} (16 bytes).
> * Link to the previous list node, "nullable", 6 bytes.
> * Link to the next list node, "nullable", 6 bytes.
> 28 bytes in total. That's a lot already. Commit replication group ID and
> transaction ID will be stored in a tree as metadata, because it would be
> other 22 bytes of constantly duplicated data.
> Since version chains don't have transaction ID, we will get it from version
> chain tree when starting the replica.
> {{// TODO it is possible to introduce a *getAll* operation on the B+Tree,
> which should make this reading faster.}}
> New partition storage API will be required to read this list.
> Obviously, the change must be backwards-compatible.
> We should probably disable it for {{{}aimem{}}}, because it's just a memory
> overhead in that case, it doesn't provide anything useful.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)