[ 
https://issues.apache.org/jira/browse/IGNITE-25665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-25665:
-----------------------------------
    Description: 
h3. Motivation

We need to persistently track pending rows to ensure they are preserved after a 
cluster restart. Otherwise, we risk losing them and inadvertently marking 
transaction statuses as aborted (as described in the root issue). This could 
lead to resolving write intents as aborted, resulting in permanent client data 
loss.
h3. Definition of done

Pending rows are persisted and fully recovered upon cluster restart.
h3. Design

The idea is to have a persistent double-linked list, constructed on a subset of 
row versions that represent write intents.

Currently, each version chain represents the following structure:

 
{code:java}
Chain 1 = [timestamp, row] -> ... -> []

Chain 2 = [timestamp, row] -> ... -> []{code}
What we want to do is to connect all the chains that have write intents as 
their heads (i.e. {{{}timestamp == 0L{}}}), and enrich them with an information 
that would allow restoring information about pending transactions:

 

 
{code:java}
...
          ^                     | 
          |                     v
Chain 1 = [rowId, timestamp, row] -> ... -> []
          ^                     | 
          |                     v
Chain 2 = [rowId, timestamp, row] -> ... -> []
          ^                     | 
          |                     v
...{code}
This means enriching {{RowVersion}} class with:

 
 * {{RowId}} (16 bytes).
 * Link to the previous list node, "nullable", 6 bytes.
 * Link to the next list node, "nullable", 6 bytes.

28 bytes in total. That's a lot already. Commit replication group ID and 
transaction ID will be stored in a tree as metadata, because it would be other 
22 bytes of constantly duplicated data.

Since version chains don't have transaction ID, we will get it from version 
chain tree when starting the replica.

{{// TODO it is possible to introduce a *getAll* operation on the B+Tree, which 
should make this reading faster.}}

New partition storage API will be required to read this list.

Obviously, the change must be backwards-compatible.

We should probably disable it for {{{}aimem{}}}, because it's just a memory 
overhead in that case, it doesn't provide anything useful.

  was:
h3. Motivation

We need to persistently track pending rows to ensure they are preserved after a 
cluster restart. Otherwise, we risk losing them and inadvertently marking 
transaction statuses as aborted (as described in the root issue). This could 
lead to resolving write intents as aborted, resulting in permanent client data 
loss.
h3. Definition of done

Pending rows are persisted and fully recovered upon cluster restart.


> Persist pending entries list in "aipersist" engine
> --------------------------------------------------
>
>                 Key: IGNITE-25665
>                 URL: https://issues.apache.org/jira/browse/IGNITE-25665
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Vladislav Pyatkov
>            Assignee: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>
> h3. Motivation
> We need to persistently track pending rows to ensure they are preserved after 
> a cluster restart. Otherwise, we risk losing them and inadvertently marking 
> transaction statuses as aborted (as described in the root issue). This could 
> lead to resolving write intents as aborted, resulting in permanent client 
> data loss.
> h3. Definition of done
> Pending rows are persisted and fully recovered upon cluster restart.
> h3. Design
> The idea is to have a persistent double-linked list, constructed on a subset 
> of row versions that represent write intents.
> Currently, each version chain represents the following structure:
>  
> {code:java}
> Chain 1 = [timestamp, row] -> ... -> []
> Chain 2 = [timestamp, row] -> ... -> []{code}
> What we want to do is to connect all the chains that have write intents as 
> their heads (i.e. {{{}timestamp == 0L{}}}), and enrich them with an 
> information that would allow restoring information about pending transactions:
>  
>  
> {code:java}
> ...
>           ^                     | 
>           |                     v
> Chain 1 = [rowId, timestamp, row] -> ... -> []
>           ^                     | 
>           |                     v
> Chain 2 = [rowId, timestamp, row] -> ... -> []
>           ^                     | 
>           |                     v
> ...{code}
> This means enriching {{RowVersion}} class with:
>  
>  * {{RowId}} (16 bytes).
>  * Link to the previous list node, "nullable", 6 bytes.
>  * Link to the next list node, "nullable", 6 bytes.
> 28 bytes in total. That's a lot already. Commit replication group ID and 
> transaction ID will be stored in a tree as metadata, because it would be 
> other 22 bytes of constantly duplicated data.
> Since version chains don't have transaction ID, we will get it from version 
> chain tree when starting the replica.
> {{// TODO it is possible to introduce a *getAll* operation on the B+Tree, 
> which should make this reading faster.}}
> New partition storage API will be required to read this list.
> Obviously, the change must be backwards-compatible.
> We should probably disable it for {{{}aimem{}}}, because it's just a memory 
> overhead in that case, it doesn't provide anything useful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to