[ 
https://issues.apache.org/jira/browse/IGNITE-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Rakov updated IGNITE-8529:
-------------------------------
    Description: 
We use sharp checkpointing of page memory in persistent mode. That implies that 
we write two types of record to write-ahead log: logical (e.g. data records) 
and phyisical (page snapshots + binary delta records). Physical records are 
applied only when node crashes/stops during ongoing checkpoint. We have the 
following invariant: checkpoint #(n-1) + all physical records = checkpoint #n.
If correctness of physical records is broken, Ignite node may recover with 
incorrect page memory state, which in turn can bring unexpected delayed errors. 
However, consistency of physical records is poorly tested: only small part of 
our autotests perform node restarts, and even less part of them perform node 
stop when ongoing checkpoint is running.
We should implement abstract test that:
1. Enforces checkpoint, freezes memory state at the moment of checkpoint.
2. Performs necessary test load.
3. Enforces checkpoint again, replays WAL and checks that page store at the 
moment of previous checkpoint with all applied physical records exactly equals 
to current checkpoint state.
Except for checking correctness, test framework should do the following:
1. Gather statistics (like histogram) for types of wriiten physical records. 
That will help us to know what types of physical records are covered by test.
2. Visualize expected and actual page state (with all applied physical records) 
if incorrect page state is detected.
Regarding implementation, I suppose we can use checkpoint listener mechanism to 
freeze page memory state at the moment of checkpoint.

  was:
We use sharp checkpointing of page memory in persistent mode. That implies that 
we write two types of record to write-ahead log: logical (e.g. data records) 
and phyisical (page snapshots + binary delta records). Physical records are 
applied only when node crashes/stops during ongoing checkpoint. We have the 
following invariant: checkpoint #(n-1) + all physical records = checkpoint #n.
If correctness of physical records is broken, Ignite node may recover with 
incorrect page memory state, which in turn can bring unexpected delayed errors. 
However, consistency of physical records is poorly tested: only small part of 
our autotests perform node restarts, and even less part of them performs node 
stop when ongoing checkpoint is running.
We should implement abstract test that:
1. Enforces checkpoint, freezes memory state at the moment of checkpoint.
2. Performs necessary test load.
3. Enforces checkpoint again, replays WAL and checks that page store at the 
moment of previous checkpoint with all applied physical records exactly equals 
to current checkpoint state.
Except for checking correctness, test framework should do the following:
1. Gather statistics (like histogram) for types of wriiten physical records. 
That will help us to know what types of physical records are covered by test.
2. Visualize expected and actual page state (with all applied physical records) 
if incorrect page state is detected.
Regarding implementation, I suppose we can use checkpoint listener mechanism to 
freeze page memory state at the moment of checkpoint.


> Implement testing framework for checking WAL delta records consistency
> ----------------------------------------------------------------------
>
>                 Key: IGNITE-8529
>                 URL: https://issues.apache.org/jira/browse/IGNITE-8529
>             Project: Ignite
>          Issue Type: New Feature
>          Components: persistence
>            Reporter: Ivan Rakov
>            Priority: Major
>
> We use sharp checkpointing of page memory in persistent mode. That implies 
> that we write two types of record to write-ahead log: logical (e.g. data 
> records) and phyisical (page snapshots + binary delta records). Physical 
> records are applied only when node crashes/stops during ongoing checkpoint. 
> We have the following invariant: checkpoint #(n-1) + all physical records = 
> checkpoint #n.
> If correctness of physical records is broken, Ignite node may recover with 
> incorrect page memory state, which in turn can bring unexpected delayed 
> errors. However, consistency of physical records is poorly tested: only small 
> part of our autotests perform node restarts, and even less part of them 
> perform node stop when ongoing checkpoint is running.
> We should implement abstract test that:
> 1. Enforces checkpoint, freezes memory state at the moment of checkpoint.
> 2. Performs necessary test load.
> 3. Enforces checkpoint again, replays WAL and checks that page store at the 
> moment of previous checkpoint with all applied physical records exactly 
> equals to current checkpoint state.
> Except for checking correctness, test framework should do the following:
> 1. Gather statistics (like histogram) for types of wriiten physical records. 
> That will help us to know what types of physical records are covered by test.
> 2. Visualize expected and actual page state (with all applied physical 
> records) if incorrect page state is detected.
> Regarding implementation, I suppose we can use checkpoint listener mechanism 
> to freeze page memory state at the moment of checkpoint.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to