Igor, Could you please elaborate - what is the whole set of information we are going to save at checkpoint time? From what I understand this should be: 1) List of active transactions with WAL pointers of their first writes 2) List of prepared transactions with their update counters 3) Partition counter low watermark (LWM) - the smallest partition counter before which there are no prepared transactions.
And the we send to supplier node a message: "Give me all updates starting from that LWM plus data for that transactions which were active when I failed". Am I right? On Fri, Nov 23, 2018 at 11:22 AM Seliverstov Igor <gvvinbl...@gmail.com> wrote: > Hi Igniters, > > Currently I’m working on possible approaches how to implement historical > rebalance (delta rebalance using WAL iterator) over MVCC caches. > > The main difficulty is that MVCC writes changes on tx active phase while > partition update version, aka update counter, is being applied on tx > finish. This means we cannot start iteration over WAL right from the > pointer where the update counter updated, but should include updates, which > the transaction that updated the counter did. > > These updates may be much earlier than the point where the update counter > was updated, so we have to be able to identify the point where the first > update happened. > > The proposed approach includes: > > 1) preserve list of active txs, sorted by the time of their first update > (using WAL ptr of first WAL record in tx) > > 2) persist this list on each checkpoint (together with TxLog for example) > > 4) send whole active tx list (transactions which were in active state at > the time the node was crushed, empty list in case of graceful node stop) as > a part of partition demand message. > > 4) find a checkpoint where the earliest tx exists in persisted txs and use > saved WAL ptr as a start point or apply current approach in case the active > tx list (sent on previous step) is empty > > 5) start iteration. > > Your thoughts? > > Regards, > Igor