[ 
https://issues.apache.org/jira/browse/IGNITE-17081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-17081:
--------------------------------------

    Assignee: Ivan Bessonov

> Implement checkpointIndex for RocksDB
> -------------------------------------
>
>                 Key: IGNITE-17081
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17081
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Ivan Bessonov
>            Assignee: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>
> Please refer to https://issues.apache.org/jira/browse/IGNITE-16907 for 
> prerequisites.
> Please also familiarize yourself with 
> https://issues.apache.org/jira/browse/IGNITE-17077 for better understanding, 
> the description is continued from there.
> For RocksDB based storage the recovery process is trivial, because RocksDB 
> has its own WAL. So, for testing purposes, it would be enough to just store 
> update index in meta column family.
> Immediately we have a write amplification issue, on top of possible 
> performance degradation. Obvious solution is inherently bad and needs to be 
> improved.
> h2. General idea & implementation
> Obviously, WAL needs to be disabled (WriteOptions#setDisableWAL). This kinda 
> breaks RocksDB recovery procedure, we need to take measures to avoid it.
> The only feasible way to do so is to use DBOptions#setAtomicFlush in 
> conjunction with org.rocksdb.WriteBatchWithIndex. This allows RocksDB to save 
> all column families consistently, if you have batches that cover several CFs. 
> Basically, {{acquireConsistencyLock()}} would create a thread-local write 
> batch, that's applied on locks release. Most of RocksDbMvPartitionStorage 
> will be affected by this change.
> NOTE: I believe that scans with unapplied batches should be prohibited for 
> now  (gladly, there's a WriteBatchInterface#count() to check). I don't see 
> any practical value and a proper way of implementing it, considering how 
> spread-out in time the scan process is.
> h2. Callbacks and RAFT snapshots
> Simply storing and reading update index is easy. Reading committed index is 
> more challenging, I propose caching it and update only from the closure, that 
> can also be used by RAFT to truncate the log.
> For a closure, there are several things to account for during the 
> implementation:
>  * DBOptions#setListeners. We need two events - ON_FLUSH_BEGIN and 
> ON_FLUSH_COMPLETED. All "completed" events go after all "begin" events in 
> atomic flush mode. And, once you have your first "completed" event ,you have 
> a guarantee that *all* memtables are already persisted.
> This allows easy tracking of RocksDB flushes, monitoring events alteration is 
> all that's needed.
>  * Unlike PDS implementation, here we will be writing updateIndex value into 
> a memtable every time. This makes it harder to find persistedIndex values for 
> partitions. Gladly, considering the events that we have, during the time 
> between first "completed" and the very next "begin", the state on disk is 
> fully consistent. And there's a way to read data from storage avoiding 
> memtable completely - ReadOptions#setReadTier(PERSISTED_TIER).
> Summarizing everything from the above, we should implement following protocol:
>  
> {code:java}
> During table start: read latest values of update indexes. Store them in an 
> in-memory structure.
> Set "lastEventType = ON_FLUSH_COMPLETED;".
> onFlushBegin:
>   if (lastEventType == ON_FLUSH_BEGIN)
>     return;
>   waitForLastAsyncUpdateIndexesRead();
>   lastEventType = ON_FLUSH_BEGIN;
> onFlushCompleted:
>   if (lastEventType == ON_FLUSH_COMPLETED)
>     return;
>   asyncReadUpdateIndexesFromDisk();
>   lastEventType = ON_FLUSH_COMPLETED;{code}
> Reading values from disk must be performed asynchronously to not stall 
> flushing process. We don't control locks that RocksDb holds while calling 
> listener's methods.
> That asynchronous process would invoke closures that provide presisted 
> updateIndex values to other components.
> NODE: One might say that we should call 
> "waitForLastAsyncUpdateIndexesRead();" as late as possible just in case. But 
> my implementation says calling it during the first event. This is fine. I 
> noticed that column families are flushed in order of their internal ids. 
> These ids correspond to a sequence number of CFs, and the "default" CF is 
> always created first. This is the exact CF that we use to store meta. Maybe 
> we're going to change this and create a separate meta CF. Only then we could 
> start optimizing this part, and only if we'll have an actual proof that 
> there's a stall in this exact place.
> h3. Types of storages
> RocksDB is used for:
>  * tables
>  * cluster management
>  * meta-storage
> All these types should use the same recovery procedure, but code is located 
> in different places. I hope that it won't be a big problem and we can do 
> everything at once.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to