[ 
https://issues.apache.org/jira/browse/IGNITE-15818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirill Tkalenko updated IGNITE-15818:
-------------------------------------
    Fix Version/s: 3.0.0-alpha6

> [Native Persistence 3.0] Checkpoint, lifecycle and file store refactoring and 
> re-implementation
> -----------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-15818
>                 URL: https://issues.apache.org/jira/browse/IGNITE-15818
>             Project: Ignite
>          Issue Type: Task
>            Reporter: Sergey Chugunov
>            Assignee: Kirill Tkalenko
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-alpha6
>
>
> h2. Goal
> Port and refactor core classes implementing page-based persistent store in 
> Ignite 2.x: GridCacheOffheapManager, GridCacheDatabaseSharedManager, 
> PageMemoryImpl, Checkpointer, FileWriteAheadLogManager.
> New checkpoint implementation to avoid excessive logging.
> Store lifecycle clarification to avoid complicated and invasive code of 
> custom lifecycle managed mostly by DatabaseSharedManager.
> h2. Items to pay attention to
> New checkpoint implementation based on split-file storage, new page index 
> structure to maintain disk-memory page mapping.
> File page store implementation should be extracted from 
> GridCacheOffheapManager to a separate entity, target implementation should 
> support new version of checkpoint (split-file store to enable 
> always-consistent store and to eliminate binary recovery phase).
> Support of big pages (256+ kB).
> Support of throttling algorithms.
> h2. References
> New checkpoint design overview is available 
> [here|https://github.com/apache/ignite-3/blob/ignite-14647/modules/vault/README.md]
> h2. Thoughts
> Although there is a technical opportunity to have independent checkpoints for 
> different data regions, managing them could be a nightmare and it's 
> definitely in the realm of optimizations and out of scope right now.
> So, let's assume that there's one good old checkpoint process. There's still 
> a requirement to have checkpoint markers, but they will not have a reference 
> to WAL, because there's no WAL. Instead, we will have to store RAFT log 
> revision per partition. Or not, I'm not that familiar with a recovery 
> procedure that's currently in development.
> Unlike checkpoints in Ignite 2.x, that had DO and REDO operations, new 
> version will have DO and UNDO. This drastically simplifies both checkpoint 
> itself and node recovery. But is complicates data access.
> There will be two process that will share storage resource: "checkpointer" 
> and "compactor". Let's examine what compactor should or shouldn't do:
>  * it should not work in parallel with checkpointer, except for cases when 
> there are too many layers (more on that later)
>  * it should merge later checkpoint delta files into main partition files
>  * it should delete checkpoint markers once all merges are completed for it, 
> thus markers are decoupled from RAFT log
> About "cases when there are too many layers" - too many layers could 
> compromise reading speed. Number of layers should not increase 
> uncontrollably. So, when a threshold is exceeded, compactor should start 
> working no mater what. If anything, writing load can be throttled, reading 
> matters more.
> Recovery procedure:
>  * read the list of checkpoint markers on engines start
>  * remove all data from unfinished checkpoint, if it's there
>  * trim main partition files to their proper size (should check it it's 
> actually beneficial)
> Table start procedure:
>  * read all layer files headers according to the list of checkpoints
>  * construct a list oh hash tables (pageId -> pageIndex) for all layers, make 
> it as effective as possible
>  * everything else is just like before
> Partition removal might be tricky, but we'll see. It's tricky in Ignite 2.x 
> after all. "Restore partition states" procedure could be revisited, I don't 
> know how this will work yet.
> How to store hashmaps:
> regular maps might be too much, we should consider roaring map implementation 
> or something similar that'll occupy less space. This is only a concern for 
> in-memory structures. Files on disk may have a list of pairs, that's fine. 
> Generally speaking, checkpoints with a size of 100 thousand pages are close 
> to the top limit for most users. Splitting that to 500 partitions, for 
> example, gives us 200 pages per partition. Entire map should fit into a 
> single page.
> The only exception to these calculations is index.bin. Amount of pages per 
> checkpoint can be an orders of magnitudes higher, so we should keep an eye on 
> it, It'll be the main target for testing/benchmarking. Anyway, 4 kilobytes is 
> enough to fit 512 integer pairs, scaling to 2048 for regular 16 kilobytes 
> pages. Map won't be too big IMO.
> Another important moment - we should enable direct IO, it's supported by Java 
> natively since version 9 (I guess). There's a chance that not only regular 
> disk operations will become somewhat faster, but fsync will become 
> drastically faster as a result. Which is good, fsync can easily take half a 
> time of the checkpoint, which is just unacceptable.
> h2. Thoughts 2.0
> With high likelihood, we'll get rid of index.bin. This will remove the 
> requirement of having checkpoint markers.
> All that we need is a consistently growing local counter that will be used to 
> mark partition delta files. But, it doesn't need to be global even on a level 
> of local node, it can be a local counter per partition, that's persisted in 
> the meta page. This should be further discussed during the implementation.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to