Alexey Goncharuk created IGNITE-12263:
-----------------------------------------

             Summary: Introduce native persistence compaction operation
                 Key: IGNITE-12263
                 URL: https://issues.apache.org/jira/browse/IGNITE-12263
             Project: Ignite
          Issue Type: Improvement
            Reporter: Alexey Goncharuk


Currently, Ignite native persistence does not shrink storage files after 
key-value pairs are removed.
The causes of this behavior are:
 * The absence of a mechanism that allows Ignite to track highest non-empty 
page position in a partition file
 * The absence of a mechanism which allows Ignite to select a page closest to 
the file beginning for write
 * The absence of a mechanism which allows Ignite to move a key-value pair from 
page to page during defragmentation

As an initial change I suggest to introduce a new node startup mode, which will 
run a defragmentation procedure allowing the node to shrink storage files. The 
procedure will not mutate the logical state of a partition allowing further 
historical rebalance to quickly catch up the node. Since the procedure will run 
during the node startup (during the final stages of recovery), there will be no 
concurrent load, thus the entries can be freely moved from page to page with no 
tricky synchronization.

If a procedure is applied during the whole cluster restart, then all nodes will 
be defragmented simultaneously, allowing for a quicker parallel defragmentation 
at a cost of downtime.

The procedure should accept an optional list of cache groups to defragment to 
allow arbitrary cache group selection for defragmentation.

An idea of the actions taken during the run for each partition selected for 
defragmentation:
 * Partition pages are preloaded to memory if possible to avoid excessive page 
replacement. During the scan, a HWM of the written data is detected (empty 
pages are skipped)
 * Pages references in a free list are sorted in a way allowing to pick pages 
closest to the file start
 * The partition is scanned in reverse order, key-value pairs are moved closer 
to the file start, HWM is updated accordingly. This step is particularly open 
for various optimizations because different strategies will work well for 
different fragmentation patterns.
 * After the scan iteration is completed, the file size can be updated 
according to the HWM

As a further improvement, this partition defragmentation procedure can be later 
run in online mode, after proper cache update protocol changes are designed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to