Ivan Bessonov created IGNITE-17087:
--------------------------------------

             Summary: Native rebalance for PDS partitions
                 Key: IGNITE-17087
                 URL: https://issues.apache.org/jira/browse/IGNITE-17087
             Project: Ignite
          Issue Type: Improvement
            Reporter: Ivan Bessonov


General idea of full rebalance is described inĀ 
https://issues.apache.org/jira/browse/IGNITE-17083

For persistent storages, there's an option to avoid copy-on-write rebalance 
algorithms if it's desired. Intuitively, it's a preferable option. Each storage 
chooses its own format.
h2. General idea

In this case, PDS has checkpointing feature that saves consistent state on 
disk. I expect SQL indexes to be in the same partition file as other data.

For every partition, its state on disk would look like this:
{code:java}
part-x.bin
part-x-1.bin
part-x-2.bin
...
part-x-n.bin{code}
part-x.bin is a baseline, and every other file is a delta that should be 
applied to underlying layers to get consistent data. It can be viewed like full 
and incremental backups.

When rebalance snapshot is required, we could force a checkpoint and then 
*prohibit merging* of new deltas to delta files from the snapshot until 
rebalance is finished. We must guarantee that consistent state can be read from 
disk.

Now, there are several strategies of data transferring:
 * File-based. We can send baseline and delta files as files. Two possible 
issues here:
 ** Files contain duplicated pages, so the volume of data will be bigger than 
necessary.
 ** Baseline file has to be truncated, because some delta pages go directly 
into baseline file as optimization.
 * Page-based. Latest state of every required page is sent separately. Two 
strategies here:
 ** Iterate pages in order of page indexes. Overheads during reads, but writes 
are very effective.
 ** Iterate pages in order of delta files, skipping already read pages in the 
process (like snapshots in GridGain, for example). Little overhead on read, but 
write won't be append-only.
I would argue that slower reads are more appropriate then slower writes. 
Generally speaking, any write should be slower than any read of the same size, 
right?

Should we implement all strategies and give user a choice? It's hard to predict 
which one is better for which scenario. In the future, I think it would be 
convenient to implement many options, but at first we should stick to the 
simplest one.

There must be a common "infrastructure" or a framework to stream native 
rebalance snapshots. Data format should be as simple as possible.

NOTE: of course, it has to be mentioned that this approach might lead to 
ineffective storage space usage. It can be a problem in theory, but in practice 
full rebalance isn't expected to occur often, and event then we don't expect 
that users will rewrite the entire partition data in a span of a single 
rebalance.
h2. Possible problems

Given that "raw" data is sent, including sql indexes, all incompleted indexes 
will be sent incompleted. Maybe we should also send a build state for each 
index so that the receiving side could continue from the right place, not from 
the beginning.

This problem will be resolved in the future. Currently we don't have indexes 
implemented.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to