Ivan, thanks for the analysis!

>> With having pre-calculated partition hash value, we can automatically
detect inconsistent partitions on every PME.
Great idea, seems this covers all broken synс cases.

It will check alive nodes in case the primary failed immediately
and will check rejoining node once it finished a rebalance (PME on becoming
an owner).
Recovered cluster will be checked on activation PME (or even before that?).
Also, warmed cluster will be still warmed after check.

Have I missed some cases leads to broken sync except bugs?

1) But how to keep this hash?
- It should be automatically persisted on each checkpoint (it should not
require recalculation on restore, snapshots should be covered too) (and
covered by WAL?).
- It should be always available at RAM for every partition (even for cold
partitions never updated/readed on this node) to be immediately used once
all operations done on PME.

Can we have special pages to keep such hashes and never allow their
eviction?

2) PME is a rare operation on production cluster, but, seems, we have to
check consistency in a regular way.
Since we have to finish all operations before the check, should we have
fake PME for maintenance check in this case?

On Mon, Apr 29, 2019 at 4:59 PM Ivan Rakov <ivan.glu...@gmail.com> wrote:

> Hi Anton,
>
> Thanks for sharing your ideas.
> I think your approach should work in general. I'll just share my concerns
> about possible issues that may come up.
>
> 1) Equality of update counters doesn't imply equality of partitions
> content under load.
> For every update, primary node generates update counter and then update is
> delivered to backup node and gets applied with the corresponding update
> counter. For example, there are two transactions (A and B) that update
> partition X by the following scenario:
> - A updates key1 in partition X on primary node and increments counter to
> 10
> - B updates key2 in partition X on primary node and increments counter to
> 11
> - While A is still updating another keys, B is finally committed
> - Update of key2 arrives to backup node and sets update counter to 11
> Observer will see equal update counters (11), but update of key 1 is still
> missing in the backup partition.
> This is a fundamental problem which is being solved here:
> https://issues.apache.org/jira/browse/IGNITE-10078
> "Online verify" should operate with new complex update counters which take
> such "update holes" into account. Otherwise, online verify may provide
> false-positive inconsistency reports.
>
> 2) Acquisition and comparison of update counters is fast, but partition
> hash calculation is long. We should check that update counter remains
> unchanged after every K keys handled.
>
> 3)
>
> Another hope is that we'll be able to pause/continue scan, for example,
> we'll check 1/3 partitions today, 1/3 tomorrow, and in three days we'll
> check the whole cluster.
>
> Totally makes sense.
> We may find ourselves into a situation where some "hot" partitions are
> still unprocessed, and every next attempt to calculate partition hash fails
> due to another concurrent update. We should be able to track progress of
> validation (% of calculation time wasted due to concurrent operations may
> be a good metric, 100% is the worst case) and provide option to stop/pause
> activity.
> I think, pause should return an "intermediate results report" with
> information about which partitions have been successfully checked. With
> such report, we can resume activity later: partitions from report will be
> just skipped.
>
> 4)
>
> Since "Idle verify" uses regular pagmem, I assume it replaces hot data
> with persisted.
> So, we have to warm up the cluster after each check.
> Are there any chances to check without cooling the cluster?
>
> I don't see an easy way to achieve it with our page memory architecture.
> We definitely can't just read pages from disk directly: we need to
> synchronize page access with concurrent update operations and checkpoints.
> From my point of view, the correct way to solve this issue is improving
> our page replacement [1] mechanics by making it truly scan-resistant.
>
> P. S. There's another possible way of achieving online verify: instead of
> on-demand hash calculation, we can always keep up-to-date hash value for
> every partition. We'll need to update hash on every insert/update/remove
> operation, but there will be no reordering issues as per function that we
> use for aggregating hash results (+) is commutative. With having
> pre-calculated partition hash value, we can automatically detect
> inconsistent partitions on every PME. What do you think?
>
> [1] -
> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk)
>
> Best Regards,
> Ivan Rakov
>
> On 29.04.2019 12:20, Anton Vinogradov wrote:
>
> Igniters and especially Ivan Rakov,
>
> "Idle verify" [1] is a really cool tool, to make sure that cluster is
> consistent.
>
> 1) But it required to have operations paused during cluster check.
> At some clusters, this check requires hours (3-4 hours at cases I saw).
> I've checked the code of "idle verify" and it seems it possible to make it
> "online" with some assumptions.
>
> Idea:
> Currently "Idle verify" checks that partitions hashes, generated this way
> while (it.hasNextX()) {
> CacheDataRow row = it.nextX();
> partHash += row.key().hashCode();
> partHash +=
> Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext()));
> }
> , are the same.
>
> What if we'll generate same pairs updateCounter-partitionHash but will
> compare hashes only in case counters are the same?
> So, for example, will ask cluster to generate pairs for 64 partitions,
> then will find that 55 have the same counters (was not updated during
> check) and check them.
> The rest (64-55 = 9) partitions will be re-requested and rechecked with an
> additional 55.
> This way we'll be able to check cluster is consistent even in сase
> operations are in progress (just retrying modified).
>
> Risks and assumptions:
> Using this strategy we'll check the cluster's consistency ... eventually,
> and the check will take more time even on an idle cluster.
> In case operationsPerTimeToGeneratePartitionHashes > partitionsCount we'll
> definitely gain no progress.
> But, in case of the load is not high, we'll be able to check all cluster.
>
> Another hope is that we'll be able to pause/continue scan, for example,
> we'll check 1/3 partitions today, 1/3 tomorrow, and in three days we'll
> check the whole cluster.
>
> Have I missed something?
>
> 2) Since "Idle verify" uses regular pagmem, I assume it replaces hot data
> with persisted.
> So, we have to warm up the cluster after each check.
> Are there any chances to check without cooling the cluster?
>
> [1]
> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
>
>

Reply via email to