Re: "Idle verify" to "Online verify"

Alexei Scherbakov Tue, 07 May 2019 08:25:37 -0700

I've linked to IGNITE-10078 additional related/dependent tickets requiring
investigation as soon as main contribution will be accepted.


вт, 7 мая 2019 г. в 18:10, Alexei Scherbakov <alexey.scherbak...@gmail.com>:

> Anton,
>
> It's ready for review, look for Patch Available status.
> Yes, atomic caches are not fixed by this contribution. See [1]
>
> [1] https://issues.apache.org/jira/browse/IGNITE-11797
>
> вт, 7 мая 2019 г. в 17:30, Anton Vinogradov <a...@apache.org>:
>
>> Alexei,
>>
>> Got it.
>> Could you please let me know once PR will be ready for review?
>> Currently, have some questions, but, possible, they caused by non-final PR
>> (eg. why atomic counter still ignores misses?).
>>
>> On Tue, May 7, 2019 at 4:43 PM Alexei Scherbakov <
>> alexey.scherbak...@gmail.com> wrote:
>>
>> > Anton,
>> >
>> > 1) Extended counters indeed will answer the question if partition could
>> be
>> > safely restored to synchronized state on all owners.
>> > The only condition - one of owners has no missed updates.
>> > If not, partition must be moved to LOST state, see [1],
>> > TxPartitionCounterStateOnePrimaryTwoBackupsFailAll*Test,
>> >
>> >
>> IgniteSystemProperties#IGNITE_FAIL_NODE_ON_UNRECOVERABLE_PARTITION_INCONSISTENCY
>> > This is known issue and could happen if all partition owners were
>> > unavailable at some point.
>> > In such case we could try to recover consistency using some complex
>> > recovery protocol as you described. Related ticket [2]
>> >
>> > 2) Bitset implementation is considered as an option in GG Community
>> > Edition. No specific implementation dates at the moment.
>> >
>> > 3) As for "online" partition verification,  I think the best option
>> right
>> > now is to do verification partition by partition using read only mode
>> per
>> > group partition under load.
>> > While verification is in progress, all write ops are waiting, not
>> rejected.
>> > This is only 100% reliable way to compare partitions - by touching
>> actual
>> > data, all other ways like pre-computed hash are error prone.
>> > There is already ticket [3] for simplifing grid consistency verification
>> > which could be used as basis for such functionality.
>> > As for avoiding cache pollution, we could try read pages sequentially
>> from
>> > disk without lifting them to pagemem and computing some kind of
>> commutative
>> > hash. It's safe under partition write lock.
>> >
>> > [1] https://issues.apache.org/jira/browse/IGNITE-11611
>> > [2] https://issues.apache.org/jira/browse/IGNITE-6324
>> > [3] https://issues.apache.org/jira/browse/IGNITE-11256
>> >
>> > пн, 6 мая 2019 г. в 16:12, Anton Vinogradov <a...@apache.org>:
>> >
>> > > Ivan,
>> > >
>> > > 1) I've checked the PR [1] and it looks like it does not solve the
>> issue
>> > > too.
>> > > AFAICS, the main goal here (at PR) is to produce
>> > > PartitionUpdateCounter#sequential which can be false for all backups,
>> > what
>> > > backup should win in that case?
>> > >
>> > > Is there any IEP or some another design page for this fix?
>> > >
>> > > Looks like extended counters should be able to recover the whole
>> cluster
>> > > even in case all copies of the same partition are broken.
>> > > So, seems, the counter should provide detailed info:
>> > > - biggest applied updateCounter
>> > > - list of all missed counters before biggest applied
>> > > - optional hash
>> > >
>> > > In that case, we'll be able to perform some exchange between broken
>> > copies.
>> > > For example, we'll found that copy1 missed key1, and copy2 missed
>> key2.
>> > > It's pretty simple to fix both copies in that case.
>> > > In case all misses can be solved this way, we'll continue cluster
>> > > activation like it was not broken before.
>> > >
>> > > 2) Seems I see the simpler solution to handle misses (than at PR).
>> > > Once you have newUpdateCounter > curUpdateCounter + 1, you should add
>> > byte
>> > > (or int or long (smaplest possible)) value to special structure.
>> > > This value will represent delta between newUpdateCounter and
>> > > curUpdateCounter in bitmask way.
>> > > In case you'll handle updateCounter less that curUpdateCounter, you
>> > should
>> > > update the value at structure responsible to this delta.
>> > > For example, when you have delta "2 to 6", you will have 00000000
>> > initially
>> > > and 00011111 finally.
>> > > Each delta update should be finished with check it completed (value
>> == 31
>> > > in this case). Once it finished, it should be removed from the
>> structure.
>> > > Deltas can and should be reused to solve GC issue.
>> > >
>> > > What do you think about the proposed solution?
>> > >
>> > > 3) Hash computation can be an additional extension for extended
>> counters,
>> > > just one more dimension to be extremely sure everything is ok.
>> > > Any objections?
>> > >
>> > > [1] https://github.com/apache/ignite/pull/5765
>> > >
>> > > On Mon, May 6, 2019 at 12:48 PM Ivan Rakov <ivan.glu...@gmail.com>
>> > wrote:
>> > >
>> > > > Anton,
>> > > >
>> > > > Automatic quorum-based partition drop may work as a partial
>> workaround
>> > > > for IGNITE-10078, but discussed approach surely doesn't replace
>> > > > IGNITE-10078 activity. We still don't know what do to when quorum
>> can't
>> > > > be reached (2 partitions have hash X, 2 have hash Y) and keeping
>> > > > extended update counters is the only way to resolve such case.
>> > > > On the other hand, precalculated partition hashes validation on PME
>> can
>> > > > be a good addition to IGNITE-10078 logic: we'll be able to detect
>> > > > situations when extended update counters are equal, but for some
>> reason
>> > > > (bug or whatsoever) partition contents are different.
>> > > >
>> > > > Best Regards,
>> > > > Ivan Rakov
>> > > >
>> > > > On 06.05.2019 12:27, Anton Vinogradov wrote:
>> > > > > Ivan, just to make sure ...
>> > > > > The discussed case will fully solve the issue [1] in case we'll
>> also
>> > > add
>> > > > > some strategy to reject partitions with missed updates
>> > (updateCnt==Ok,
>> > > > > Hash!=Ok).
>> > > > > For example, we may use the Quorum strategy, when the majority
>> wins.
>> > > > > Sounds correct?
>> > > > >
>> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-10078
>> > > > >
>> > > > > On Tue, Apr 30, 2019 at 3:14 PM Anton Vinogradov <a...@apache.org>
>> > > wrote:
>> > > > >
>> > > > >> Ivan,
>> > > > >>
>> > > > >> Thanks for the detailed explanation.
>> > > > >> I'll try to implement the PoC to check the idea.
>> > > > >>
>> > > > >> On Mon, Apr 29, 2019 at 8:22 PM Ivan Rakov <
>> ivan.glu...@gmail.com>
>> > > > wrote:
>> > > > >>
>> > > > >>>> But how to keep this hash?
>> > > > >>> I think, we can just adopt way of storing partition update
>> > counters.
>> > > > >>> Update counters are:
>> > > > >>> 1) Kept and updated in heap, see
>> > > > >>> IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#pCntr (accessed
>> > > during
>> > > > >>> regular cache operations, no page replacement latency issues)
>> > > > >>> 2) Synchronized with page memory (and with disk) on every
>> > checkpoint,
>> > > > >>> see GridCacheOffheapManager#saveStoreMetadata
>> > > > >>> 3) Stored in partition meta page, see
>> > > > PagePartitionMetaIO#setUpdateCounter
>> > > > >>> 4) On node restart, we init onheap counter with value from disk
>> > (for
>> > > > the
>> > > > >>> moment of last checkpoint) and update it to latest value during
>> WAL
>> > > > >>> logical records replay
>> > > > >>>
>> > > > >>>> 2) PME is a rare operation on production cluster, but, seems,
>> we
>> > > have
>> > > > >>>> to check consistency in a regular way.
>> > > > >>>> Since we have to finish all operations before the check,
>> should we
>> > > > >>>> have fake PME for maintenance check in this case?
>> > > > >>>   From my experience, PME happens on prod clusters from time to
>> > time
>> > > > >>> (several times per week), which can be enough. In case it's
>> needed
>> > to
>> > > > >>> check consistency more often than regular PMEs occur, we can
>> > > implement
>> > > > >>> command that will trigger fake PME for consistency checking.
>> > > > >>>
>> > > > >>> Best Regards,
>> > > > >>> Ivan Rakov
>> > > > >>>
>> > > > >>> On 29.04.2019 18:53, Anton Vinogradov wrote:
>> > > > >>>> Ivan, thanks for the analysis!
>> > > > >>>>
>> > > > >>>>>> With having pre-calculated partition hash value, we can
>> > > > >>>> automatically detect inconsistent partitions on every PME.
>> > > > >>>> Great idea, seems this covers all broken synс cases.
>> > > > >>>>
>> > > > >>>> It will check alive nodes in case the primary failed
>> immediately
>> > > > >>>> and will check rejoining node once it finished a rebalance
>> (PME on
>> > > > >>>> becoming an owner).
>> > > > >>>> Recovered cluster will be checked on activation PME (or even
>> > before
>> > > > >>>> that?).
>> > > > >>>> Also, warmed cluster will be still warmed after check.
>> > > > >>>>
>> > > > >>>> Have I missed some cases leads to broken sync except bugs?
>> > > > >>>>
>> > > > >>>> 1) But how to keep this hash?
>> > > > >>>> - It should be automatically persisted on each checkpoint (it
>> > should
>> > > > >>>> not require recalculation on restore, snapshots should be
>> covered
>> > > too)
>> > > > >>>> (and covered by WAL?).
>> > > > >>>> - It should be always available at RAM for every partition
>> (even
>> > for
>> > > > >>>> cold partitions never updated/readed on this node) to be
>> > immediately
>> > > > >>>> used once all operations done on PME.
>> > > > >>>>
>> > > > >>>> Can we have special pages to keep such hashes and never allow
>> > their
>> > > > >>>> eviction?
>> > > > >>>>
>> > > > >>>> 2) PME is a rare operation on production cluster, but, seems,
>> we
>> > > have
>> > > > >>>> to check consistency in a regular way.
>> > > > >>>> Since we have to finish all operations before the check,
>> should we
>> > > > >>>> have fake PME for maintenance check in this case?
>> > > > >>>>
>> > > > >>>> On Mon, Apr 29, 2019 at 4:59 PM Ivan Rakov <
>> ivan.glu...@gmail.com
>> > > > >>>> <mailto:ivan.glu...@gmail.com>> wrote:
>> > > > >>>>
>> > > > >>>>      Hi Anton,
>> > > > >>>>
>> > > > >>>>      Thanks for sharing your ideas.
>> > > > >>>>      I think your approach should work in general. I'll just
>> share
>> > > my
>> > > > >>>>      concerns about possible issues that may come up.
>> > > > >>>>
>> > > > >>>>      1) Equality of update counters doesn't imply equality of
>> > > > >>>>      partitions content under load.
>> > > > >>>>      For every update, primary node generates update counter
>> and
>> > > then
>> > > > >>>>      update is delivered to backup node and gets applied with
>> the
>> > > > >>>>      corresponding update counter. For example, there are two
>> > > > >>>>      transactions (A and B) that update partition X by the
>> > following
>> > > > >>>>      scenario:
>> > > > >>>>      - A updates key1 in partition X on primary node and
>> > increments
>> > > > >>>>      counter to 10
>> > > > >>>>      - B updates key2 in partition X on primary node and
>> > increments
>> > > > >>>>      counter to 11
>> > > > >>>>      - While A is still updating another keys, B is finally
>> > > committed
>> > > > >>>>      - Update of key2 arrives to backup node and sets update
>> > counter
>> > > > to
>> > > > >>> 11
>> > > > >>>>      Observer will see equal update counters (11), but update
>> of
>> > > key 1
>> > > > >>>>      is still missing in the backup partition.
>> > > > >>>>      This is a fundamental problem which is being solved here:
>> > > > >>>>      https://issues.apache.org/jira/browse/IGNITE-10078
>> > > > >>>>      "Online verify" should operate with new complex update
>> > counters
>> > > > >>>>      which take such "update holes" into account. Otherwise,
>> > online
>> > > > >>>>      verify may provide false-positive inconsistency reports.
>> > > > >>>>
>> > > > >>>>      2) Acquisition and comparison of update counters is fast,
>> but
>> > > > >>>>      partition hash calculation is long. We should check that
>> > update
>> > > > >>>>      counter remains unchanged after every K keys handled.
>> > > > >>>>
>> > > > >>>>      3)
>> > > > >>>>
>> > > > >>>>>      Another hope is that we'll be able to pause/continue
>> scan,
>> > for
>> > > > >>>>>      example, we'll check 1/3 partitions today, 1/3 tomorrow,
>> and
>> > > in
>> > > > >>>>>      three days we'll check the whole cluster.
>> > > > >>>>      Totally makes sense.
>> > > > >>>>      We may find ourselves into a situation where some "hot"
>> > > > partitions
>> > > > >>>>      are still unprocessed, and every next attempt to calculate
>> > > > >>>>      partition hash fails due to another concurrent update. We
>> > > should
>> > > > >>>>      be able to track progress of validation (% of calculation
>> > time
>> > > > >>>>      wasted due to concurrent operations may be a good metric,
>> > 100%
>> > > is
>> > > > >>>>      the worst case) and provide option to stop/pause activity.
>> > > > >>>>      I think, pause should return an "intermediate results
>> report"
>> > > > with
>> > > > >>>>      information about which partitions have been successfully
>> > > > checked.
>> > > > >>>>      With such report, we can resume activity later: partitions
>> > from
>> > > > >>>>      report will be just skipped.
>> > > > >>>>
>> > > > >>>>      4)
>> > > > >>>>
>> > > > >>>>>      Since "Idle verify" uses regular pagmem, I assume it
>> > replaces
>> > > > hot
>> > > > >>>>>      data with persisted.
>> > > > >>>>>      So, we have to warm up the cluster after each check.
>> > > > >>>>>      Are there any chances to check without cooling the
>> cluster?
>> > > > >>>>      I don't see an easy way to achieve it with our page memory
>> > > > >>>>      architecture. We definitely can't just read pages from
>> disk
>> > > > >>>>      directly: we need to synchronize page access with
>> concurrent
>> > > > >>>>      update operations and checkpoints.
>> > > > >>>>      From my point of view, the correct way to solve this
>> issue is
>> > > > >>>>      improving our page replacement [1] mechanics by making it
>> > truly
>> > > > >>>>      scan-resistant.
>> > > > >>>>
>> > > > >>>>      P. S. There's another possible way of achieving online
>> > verify:
>> > > > >>>>      instead of on-demand hash calculation, we can always keep
>> > > > >>>>      up-to-date hash value for every partition. We'll need to
>> > update
>> > > > >>>>      hash on every insert/update/remove operation, but there
>> will
>> > be
>> > > > no
>> > > > >>>>      reordering issues as per function that we use for
>> aggregating
>> > > > hash
>> > > > >>>>      results (+) is commutative. With having pre-calculated
>> > > partition
>> > > > >>>>      hash value, we can automatically detect inconsistent
>> > partitions
>> > > > on
>> > > > >>>>      every PME. What do you think?
>> > > > >>>>
>> > > > >>>>      [1] -
>> > > > >>>>
>> > > > >>>
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk)
>> > > > >>>>      Best Regards,
>> > > > >>>>      Ivan Rakov
>> > > > >>>>
>> > > > >>>>      On 29.04.2019 12:20, Anton Vinogradov wrote:
>> > > > >>>>>      Igniters and especially Ivan Rakov,
>> > > > >>>>>
>> > > > >>>>>      "Idle verify" [1] is a really cool tool, to make sure
>> that
>> > > > >>>>>      cluster is consistent.
>> > > > >>>>>
>> > > > >>>>>      1) But it required to have operations paused during
>> cluster
>> > > > check.
>> > > > >>>>>      At some clusters, this check requires hours (3-4 hours at
>> > > cases
>> > > > I
>> > > > >>>>>      saw).
>> > > > >>>>>      I've checked the code of "idle verify" and it seems it
>> > > possible
>> > > > >>>>>      to make it "online" with some assumptions.
>> > > > >>>>>
>> > > > >>>>>      Idea:
>> > > > >>>>>      Currently "Idle verify" checks that partitions hashes,
>> > > generated
>> > > > >>>>>      this way
>> > > > >>>>>      while (it.hasNextX()) {
>> > > > >>>>>      CacheDataRow row = it.nextX();
>> > > > >>>>>      partHash += row.key().hashCode();
>> > > > >>>>>      partHash +=
>> > > > >>>>>
>> > > > >>>
>> > >  Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext()));
>> > > > >>>>>      }
>> > > > >>>>>      , are the same.
>> > > > >>>>>
>> > > > >>>>>      What if we'll generate same pairs
>> > updateCounter-partitionHash
>> > > > but
>> > > > >>>>>      will compare hashes only in case counters are the same?
>> > > > >>>>>      So, for example, will ask cluster to generate pairs for
>> 64
>> > > > >>>>>      partitions, then will find that 55 have the same counters
>> > (was
>> > > > >>>>>      not updated during check) and check them.
>> > > > >>>>>      The rest (64-55 = 9) partitions will be re-requested and
>> > > > >>>>>      rechecked with an additional 55.
>> > > > >>>>>      This way we'll be able to check cluster is consistent
>> even
>> > in
>> > > > >>>>>      сase operations are in progress (just retrying modified).
>> > > > >>>>>
>> > > > >>>>>      Risks and assumptions:
>> > > > >>>>>      Using this strategy we'll check the cluster's consistency
>> > ...
>> > > > >>>>>      eventually, and the check will take more time even on an
>> > idle
>> > > > >>>>>      cluster.
>> > > > >>>>>      In case operationsPerTimeToGeneratePartitionHashes >
>> > > > >>>>>      partitionsCount we'll definitely gain no progress.
>> > > > >>>>>      But, in case of the load is not high, we'll be able to
>> check
>> > > all
>> > > > >>>>>      cluster.
>> > > > >>>>>
>> > > > >>>>>      Another hope is that we'll be able to pause/continue
>> scan,
>> > for
>> > > > >>>>>      example, we'll check 1/3 partitions today, 1/3 tomorrow,
>> and
>> > > in
>> > > > >>>>>      three days we'll check the whole cluster.
>> > > > >>>>>
>> > > > >>>>>      Have I missed something?
>> > > > >>>>>
>> > > > >>>>>      2) Since "Idle verify" uses regular pagmem, I assume it
>> > > replaces
>> > > > >>>>>      hot data with persisted.
>> > > > >>>>>      So, we have to warm up the cluster after each check.
>> > > > >>>>>      Are there any chances to check without cooling the
>> cluster?
>> > > > >>>>>
>> > > > >>>>>      [1]
>> > > > >>>>>
>> > > > >>>
>> > > >
>> > >
>> >
>> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
>> > > >
>> > >
>> >
>> >
>> > --
>> >
>> > Best regards,
>> > Alexei Scherbakov
>> >
>>
>
>
> --
>
> Best regards,
> Alexei Scherbakov
>


-- 

Best regards,
Alexei Scherbakov

Re: "Idle verify" to "Online verify"

Reply via email to