Re: "Idle verify" to "Online verify"

Alexei Scherbakov Tue, 07 May 2019 08:11:10 -0700

Anton,

It's ready for review, look for Patch Available status.
Yes, atomic caches are not fixed by this contribution. See [1]


[1] https://issues.apache.org/jira/browse/IGNITE-11797

вт, 7 мая 2019 г. в 17:30, Anton Vinogradov <[email protected]>:

> Alexei,
>
> Got it.
> Could you please let me know once PR will be ready for review?
> Currently, have some questions, but, possible, they caused by non-final PR
> (eg. why atomic counter still ignores misses?).
>
> On Tue, May 7, 2019 at 4:43 PM Alexei Scherbakov <
> [email protected]> wrote:
>
> > Anton,
> >
> > 1) Extended counters indeed will answer the question if partition could
> be
> > safely restored to synchronized state on all owners.
> > The only condition - one of owners has no missed updates.
> > If not, partition must be moved to LOST state, see [1],
> > TxPartitionCounterStateOnePrimaryTwoBackupsFailAll*Test,
> >
> >
> IgniteSystemProperties#IGNITE_FAIL_NODE_ON_UNRECOVERABLE_PARTITION_INCONSISTENCY
> > This is known issue and could happen if all partition owners were
> > unavailable at some point.
> > In such case we could try to recover consistency using some complex
> > recovery protocol as you described. Related ticket [2]
> >
> > 2) Bitset implementation is considered as an option in GG Community
> > Edition. No specific implementation dates at the moment.
> >
> > 3) As for "online" partition verification,  I think the best option right
> > now is to do verification partition by partition using read only mode per
> > group partition under load.
> > While verification is in progress, all write ops are waiting, not
> rejected.
> > This is only 100% reliable way to compare partitions - by touching actual
> > data, all other ways like pre-computed hash are error prone.
> > There is already ticket [3] for simplifing grid consistency verification
> > which could be used as basis for such functionality.
> > As for avoiding cache pollution, we could try read pages sequentially
> from
> > disk without lifting them to pagemem and computing some kind of
> commutative
> > hash. It's safe under partition write lock.
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-11611
> > [2] https://issues.apache.org/jira/browse/IGNITE-6324
> > [3] https://issues.apache.org/jira/browse/IGNITE-11256
> >
> > пн, 6 мая 2019 г. в 16:12, Anton Vinogradov <[email protected]>:
> >
> > > Ivan,
> > >
> > > 1) I've checked the PR [1] and it looks like it does not solve the
> issue
> > > too.
> > > AFAICS, the main goal here (at PR) is to produce
> > > PartitionUpdateCounter#sequential which can be false for all backups,
> > what
> > > backup should win in that case?
> > >
> > > Is there any IEP or some another design page for this fix?
> > >
> > > Looks like extended counters should be able to recover the whole
> cluster
> > > even in case all copies of the same partition are broken.
> > > So, seems, the counter should provide detailed info:
> > > - biggest applied updateCounter
> > > - list of all missed counters before biggest applied
> > > - optional hash
> > >
> > > In that case, we'll be able to perform some exchange between broken
> > copies.
> > > For example, we'll found that copy1 missed key1, and copy2 missed key2.
> > > It's pretty simple to fix both copies in that case.
> > > In case all misses can be solved this way, we'll continue cluster
> > > activation like it was not broken before.
> > >
> > > 2) Seems I see the simpler solution to handle misses (than at PR).
> > > Once you have newUpdateCounter > curUpdateCounter + 1, you should add
> > byte
> > > (or int or long (smaplest possible)) value to special structure.
> > > This value will represent delta between newUpdateCounter and
> > > curUpdateCounter in bitmask way.
> > > In case you'll handle updateCounter less that curUpdateCounter, you
> > should
> > > update the value at structure responsible to this delta.
> > > For example, when you have delta "2 to 6", you will have 00000000
> > initially
> > > and 00011111 finally.
> > > Each delta update should be finished with check it completed (value ==
> 31
> > > in this case). Once it finished, it should be removed from the
> structure.
> > > Deltas can and should be reused to solve GC issue.
> > >
> > > What do you think about the proposed solution?
> > >
> > > 3) Hash computation can be an additional extension for extended
> counters,
> > > just one more dimension to be extremely sure everything is ok.
> > > Any objections?
> > >
> > > [1] https://github.com/apache/ignite/pull/5765
> > >
> > > On Mon, May 6, 2019 at 12:48 PM Ivan Rakov <[email protected]>
> > wrote:
> > >
> > > > Anton,
> > > >
> > > > Automatic quorum-based partition drop may work as a partial
> workaround
> > > > for IGNITE-10078, but discussed approach surely doesn't replace
> > > > IGNITE-10078 activity. We still don't know what do to when quorum
> can't
> > > > be reached (2 partitions have hash X, 2 have hash Y) and keeping
> > > > extended update counters is the only way to resolve such case.
> > > > On the other hand, precalculated partition hashes validation on PME
> can
> > > > be a good addition to IGNITE-10078 logic: we'll be able to detect
> > > > situations when extended update counters are equal, but for some
> reason
> > > > (bug or whatsoever) partition contents are different.
> > > >
> > > > Best Regards,
> > > > Ivan Rakov
> > > >
> > > > On 06.05.2019 12:27, Anton Vinogradov wrote:
> > > > > Ivan, just to make sure ...
> > > > > The discussed case will fully solve the issue [1] in case we'll
> also
> > > add
> > > > > some strategy to reject partitions with missed updates
> > (updateCnt==Ok,
> > > > > Hash!=Ok).
> > > > > For example, we may use the Quorum strategy, when the majority
> wins.
> > > > > Sounds correct?
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-10078
> > > > >
> > > > > On Tue, Apr 30, 2019 at 3:14 PM Anton Vinogradov <[email protected]>
> > > wrote:
> > > > >
> > > > >> Ivan,
> > > > >>
> > > > >> Thanks for the detailed explanation.
> > > > >> I'll try to implement the PoC to check the idea.
> > > > >>
> > > > >> On Mon, Apr 29, 2019 at 8:22 PM Ivan Rakov <[email protected]
> >
> > > > wrote:
> > > > >>
> > > > >>>> But how to keep this hash?
> > > > >>> I think, we can just adopt way of storing partition update
> > counters.
> > > > >>> Update counters are:
> > > > >>> 1) Kept and updated in heap, see
> > > > >>> IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#pCntr (accessed
> > > during
> > > > >>> regular cache operations, no page replacement latency issues)
> > > > >>> 2) Synchronized with page memory (and with disk) on every
> > checkpoint,
> > > > >>> see GridCacheOffheapManager#saveStoreMetadata
> > > > >>> 3) Stored in partition meta page, see
> > > > PagePartitionMetaIO#setUpdateCounter
> > > > >>> 4) On node restart, we init onheap counter with value from disk
> > (for
> > > > the
> > > > >>> moment of last checkpoint) and update it to latest value during
> WAL
> > > > >>> logical records replay
> > > > >>>
> > > > >>>> 2) PME is a rare operation on production cluster, but, seems, we
> > > have
> > > > >>>> to check consistency in a regular way.
> > > > >>>> Since we have to finish all operations before the check, should
> we
> > > > >>>> have fake PME for maintenance check in this case?
> > > > >>>   From my experience, PME happens on prod clusters from time to
> > time
> > > > >>> (several times per week), which can be enough. In case it's
> needed
> > to
> > > > >>> check consistency more often than regular PMEs occur, we can
> > > implement
> > > > >>> command that will trigger fake PME for consistency checking.
> > > > >>>
> > > > >>> Best Regards,
> > > > >>> Ivan Rakov
> > > > >>>
> > > > >>> On 29.04.2019 18:53, Anton Vinogradov wrote:
> > > > >>>> Ivan, thanks for the analysis!
> > > > >>>>
> > > > >>>>>> With having pre-calculated partition hash value, we can
> > > > >>>> automatically detect inconsistent partitions on every PME.
> > > > >>>> Great idea, seems this covers all broken synс cases.
> > > > >>>>
> > > > >>>> It will check alive nodes in case the primary failed immediately
> > > > >>>> and will check rejoining node once it finished a rebalance (PME
> on
> > > > >>>> becoming an owner).
> > > > >>>> Recovered cluster will be checked on activation PME (or even
> > before
> > > > >>>> that?).
> > > > >>>> Also, warmed cluster will be still warmed after check.
> > > > >>>>
> > > > >>>> Have I missed some cases leads to broken sync except bugs?
> > > > >>>>
> > > > >>>> 1) But how to keep this hash?
> > > > >>>> - It should be automatically persisted on each checkpoint (it
> > should
> > > > >>>> not require recalculation on restore, snapshots should be
> covered
> > > too)
> > > > >>>> (and covered by WAL?).
> > > > >>>> - It should be always available at RAM for every partition (even
> > for
> > > > >>>> cold partitions never updated/readed on this node) to be
> > immediately
> > > > >>>> used once all operations done on PME.
> > > > >>>>
> > > > >>>> Can we have special pages to keep such hashes and never allow
> > their
> > > > >>>> eviction?
> > > > >>>>
> > > > >>>> 2) PME is a rare operation on production cluster, but, seems, we
> > > have
> > > > >>>> to check consistency in a regular way.
> > > > >>>> Since we have to finish all operations before the check, should
> we
> > > > >>>> have fake PME for maintenance check in this case?
> > > > >>>>
> > > > >>>> On Mon, Apr 29, 2019 at 4:59 PM Ivan Rakov <
> [email protected]
> > > > >>>> <mailto:[email protected]>> wrote:
> > > > >>>>
> > > > >>>>      Hi Anton,
> > > > >>>>
> > > > >>>>      Thanks for sharing your ideas.
> > > > >>>>      I think your approach should work in general. I'll just
> share
> > > my
> > > > >>>>      concerns about possible issues that may come up.
> > > > >>>>
> > > > >>>>      1) Equality of update counters doesn't imply equality of
> > > > >>>>      partitions content under load.
> > > > >>>>      For every update, primary node generates update counter and
> > > then
> > > > >>>>      update is delivered to backup node and gets applied with
> the
> > > > >>>>      corresponding update counter. For example, there are two
> > > > >>>>      transactions (A and B) that update partition X by the
> > following
> > > > >>>>      scenario:
> > > > >>>>      - A updates key1 in partition X on primary node and
> > increments
> > > > >>>>      counter to 10
> > > > >>>>      - B updates key2 in partition X on primary node and
> > increments
> > > > >>>>      counter to 11
> > > > >>>>      - While A is still updating another keys, B is finally
> > > committed
> > > > >>>>      - Update of key2 arrives to backup node and sets update
> > counter
> > > > to
> > > > >>> 11
> > > > >>>>      Observer will see equal update counters (11), but update of
> > > key 1
> > > > >>>>      is still missing in the backup partition.
> > > > >>>>      This is a fundamental problem which is being solved here:
> > > > >>>>      https://issues.apache.org/jira/browse/IGNITE-10078
> > > > >>>>      "Online verify" should operate with new complex update
> > counters
> > > > >>>>      which take such "update holes" into account. Otherwise,
> > online
> > > > >>>>      verify may provide false-positive inconsistency reports.
> > > > >>>>
> > > > >>>>      2) Acquisition and comparison of update counters is fast,
> but
> > > > >>>>      partition hash calculation is long. We should check that
> > update
> > > > >>>>      counter remains unchanged after every K keys handled.
> > > > >>>>
> > > > >>>>      3)
> > > > >>>>
> > > > >>>>>      Another hope is that we'll be able to pause/continue scan,
> > for
> > > > >>>>>      example, we'll check 1/3 partitions today, 1/3 tomorrow,
> and
> > > in
> > > > >>>>>      three days we'll check the whole cluster.
> > > > >>>>      Totally makes sense.
> > > > >>>>      We may find ourselves into a situation where some "hot"
> > > > partitions
> > > > >>>>      are still unprocessed, and every next attempt to calculate
> > > > >>>>      partition hash fails due to another concurrent update. We
> > > should
> > > > >>>>      be able to track progress of validation (% of calculation
> > time
> > > > >>>>      wasted due to concurrent operations may be a good metric,
> > 100%
> > > is
> > > > >>>>      the worst case) and provide option to stop/pause activity.
> > > > >>>>      I think, pause should return an "intermediate results
> report"
> > > > with
> > > > >>>>      information about which partitions have been successfully
> > > > checked.
> > > > >>>>      With such report, we can resume activity later: partitions
> > from
> > > > >>>>      report will be just skipped.
> > > > >>>>
> > > > >>>>      4)
> > > > >>>>
> > > > >>>>>      Since "Idle verify" uses regular pagmem, I assume it
> > replaces
> > > > hot
> > > > >>>>>      data with persisted.
> > > > >>>>>      So, we have to warm up the cluster after each check.
> > > > >>>>>      Are there any chances to check without cooling the
> cluster?
> > > > >>>>      I don't see an easy way to achieve it with our page memory
> > > > >>>>      architecture. We definitely can't just read pages from disk
> > > > >>>>      directly: we need to synchronize page access with
> concurrent
> > > > >>>>      update operations and checkpoints.
> > > > >>>>      From my point of view, the correct way to solve this issue
> is
> > > > >>>>      improving our page replacement [1] mechanics by making it
> > truly
> > > > >>>>      scan-resistant.
> > > > >>>>
> > > > >>>>      P. S. There's another possible way of achieving online
> > verify:
> > > > >>>>      instead of on-demand hash calculation, we can always keep
> > > > >>>>      up-to-date hash value for every partition. We'll need to
> > update
> > > > >>>>      hash on every insert/update/remove operation, but there
> will
> > be
> > > > no
> > > > >>>>      reordering issues as per function that we use for
> aggregating
> > > > hash
> > > > >>>>      results (+) is commutative. With having pre-calculated
> > > partition
> > > > >>>>      hash value, we can automatically detect inconsistent
> > partitions
> > > > on
> > > > >>>>      every PME. What do you think?
> > > > >>>>
> > > > >>>>      [1] -
> > > > >>>>
> > > > >>>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk)
> > > > >>>>      Best Regards,
> > > > >>>>      Ivan Rakov
> > > > >>>>
> > > > >>>>      On 29.04.2019 12:20, Anton Vinogradov wrote:
> > > > >>>>>      Igniters and especially Ivan Rakov,
> > > > >>>>>
> > > > >>>>>      "Idle verify" [1] is a really cool tool, to make sure that
> > > > >>>>>      cluster is consistent.
> > > > >>>>>
> > > > >>>>>      1) But it required to have operations paused during
> cluster
> > > > check.
> > > > >>>>>      At some clusters, this check requires hours (3-4 hours at
> > > cases
> > > > I
> > > > >>>>>      saw).
> > > > >>>>>      I've checked the code of "idle verify" and it seems it
> > > possible
> > > > >>>>>      to make it "online" with some assumptions.
> > > > >>>>>
> > > > >>>>>      Idea:
> > > > >>>>>      Currently "Idle verify" checks that partitions hashes,
> > > generated
> > > > >>>>>      this way
> > > > >>>>>      while (it.hasNextX()) {
> > > > >>>>>      CacheDataRow row = it.nextX();
> > > > >>>>>      partHash += row.key().hashCode();
> > > > >>>>>      partHash +=
> > > > >>>>>
> > > > >>>
> > >  Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext()));
> > > > >>>>>      }
> > > > >>>>>      , are the same.
> > > > >>>>>
> > > > >>>>>      What if we'll generate same pairs
> > updateCounter-partitionHash
> > > > but
> > > > >>>>>      will compare hashes only in case counters are the same?
> > > > >>>>>      So, for example, will ask cluster to generate pairs for 64
> > > > >>>>>      partitions, then will find that 55 have the same counters
> > (was
> > > > >>>>>      not updated during check) and check them.
> > > > >>>>>      The rest (64-55 = 9) partitions will be re-requested and
> > > > >>>>>      rechecked with an additional 55.
> > > > >>>>>      This way we'll be able to check cluster is consistent even
> > in
> > > > >>>>>      сase operations are in progress (just retrying modified).
> > > > >>>>>
> > > > >>>>>      Risks and assumptions:
> > > > >>>>>      Using this strategy we'll check the cluster's consistency
> > ...
> > > > >>>>>      eventually, and the check will take more time even on an
> > idle
> > > > >>>>>      cluster.
> > > > >>>>>      In case operationsPerTimeToGeneratePartitionHashes >
> > > > >>>>>      partitionsCount we'll definitely gain no progress.
> > > > >>>>>      But, in case of the load is not high, we'll be able to
> check
> > > all
> > > > >>>>>      cluster.
> > > > >>>>>
> > > > >>>>>      Another hope is that we'll be able to pause/continue scan,
> > for
> > > > >>>>>      example, we'll check 1/3 partitions today, 1/3 tomorrow,
> and
> > > in
> > > > >>>>>      three days we'll check the whole cluster.
> > > > >>>>>
> > > > >>>>>      Have I missed something?
> > > > >>>>>
> > > > >>>>>      2) Since "Idle verify" uses regular pagmem, I assume it
> > > replaces
> > > > >>>>>      hot data with persisted.
> > > > >>>>>      So, we have to warm up the cluster after each check.
> > > > >>>>>      Are there any chances to check without cooling the
> cluster?
> > > > >>>>>
> > > > >>>>>      [1]
> > > > >>>>>
> > > > >>>
> > > >
> > >
> >
> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums
> > > >
> > >
> >
> >
> > --
> >
> > Best regards,
> > Alexei Scherbakov
> >
>


-- 

Best regards,
Alexei Scherbakov

Re: "Idle verify" to "Online verify"

Reply via email to