I've linked to IGNITE-10078 additional related/dependent tickets requiring investigation as soon as main contribution will be accepted.
вт, 7 мая 2019 г. в 18:10, Alexei Scherbakov <alexey.scherbak...@gmail.com>: > Anton, > > It's ready for review, look for Patch Available status. > Yes, atomic caches are not fixed by this contribution. See [1] > > [1] https://issues.apache.org/jira/browse/IGNITE-11797 > > вт, 7 мая 2019 г. в 17:30, Anton Vinogradov <a...@apache.org>: > >> Alexei, >> >> Got it. >> Could you please let me know once PR will be ready for review? >> Currently, have some questions, but, possible, they caused by non-final PR >> (eg. why atomic counter still ignores misses?). >> >> On Tue, May 7, 2019 at 4:43 PM Alexei Scherbakov < >> alexey.scherbak...@gmail.com> wrote: >> >> > Anton, >> > >> > 1) Extended counters indeed will answer the question if partition could >> be >> > safely restored to synchronized state on all owners. >> > The only condition - one of owners has no missed updates. >> > If not, partition must be moved to LOST state, see [1], >> > TxPartitionCounterStateOnePrimaryTwoBackupsFailAll*Test, >> > >> > >> IgniteSystemProperties#IGNITE_FAIL_NODE_ON_UNRECOVERABLE_PARTITION_INCONSISTENCY >> > This is known issue and could happen if all partition owners were >> > unavailable at some point. >> > In such case we could try to recover consistency using some complex >> > recovery protocol as you described. Related ticket [2] >> > >> > 2) Bitset implementation is considered as an option in GG Community >> > Edition. No specific implementation dates at the moment. >> > >> > 3) As for "online" partition verification, I think the best option >> right >> > now is to do verification partition by partition using read only mode >> per >> > group partition under load. >> > While verification is in progress, all write ops are waiting, not >> rejected. >> > This is only 100% reliable way to compare partitions - by touching >> actual >> > data, all other ways like pre-computed hash are error prone. >> > There is already ticket [3] for simplifing grid consistency verification >> > which could be used as basis for such functionality. >> > As for avoiding cache pollution, we could try read pages sequentially >> from >> > disk without lifting them to pagemem and computing some kind of >> commutative >> > hash. It's safe under partition write lock. >> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-11611 >> > [2] https://issues.apache.org/jira/browse/IGNITE-6324 >> > [3] https://issues.apache.org/jira/browse/IGNITE-11256 >> > >> > пн, 6 мая 2019 г. в 16:12, Anton Vinogradov <a...@apache.org>: >> > >> > > Ivan, >> > > >> > > 1) I've checked the PR [1] and it looks like it does not solve the >> issue >> > > too. >> > > AFAICS, the main goal here (at PR) is to produce >> > > PartitionUpdateCounter#sequential which can be false for all backups, >> > what >> > > backup should win in that case? >> > > >> > > Is there any IEP or some another design page for this fix? >> > > >> > > Looks like extended counters should be able to recover the whole >> cluster >> > > even in case all copies of the same partition are broken. >> > > So, seems, the counter should provide detailed info: >> > > - biggest applied updateCounter >> > > - list of all missed counters before biggest applied >> > > - optional hash >> > > >> > > In that case, we'll be able to perform some exchange between broken >> > copies. >> > > For example, we'll found that copy1 missed key1, and copy2 missed >> key2. >> > > It's pretty simple to fix both copies in that case. >> > > In case all misses can be solved this way, we'll continue cluster >> > > activation like it was not broken before. >> > > >> > > 2) Seems I see the simpler solution to handle misses (than at PR). >> > > Once you have newUpdateCounter > curUpdateCounter + 1, you should add >> > byte >> > > (or int or long (smaplest possible)) value to special structure. >> > > This value will represent delta between newUpdateCounter and >> > > curUpdateCounter in bitmask way. >> > > In case you'll handle updateCounter less that curUpdateCounter, you >> > should >> > > update the value at structure responsible to this delta. >> > > For example, when you have delta "2 to 6", you will have 00000000 >> > initially >> > > and 00011111 finally. >> > > Each delta update should be finished with check it completed (value >> == 31 >> > > in this case). Once it finished, it should be removed from the >> structure. >> > > Deltas can and should be reused to solve GC issue. >> > > >> > > What do you think about the proposed solution? >> > > >> > > 3) Hash computation can be an additional extension for extended >> counters, >> > > just one more dimension to be extremely sure everything is ok. >> > > Any objections? >> > > >> > > [1] https://github.com/apache/ignite/pull/5765 >> > > >> > > On Mon, May 6, 2019 at 12:48 PM Ivan Rakov <ivan.glu...@gmail.com> >> > wrote: >> > > >> > > > Anton, >> > > > >> > > > Automatic quorum-based partition drop may work as a partial >> workaround >> > > > for IGNITE-10078, but discussed approach surely doesn't replace >> > > > IGNITE-10078 activity. We still don't know what do to when quorum >> can't >> > > > be reached (2 partitions have hash X, 2 have hash Y) and keeping >> > > > extended update counters is the only way to resolve such case. >> > > > On the other hand, precalculated partition hashes validation on PME >> can >> > > > be a good addition to IGNITE-10078 logic: we'll be able to detect >> > > > situations when extended update counters are equal, but for some >> reason >> > > > (bug or whatsoever) partition contents are different. >> > > > >> > > > Best Regards, >> > > > Ivan Rakov >> > > > >> > > > On 06.05.2019 12:27, Anton Vinogradov wrote: >> > > > > Ivan, just to make sure ... >> > > > > The discussed case will fully solve the issue [1] in case we'll >> also >> > > add >> > > > > some strategy to reject partitions with missed updates >> > (updateCnt==Ok, >> > > > > Hash!=Ok). >> > > > > For example, we may use the Quorum strategy, when the majority >> wins. >> > > > > Sounds correct? >> > > > > >> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-10078 >> > > > > >> > > > > On Tue, Apr 30, 2019 at 3:14 PM Anton Vinogradov <a...@apache.org> >> > > wrote: >> > > > > >> > > > >> Ivan, >> > > > >> >> > > > >> Thanks for the detailed explanation. >> > > > >> I'll try to implement the PoC to check the idea. >> > > > >> >> > > > >> On Mon, Apr 29, 2019 at 8:22 PM Ivan Rakov < >> ivan.glu...@gmail.com> >> > > > wrote: >> > > > >> >> > > > >>>> But how to keep this hash? >> > > > >>> I think, we can just adopt way of storing partition update >> > counters. >> > > > >>> Update counters are: >> > > > >>> 1) Kept and updated in heap, see >> > > > >>> IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#pCntr (accessed >> > > during >> > > > >>> regular cache operations, no page replacement latency issues) >> > > > >>> 2) Synchronized with page memory (and with disk) on every >> > checkpoint, >> > > > >>> see GridCacheOffheapManager#saveStoreMetadata >> > > > >>> 3) Stored in partition meta page, see >> > > > PagePartitionMetaIO#setUpdateCounter >> > > > >>> 4) On node restart, we init onheap counter with value from disk >> > (for >> > > > the >> > > > >>> moment of last checkpoint) and update it to latest value during >> WAL >> > > > >>> logical records replay >> > > > >>> >> > > > >>>> 2) PME is a rare operation on production cluster, but, seems, >> we >> > > have >> > > > >>>> to check consistency in a regular way. >> > > > >>>> Since we have to finish all operations before the check, >> should we >> > > > >>>> have fake PME for maintenance check in this case? >> > > > >>> From my experience, PME happens on prod clusters from time to >> > time >> > > > >>> (several times per week), which can be enough. In case it's >> needed >> > to >> > > > >>> check consistency more often than regular PMEs occur, we can >> > > implement >> > > > >>> command that will trigger fake PME for consistency checking. >> > > > >>> >> > > > >>> Best Regards, >> > > > >>> Ivan Rakov >> > > > >>> >> > > > >>> On 29.04.2019 18:53, Anton Vinogradov wrote: >> > > > >>>> Ivan, thanks for the analysis! >> > > > >>>> >> > > > >>>>>> With having pre-calculated partition hash value, we can >> > > > >>>> automatically detect inconsistent partitions on every PME. >> > > > >>>> Great idea, seems this covers all broken synс cases. >> > > > >>>> >> > > > >>>> It will check alive nodes in case the primary failed >> immediately >> > > > >>>> and will check rejoining node once it finished a rebalance >> (PME on >> > > > >>>> becoming an owner). >> > > > >>>> Recovered cluster will be checked on activation PME (or even >> > before >> > > > >>>> that?). >> > > > >>>> Also, warmed cluster will be still warmed after check. >> > > > >>>> >> > > > >>>> Have I missed some cases leads to broken sync except bugs? >> > > > >>>> >> > > > >>>> 1) But how to keep this hash? >> > > > >>>> - It should be automatically persisted on each checkpoint (it >> > should >> > > > >>>> not require recalculation on restore, snapshots should be >> covered >> > > too) >> > > > >>>> (and covered by WAL?). >> > > > >>>> - It should be always available at RAM for every partition >> (even >> > for >> > > > >>>> cold partitions never updated/readed on this node) to be >> > immediately >> > > > >>>> used once all operations done on PME. >> > > > >>>> >> > > > >>>> Can we have special pages to keep such hashes and never allow >> > their >> > > > >>>> eviction? >> > > > >>>> >> > > > >>>> 2) PME is a rare operation on production cluster, but, seems, >> we >> > > have >> > > > >>>> to check consistency in a regular way. >> > > > >>>> Since we have to finish all operations before the check, >> should we >> > > > >>>> have fake PME for maintenance check in this case? >> > > > >>>> >> > > > >>>> On Mon, Apr 29, 2019 at 4:59 PM Ivan Rakov < >> ivan.glu...@gmail.com >> > > > >>>> <mailto:ivan.glu...@gmail.com>> wrote: >> > > > >>>> >> > > > >>>> Hi Anton, >> > > > >>>> >> > > > >>>> Thanks for sharing your ideas. >> > > > >>>> I think your approach should work in general. I'll just >> share >> > > my >> > > > >>>> concerns about possible issues that may come up. >> > > > >>>> >> > > > >>>> 1) Equality of update counters doesn't imply equality of >> > > > >>>> partitions content under load. >> > > > >>>> For every update, primary node generates update counter >> and >> > > then >> > > > >>>> update is delivered to backup node and gets applied with >> the >> > > > >>>> corresponding update counter. For example, there are two >> > > > >>>> transactions (A and B) that update partition X by the >> > following >> > > > >>>> scenario: >> > > > >>>> - A updates key1 in partition X on primary node and >> > increments >> > > > >>>> counter to 10 >> > > > >>>> - B updates key2 in partition X on primary node and >> > increments >> > > > >>>> counter to 11 >> > > > >>>> - While A is still updating another keys, B is finally >> > > committed >> > > > >>>> - Update of key2 arrives to backup node and sets update >> > counter >> > > > to >> > > > >>> 11 >> > > > >>>> Observer will see equal update counters (11), but update >> of >> > > key 1 >> > > > >>>> is still missing in the backup partition. >> > > > >>>> This is a fundamental problem which is being solved here: >> > > > >>>> https://issues.apache.org/jira/browse/IGNITE-10078 >> > > > >>>> "Online verify" should operate with new complex update >> > counters >> > > > >>>> which take such "update holes" into account. Otherwise, >> > online >> > > > >>>> verify may provide false-positive inconsistency reports. >> > > > >>>> >> > > > >>>> 2) Acquisition and comparison of update counters is fast, >> but >> > > > >>>> partition hash calculation is long. We should check that >> > update >> > > > >>>> counter remains unchanged after every K keys handled. >> > > > >>>> >> > > > >>>> 3) >> > > > >>>> >> > > > >>>>> Another hope is that we'll be able to pause/continue >> scan, >> > for >> > > > >>>>> example, we'll check 1/3 partitions today, 1/3 tomorrow, >> and >> > > in >> > > > >>>>> three days we'll check the whole cluster. >> > > > >>>> Totally makes sense. >> > > > >>>> We may find ourselves into a situation where some "hot" >> > > > partitions >> > > > >>>> are still unprocessed, and every next attempt to calculate >> > > > >>>> partition hash fails due to another concurrent update. We >> > > should >> > > > >>>> be able to track progress of validation (% of calculation >> > time >> > > > >>>> wasted due to concurrent operations may be a good metric, >> > 100% >> > > is >> > > > >>>> the worst case) and provide option to stop/pause activity. >> > > > >>>> I think, pause should return an "intermediate results >> report" >> > > > with >> > > > >>>> information about which partitions have been successfully >> > > > checked. >> > > > >>>> With such report, we can resume activity later: partitions >> > from >> > > > >>>> report will be just skipped. >> > > > >>>> >> > > > >>>> 4) >> > > > >>>> >> > > > >>>>> Since "Idle verify" uses regular pagmem, I assume it >> > replaces >> > > > hot >> > > > >>>>> data with persisted. >> > > > >>>>> So, we have to warm up the cluster after each check. >> > > > >>>>> Are there any chances to check without cooling the >> cluster? >> > > > >>>> I don't see an easy way to achieve it with our page memory >> > > > >>>> architecture. We definitely can't just read pages from >> disk >> > > > >>>> directly: we need to synchronize page access with >> concurrent >> > > > >>>> update operations and checkpoints. >> > > > >>>> From my point of view, the correct way to solve this >> issue is >> > > > >>>> improving our page replacement [1] mechanics by making it >> > truly >> > > > >>>> scan-resistant. >> > > > >>>> >> > > > >>>> P. S. There's another possible way of achieving online >> > verify: >> > > > >>>> instead of on-demand hash calculation, we can always keep >> > > > >>>> up-to-date hash value for every partition. We'll need to >> > update >> > > > >>>> hash on every insert/update/remove operation, but there >> will >> > be >> > > > no >> > > > >>>> reordering issues as per function that we use for >> aggregating >> > > > hash >> > > > >>>> results (+) is commutative. With having pre-calculated >> > > partition >> > > > >>>> hash value, we can automatically detect inconsistent >> > partitions >> > > > on >> > > > >>>> every PME. What do you think? >> > > > >>>> >> > > > >>>> [1] - >> > > > >>>> >> > > > >>> >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk) >> > > > >>>> Best Regards, >> > > > >>>> Ivan Rakov >> > > > >>>> >> > > > >>>> On 29.04.2019 12:20, Anton Vinogradov wrote: >> > > > >>>>> Igniters and especially Ivan Rakov, >> > > > >>>>> >> > > > >>>>> "Idle verify" [1] is a really cool tool, to make sure >> that >> > > > >>>>> cluster is consistent. >> > > > >>>>> >> > > > >>>>> 1) But it required to have operations paused during >> cluster >> > > > check. >> > > > >>>>> At some clusters, this check requires hours (3-4 hours at >> > > cases >> > > > I >> > > > >>>>> saw). >> > > > >>>>> I've checked the code of "idle verify" and it seems it >> > > possible >> > > > >>>>> to make it "online" with some assumptions. >> > > > >>>>> >> > > > >>>>> Idea: >> > > > >>>>> Currently "Idle verify" checks that partitions hashes, >> > > generated >> > > > >>>>> this way >> > > > >>>>> while (it.hasNextX()) { >> > > > >>>>> CacheDataRow row = it.nextX(); >> > > > >>>>> partHash += row.key().hashCode(); >> > > > >>>>> partHash += >> > > > >>>>> >> > > > >>> >> > > Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext())); >> > > > >>>>> } >> > > > >>>>> , are the same. >> > > > >>>>> >> > > > >>>>> What if we'll generate same pairs >> > updateCounter-partitionHash >> > > > but >> > > > >>>>> will compare hashes only in case counters are the same? >> > > > >>>>> So, for example, will ask cluster to generate pairs for >> 64 >> > > > >>>>> partitions, then will find that 55 have the same counters >> > (was >> > > > >>>>> not updated during check) and check them. >> > > > >>>>> The rest (64-55 = 9) partitions will be re-requested and >> > > > >>>>> rechecked with an additional 55. >> > > > >>>>> This way we'll be able to check cluster is consistent >> even >> > in >> > > > >>>>> сase operations are in progress (just retrying modified). >> > > > >>>>> >> > > > >>>>> Risks and assumptions: >> > > > >>>>> Using this strategy we'll check the cluster's consistency >> > ... >> > > > >>>>> eventually, and the check will take more time even on an >> > idle >> > > > >>>>> cluster. >> > > > >>>>> In case operationsPerTimeToGeneratePartitionHashes > >> > > > >>>>> partitionsCount we'll definitely gain no progress. >> > > > >>>>> But, in case of the load is not high, we'll be able to >> check >> > > all >> > > > >>>>> cluster. >> > > > >>>>> >> > > > >>>>> Another hope is that we'll be able to pause/continue >> scan, >> > for >> > > > >>>>> example, we'll check 1/3 partitions today, 1/3 tomorrow, >> and >> > > in >> > > > >>>>> three days we'll check the whole cluster. >> > > > >>>>> >> > > > >>>>> Have I missed something? >> > > > >>>>> >> > > > >>>>> 2) Since "Idle verify" uses regular pagmem, I assume it >> > > replaces >> > > > >>>>> hot data with persisted. >> > > > >>>>> So, we have to warm up the cluster after each check. >> > > > >>>>> Are there any chances to check without cooling the >> cluster? >> > > > >>>>> >> > > > >>>>> [1] >> > > > >>>>> >> > > > >>> >> > > > >> > > >> > >> https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums >> > > > >> > > >> > >> > >> > -- >> > >> > Best regards, >> > Alexei Scherbakov >> > >> > > > -- > > Best regards, > Alexei Scherbakov > -- Best regards, Alexei Scherbakov