Anton, It's ready for review, look for Patch Available status. Yes, atomic caches are not fixed by this contribution. See [1]
[1] https://issues.apache.org/jira/browse/IGNITE-11797 вт, 7 мая 2019 г. в 17:30, Anton Vinogradov <a...@apache.org>: > Alexei, > > Got it. > Could you please let me know once PR will be ready for review? > Currently, have some questions, but, possible, they caused by non-final PR > (eg. why atomic counter still ignores misses?). > > On Tue, May 7, 2019 at 4:43 PM Alexei Scherbakov < > alexey.scherbak...@gmail.com> wrote: > > > Anton, > > > > 1) Extended counters indeed will answer the question if partition could > be > > safely restored to synchronized state on all owners. > > The only condition - one of owners has no missed updates. > > If not, partition must be moved to LOST state, see [1], > > TxPartitionCounterStateOnePrimaryTwoBackupsFailAll*Test, > > > > > IgniteSystemProperties#IGNITE_FAIL_NODE_ON_UNRECOVERABLE_PARTITION_INCONSISTENCY > > This is known issue and could happen if all partition owners were > > unavailable at some point. > > In such case we could try to recover consistency using some complex > > recovery protocol as you described. Related ticket [2] > > > > 2) Bitset implementation is considered as an option in GG Community > > Edition. No specific implementation dates at the moment. > > > > 3) As for "online" partition verification, I think the best option right > > now is to do verification partition by partition using read only mode per > > group partition under load. > > While verification is in progress, all write ops are waiting, not > rejected. > > This is only 100% reliable way to compare partitions - by touching actual > > data, all other ways like pre-computed hash are error prone. > > There is already ticket [3] for simplifing grid consistency verification > > which could be used as basis for such functionality. > > As for avoiding cache pollution, we could try read pages sequentially > from > > disk without lifting them to pagemem and computing some kind of > commutative > > hash. It's safe under partition write lock. > > > > [1] https://issues.apache.org/jira/browse/IGNITE-11611 > > [2] https://issues.apache.org/jira/browse/IGNITE-6324 > > [3] https://issues.apache.org/jira/browse/IGNITE-11256 > > > > пн, 6 мая 2019 г. в 16:12, Anton Vinogradov <a...@apache.org>: > > > > > Ivan, > > > > > > 1) I've checked the PR [1] and it looks like it does not solve the > issue > > > too. > > > AFAICS, the main goal here (at PR) is to produce > > > PartitionUpdateCounter#sequential which can be false for all backups, > > what > > > backup should win in that case? > > > > > > Is there any IEP or some another design page for this fix? > > > > > > Looks like extended counters should be able to recover the whole > cluster > > > even in case all copies of the same partition are broken. > > > So, seems, the counter should provide detailed info: > > > - biggest applied updateCounter > > > - list of all missed counters before biggest applied > > > - optional hash > > > > > > In that case, we'll be able to perform some exchange between broken > > copies. > > > For example, we'll found that copy1 missed key1, and copy2 missed key2. > > > It's pretty simple to fix both copies in that case. > > > In case all misses can be solved this way, we'll continue cluster > > > activation like it was not broken before. > > > > > > 2) Seems I see the simpler solution to handle misses (than at PR). > > > Once you have newUpdateCounter > curUpdateCounter + 1, you should add > > byte > > > (or int or long (smaplest possible)) value to special structure. > > > This value will represent delta between newUpdateCounter and > > > curUpdateCounter in bitmask way. > > > In case you'll handle updateCounter less that curUpdateCounter, you > > should > > > update the value at structure responsible to this delta. > > > For example, when you have delta "2 to 6", you will have 00000000 > > initially > > > and 00011111 finally. > > > Each delta update should be finished with check it completed (value == > 31 > > > in this case). Once it finished, it should be removed from the > structure. > > > Deltas can and should be reused to solve GC issue. > > > > > > What do you think about the proposed solution? > > > > > > 3) Hash computation can be an additional extension for extended > counters, > > > just one more dimension to be extremely sure everything is ok. > > > Any objections? > > > > > > [1] https://github.com/apache/ignite/pull/5765 > > > > > > On Mon, May 6, 2019 at 12:48 PM Ivan Rakov <ivan.glu...@gmail.com> > > wrote: > > > > > > > Anton, > > > > > > > > Automatic quorum-based partition drop may work as a partial > workaround > > > > for IGNITE-10078, but discussed approach surely doesn't replace > > > > IGNITE-10078 activity. We still don't know what do to when quorum > can't > > > > be reached (2 partitions have hash X, 2 have hash Y) and keeping > > > > extended update counters is the only way to resolve such case. > > > > On the other hand, precalculated partition hashes validation on PME > can > > > > be a good addition to IGNITE-10078 logic: we'll be able to detect > > > > situations when extended update counters are equal, but for some > reason > > > > (bug or whatsoever) partition contents are different. > > > > > > > > Best Regards, > > > > Ivan Rakov > > > > > > > > On 06.05.2019 12:27, Anton Vinogradov wrote: > > > > > Ivan, just to make sure ... > > > > > The discussed case will fully solve the issue [1] in case we'll > also > > > add > > > > > some strategy to reject partitions with missed updates > > (updateCnt==Ok, > > > > > Hash!=Ok). > > > > > For example, we may use the Quorum strategy, when the majority > wins. > > > > > Sounds correct? > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-10078 > > > > > > > > > > On Tue, Apr 30, 2019 at 3:14 PM Anton Vinogradov <a...@apache.org> > > > wrote: > > > > > > > > > >> Ivan, > > > > >> > > > > >> Thanks for the detailed explanation. > > > > >> I'll try to implement the PoC to check the idea. > > > > >> > > > > >> On Mon, Apr 29, 2019 at 8:22 PM Ivan Rakov <ivan.glu...@gmail.com > > > > > > wrote: > > > > >> > > > > >>>> But how to keep this hash? > > > > >>> I think, we can just adopt way of storing partition update > > counters. > > > > >>> Update counters are: > > > > >>> 1) Kept and updated in heap, see > > > > >>> IgniteCacheOffheapManagerImpl.CacheDataStoreImpl#pCntr (accessed > > > during > > > > >>> regular cache operations, no page replacement latency issues) > > > > >>> 2) Synchronized with page memory (and with disk) on every > > checkpoint, > > > > >>> see GridCacheOffheapManager#saveStoreMetadata > > > > >>> 3) Stored in partition meta page, see > > > > PagePartitionMetaIO#setUpdateCounter > > > > >>> 4) On node restart, we init onheap counter with value from disk > > (for > > > > the > > > > >>> moment of last checkpoint) and update it to latest value during > WAL > > > > >>> logical records replay > > > > >>> > > > > >>>> 2) PME is a rare operation on production cluster, but, seems, we > > > have > > > > >>>> to check consistency in a regular way. > > > > >>>> Since we have to finish all operations before the check, should > we > > > > >>>> have fake PME for maintenance check in this case? > > > > >>> From my experience, PME happens on prod clusters from time to > > time > > > > >>> (several times per week), which can be enough. In case it's > needed > > to > > > > >>> check consistency more often than regular PMEs occur, we can > > > implement > > > > >>> command that will trigger fake PME for consistency checking. > > > > >>> > > > > >>> Best Regards, > > > > >>> Ivan Rakov > > > > >>> > > > > >>> On 29.04.2019 18:53, Anton Vinogradov wrote: > > > > >>>> Ivan, thanks for the analysis! > > > > >>>> > > > > >>>>>> With having pre-calculated partition hash value, we can > > > > >>>> automatically detect inconsistent partitions on every PME. > > > > >>>> Great idea, seems this covers all broken synс cases. > > > > >>>> > > > > >>>> It will check alive nodes in case the primary failed immediately > > > > >>>> and will check rejoining node once it finished a rebalance (PME > on > > > > >>>> becoming an owner). > > > > >>>> Recovered cluster will be checked on activation PME (or even > > before > > > > >>>> that?). > > > > >>>> Also, warmed cluster will be still warmed after check. > > > > >>>> > > > > >>>> Have I missed some cases leads to broken sync except bugs? > > > > >>>> > > > > >>>> 1) But how to keep this hash? > > > > >>>> - It should be automatically persisted on each checkpoint (it > > should > > > > >>>> not require recalculation on restore, snapshots should be > covered > > > too) > > > > >>>> (and covered by WAL?). > > > > >>>> - It should be always available at RAM for every partition (even > > for > > > > >>>> cold partitions never updated/readed on this node) to be > > immediately > > > > >>>> used once all operations done on PME. > > > > >>>> > > > > >>>> Can we have special pages to keep such hashes and never allow > > their > > > > >>>> eviction? > > > > >>>> > > > > >>>> 2) PME is a rare operation on production cluster, but, seems, we > > > have > > > > >>>> to check consistency in a regular way. > > > > >>>> Since we have to finish all operations before the check, should > we > > > > >>>> have fake PME for maintenance check in this case? > > > > >>>> > > > > >>>> On Mon, Apr 29, 2019 at 4:59 PM Ivan Rakov < > ivan.glu...@gmail.com > > > > >>>> <mailto:ivan.glu...@gmail.com>> wrote: > > > > >>>> > > > > >>>> Hi Anton, > > > > >>>> > > > > >>>> Thanks for sharing your ideas. > > > > >>>> I think your approach should work in general. I'll just > share > > > my > > > > >>>> concerns about possible issues that may come up. > > > > >>>> > > > > >>>> 1) Equality of update counters doesn't imply equality of > > > > >>>> partitions content under load. > > > > >>>> For every update, primary node generates update counter and > > > then > > > > >>>> update is delivered to backup node and gets applied with > the > > > > >>>> corresponding update counter. For example, there are two > > > > >>>> transactions (A and B) that update partition X by the > > following > > > > >>>> scenario: > > > > >>>> - A updates key1 in partition X on primary node and > > increments > > > > >>>> counter to 10 > > > > >>>> - B updates key2 in partition X on primary node and > > increments > > > > >>>> counter to 11 > > > > >>>> - While A is still updating another keys, B is finally > > > committed > > > > >>>> - Update of key2 arrives to backup node and sets update > > counter > > > > to > > > > >>> 11 > > > > >>>> Observer will see equal update counters (11), but update of > > > key 1 > > > > >>>> is still missing in the backup partition. > > > > >>>> This is a fundamental problem which is being solved here: > > > > >>>> https://issues.apache.org/jira/browse/IGNITE-10078 > > > > >>>> "Online verify" should operate with new complex update > > counters > > > > >>>> which take such "update holes" into account. Otherwise, > > online > > > > >>>> verify may provide false-positive inconsistency reports. > > > > >>>> > > > > >>>> 2) Acquisition and comparison of update counters is fast, > but > > > > >>>> partition hash calculation is long. We should check that > > update > > > > >>>> counter remains unchanged after every K keys handled. > > > > >>>> > > > > >>>> 3) > > > > >>>> > > > > >>>>> Another hope is that we'll be able to pause/continue scan, > > for > > > > >>>>> example, we'll check 1/3 partitions today, 1/3 tomorrow, > and > > > in > > > > >>>>> three days we'll check the whole cluster. > > > > >>>> Totally makes sense. > > > > >>>> We may find ourselves into a situation where some "hot" > > > > partitions > > > > >>>> are still unprocessed, and every next attempt to calculate > > > > >>>> partition hash fails due to another concurrent update. We > > > should > > > > >>>> be able to track progress of validation (% of calculation > > time > > > > >>>> wasted due to concurrent operations may be a good metric, > > 100% > > > is > > > > >>>> the worst case) and provide option to stop/pause activity. > > > > >>>> I think, pause should return an "intermediate results > report" > > > > with > > > > >>>> information about which partitions have been successfully > > > > checked. > > > > >>>> With such report, we can resume activity later: partitions > > from > > > > >>>> report will be just skipped. > > > > >>>> > > > > >>>> 4) > > > > >>>> > > > > >>>>> Since "Idle verify" uses regular pagmem, I assume it > > replaces > > > > hot > > > > >>>>> data with persisted. > > > > >>>>> So, we have to warm up the cluster after each check. > > > > >>>>> Are there any chances to check without cooling the > cluster? > > > > >>>> I don't see an easy way to achieve it with our page memory > > > > >>>> architecture. We definitely can't just read pages from disk > > > > >>>> directly: we need to synchronize page access with > concurrent > > > > >>>> update operations and checkpoints. > > > > >>>> From my point of view, the correct way to solve this issue > is > > > > >>>> improving our page replacement [1] mechanics by making it > > truly > > > > >>>> scan-resistant. > > > > >>>> > > > > >>>> P. S. There's another possible way of achieving online > > verify: > > > > >>>> instead of on-demand hash calculation, we can always keep > > > > >>>> up-to-date hash value for every partition. We'll need to > > update > > > > >>>> hash on every insert/update/remove operation, but there > will > > be > > > > no > > > > >>>> reordering issues as per function that we use for > aggregating > > > > hash > > > > >>>> results (+) is commutative. With having pre-calculated > > > partition > > > > >>>> hash value, we can automatically detect inconsistent > > partitions > > > > on > > > > >>>> every PME. What do you think? > > > > >>>> > > > > >>>> [1] - > > > > >>>> > > > > >>> > > > > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Durable+Memory+-+under+the+hood#IgniteDurableMemory-underthehood-Pagereplacement(rotationwithdisk) > > > > >>>> Best Regards, > > > > >>>> Ivan Rakov > > > > >>>> > > > > >>>> On 29.04.2019 12:20, Anton Vinogradov wrote: > > > > >>>>> Igniters and especially Ivan Rakov, > > > > >>>>> > > > > >>>>> "Idle verify" [1] is a really cool tool, to make sure that > > > > >>>>> cluster is consistent. > > > > >>>>> > > > > >>>>> 1) But it required to have operations paused during > cluster > > > > check. > > > > >>>>> At some clusters, this check requires hours (3-4 hours at > > > cases > > > > I > > > > >>>>> saw). > > > > >>>>> I've checked the code of "idle verify" and it seems it > > > possible > > > > >>>>> to make it "online" with some assumptions. > > > > >>>>> > > > > >>>>> Idea: > > > > >>>>> Currently "Idle verify" checks that partitions hashes, > > > generated > > > > >>>>> this way > > > > >>>>> while (it.hasNextX()) { > > > > >>>>> CacheDataRow row = it.nextX(); > > > > >>>>> partHash += row.key().hashCode(); > > > > >>>>> partHash += > > > > >>>>> > > > > >>> > > > Arrays.hashCode(row.value().valueBytes(grpCtx.cacheObjectContext())); > > > > >>>>> } > > > > >>>>> , are the same. > > > > >>>>> > > > > >>>>> What if we'll generate same pairs > > updateCounter-partitionHash > > > > but > > > > >>>>> will compare hashes only in case counters are the same? > > > > >>>>> So, for example, will ask cluster to generate pairs for 64 > > > > >>>>> partitions, then will find that 55 have the same counters > > (was > > > > >>>>> not updated during check) and check them. > > > > >>>>> The rest (64-55 = 9) partitions will be re-requested and > > > > >>>>> rechecked with an additional 55. > > > > >>>>> This way we'll be able to check cluster is consistent even > > in > > > > >>>>> сase operations are in progress (just retrying modified). > > > > >>>>> > > > > >>>>> Risks and assumptions: > > > > >>>>> Using this strategy we'll check the cluster's consistency > > ... > > > > >>>>> eventually, and the check will take more time even on an > > idle > > > > >>>>> cluster. > > > > >>>>> In case operationsPerTimeToGeneratePartitionHashes > > > > > >>>>> partitionsCount we'll definitely gain no progress. > > > > >>>>> But, in case of the load is not high, we'll be able to > check > > > all > > > > >>>>> cluster. > > > > >>>>> > > > > >>>>> Another hope is that we'll be able to pause/continue scan, > > for > > > > >>>>> example, we'll check 1/3 partitions today, 1/3 tomorrow, > and > > > in > > > > >>>>> three days we'll check the whole cluster. > > > > >>>>> > > > > >>>>> Have I missed something? > > > > >>>>> > > > > >>>>> 2) Since "Idle verify" uses regular pagmem, I assume it > > > replaces > > > > >>>>> hot data with persisted. > > > > >>>>> So, we have to warm up the cluster after each check. > > > > >>>>> Are there any chances to check without cooling the > cluster? > > > > >>>>> > > > > >>>>> [1] > > > > >>>>> > > > > >>> > > > > > > > > > > https://apacheignite-tools.readme.io/docs/control-script#section-verification-of-partition-checksums > > > > > > > > > > > > > -- > > > > Best regards, > > Alexei Scherbakov > > > -- Best regards, Alexei Scherbakov