Igniters, I would like this issue to be part of release 2.7. There is not much time left considering that I will have to correct comments after the review.
Will anyone help with review? PR: https://github.com/apache/ignite/pull/4520/files Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727 JIRA: https://issues.apache.org/jira/browse/IGNITE-7196 On Thu, 13 Sep 2018 at 11:27 Maxim Muzafarov <maxmu...@gmail.com> wrote: > Igniters, > > I need your help with the review of this patch. In general, it will help > to reduce PME duration. I've moved binary > memory recovery at the moment of node startup as we've discussed it with > Pavel previously this topic. > > Changes are relatively small (+299 −95) and they are ready. I've left a > comment in JIRA with high-level > implementation details. Hope, this will help with the review [1]. > > Please, take a look. > PR: https://github.com/apache/ignite/pull/4520/files > Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727 > JIRA: https://issues.apache.org/jira/browse/IGNITE-7196 > TC Run All: [2] > > > [1] > https://issues.apache.org/jira/browse/IGNITE-7196?focusedCommentId=16613175&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16613175 > [2] > https://ci.ignite.apache.org/viewLog.html?buildId=1835822&tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_RunAll > > > On Tue, 28 Aug 2018 at 17:59 Pavel Kovalenko <jokse...@gmail.com> wrote: > >> Hello Maxim, >> >> I think you're going in the right direction, but why you perform binary >> recovery only in the case when a node in the baseline? >> I think that node can also perform binary recovery before the first >> exchange is started in case of >> 1) First activation >> 2) Baseline change >> >> In all of these cases, you need just local available CacheDescriptors to >> restore binary state from WAL. >> I would like also see a test when Ignite is stopped in the middle of a >> checkpoint and binary recovery succeed before PME in that case also. >> >> >> ср, 22 авг. 2018 г. в 15:31, Pavel Kovalenko <jokse...@gmail.com>: >> >> > Hello Maxim, >> > >> > Thank you for your work. I will review your changes within the next >> > several days. >> > >> > 2018-08-22 11:12 GMT+03:00 Maxim Muzafarov <maxmu...@gmail.com>: >> > >> >> Pavel, >> >> >> >> Thank you for so detailed introduction into the root of the problem of >> >> ticket IGNITE-7196. >> >> As you mentioned before, we can divide this ticket into two sub-tasks. >> So, >> >> I will >> >> file new issue for `Logical consistency recovery`. I think it's more >> >> convenient way >> >> for reviewing and merging each of these high level Steps (PRs). >> >> >> >> Currently, let's focus on `Physical(binary) consistency recovery` as >> the >> >> most critical >> >> part of this issue and synchronize our vision and vision of other >> >> community >> >> members >> >> on implementation details of Step 1. From this point everything what >> I'm >> >> saying is about >> >> only binary recovery. >> >> >> >> >> >> I think PR is almost ready (I need to check TC carefully). >> >> Can you look at my changes? Am I going the right way? >> >> >> >> PR: https://github.com/apache/ignite/pull/4520 >> >> Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727 >> >> JIRA: https://issues.apache.org/jira/browse/IGNITE-7196 >> >> >> >> >> >> These are my milestones which I've tried to solve: >> >> 1. On first time cluster activation no changes required. We should keep >> >> this behavior. >> >> >> >> 2. `startMemoryPolicies` earlier if need. >> >> Now at the master branch data regions starts at onActivate method >> called. >> >> If we are trying >> >> to restore binary state on node reconnects to cluster we should start >> >> regions earlier. >> >> >> >> 3. `suspendLogging` method added to switch off handlers. >> >> Keep fast PDS cleanup when joining is not in BLT (or belongs to another >> >> BLT). >> >> Restore memory and metastorage initialization required `WALWriter` and >> >> `FileWriteHandle` >> >> to be started. They can lock directory\files and block cleanups. >> >> >> >> 4. `isBaselineNode` check before resumming logging. >> >> The same as point 3. Local node gets discovery data and determines >> >> #isLocalNodeNotInBaseline. >> >> If node not in current baseline it performs cleanup local PDS and we >> >> should >> >> reset state of >> >> previously initialized at startup metastorage instance. >> >> >> >> 5. `cacheProcessorStarted` callback added to restore memory at startup. >> >> We can restore memory only after having info about all >> >> `CacheGroupDescriptor`. To achieve this >> >> we should do restoring at the moment `GridCacheProcessor` started. >> >> >> >> 6. Move `fsync` for `MemoryRecoveryRecord` at node startup. >> >> We are fsync'ing it inside running PME. Suppose, it used for recovery >> >> actions and\or control >> >> cluster state outside Ignite project (methods `nodeStart` and >> >> `nodeStartedPointers` of this >> >> are not used at Ignite project code). I've moved them at the node >> startup, >> >> but may be we should >> >> dumping timestamp only at the time of node activation. Not sure. >> >> >> >> 7. `testBinaryRecoverBeforeExchange` test added to check restore before >> >> PME. >> >> Not sure that we should repeat wal reading logic as it was in >> >> `testApplyDeltaRecords` test for >> >> checking memory restoring. Suppose, minimally necessary condition is to >> >> check `lastRestored` and >> >> `checkpointHistory` states of database manager's (they should happend >> >> before node joins to cluster). >> >> >> >> I will create separate ticket and refactor these minor issues after we >> >> will >> >> finish with 7196: >> >> - simlify registerMBeans methods for >> >> `Ignite\GridCacheDatabaseSharedManager` >> >> - add @Override annotations if missed >> >> - fix abbreviations regarding IdeaAbbrPlugin >> >> - createDataRegionConfiguration method can be removed and simplified >> >> - startMemoryPolicies rename to startDataRegion >> >> - output messages format fix code style needed >> >> - fix format javadoc's >> >> - make inner classes private >> >> - missed logging for cleanupCheckpointDirectory, >> >> cleanupCheckpointDirectory, cleanupWalDirectories >> >> >> >> On Fri, 3 Aug 2018 at 18:20 Pavel Kovalenko <jokse...@gmail.com> >> wrote: >> >> >> >> > Maxim, >> >> > >> >> > In the addition to my answer, I propose to divide the whole task >> into 2 >> >> > steps: >> >> > 1) Physical consistency recovery before PME. >> >> > 2) Logical consistency recovery before PME. >> >> > >> >> > The first step can be done before the discovery manager start, as it >> no >> >> > needs any information about caches, it operates only with pages >> binary >> >> > data. On this step, we can do points a) and b) from answer 3) and >> fsync >> >> the >> >> > state of the page memory. >> >> > The second step is much harder to implement as it requires major >> >> changes in >> >> > Discovery SPI. Discovery node join event is triggered on coordinator >> >> after >> >> > node successfully joins to the cluster. To avoid it, we should >> >> introduce a >> >> > new state of node and fire discovery event only after logical >> recovery >> >> has >> >> > finished. I doubt that logical recovery consumes a lot of time of >> >> exchange, >> >> > but physical recovery does, as it performs fsync on disk after >> restore. >> >> > >> >> > >> >> > >> >> > >> >> > 2018-08-03 17:46 GMT+03:00 Pavel Kovalenko <jokse...@gmail.com>: >> >> > >> >> > > Hello Maxim, >> >> > > >> >> > > 1) Yes, Discovery Manager is starting after GridCacheProcessor, >> which >> >> > > starts GridCacheDatabaseSharedManager which invokes >> readMetastorage on >> >> > > start. >> >> > > >> >> > > 2) Before we complete the local join future, we create and add >> >> Exchange >> >> > > future on local node join to ExchangeManager. So, when local join >> >> future >> >> > > completes, first PME on that node should be already run or at least >> >> there >> >> > > will be first Exchange future in Exchange worker queue. >> >> > > >> >> > > 3) I don't exactly understand what do you mean saying "update >> obsolete >> >> > > partition counter", but the process is the following: >> >> > > a) We restore page memory state (which contain partition update >> >> > > counters) from the last checkpoint. >> >> > > b) We repair page memory state using physical delta records >> from >> >> WAL. >> >> > > c) We apply logical records since last checkpoint finish >> marker. >> >> > > During this iteration, if we meet DataRecord we increment update >> >> counter >> >> > on >> >> > > appropriate partition. NOTE: This phase currently happens during >> >> > exchange. >> >> > > After that partition exchange starts and information about >> actual >> >> > > update counters will be collected from FullMessage. If partition >> >> counters >> >> > > are outdated, rebalance will start after PME end. >> >> > > >> >> > > 4) I think yes because everything that you need is cache >> descriptors >> >> of >> >> > > currently started caches on the grid. Some information can be >> >> retrieved >> >> > > from the static configuration. Information about dynamically >> started >> >> > caches >> >> > > can be retrieved from Discovery Data Bag received when a node >> joins to >> >> > > ring. See >> >> org.apache.ignite.internal.processors.cache.ClusterCachesInfo# >> >> > > onGridDataReceived >> >> > > >> >> > > 5) I think the main problem is that metastorage and page memory >> share >> >> one >> >> > > WAL. But If this phase will happen before PME is started on all >> >> cluster >> >> > > nodes, I think it's acceptable that they will use 2 WAL iterations. >> >> > > >> >> > > 2018-08-03 10:28 GMT+03:00 Nikolay Izhikov <nizhi...@apache.org>: >> >> > > >> >> > >> Hello, Maxim. >> >> > >> >> >> > >> > 1) Is it correct that readMetastore() happens after node starts> >> >> but >> >> > >> before including node into the ring?> >> >> > >> >> >> > >> I think yes. >> >> > >> You can have some kind of metainformation required on node join. >> >> > >> >> >> > >> > 5) Does in our final solution for new joined node readMetastore> >> >> and >> >> > >> restoreMemory should be performed in one step? >> >> > >> >> >> > >> I think, no. >> >> > >> >> >> > >> Meta Information can be required to perform restore memory. >> >> > >> So we have to restore metainformation in first step and restore >> whole >> >> > >> memory as a second step. >> >> > >> >> >> > >> В Пт, 03/08/2018 в 09:44 +0300, Maxim Muzafarov пишет: >> >> > >> > Hi Igniters, >> >> > >> > >> >> > >> > >> >> > >> > I'm working on bug [1] and have some questions about the final >> >> > >> > implementation. Probably, I've already found answers on some of >> >> > >> > them but I want to be sure. Please, help me to clarify details. >> >> > >> > >> >> > >> > >> >> > >> > The key problem here is that we are reading WAL and restoring >> >> > >> > memory state of new joined node inside PME. Reading WAL can >> >> > >> > consume huge amount of time, so the whole cluster stucks and >> >> > >> > waits for the single node. >> >> > >> > >> >> > >> > >> >> > >> > 1) Is it correct that readMetastore() happens after node starts >> >> > >> > but before including node into the ring? >> >> > >> > >> >> > >> > 2) Is after onDone() method called for LocalJoinFuture on local >> >> > >> > node happend we can proceed with initiating PME on local node? >> >> > >> > >> >> > >> > 3) After reading checkpoint and restore memory for new joined >> >> > >> > node how and when we are updating obsolete partitions update >> >> > >> > counter? At historical rebalance, right? >> >> > >> > >> >> > >> > 4) Should we restoreMemory for new joined node before PME >> >> > >> > initiates on the other nodes in cluster? >> >> > >> > >> >> > >> > 5) Does in our final solution for new joined node readMetastore >> >> > >> > and restoreMemory should be performed in one step? >> >> > >> > >> >> > >> > >> >> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-7196 >> >> > >> >> >> > > >> >> > > >> >> > >> >> -- >> >> -- >> >> Maxim Muzafarov >> >> >> > >> > >> > -- > -- > Maxim Muzafarov > -- -- Maxim Muzafarov