Pavel, Dmitry G., Thank you for the help and review this improvement!
Igniters, I'd like to finalize this topic with the following changes [1] `Update Grid(Ignite)CacheDatabaseSharedManager according to accepted coding guidelines`. I've consciously excluded them from the original PR not to overcomplicate review. Please, consider these changes to be included in the master branch too: - createDataRegionConfiguration have an unobvious name – it creates metastore dataregion config - startMemoryPolicies rename to startDataRegion\or remove - output messages format fix code style - register\unregister method for MBean can be simplified I've prepared everything required for the review. [1] https://issues.apache.org/jira/browse/IGNITE-10033 [2] https://reviews.ignite.apache.org/ignite/review/IGNT-CR-927 [3] https://github.com/apache/ignite/pull/5156 [4] https://ci.ignite.apache.org/viewLog.html?buildId=2184711&tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_RunAll On Tue, 18 Sep 2018 at 18:02 Maxim Muzafarov <maxmu...@gmail.com> wrote: > Igniters, > > I would like this issue to be part of release 2.7. There is not much time > left > considering that I will have to correct comments after the review. > > Will anyone help with review? > > > PR: https://github.com/apache/ignite/pull/4520/files > Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727 > JIRA: https://issues.apache.org/jira/browse/IGNITE-7196 > > On Thu, 13 Sep 2018 at 11:27 Maxim Muzafarov <maxmu...@gmail.com> wrote: > >> Igniters, >> >> I need your help with the review of this patch. In general, it will help >> to reduce PME duration. I've moved binary >> memory recovery at the moment of node startup as we've discussed it with >> Pavel previously this topic. >> >> Changes are relatively small (+299 −95) and they are ready. I've left a >> comment in JIRA with high-level >> implementation details. Hope, this will help with the review [1]. >> >> Please, take a look. >> PR: https://github.com/apache/ignite/pull/4520/files >> Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727 >> JIRA: https://issues.apache.org/jira/browse/IGNITE-7196 >> TC Run All: [2] >> >> >> [1] >> https://issues.apache.org/jira/browse/IGNITE-7196?focusedCommentId=16613175&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16613175 >> [2] >> https://ci.ignite.apache.org/viewLog.html?buildId=1835822&tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_RunAll >> >> >> On Tue, 28 Aug 2018 at 17:59 Pavel Kovalenko <jokse...@gmail.com> wrote: >> >>> Hello Maxim, >>> >>> I think you're going in the right direction, but why you perform binary >>> recovery only in the case when a node in the baseline? >>> I think that node can also perform binary recovery before the first >>> exchange is started in case of >>> 1) First activation >>> 2) Baseline change >>> >>> In all of these cases, you need just local available CacheDescriptors to >>> restore binary state from WAL. >>> I would like also see a test when Ignite is stopped in the middle of a >>> checkpoint and binary recovery succeed before PME in that case also. >>> >>> >>> ср, 22 авг. 2018 г. в 15:31, Pavel Kovalenko <jokse...@gmail.com>: >>> >>> > Hello Maxim, >>> > >>> > Thank you for your work. I will review your changes within the next >>> > several days. >>> > >>> > 2018-08-22 11:12 GMT+03:00 Maxim Muzafarov <maxmu...@gmail.com>: >>> > >>> >> Pavel, >>> >> >>> >> Thank you for so detailed introduction into the root of the problem of >>> >> ticket IGNITE-7196. >>> >> As you mentioned before, we can divide this ticket into two >>> sub-tasks. So, >>> >> I will >>> >> file new issue for `Logical consistency recovery`. I think it's more >>> >> convenient way >>> >> for reviewing and merging each of these high level Steps (PRs). >>> >> >>> >> Currently, let's focus on `Physical(binary) consistency recovery` as >>> the >>> >> most critical >>> >> part of this issue and synchronize our vision and vision of other >>> >> community >>> >> members >>> >> on implementation details of Step 1. From this point everything what >>> I'm >>> >> saying is about >>> >> only binary recovery. >>> >> >>> >> >>> >> I think PR is almost ready (I need to check TC carefully). >>> >> Can you look at my changes? Am I going the right way? >>> >> >>> >> PR: https://github.com/apache/ignite/pull/4520 >>> >> Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727 >>> >> JIRA: https://issues.apache.org/jira/browse/IGNITE-7196 >>> >> >>> >> >>> >> These are my milestones which I've tried to solve: >>> >> 1. On first time cluster activation no changes required. We should >>> keep >>> >> this behavior. >>> >> >>> >> 2. `startMemoryPolicies` earlier if need. >>> >> Now at the master branch data regions starts at onActivate method >>> called. >>> >> If we are trying >>> >> to restore binary state on node reconnects to cluster we should start >>> >> regions earlier. >>> >> >>> >> 3. `suspendLogging` method added to switch off handlers. >>> >> Keep fast PDS cleanup when joining is not in BLT (or belongs to >>> another >>> >> BLT). >>> >> Restore memory and metastorage initialization required `WALWriter` and >>> >> `FileWriteHandle` >>> >> to be started. They can lock directory\files and block cleanups. >>> >> >>> >> 4. `isBaselineNode` check before resumming logging. >>> >> The same as point 3. Local node gets discovery data and determines >>> >> #isLocalNodeNotInBaseline. >>> >> If node not in current baseline it performs cleanup local PDS and we >>> >> should >>> >> reset state of >>> >> previously initialized at startup metastorage instance. >>> >> >>> >> 5. `cacheProcessorStarted` callback added to restore memory at >>> startup. >>> >> We can restore memory only after having info about all >>> >> `CacheGroupDescriptor`. To achieve this >>> >> we should do restoring at the moment `GridCacheProcessor` started. >>> >> >>> >> 6. Move `fsync` for `MemoryRecoveryRecord` at node startup. >>> >> We are fsync'ing it inside running PME. Suppose, it used for recovery >>> >> actions and\or control >>> >> cluster state outside Ignite project (methods `nodeStart` and >>> >> `nodeStartedPointers` of this >>> >> are not used at Ignite project code). I've moved them at the node >>> startup, >>> >> but may be we should >>> >> dumping timestamp only at the time of node activation. Not sure. >>> >> >>> >> 7. `testBinaryRecoverBeforeExchange` test added to check restore >>> before >>> >> PME. >>> >> Not sure that we should repeat wal reading logic as it was in >>> >> `testApplyDeltaRecords` test for >>> >> checking memory restoring. Suppose, minimally necessary condition is >>> to >>> >> check `lastRestored` and >>> >> `checkpointHistory` states of database manager's (they should happend >>> >> before node joins to cluster). >>> >> >>> >> I will create separate ticket and refactor these minor issues after we >>> >> will >>> >> finish with 7196: >>> >> - simlify registerMBeans methods for >>> >> `Ignite\GridCacheDatabaseSharedManager` >>> >> - add @Override annotations if missed >>> >> - fix abbreviations regarding IdeaAbbrPlugin >>> >> - createDataRegionConfiguration method can be removed and simplified >>> >> - startMemoryPolicies rename to startDataRegion >>> >> - output messages format fix code style needed >>> >> - fix format javadoc's >>> >> - make inner classes private >>> >> - missed logging for cleanupCheckpointDirectory, >>> >> cleanupCheckpointDirectory, cleanupWalDirectories >>> >> >>> >> On Fri, 3 Aug 2018 at 18:20 Pavel Kovalenko <jokse...@gmail.com> >>> wrote: >>> >> >>> >> > Maxim, >>> >> > >>> >> > In the addition to my answer, I propose to divide the whole task >>> into 2 >>> >> > steps: >>> >> > 1) Physical consistency recovery before PME. >>> >> > 2) Logical consistency recovery before PME. >>> >> > >>> >> > The first step can be done before the discovery manager start, as >>> it no >>> >> > needs any information about caches, it operates only with pages >>> binary >>> >> > data. On this step, we can do points a) and b) from answer 3) and >>> fsync >>> >> the >>> >> > state of the page memory. >>> >> > The second step is much harder to implement as it requires major >>> >> changes in >>> >> > Discovery SPI. Discovery node join event is triggered on coordinator >>> >> after >>> >> > node successfully joins to the cluster. To avoid it, we should >>> >> introduce a >>> >> > new state of node and fire discovery event only after logical >>> recovery >>> >> has >>> >> > finished. I doubt that logical recovery consumes a lot of time of >>> >> exchange, >>> >> > but physical recovery does, as it performs fsync on disk after >>> restore. >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > 2018-08-03 17:46 GMT+03:00 Pavel Kovalenko <jokse...@gmail.com>: >>> >> > >>> >> > > Hello Maxim, >>> >> > > >>> >> > > 1) Yes, Discovery Manager is starting after GridCacheProcessor, >>> which >>> >> > > starts GridCacheDatabaseSharedManager which invokes >>> readMetastorage on >>> >> > > start. >>> >> > > >>> >> > > 2) Before we complete the local join future, we create and add >>> >> Exchange >>> >> > > future on local node join to ExchangeManager. So, when local join >>> >> future >>> >> > > completes, first PME on that node should be already run or at >>> least >>> >> there >>> >> > > will be first Exchange future in Exchange worker queue. >>> >> > > >>> >> > > 3) I don't exactly understand what do you mean saying "update >>> obsolete >>> >> > > partition counter", but the process is the following: >>> >> > > a) We restore page memory state (which contain partition >>> update >>> >> > > counters) from the last checkpoint. >>> >> > > b) We repair page memory state using physical delta records >>> from >>> >> WAL. >>> >> > > c) We apply logical records since last checkpoint finish >>> marker. >>> >> > > During this iteration, if we meet DataRecord we increment update >>> >> counter >>> >> > on >>> >> > > appropriate partition. NOTE: This phase currently happens during >>> >> > exchange. >>> >> > > After that partition exchange starts and information about >>> actual >>> >> > > update counters will be collected from FullMessage. If partition >>> >> counters >>> >> > > are outdated, rebalance will start after PME end. >>> >> > > >>> >> > > 4) I think yes because everything that you need is cache >>> descriptors >>> >> of >>> >> > > currently started caches on the grid. Some information can be >>> >> retrieved >>> >> > > from the static configuration. Information about dynamically >>> started >>> >> > caches >>> >> > > can be retrieved from Discovery Data Bag received when a node >>> joins to >>> >> > > ring. See >>> >> org.apache.ignite.internal.processors.cache.ClusterCachesInfo# >>> >> > > onGridDataReceived >>> >> > > >>> >> > > 5) I think the main problem is that metastorage and page memory >>> share >>> >> one >>> >> > > WAL. But If this phase will happen before PME is started on all >>> >> cluster >>> >> > > nodes, I think it's acceptable that they will use 2 WAL >>> iterations. >>> >> > > >>> >> > > 2018-08-03 10:28 GMT+03:00 Nikolay Izhikov <nizhi...@apache.org>: >>> >> > > >>> >> > >> Hello, Maxim. >>> >> > >> >>> >> > >> > 1) Is it correct that readMetastore() happens after node >>> starts> >>> >> but >>> >> > >> before including node into the ring?> >>> >> > >> >>> >> > >> I think yes. >>> >> > >> You can have some kind of metainformation required on node join. >>> >> > >> >>> >> > >> > 5) Does in our final solution for new joined node >>> readMetastore> >>> >> and >>> >> > >> restoreMemory should be performed in one step? >>> >> > >> >>> >> > >> I think, no. >>> >> > >> >>> >> > >> Meta Information can be required to perform restore memory. >>> >> > >> So we have to restore metainformation in first step and restore >>> whole >>> >> > >> memory as a second step. >>> >> > >> >>> >> > >> В Пт, 03/08/2018 в 09:44 +0300, Maxim Muzafarov пишет: >>> >> > >> > Hi Igniters, >>> >> > >> > >>> >> > >> > >>> >> > >> > I'm working on bug [1] and have some questions about the final >>> >> > >> > implementation. Probably, I've already found answers on some of >>> >> > >> > them but I want to be sure. Please, help me to clarify details. >>> >> > >> > >>> >> > >> > >>> >> > >> > The key problem here is that we are reading WAL and restoring >>> >> > >> > memory state of new joined node inside PME. Reading WAL can >>> >> > >> > consume huge amount of time, so the whole cluster stucks and >>> >> > >> > waits for the single node. >>> >> > >> > >>> >> > >> > >>> >> > >> > 1) Is it correct that readMetastore() happens after node starts >>> >> > >> > but before including node into the ring? >>> >> > >> > >>> >> > >> > 2) Is after onDone() method called for LocalJoinFuture on local >>> >> > >> > node happend we can proceed with initiating PME on local node? >>> >> > >> > >>> >> > >> > 3) After reading checkpoint and restore memory for new joined >>> >> > >> > node how and when we are updating obsolete partitions update >>> >> > >> > counter? At historical rebalance, right? >>> >> > >> > >>> >> > >> > 4) Should we restoreMemory for new joined node before PME >>> >> > >> > initiates on the other nodes in cluster? >>> >> > >> > >>> >> > >> > 5) Does in our final solution for new joined node readMetastore >>> >> > >> > and restoreMemory should be performed in one step? >>> >> > >> > >>> >> > >> > >>> >> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-7196 >>> >> > >> >>> >> > > >>> >> > > >>> >> > >>> >> -- >>> >> -- >>> >> Maxim Muzafarov >>> >> >>> > >>> > >>> >> -- >> -- >> Maxim Muzafarov >> > -- > -- > Maxim Muzafarov > -- -- Maxim Muzafarov