Hello Maxim, Thank you for your work. I will review your changes within the next several days.
2018-08-22 11:12 GMT+03:00 Maxim Muzafarov <maxmu...@gmail.com>: > Pavel, > > Thank you for so detailed introduction into the root of the problem of > ticket IGNITE-7196. > As you mentioned before, we can divide this ticket into two sub-tasks. So, > I will > file new issue for `Logical consistency recovery`. I think it's more > convenient way > for reviewing and merging each of these high level Steps (PRs). > > Currently, let's focus on `Physical(binary) consistency recovery` as the > most critical > part of this issue and synchronize our vision and vision of other community > members > on implementation details of Step 1. From this point everything what I'm > saying is about > only binary recovery. > > > I think PR is almost ready (I need to check TC carefully). > Can you look at my changes? Am I going the right way? > > PR: https://github.com/apache/ignite/pull/4520 > Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727 > JIRA: https://issues.apache.org/jira/browse/IGNITE-7196 > > > These are my milestones which I've tried to solve: > 1. On first time cluster activation no changes required. We should keep > this behavior. > > 2. `startMemoryPolicies` earlier if need. > Now at the master branch data regions starts at onActivate method called. > If we are trying > to restore binary state on node reconnects to cluster we should start > regions earlier. > > 3. `suspendLogging` method added to switch off handlers. > Keep fast PDS cleanup when joining is not in BLT (or belongs to another > BLT). > Restore memory and metastorage initialization required `WALWriter` and > `FileWriteHandle` > to be started. They can lock directory\files and block cleanups. > > 4. `isBaselineNode` check before resumming logging. > The same as point 3. Local node gets discovery data and determines > #isLocalNodeNotInBaseline. > If node not in current baseline it performs cleanup local PDS and we should > reset state of > previously initialized at startup metastorage instance. > > 5. `cacheProcessorStarted` callback added to restore memory at startup. > We can restore memory only after having info about all > `CacheGroupDescriptor`. To achieve this > we should do restoring at the moment `GridCacheProcessor` started. > > 6. Move `fsync` for `MemoryRecoveryRecord` at node startup. > We are fsync'ing it inside running PME. Suppose, it used for recovery > actions and\or control > cluster state outside Ignite project (methods `nodeStart` and > `nodeStartedPointers` of this > are not used at Ignite project code). I've moved them at the node startup, > but may be we should > dumping timestamp only at the time of node activation. Not sure. > > 7. `testBinaryRecoverBeforeExchange` test added to check restore before > PME. > Not sure that we should repeat wal reading logic as it was in > `testApplyDeltaRecords` test for > checking memory restoring. Suppose, minimally necessary condition is to > check `lastRestored` and > `checkpointHistory` states of database manager's (they should happend > before node joins to cluster). > > I will create separate ticket and refactor these minor issues after we will > finish with 7196: > - simlify registerMBeans methods for `Ignite\ > GridCacheDatabaseSharedManager` > - add @Override annotations if missed > - fix abbreviations regarding IdeaAbbrPlugin > - createDataRegionConfiguration method can be removed and simplified > - startMemoryPolicies rename to startDataRegion > - output messages format fix code style needed > - fix format javadoc's > - make inner classes private > - missed logging for cleanupCheckpointDirectory, > cleanupCheckpointDirectory, cleanupWalDirectories > > On Fri, 3 Aug 2018 at 18:20 Pavel Kovalenko <jokse...@gmail.com> wrote: > > > Maxim, > > > > In the addition to my answer, I propose to divide the whole task into 2 > > steps: > > 1) Physical consistency recovery before PME. > > 2) Logical consistency recovery before PME. > > > > The first step can be done before the discovery manager start, as it no > > needs any information about caches, it operates only with pages binary > > data. On this step, we can do points a) and b) from answer 3) and fsync > the > > state of the page memory. > > The second step is much harder to implement as it requires major changes > in > > Discovery SPI. Discovery node join event is triggered on coordinator > after > > node successfully joins to the cluster. To avoid it, we should introduce > a > > new state of node and fire discovery event only after logical recovery > has > > finished. I doubt that logical recovery consumes a lot of time of > exchange, > > but physical recovery does, as it performs fsync on disk after restore. > > > > > > > > > > 2018-08-03 17:46 GMT+03:00 Pavel Kovalenko <jokse...@gmail.com>: > > > > > Hello Maxim, > > > > > > 1) Yes, Discovery Manager is starting after GridCacheProcessor, which > > > starts GridCacheDatabaseSharedManager which invokes readMetastorage on > > > start. > > > > > > 2) Before we complete the local join future, we create and add Exchange > > > future on local node join to ExchangeManager. So, when local join > future > > > completes, first PME on that node should be already run or at least > there > > > will be first Exchange future in Exchange worker queue. > > > > > > 3) I don't exactly understand what do you mean saying "update obsolete > > > partition counter", but the process is the following: > > > a) We restore page memory state (which contain partition update > > > counters) from the last checkpoint. > > > b) We repair page memory state using physical delta records from > WAL. > > > c) We apply logical records since last checkpoint finish marker. > > > During this iteration, if we meet DataRecord we increment update > counter > > on > > > appropriate partition. NOTE: This phase currently happens during > > exchange. > > > After that partition exchange starts and information about actual > > > update counters will be collected from FullMessage. If partition > counters > > > are outdated, rebalance will start after PME end. > > > > > > 4) I think yes because everything that you need is cache descriptors of > > > currently started caches on the grid. Some information can be retrieved > > > from the static configuration. Information about dynamically started > > caches > > > can be retrieved from Discovery Data Bag received when a node joins to > > > ring. See org.apache.ignite.internal.processors.cache. > ClusterCachesInfo# > > > onGridDataReceived > > > > > > 5) I think the main problem is that metastorage and page memory share > one > > > WAL. But If this phase will happen before PME is started on all cluster > > > nodes, I think it's acceptable that they will use 2 WAL iterations. > > > > > > 2018-08-03 10:28 GMT+03:00 Nikolay Izhikov <nizhi...@apache.org>: > > > > > >> Hello, Maxim. > > >> > > >> > 1) Is it correct that readMetastore() happens after node starts> but > > >> before including node into the ring?> > > >> > > >> I think yes. > > >> You can have some kind of metainformation required on node join. > > >> > > >> > 5) Does in our final solution for new joined node readMetastore> and > > >> restoreMemory should be performed in one step? > > >> > > >> I think, no. > > >> > > >> Meta Information can be required to perform restore memory. > > >> So we have to restore metainformation in first step and restore whole > > >> memory as a second step. > > >> > > >> В Пт, 03/08/2018 в 09:44 +0300, Maxim Muzafarov пишет: > > >> > Hi Igniters, > > >> > > > >> > > > >> > I'm working on bug [1] and have some questions about the final > > >> > implementation. Probably, I've already found answers on some of > > >> > them but I want to be sure. Please, help me to clarify details. > > >> > > > >> > > > >> > The key problem here is that we are reading WAL and restoring > > >> > memory state of new joined node inside PME. Reading WAL can > > >> > consume huge amount of time, so the whole cluster stucks and > > >> > waits for the single node. > > >> > > > >> > > > >> > 1) Is it correct that readMetastore() happens after node starts > > >> > but before including node into the ring? > > >> > > > >> > 2) Is after onDone() method called for LocalJoinFuture on local > > >> > node happend we can proceed with initiating PME on local node? > > >> > > > >> > 3) After reading checkpoint and restore memory for new joined > > >> > node how and when we are updating obsolete partitions update > > >> > counter? At historical rebalance, right? > > >> > > > >> > 4) Should we restoreMemory for new joined node before PME > > >> > initiates on the other nodes in cluster? > > >> > > > >> > 5) Does in our final solution for new joined node readMetastore > > >> > and restoreMemory should be performed in one step? > > >> > > > >> > > > >> > [1] https://issues.apache.org/jira/browse/IGNITE-7196 > > >> > > > > > > > > > -- > -- > Maxim Muzafarov >