Re: Exchange stucks while node restoring state from WAL

Pavel Kovalenko Wed, 22 Aug 2018 05:32:43 -0700

Hello Maxim,

Thank you for your work. I will review your changes within the next several
days.


2018-08-22 11:12 GMT+03:00 Maxim Muzafarov <maxmu...@gmail.com>:

> Pavel,
>
> Thank you for so detailed introduction into the root of the problem of
> ticket IGNITE-7196.
> As you mentioned before, we can divide this ticket into two sub-tasks. So,
> I will
> file new issue for `Logical consistency recovery`. I think it's more
> convenient way
> for reviewing and merging each of these high level Steps (PRs).
>
> Currently, let's focus on `Physical(binary) consistency recovery` as the
> most critical
> part of this issue and synchronize our vision and vision of other community
> members
> on implementation details of Step 1. From this point everything what I'm
> saying is about
> only binary recovery.
>
>
> I think PR is almost ready (I need to check TC carefully).
> Can you look at my changes? Am I going the right way?
>
> PR: https://github.com/apache/ignite/pull/4520
> Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727
> JIRA: https://issues.apache.org/jira/browse/IGNITE-7196
>
>
> These are my milestones which I've tried to solve:
> 1. On first time cluster activation no changes required. We should keep
> this behavior.
>
> 2. `startMemoryPolicies` earlier if need.
> Now at the master branch data regions starts at onActivate method called.
> If we are  trying
> to restore binary state on node reconnects to cluster we should start
> regions earlier.
>
> 3. `suspendLogging` method added to switch off handlers.
> Keep fast PDS cleanup when joining is not in BLT (or belongs to another
> BLT).
> Restore memory and metastorage initialization required `WALWriter` and
> `FileWriteHandle`
> to be started. They can lock directory\files and block cleanups.
>
> 4. `isBaselineNode` check before resumming logging.
> The same as point 3. Local node gets discovery data and determines
> #isLocalNodeNotInBaseline.
> If node not in current baseline it performs cleanup local PDS and we should
> reset state of
> previously initialized at startup metastorage instance.
>
> 5. `cacheProcessorStarted` callback added to restore memory at startup.
> We can restore memory only after having info about all
> `CacheGroupDescriptor`. To achieve this
> we should do restoring at the moment `GridCacheProcessor` started.
>
> 6. Move `fsync` for `MemoryRecoveryRecord` at node startup.
> We are fsync'ing it inside running PME. Suppose, it used for recovery
> actions and\or control
> cluster state outside Ignite project (methods `nodeStart` and
> `nodeStartedPointers` of this
> are not used at Ignite project code). I've moved them at the node startup,
> but may be we should
> dumping timestamp only at the time of node activation. Not sure.
>
> 7. `testBinaryRecoverBeforeExchange` test added to check restore before
> PME.
> Not sure that we should repeat wal reading logic as it was in
> `testApplyDeltaRecords` test for
> checking memory restoring. Suppose, minimally necessary condition is to
> check `lastRestored` and
> `checkpointHistory` states of database manager's (they should happend
> before node joins to cluster).
>
> I will create separate ticket and refactor these minor issues after we will
> finish with 7196:
> - simlify registerMBeans methods for `Ignite\
> GridCacheDatabaseSharedManager`
> - add @Override annotations if missed
> - fix abbreviations regarding IdeaAbbrPlugin
> - createDataRegionConfiguration method can be removed and simplified
> - startMemoryPolicies rename to startDataRegion
> - output messages format fix code style needed
> - fix format javadoc's
> - make inner classes private
> - missed logging for cleanupCheckpointDirectory,
> cleanupCheckpointDirectory, cleanupWalDirectories
>
> On Fri, 3 Aug 2018 at 18:20 Pavel Kovalenko <jokse...@gmail.com> wrote:
>
> > Maxim,
> >
> > In the addition to my answer, I propose to divide the whole task into 2
> > steps:
> > 1) Physical consistency recovery before PME.
> > 2) Logical consistency recovery before PME.
> >
> > The first step can be done before the discovery manager start, as it no
> > needs any information about caches, it operates only with pages binary
> > data. On this step, we can do points a) and b) from answer 3) and fsync
> the
> > state of the page memory.
> > The second step is much harder to implement as it requires major changes
> in
> > Discovery SPI. Discovery node join event is triggered on coordinator
> after
> > node successfully joins to the cluster. To avoid it, we should introduce
> a
> > new state of node and fire discovery event only after logical recovery
> has
> > finished. I doubt that logical recovery consumes a lot of time of
> exchange,
> > but physical recovery does, as it performs fsync on disk after restore.
> >
> >
> >
> >
> > 2018-08-03 17:46 GMT+03:00 Pavel Kovalenko <jokse...@gmail.com>:
> >
> > > Hello Maxim,
> > >
> > > 1) Yes, Discovery Manager is starting after GridCacheProcessor, which
> > > starts GridCacheDatabaseSharedManager which invokes readMetastorage on
> > > start.
> > >
> > > 2) Before we complete the local join future, we create and add Exchange
> > > future on local node join to ExchangeManager. So, when local join
> future
> > > completes, first PME on that node should be already run or at least
> there
> > > will be first Exchange future in Exchange worker queue.
> > >
> > > 3) I don't exactly understand what do you mean saying "update obsolete
> > > partition counter", but the process is the following:
> > >     a) We restore page memory state (which contain partition update
> > > counters) from the last checkpoint.
> > >     b) We repair page memory state using physical delta records from
> WAL.
> > >     c) We apply logical records since last checkpoint finish marker.
> > > During this iteration, if we meet DataRecord we increment update
> counter
> > on
> > > appropriate partition. NOTE: This phase currently happens during
> > exchange.
> > >     After that partition exchange starts and information about actual
> > > update counters will be collected from FullMessage. If partition
> counters
> > > are outdated, rebalance will start after PME end.
> > >
> > > 4) I think yes because everything that you need is cache descriptors of
> > > currently started caches on the grid. Some information can be retrieved
> > > from the static configuration. Information about dynamically started
> > caches
> > > can be retrieved from Discovery Data Bag received when a node joins to
> > > ring. See org.apache.ignite.internal.processors.cache.
> ClusterCachesInfo#
> > > onGridDataReceived
> > >
> > > 5) I think the main problem is that metastorage and page memory share
> one
> > > WAL. But If this phase will happen before PME is started on all cluster
> > > nodes, I think it's acceptable that they will use 2 WAL iterations.
> > >
> > > 2018-08-03 10:28 GMT+03:00 Nikolay Izhikov <nizhi...@apache.org>:
> > >
> > >> Hello, Maxim.
> > >>
> > >> > 1) Is it correct that readMetastore() happens after node starts> but
> > >> before including node into the ring?>
> > >>
> > >> I think yes.
> > >> You can have some kind of metainformation required on node join.
> > >>
> > >> > 5) Does in our final solution for new joined node readMetastore> and
> > >> restoreMemory should be performed in one step?
> > >>
> > >> I think, no.
> > >>
> > >> Meta Information can be required to perform restore memory.
> > >> So we have to restore metainformation in first step and restore whole
> > >> memory as a second step.
> > >>
> > >> В Пт, 03/08/2018 в 09:44 +0300, Maxim Muzafarov пишет:
> > >> > Hi Igniters,
> > >> >
> > >> >
> > >> > I'm working on bug [1] and have some questions about the final
> > >> > implementation. Probably, I've already found answers on some of
> > >> > them but I want to be sure. Please, help me to clarify details.
> > >> >
> > >> >
> > >> > The key problem here is that we are reading WAL and restoring
> > >> > memory state of new joined node inside PME. Reading WAL can
> > >> > consume huge amount of time, so the whole cluster stucks and
> > >> > waits for the single node.
> > >> >
> > >> >
> > >> > 1) Is it correct that readMetastore() happens after node starts
> > >> > but before including node into the ring?
> > >> >
> > >> > 2) Is after onDone() method called for LocalJoinFuture on local
> > >> > node happend we can proceed with initiating PME on local node?
> > >> >
> > >> > 3) After reading checkpoint and restore memory for new joined
> > >> > node how and when we are updating obsolete partitions update
> > >> > counter? At historical rebalance, right?
> > >> >
> > >> > 4) Should we restoreMemory for new joined node before PME
> > >> > initiates on the other nodes in cluster?
> > >> >
> > >> > 5) Does in our final solution for new joined node readMetastore
> > >> > and restoreMemory should be performed in one step?
> > >> >
> > >> >
> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-7196
> > >>
> > >
> > >
> >
> --
> --
> Maxim Muzafarov
>

Re: Exchange stucks while node restoring state from WAL

Reply via email to