Re: Exchange stucks while node restoring state from WAL

Maxim Muzafarov Tue, 18 Sep 2018 08:03:10 -0700

Igniters,

I would like this issue to be part of release 2.7. There is not much time
left
considering that I will have to correct comments after the review.


Will anyone help with review?

PR: https://github.com/apache/ignite/pull/4520/files
Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727
JIRA: https://issues.apache.org/jira/browse/IGNITE-7196

On Thu, 13 Sep 2018 at 11:27 Maxim Muzafarov <[email protected]> wrote:

> Igniters,
>
> I need your help with the review of this patch. In general, it will help
> to reduce PME duration. I've moved binary
> memory recovery at the moment of node startup as we've discussed it with
> Pavel previously this topic.
>
> Changes are relatively small (+299 −95) and they are ready. I've left a
> comment in JIRA with high-level
> implementation details. Hope, this will help with the review [1].
>
> Please, take a look.
> PR: https://github.com/apache/ignite/pull/4520/files
> Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727
> JIRA: https://issues.apache.org/jira/browse/IGNITE-7196
> TC Run All: [2]
>
>
> [1]
> https://issues.apache.org/jira/browse/IGNITE-7196?focusedCommentId=16613175&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16613175
> [2]
> https://ci.ignite.apache.org/viewLog.html?buildId=1835822&tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_RunAll
>
>
> On Tue, 28 Aug 2018 at 17:59 Pavel Kovalenko <[email protected]> wrote:
>
>> Hello Maxim,
>>
>> I think you're going in the right direction, but why you perform binary
>> recovery only in the case when a node in the baseline?
>> I think that node can also perform binary recovery before the first
>> exchange is started in case of
>> 1) First activation
>> 2) Baseline change
>>
>> In all of these cases, you need just local available CacheDescriptors to
>> restore binary state from WAL.
>> I would like also see a test when Ignite is stopped in the middle of a
>> checkpoint and binary recovery succeed before PME in that case also.
>>
>>
>> ср, 22 авг. 2018 г. в 15:31, Pavel Kovalenko <[email protected]>:
>>
>> > Hello Maxim,
>> >
>> > Thank you for your work. I will review your changes within the next
>> > several days.
>> >
>> > 2018-08-22 11:12 GMT+03:00 Maxim Muzafarov <[email protected]>:
>> >
>> >> Pavel,
>> >>
>> >> Thank you for so detailed introduction into the root of the problem of
>> >> ticket IGNITE-7196.
>> >> As you mentioned before, we can divide this ticket into two sub-tasks.
>> So,
>> >> I will
>> >> file new issue for `Logical consistency recovery`. I think it's more
>> >> convenient way
>> >> for reviewing and merging each of these high level Steps (PRs).
>> >>
>> >> Currently, let's focus on `Physical(binary) consistency recovery` as
>> the
>> >> most critical
>> >> part of this issue and synchronize our vision and vision of other
>> >> community
>> >> members
>> >> on implementation details of Step 1. From this point everything what
>> I'm
>> >> saying is about
>> >> only binary recovery.
>> >>
>> >>
>> >> I think PR is almost ready (I need to check TC carefully).
>> >> Can you look at my changes? Am I going the right way?
>> >>
>> >> PR: https://github.com/apache/ignite/pull/4520
>> >> Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727
>> >> JIRA: https://issues.apache.org/jira/browse/IGNITE-7196
>> >>
>> >>
>> >> These are my milestones which I've tried to solve:
>> >> 1. On first time cluster activation no changes required. We should keep
>> >> this behavior.
>> >>
>> >> 2. `startMemoryPolicies` earlier if need.
>> >> Now at the master branch data regions starts at onActivate method
>> called.
>> >> If we are  trying
>> >> to restore binary state on node reconnects to cluster we should start
>> >> regions earlier.
>> >>
>> >> 3. `suspendLogging` method added to switch off handlers.
>> >> Keep fast PDS cleanup when joining is not in BLT (or belongs to another
>> >> BLT).
>> >> Restore memory and metastorage initialization required `WALWriter` and
>> >> `FileWriteHandle`
>> >> to be started. They can lock directory\files and block cleanups.
>> >>
>> >> 4. `isBaselineNode` check before resumming logging.
>> >> The same as point 3. Local node gets discovery data and determines
>> >> #isLocalNodeNotInBaseline.
>> >> If node not in current baseline it performs cleanup local PDS and we
>> >> should
>> >> reset state of
>> >> previously initialized at startup metastorage instance.
>> >>
>> >> 5. `cacheProcessorStarted` callback added to restore memory at startup.
>> >> We can restore memory only after having info about all
>> >> `CacheGroupDescriptor`. To achieve this
>> >> we should do restoring at the moment `GridCacheProcessor` started.
>> >>
>> >> 6. Move `fsync` for `MemoryRecoveryRecord` at node startup.
>> >> We are fsync'ing it inside running PME. Suppose, it used for recovery
>> >> actions and\or control
>> >> cluster state outside Ignite project (methods `nodeStart` and
>> >> `nodeStartedPointers` of this
>> >> are not used at Ignite project code). I've moved them at the node
>> startup,
>> >> but may be we should
>> >> dumping timestamp only at the time of node activation. Not sure.
>> >>
>> >> 7. `testBinaryRecoverBeforeExchange` test added to check restore before
>> >> PME.
>> >> Not sure that we should repeat wal reading logic as it was in
>> >> `testApplyDeltaRecords` test for
>> >> checking memory restoring. Suppose, minimally necessary condition is to
>> >> check `lastRestored` and
>> >> `checkpointHistory` states of database manager's (they should happend
>> >> before node joins to cluster).
>> >>
>> >> I will create separate ticket and refactor these minor issues after we
>> >> will
>> >> finish with 7196:
>> >> - simlify registerMBeans methods for
>> >> `Ignite\GridCacheDatabaseSharedManager`
>> >> - add @Override annotations if missed
>> >> - fix abbreviations regarding IdeaAbbrPlugin
>> >> - createDataRegionConfiguration method can be removed and simplified
>> >> - startMemoryPolicies rename to startDataRegion
>> >> - output messages format fix code style needed
>> >> - fix format javadoc's
>> >> - make inner classes private
>> >> - missed logging for cleanupCheckpointDirectory,
>> >> cleanupCheckpointDirectory, cleanupWalDirectories
>> >>
>> >> On Fri, 3 Aug 2018 at 18:20 Pavel Kovalenko <[email protected]>
>> wrote:
>> >>
>> >> > Maxim,
>> >> >
>> >> > In the addition to my answer, I propose to divide the whole task
>> into 2
>> >> > steps:
>> >> > 1) Physical consistency recovery before PME.
>> >> > 2) Logical consistency recovery before PME.
>> >> >
>> >> > The first step can be done before the discovery manager start, as it
>> no
>> >> > needs any information about caches, it operates only with pages
>> binary
>> >> > data. On this step, we can do points a) and b) from answer 3) and
>> fsync
>> >> the
>> >> > state of the page memory.
>> >> > The second step is much harder to implement as it requires major
>> >> changes in
>> >> > Discovery SPI. Discovery node join event is triggered on coordinator
>> >> after
>> >> > node successfully joins to the cluster. To avoid it, we should
>> >> introduce a
>> >> > new state of node and fire discovery event only after logical
>> recovery
>> >> has
>> >> > finished. I doubt that logical recovery consumes a lot of time of
>> >> exchange,
>> >> > but physical recovery does, as it performs fsync on disk after
>> restore.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > 2018-08-03 17:46 GMT+03:00 Pavel Kovalenko <[email protected]>:
>> >> >
>> >> > > Hello Maxim,
>> >> > >
>> >> > > 1) Yes, Discovery Manager is starting after GridCacheProcessor,
>> which
>> >> > > starts GridCacheDatabaseSharedManager which invokes
>> readMetastorage on
>> >> > > start.
>> >> > >
>> >> > > 2) Before we complete the local join future, we create and add
>> >> Exchange
>> >> > > future on local node join to ExchangeManager. So, when local join
>> >> future
>> >> > > completes, first PME on that node should be already run or at least
>> >> there
>> >> > > will be first Exchange future in Exchange worker queue.
>> >> > >
>> >> > > 3) I don't exactly understand what do you mean saying "update
>> obsolete
>> >> > > partition counter", but the process is the following:
>> >> > >     a) We restore page memory state (which contain partition update
>> >> > > counters) from the last checkpoint.
>> >> > >     b) We repair page memory state using physical delta records
>> from
>> >> WAL.
>> >> > >     c) We apply logical records since last checkpoint finish
>> marker.
>> >> > > During this iteration, if we meet DataRecord we increment update
>> >> counter
>> >> > on
>> >> > > appropriate partition. NOTE: This phase currently happens during
>> >> > exchange.
>> >> > >     After that partition exchange starts and information about
>> actual
>> >> > > update counters will be collected from FullMessage. If partition
>> >> counters
>> >> > > are outdated, rebalance will start after PME end.
>> >> > >
>> >> > > 4) I think yes because everything that you need is cache
>> descriptors
>> >> of
>> >> > > currently started caches on the grid. Some information can be
>> >> retrieved
>> >> > > from the static configuration. Information about dynamically
>> started
>> >> > caches
>> >> > > can be retrieved from Discovery Data Bag received when a node
>> joins to
>> >> > > ring. See
>> >> org.apache.ignite.internal.processors.cache.ClusterCachesInfo#
>> >> > > onGridDataReceived
>> >> > >
>> >> > > 5) I think the main problem is that metastorage and page memory
>> share
>> >> one
>> >> > > WAL. But If this phase will happen before PME is started on all
>> >> cluster
>> >> > > nodes, I think it's acceptable that they will use 2 WAL iterations.
>> >> > >
>> >> > > 2018-08-03 10:28 GMT+03:00 Nikolay Izhikov <[email protected]>:
>> >> > >
>> >> > >> Hello, Maxim.
>> >> > >>
>> >> > >> > 1) Is it correct that readMetastore() happens after node starts>
>> >> but
>> >> > >> before including node into the ring?>
>> >> > >>
>> >> > >> I think yes.
>> >> > >> You can have some kind of metainformation required on node join.
>> >> > >>
>> >> > >> > 5) Does in our final solution for new joined node readMetastore>
>> >> and
>> >> > >> restoreMemory should be performed in one step?
>> >> > >>
>> >> > >> I think, no.
>> >> > >>
>> >> > >> Meta Information can be required to perform restore memory.
>> >> > >> So we have to restore metainformation in first step and restore
>> whole
>> >> > >> memory as a second step.
>> >> > >>
>> >> > >> В Пт, 03/08/2018 в 09:44 +0300, Maxim Muzafarov пишет:
>> >> > >> > Hi Igniters,
>> >> > >> >
>> >> > >> >
>> >> > >> > I'm working on bug [1] and have some questions about the final
>> >> > >> > implementation. Probably, I've already found answers on some of
>> >> > >> > them but I want to be sure. Please, help me to clarify details.
>> >> > >> >
>> >> > >> >
>> >> > >> > The key problem here is that we are reading WAL and restoring
>> >> > >> > memory state of new joined node inside PME. Reading WAL can
>> >> > >> > consume huge amount of time, so the whole cluster stucks and
>> >> > >> > waits for the single node.
>> >> > >> >
>> >> > >> >
>> >> > >> > 1) Is it correct that readMetastore() happens after node starts
>> >> > >> > but before including node into the ring?
>> >> > >> >
>> >> > >> > 2) Is after onDone() method called for LocalJoinFuture on local
>> >> > >> > node happend we can proceed with initiating PME on local node?
>> >> > >> >
>> >> > >> > 3) After reading checkpoint and restore memory for new joined
>> >> > >> > node how and when we are updating obsolete partitions update
>> >> > >> > counter? At historical rebalance, right?
>> >> > >> >
>> >> > >> > 4) Should we restoreMemory for new joined node before PME
>> >> > >> > initiates on the other nodes in cluster?
>> >> > >> >
>> >> > >> > 5) Does in our final solution for new joined node readMetastore
>> >> > >> > and restoreMemory should be performed in one step?
>> >> > >> >
>> >> > >> >
>> >> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-7196
>> >> > >>
>> >> > >
>> >> > >
>> >> >
>> >> --
>> >> --
>> >> Maxim Muzafarov
>> >>
>> >
>> >
>>
> --
> --
> Maxim Muzafarov
>
-- 
--
Maxim Muzafarov

Re: Exchange stucks while node restoring state from WAL

Reply via email to