Re: Exchange stucks while node restoring state from WAL

Maxim Muzafarov Mon, 29 Oct 2018 01:00:34 -0700

Pavel, Dmitry G.,

Thank you for the help and review this improvement!


Igniters,

I'd like to finalize this topic with the following changes [1] `Update
Grid(Ignite)CacheDatabaseSharedManager
according to accepted coding guidelines`. I've consciously excluded them
from the original PR not to overcomplicate review.

Please, consider these changes to be included in the master branch too:
 - createDataRegionConfiguration have an unobvious name – it creates
metastore dataregion config
 - startMemoryPolicies rename to startDataRegion\or remove
 - output messages format fix code style
 - register\unregister method for MBean can be simplified

I've prepared everything required for the review.

[1] https://issues.apache.org/jira/browse/IGNITE-10033
[2] https://reviews.ignite.apache.org/ignite/review/IGNT-CR-927
[3] https://github.com/apache/ignite/pull/5156
[4]
https://ci.ignite.apache.org/viewLog.html?buildId=2184711&tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_RunAll



On Tue, 18 Sep 2018 at 18:02 Maxim Muzafarov <maxmu...@gmail.com> wrote:

> Igniters,
>
> I would like this issue to be part of release 2.7. There is not much time
> left
> considering that I will have to correct comments after the review.
>
> Will anyone help with review?
>
>
> PR: https://github.com/apache/ignite/pull/4520/files
> Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727
> JIRA: https://issues.apache.org/jira/browse/IGNITE-7196
>
> On Thu, 13 Sep 2018 at 11:27 Maxim Muzafarov <maxmu...@gmail.com> wrote:
>
>> Igniters,
>>
>> I need your help with the review of this patch. In general, it will help
>> to reduce PME duration. I've moved binary
>> memory recovery at the moment of node startup as we've discussed it with
>> Pavel previously this topic.
>>
>> Changes are relatively small (+299 −95) and they are ready. I've left a
>> comment in JIRA with high-level
>> implementation details. Hope, this will help with the review [1].
>>
>> Please, take a look.
>> PR: https://github.com/apache/ignite/pull/4520/files
>> Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727
>> JIRA: https://issues.apache.org/jira/browse/IGNITE-7196
>> TC Run All: [2]
>>
>>
>> [1]
>> https://issues.apache.org/jira/browse/IGNITE-7196?focusedCommentId=16613175&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16613175
>> [2]
>> https://ci.ignite.apache.org/viewLog.html?buildId=1835822&tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_RunAll
>>
>>
>> On Tue, 28 Aug 2018 at 17:59 Pavel Kovalenko <jokse...@gmail.com> wrote:
>>
>>> Hello Maxim,
>>>
>>> I think you're going in the right direction, but why you perform binary
>>> recovery only in the case when a node in the baseline?
>>> I think that node can also perform binary recovery before the first
>>> exchange is started in case of
>>> 1) First activation
>>> 2) Baseline change
>>>
>>> In all of these cases, you need just local available CacheDescriptors to
>>> restore binary state from WAL.
>>> I would like also see a test when Ignite is stopped in the middle of a
>>> checkpoint and binary recovery succeed before PME in that case also.
>>>
>>>
>>> ср, 22 авг. 2018 г. в 15:31, Pavel Kovalenko <jokse...@gmail.com>:
>>>
>>> > Hello Maxim,
>>> >
>>> > Thank you for your work. I will review your changes within the next
>>> > several days.
>>> >
>>> > 2018-08-22 11:12 GMT+03:00 Maxim Muzafarov <maxmu...@gmail.com>:
>>> >
>>> >> Pavel,
>>> >>
>>> >> Thank you for so detailed introduction into the root of the problem of
>>> >> ticket IGNITE-7196.
>>> >> As you mentioned before, we can divide this ticket into two
>>> sub-tasks. So,
>>> >> I will
>>> >> file new issue for `Logical consistency recovery`. I think it's more
>>> >> convenient way
>>> >> for reviewing and merging each of these high level Steps (PRs).
>>> >>
>>> >> Currently, let's focus on `Physical(binary) consistency recovery` as
>>> the
>>> >> most critical
>>> >> part of this issue and synchronize our vision and vision of other
>>> >> community
>>> >> members
>>> >> on implementation details of Step 1. From this point everything what
>>> I'm
>>> >> saying is about
>>> >> only binary recovery.
>>> >>
>>> >>
>>> >> I think PR is almost ready (I need to check TC carefully).
>>> >> Can you look at my changes? Am I going the right way?
>>> >>
>>> >> PR: https://github.com/apache/ignite/pull/4520
>>> >> Upsource: https://reviews.ignite.apache.org/ignite/review/IGNT-CR-727
>>> >> JIRA: https://issues.apache.org/jira/browse/IGNITE-7196
>>> >>
>>> >>
>>> >> These are my milestones which I've tried to solve:
>>> >> 1. On first time cluster activation no changes required. We should
>>> keep
>>> >> this behavior.
>>> >>
>>> >> 2. `startMemoryPolicies` earlier if need.
>>> >> Now at the master branch data regions starts at onActivate method
>>> called.
>>> >> If we are  trying
>>> >> to restore binary state on node reconnects to cluster we should start
>>> >> regions earlier.
>>> >>
>>> >> 3. `suspendLogging` method added to switch off handlers.
>>> >> Keep fast PDS cleanup when joining is not in BLT (or belongs to
>>> another
>>> >> BLT).
>>> >> Restore memory and metastorage initialization required `WALWriter` and
>>> >> `FileWriteHandle`
>>> >> to be started. They can lock directory\files and block cleanups.
>>> >>
>>> >> 4. `isBaselineNode` check before resumming logging.
>>> >> The same as point 3. Local node gets discovery data and determines
>>> >> #isLocalNodeNotInBaseline.
>>> >> If node not in current baseline it performs cleanup local PDS and we
>>> >> should
>>> >> reset state of
>>> >> previously initialized at startup metastorage instance.
>>> >>
>>> >> 5. `cacheProcessorStarted` callback added to restore memory at
>>> startup.
>>> >> We can restore memory only after having info about all
>>> >> `CacheGroupDescriptor`. To achieve this
>>> >> we should do restoring at the moment `GridCacheProcessor` started.
>>> >>
>>> >> 6. Move `fsync` for `MemoryRecoveryRecord` at node startup.
>>> >> We are fsync'ing it inside running PME. Suppose, it used for recovery
>>> >> actions and\or control
>>> >> cluster state outside Ignite project (methods `nodeStart` and
>>> >> `nodeStartedPointers` of this
>>> >> are not used at Ignite project code). I've moved them at the node
>>> startup,
>>> >> but may be we should
>>> >> dumping timestamp only at the time of node activation. Not sure.
>>> >>
>>> >> 7. `testBinaryRecoverBeforeExchange` test added to check restore
>>> before
>>> >> PME.
>>> >> Not sure that we should repeat wal reading logic as it was in
>>> >> `testApplyDeltaRecords` test for
>>> >> checking memory restoring. Suppose, minimally necessary condition is
>>> to
>>> >> check `lastRestored` and
>>> >> `checkpointHistory` states of database manager's (they should happend
>>> >> before node joins to cluster).
>>> >>
>>> >> I will create separate ticket and refactor these minor issues after we
>>> >> will
>>> >> finish with 7196:
>>> >> - simlify registerMBeans methods for
>>> >> `Ignite\GridCacheDatabaseSharedManager`
>>> >> - add @Override annotations if missed
>>> >> - fix abbreviations regarding IdeaAbbrPlugin
>>> >> - createDataRegionConfiguration method can be removed and simplified
>>> >> - startMemoryPolicies rename to startDataRegion
>>> >> - output messages format fix code style needed
>>> >> - fix format javadoc's
>>> >> - make inner classes private
>>> >> - missed logging for cleanupCheckpointDirectory,
>>> >> cleanupCheckpointDirectory, cleanupWalDirectories
>>> >>
>>> >> On Fri, 3 Aug 2018 at 18:20 Pavel Kovalenko <jokse...@gmail.com>
>>> wrote:
>>> >>
>>> >> > Maxim,
>>> >> >
>>> >> > In the addition to my answer, I propose to divide the whole task
>>> into 2
>>> >> > steps:
>>> >> > 1) Physical consistency recovery before PME.
>>> >> > 2) Logical consistency recovery before PME.
>>> >> >
>>> >> > The first step can be done before the discovery manager start, as
>>> it no
>>> >> > needs any information about caches, it operates only with pages
>>> binary
>>> >> > data. On this step, we can do points a) and b) from answer 3) and
>>> fsync
>>> >> the
>>> >> > state of the page memory.
>>> >> > The second step is much harder to implement as it requires major
>>> >> changes in
>>> >> > Discovery SPI. Discovery node join event is triggered on coordinator
>>> >> after
>>> >> > node successfully joins to the cluster. To avoid it, we should
>>> >> introduce a
>>> >> > new state of node and fire discovery event only after logical
>>> recovery
>>> >> has
>>> >> > finished. I doubt that logical recovery consumes a lot of time of
>>> >> exchange,
>>> >> > but physical recovery does, as it performs fsync on disk after
>>> restore.
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > 2018-08-03 17:46 GMT+03:00 Pavel Kovalenko <jokse...@gmail.com>:
>>> >> >
>>> >> > > Hello Maxim,
>>> >> > >
>>> >> > > 1) Yes, Discovery Manager is starting after GridCacheProcessor,
>>> which
>>> >> > > starts GridCacheDatabaseSharedManager which invokes
>>> readMetastorage on
>>> >> > > start.
>>> >> > >
>>> >> > > 2) Before we complete the local join future, we create and add
>>> >> Exchange
>>> >> > > future on local node join to ExchangeManager. So, when local join
>>> >> future
>>> >> > > completes, first PME on that node should be already run or at
>>> least
>>> >> there
>>> >> > > will be first Exchange future in Exchange worker queue.
>>> >> > >
>>> >> > > 3) I don't exactly understand what do you mean saying "update
>>> obsolete
>>> >> > > partition counter", but the process is the following:
>>> >> > >     a) We restore page memory state (which contain partition
>>> update
>>> >> > > counters) from the last checkpoint.
>>> >> > >     b) We repair page memory state using physical delta records
>>> from
>>> >> WAL.
>>> >> > >     c) We apply logical records since last checkpoint finish
>>> marker.
>>> >> > > During this iteration, if we meet DataRecord we increment update
>>> >> counter
>>> >> > on
>>> >> > > appropriate partition. NOTE: This phase currently happens during
>>> >> > exchange.
>>> >> > >     After that partition exchange starts and information about
>>> actual
>>> >> > > update counters will be collected from FullMessage. If partition
>>> >> counters
>>> >> > > are outdated, rebalance will start after PME end.
>>> >> > >
>>> >> > > 4) I think yes because everything that you need is cache
>>> descriptors
>>> >> of
>>> >> > > currently started caches on the grid. Some information can be
>>> >> retrieved
>>> >> > > from the static configuration. Information about dynamically
>>> started
>>> >> > caches
>>> >> > > can be retrieved from Discovery Data Bag received when a node
>>> joins to
>>> >> > > ring. See
>>> >> org.apache.ignite.internal.processors.cache.ClusterCachesInfo#
>>> >> > > onGridDataReceived
>>> >> > >
>>> >> > > 5) I think the main problem is that metastorage and page memory
>>> share
>>> >> one
>>> >> > > WAL. But If this phase will happen before PME is started on all
>>> >> cluster
>>> >> > > nodes, I think it's acceptable that they will use 2 WAL
>>> iterations.
>>> >> > >
>>> >> > > 2018-08-03 10:28 GMT+03:00 Nikolay Izhikov <nizhi...@apache.org>:
>>> >> > >
>>> >> > >> Hello, Maxim.
>>> >> > >>
>>> >> > >> > 1) Is it correct that readMetastore() happens after node
>>> starts>
>>> >> but
>>> >> > >> before including node into the ring?>
>>> >> > >>
>>> >> > >> I think yes.
>>> >> > >> You can have some kind of metainformation required on node join.
>>> >> > >>
>>> >> > >> > 5) Does in our final solution for new joined node
>>> readMetastore>
>>> >> and
>>> >> > >> restoreMemory should be performed in one step?
>>> >> > >>
>>> >> > >> I think, no.
>>> >> > >>
>>> >> > >> Meta Information can be required to perform restore memory.
>>> >> > >> So we have to restore metainformation in first step and restore
>>> whole
>>> >> > >> memory as a second step.
>>> >> > >>
>>> >> > >> В Пт, 03/08/2018 в 09:44 +0300, Maxim Muzafarov пишет:
>>> >> > >> > Hi Igniters,
>>> >> > >> >
>>> >> > >> >
>>> >> > >> > I'm working on bug [1] and have some questions about the final
>>> >> > >> > implementation. Probably, I've already found answers on some of
>>> >> > >> > them but I want to be sure. Please, help me to clarify details.
>>> >> > >> >
>>> >> > >> >
>>> >> > >> > The key problem here is that we are reading WAL and restoring
>>> >> > >> > memory state of new joined node inside PME. Reading WAL can
>>> >> > >> > consume huge amount of time, so the whole cluster stucks and
>>> >> > >> > waits for the single node.
>>> >> > >> >
>>> >> > >> >
>>> >> > >> > 1) Is it correct that readMetastore() happens after node starts
>>> >> > >> > but before including node into the ring?
>>> >> > >> >
>>> >> > >> > 2) Is after onDone() method called for LocalJoinFuture on local
>>> >> > >> > node happend we can proceed with initiating PME on local node?
>>> >> > >> >
>>> >> > >> > 3) After reading checkpoint and restore memory for new joined
>>> >> > >> > node how and when we are updating obsolete partitions update
>>> >> > >> > counter? At historical rebalance, right?
>>> >> > >> >
>>> >> > >> > 4) Should we restoreMemory for new joined node before PME
>>> >> > >> > initiates on the other nodes in cluster?
>>> >> > >> >
>>> >> > >> > 5) Does in our final solution for new joined node readMetastore
>>> >> > >> > and restoreMemory should be performed in one step?
>>> >> > >> >
>>> >> > >> >
>>> >> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-7196
>>> >> > >>
>>> >> > >
>>> >> > >
>>> >> >
>>> >> --
>>> >> --
>>> >> Maxim Muzafarov
>>> >>
>>> >
>>> >
>>>
>> --
>> --
>> Maxim Muzafarov
>>
> --
> --
> Maxim Muzafarov
>
-- 
--
Maxim Muzafarov

Re: Exchange stucks while node restoring state from WAL

Reply via email to