Re: Ever increasing startup times as data grow in persistent storage

Raymond Wilson Wed, 20 Jan 2021 10:17:22 -0800

Hi Andre,

I would like to see Ignite support a graceful shutdown scenario you get
with deactivation, but which does not need to be manually reactivated.


We run a pretty agile process and it is not uncommon to have multiple
deploys to production throughout a week. This is a pretty automated affair
(essentially push-button) and it works well, except for the WAL rescan on
startup.

Today there are two approaches we can take for a deployment:

1. Stop the nodes (which is what we currently do), leaving the WAL and
persistent store inconsistent. This requires a rescan of the WAL before the
grid is auto re-activated on startup. The time to do this is increasing
with the size of the persistent store - it does not appear to be related to
the size of the WAL.
2. Deactivate the grid, which leaves the WAL and persistent store in a
consistent state. This requires manual re-activation on restart, but does
not incur the increasing WAL restart cost.

Is an option like the one below possible?:

3. Suspend the grid, which performs the same steps deactivation does to
make the WAL and persistent store consistent, but which leaves the grid
activated so the manual activation process is not required on restart.

Thanks,
Raymond.


On Thu, Jan 21, 2021 at 4:02 AM andrei <aealexsand...@gmail.com> wrote:

> Hi,
>
> Yes, that was to be expected. The main autoactivation scenario is cluster
> restart. If you are using manual deactivation, you should also manually
> activate your cluster.
>
> BR,
> Andrei
> 1/20/2021 5:50 AM, Raymond Wilson пишет:
>
> We have been experimenting with using deactivation to shutdown the grid to
> reduce the time for the grid to start up again.
>
> It appears there is a downside to this: once deactivated the grid does not
> appear to auto-activate once baseline topology is achieved, which means we
> will need to run through the bootstrapping protocol of ensuring the grid
> has restarted correctly before activating it once again.
>
> The baseline topology documentation at
> https://ignite.apache.org/docs/latest/clustering/baseline-topology does
> not cover this condition.
>
> Is this expected?
>
> Thanks,
> Raymond.
>
>
> On Wed, Jan 13, 2021 at 11:49 PM Pavel Tupitsyn <ptupit...@apache.org>
> wrote:
>
>> Raymond,
>>
>> Please use ICluster.SetActive [1] instead, the API linked above is
>> obsolete
>>
>>
>> [1]
>> https://ignite.apache.org/releases/latest/dotnetdoc/api/Apache.Ignite.Core.Cluster.ICluster.html?#Apache_Ignite_Core_Cluster_ICluster_SetActive_System_Boolean_
>>
>> On Wed, Jan 13, 2021 at 11:54 AM Raymond Wilson <
>> raymond_wil...@trimble.com> wrote:
>>
>>> Of course. Obvious! :)
>>>
>>> Sent from my iPhone
>>>
>>> On 13/01/2021, at 9:15 PM, Zhenya Stanilovsky <arzamas...@mail.ru>
>>> wrote:
>>>
>>> 
>>>
>>>
>>>
>>>
>>>
>>> Is there an API version of the cluster deactivation?
>>>
>>>
>>>
>>> https://github.com/apache/ignite/blob/master/modules/platforms/dotnet/Apache.Ignite.Core.Tests/Cache/PersistentStoreTestObsolete.cs#L131
>>>
>>>
>>> On Wed, Jan 13, 2021 at 8:28 PM Zhenya Stanilovsky <arzamas...@mail.ru
>>> <//e.mail.ru/compose/?mailto=mailto%3aarzamas...@mail.ru>> wrote:
>>>
>>>
>>>
>>>
>>>
>>> Hi Zhenya,
>>>
>>> Thanks for confirming performing checkpoints more often will help here.
>>>
>>> Hi Raymond !
>>>
>>>
>>> I have established this configuration so will experiment with settings
>>> little.
>>>
>>> On a related note, is there any way to automatically trigger a
>>> checkpoint, for instance as a pre-shutdown activity?
>>>
>>>
>>> If you shutdown your cluster gracefully = with deactivation [1] further
>>> start will not trigger wal readings.
>>>
>>> [1]
>>> https://www.gridgain.com/docs/latest/administrators-guide/control-script#deactivating-cluster
>>>
>>>
>>> Checkpoints seem to be much faster than the process of applying WAL
>>> updates.
>>>
>>> Raymond.
>>>
>>> On Wed, Jan 13, 2021 at 8:07 PM Zhenya Stanilovsky <arzamas...@mail.ru
>>> <http://e.mail.ru/compose/?mailto=mailto%3aarzamas...@mail.ru>> wrote:
>>>
>>>
>>>
>>>
>>>
>>>
>>> We have noticed that startup time for our server nodes has been slowly
>>> increasing in time as the amount of data stored in the persistent store
>>> grows.
>>>
>>> This appears to be closely related to recovery of WAL changes that were
>>> not checkpointed at the time the node was stopped.
>>>
>>> After enabling debug logging we see that the WAL file is scanned, and
>>> for every cache, all partitions in the cache are examined, and if there are
>>> any uncommitted changes in the WAL file then the partition is updated (I
>>> assume this requires reading of the partition itself as a part of this
>>> process).
>>>
>>> We now have ~150Gb of data in our persistent store and we see WAL update
>>> times between 5-10 minutes to complete, during which the node is
>>> unavailable.
>>>
>>> We use fairly large WAL files (512Mb) and use 10 segments, with WAL
>>> archiving enabled.
>>>
>>> We anticipate data in persistent storage to grow to Terabytes, and if
>>> the startup time continues to grow as storage grows then this makes deploys
>>> and restarts difficult.
>>>
>>> Until now we have been using the default checkpoint time out of 3
>>> minutes which may mean we have significant uncheckpointed data in the WAL
>>> files. We are moving to 1 minute checkpoint but don't yet know if this
>>> improve startup times. We also use the default 1024 partitions per cache,
>>> though some partitions may be large.
>>>
>>> Can anyone confirm this is expected behaviour and recommendations for
>>> resolving it?
>>>
>>> Will reducing checking pointing intervals help?
>>>
>>>
>>> yes, it will help. Check
>>> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>>>
>>> Is the entire content of a partition read while applying WAL changes?
>>>
>>>
>>> don`t think so, may be someone else suggest here?
>>>
>>> Does anyone else have this issue?
>>>
>>> Thanks,
>>> Raymond.
>>>
>>>
>>> --
>>> <http://www.trimble.com/>
>>> Raymond Wilson
>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>> 11 Birmingham Drive | Christchurch, New Zealand
>>> raymond_wil...@trimble.com
>>> <http://e.mail.ru/compose/?mailto=mailto%3araymond_wil...@trimble.com>
>>>
>>>
>>>
>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> <http://www.trimble.com/>
>>> Raymond Wilson
>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>> 11 Birmingham Drive | Christchurch, New Zealand
>>> raymond_wil...@trimble.com
>>> <http://e.mail.ru/compose/?mailto=mailto%3araymond_wil...@trimble.com>
>>>
>>>
>>>
>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> <http://www.trimble.com/>
>>> Raymond Wilson
>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>> 11 Birmingham Drive | Christchurch, New Zealand
>>> raymond_wil...@trimble.com
>>> <//e.mail.ru/compose/?mailto=mailto%3araymond_wil...@trimble.com>
>>>
>>>
>>>
>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> raymond_wil...@trimble.com
>
>
>
>
>
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>
>

-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wil...@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Ever increasing startup times as data grow in persistent storage

Reply via email to