Re[2]: Ever increasing startup times as data grow in persistent storage

Zhenya Stanilovsky Tue, 12 Jan 2021 23:28:49 -0800

 
>Hi Zhenya,
> 
>Thanks for confirming performing checkpoints more often will help here.
Hi Raymond !
> 
>I have established this configuration so will experiment with settings little.
> 
>On a related note, is there any way to automatically trigger a checkpoint, for 
>instance as a pre-shutdown activity?
 
If you shutdown your cluster gracefully = with deactivation [1] further start 
will not trigger wal readings.
 
[1] 
https://www.gridgain.com/docs/latest/administrators-guide/control-script#deactivating-cluster
 
>Checkpoints seem to be much faster than the process of applying WAL updates.
> 
>Raymond.  
>On Wed, Jan 13, 2021 at 8:07 PM Zhenya Stanilovsky < arzamas...@mail.ru > 
>wrote:
>>
>>
>>
>> 
>>>We have noticed that startup time for our server nodes has been slowly 
>>>increasing in time as the amount of data stored in the persistent store 
>>>grows.
>>> 
>>>This appears to be closely related to recovery of WAL changes that were not 
>>>checkpointed at the time the node was stopped.
>>> 
>>>After enabling debug logging we see that the WAL file is scanned, and for 
>>>every cache, all partitions in the cache are examined, and if there are any 
>>>uncommitted changes in the WAL file then the partition is updated (I assume 
>>>this requires reading of the partition itself as a part of this process).
>>> 
>>>We now have ~150Gb of data in our persistent store and we see WAL update 
>>>times between 5-10 minutes to complete, during which the node is unavailable.
>>> 
>>>We use fairly large WAL files (512Mb) and use 10 segments, with WAL 
>>>archiving enabled.
>>> 
>>>We anticipate data in persistent storage to grow to Terabytes, and if the 
>>>startup time continues to grow as storage grows then this makes deploys and 
>>>restarts difficult.
>>> 
>>>Until now we have been using the default checkpoint time out of 3 minutes 
>>>which may mean we have significant uncheckpointed data in the WAL files. We 
>>>are moving to 1 minute checkpoint but don't yet know if this improve startup 
>>>times. We also use the default 1024 partitions per cache, though some 
>>>partitions may be large. 
>>> 
>>>Can anyone confirm this is expected behaviour and recommendations for 
>>>resolving it?
>>> 
>>>Will reducing checking pointing intervals help?
>> 
>>yes, it will help. Check  
>>https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>>>Is the entire content of a partition read while applying WAL changes?
>> 
>>don`t think so, may be someone else suggest here?
>>>Does anyone else have this issue?
>>> 
>>>Thanks,
>>>Raymond.
>>> 
>>>  --
>>>
>>>Raymond Wilson
>>>Solution Architect, Civil Construction Software Systems (CCSS)
>>>11 Birmingham Drive |  Christchurch, New Zealand
>>>raymond_wil...@trimble.com
>>>         
>>> 
>> 
>> 
>> 
>>  
> 
>  --
>
>Raymond Wilson
>Solution Architect, Civil Construction Software Systems (CCSS)
>11 Birmingham Drive |  Christchurch, New Zealand
>raymond_wil...@trimble.com
>         
>
Re[2]: Ever increasing startup times as data grow in persistent storage

Reply via email to