Re[2]: Questions related to check pointing

Zhenya Stanilovsky Tue, 12 Jan 2021 02:00:05 -0800

 
>Hi Zhenya,
> 
>Thanks for the pointers - I will look into them.
> 
>I have been doing some additional reading into this and discovered we are 
>using a 4.0 NFS client, which seems to be the first 'no-no'; we will look at 
>updating to use the 41 NFS client.
> 
>We have modified our default timer cadence for checkpointing from 3 minutes to 
>1 minutes, which seems to be giving us better performance. We will continue to 
>measure the impact that has.
> 
>Lastly, I'm planning to merge our two data regions into a single region to 
>reduce 'too many dirty pages' checkpoints due to high write activity in a 
>small region.
> 
>Would using larger pages sizes (eg: 16kb) be useful with EFS?
Hi, Raymond.
I have no info about it, it would be helpful if will you share your research.
thanks !
> 
>Raymond.  
>On Tue, Jan 12, 2021 at 8:27 PM Zhenya Stanilovsky < arzamas...@mail.ru > 
>wrote:
>>hope it would be helpful too:
>>https://www.jeffgeerling.com/blog/2018/getting-best-performance-out-amazon-efs
>>https://docs.aws.amazon.com/efs/latest/ug/storage-classes.html
>>> 
>>>Hi Zhenya,
>>> 
>>>The matching checkpoint finished log is this:
>>> 
>>>2020-12-15 19:07:39,253 [106] INF [MutableCacheComputeServer]  Checkpoint 
>>>finished [cpId=e2c31b43-44df-43f1-b162-6b6cefa24e28, pages=33421, 
>>>markPos=FileWALPointer [idx=6339, fileOff=243287334, len=196573], 
>>>walSegmentsCleared=0, walSegmentsCovered=[], markDuration=218ms, 
>>>pagesWrite=1150ms, fsync=37104ms, total=38571ms]  
>>> 
>>>Regards your comment that 3/4 of pages in whole data region need to be dirty 
>>>to trigger this, can you confirm this is 3/4 of the maximum size of the data 
>>>region, or of the currently used size (eg: if Min is 1Gb, and Max is 4Gb, 
>>>and used is 2Gb, would 1.5Gb of dirty pages trigger this?)
>>> 
>>>Are data regions independently checkpointed, or are they checkpointed as a 
>>>whole, so that a 'too many dirty pages' condition affects all data regions 
>>>in terms of write blocking?
>>> 
>>>Can you comment on my query regarding should we set Min and Max size of the 
>>>data region to be the same? Ie: Don't bother with growing the data region 
>>>memory use on demand, just allocate the maximum?  
>>> 
>>>In terms of the checkpoint lock hold time metric, of the checkpoints quoting 
>>>'too many dirty pages' there is one instance apart from the one I have 
>>>provided earlier violating this limit, ie:
>>> 
>>>2020-12-17 18:56:39,086 [104] INF [MutableCacheComputeServer] Checkpoint 
>>>started [checkpointId=e9ccf0ca-f813-4f91-ac93-5483350fdf66, 
>>>startPtr=FileWALPointer [idx=7164, fileOff=389224517, len=196573], 
>>>checkpointBeforeLockTime=276ms, checkpointLockWait=0ms, 
>>>checkpointListenersExecuteTime=16ms, checkpointLockHoldTime=39ms, 
>>>walCpRecordFsyncDuration=254ms, writeCheckpointEntryDuration=32ms, 
>>>splitAndSortCpPagesDuration=276ms, pages=77774, reason=' too many dirty 
>>>pages ']  
>>> 
>>>This is out of a population of 16 instances I can find. The remainder have 
>>>lock times of 16-17ms.
>>> 
>>>Regarding writes of pages to the persistent store, does the check pointing 
>>>system parallelise writes across partitions ro maximise throughput? 
>>> 
>>>Thanks,
>>>Raymond.
>>> 
>>>   
>>>On Thu, Dec 31, 2020 at 1:17 AM Zhenya Stanilovsky < arzamas...@mail.ru > 
>>>wrote:
>>>>
>>>>All write operations will be blocked for this timeout :  
>>>>checkpointLockHoldTime=32ms (Write Lock holding) If you observe huge amount 
>>>>of such messages :    reason=' too many dirty pages ' may be you need to 
>>>>store some data in not persisted regions for example or reduce indexes (if 
>>>>you use them). And please attach other part of cp message starting with : 
>>>>Checkpoint finished.
>>>>
>>>>
>>>> 
>>>>>In ( 
>>>>>https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood
>>>>> ), there is a mention of a dirty pages limit that is a factor that can 
>>>>>trigger check points.
>>>>> 
>>>>>I also found this issue:  
>>>>>http://apache-ignite-users.70518.x6.nabble.com/too-many-dirty-pages-td28572.html
>>>>> where "too many dirty pages" is a reason given for initiating a 
>>>>>checkpoint.
>>>>> 
>>>>>After reviewing our logs I found this: (one example)
>>>>> 
>>>>>2020-12-15 19:07:00,999 [106] INF [MutableCacheComputeServer] Checkpoint 
>>>>>started [checkpointId=e2c31b43-44df-43f1-b162-6b6cefa24e28, 
>>>>>startPtr=FileWALPointer [idx=6339, fileOff=243287334, len=196573], 
>>>>>checkpointBeforeLockTime=99ms, checkpointLockWait=0ms, 
>>>>>checkpointListenersExecuteTime=16ms, checkpointLockHoldTime=32ms, 
>>>>>walCpRecordFsyncDuration=113ms, writeCheckpointEntryDuration=27ms, 
>>>>>splitAndSortCpPagesDuration=45ms, pages=33421, reason=' too many dirty 
>>>>>pages ']   
>>>>> 
>>>>>Which suggests we may have the issue where writes are frozen until the 
>>>>>check point is completed.
>>>>> 
>>>>>Looking at the AI 2.8.1 source code, the dirty page limit fraction appears 
>>>>>to be 0.1 (10%), via this entry in GridCacheDatabaseSharedManager.java:
>>>>> 
>>>>>    /**
>>>>>     * Threshold to calculate limit for pages list on-heap caches.
>>>>>     * <p>
>>>>>     * Note: When a checkpoint is triggered, we need some amount of page 
>>>>>memory to store pages list on-heap cache.
>>>>>     * If a checkpoint is triggered by "too many dirty pages" reason and 
>>>>>pages list cache is rather big, we can get
>>>>>* {@code IgniteOutOfMemoryException}. To prevent this, we can limit the 
>>>>>total amount of cached page list buckets,
>>>>>     * assuming that checkpoint will be triggered if no more then 3/4 of 
>>>>>pages will be marked as dirty (there will be
>>>>>     * at least 1/4 of clean pages) and each cached page list bucket can 
>>>>>be stored to up to 2 pages (this value is not
>>>>>     * static, but depends on PagesCache.MAX_SIZE, so if 
>>>>>PagesCache.MAX_SIZE > PagesListNodeIO#getCapacity it can take
>>>>>     * more than 2 pages). Also some amount of page memory needed to store 
>>>>>page list metadata.
>>>>>     */
>>>>>     private   static   final   double   PAGE_LIST_CACHE_LIMIT_THRESHOLD  
>>>>>=  0.1 ;
>>>>> 
>>>>>This raises two questions: 
>>>>> 
>>>>>1. The data region where most writes are occurring has 4Gb allocated to 
>>>>>it, though it is permitted to start at a much lower level. 4Gb should be 
>>>>>1,000,000 pages, 10% of which should be 100,000 dirty pages.
>>>>> 
>>>>>The 'limit holder' is calculated like this:
>>>>> 
>>>>>    /**
>>>>>     *  @return  Holder for page list cache limit for given data region.
>>>>>     */
>>>>>     public   AtomicLong   pageListCacheLimitHolder ( DataRegion   
>>>>>dataRegion ) {
>>>>>         if  ( dataRegion . config (). isPersistenceEnabled ()) {
>>>>>             return   pageListCacheLimits . computeIfAbsent ( dataRegion . 
>>>>>config (). getName (), name  ->   new   AtomicLong (
>>>>>                ( long )(((PageMemoryEx) dataRegion . pageMemory ()). 
>>>>>totalPages () * PAGE_LIST_CACHE_LIMIT_THRESHOLD)));
>>>>>        }  
>>>>>         return   null ;
>>>>>    }
>>>>> 
>>>>>... but I am unsure if totalPages() is referring to the current size of 
>>>>>the data region, or the size it is permitted to grow to. ie: Could the 
>>>>>'dirty page limit' be a sliding limit based on the growth of the data 
>>>>>region? Is it better to set the initial and maximum sizes of data regions 
>>>>>to be the same number?
>>>>> 
>>>>>2. We have two data regions, one supporting inbound arrival of data (with 
>>>>>low numbers of writes), and one supporting storage of processed results 
>>>>>from the arriving data (with many more writes). 
>>>>> 
>>>>>The block on writes due to the number of dirty pages appears to affect all 
>>>>>data regions, not just the one which has violated the dirty page limit. Is 
>>>>>that correct? If so, is this something that can be improved?
>>>>> 
>>>>>Thanks,
>>>>>Raymond.
>>>>>   
>>>>>On Wed, Dec 30, 2020 at 9:17 PM Raymond Wilson < 
>>>>>raymond_wil...@trimble.com > wrote:
>>>>>>I'm working on getting automatic JVM thread stack dumping occurring if we 
>>>>>>detect long delays in put (PutIfAbsent) operations. Hopefully this will 
>>>>>>provide more information.  
>>>>>>On Wed, Dec 30, 2020 at 7:48 PM Zhenya Stanilovsky < arzamas...@mail.ru > 
>>>>>>wrote:
>>>>>>>
>>>>>>>Don`t think so, checkpointing work perfectly well already before this 
>>>>>>>fix.
>>>>>>>Need additional info for start digging your problem, can you share 
>>>>>>>ignite logs somewhere?
>>>>>>>   
>>>>>>>>I noticed an entry in the Ignite 2.9.1 changelog:
>>>>>>>>*  Improved checkpoint concurrent behaviour
>>>>>>>>I am having trouble finding the relevant Jira ticket for this in the 
>>>>>>>>2.9.1 Jira area at  
>>>>>>>>https://issues.apache.org/jira/browse/IGNITE-13876?jql=project%20%3D%20IGNITE%20AND%20fixVersion%20%3D%202.9.1%20and%20status%20%3D%20Resolved
>>>>>>>> 
>>>>>>>>Perhaps this change may improve the checkpointing issue we are seeing?
>>>>>>>> 
>>>>>>>>Raymond.
>>>>>>>>   
>>>>>>>>On Tue, Dec 29, 2020 at 8:35 PM Raymond Wilson < 
>>>>>>>>raymond_wil...@trimble.com > wrote:
>>>>>>>>>Hi Zhenya,
>>>>>>>>> 
>>>>>>>>>1. We currently use AWS EFS for primary storage, with provisioned IOPS 
>>>>>>>>>to provide sufficient IO. Our Ignite cluster currently tops out at 
>>>>>>>>>~10% usage (with at least 5 nodes writing to it, including WAL and WAL 
>>>>>>>>>archive), so we are not saturating the EFS interface. We use the 
>>>>>>>>>default page size (experiments with larger page sizes showed 
>>>>>>>>>instability when checkpointing due to free page starvation, so we 
>>>>>>>>>reverted to the default size). 
>>>>>>>>> 
>>>>>>>>>2. Thanks for the detail, we will look for that in thread dumps when 
>>>>>>>>>we can create them.
>>>>>>>>> 
>>>>>>>>>3. We are using the default CP buffer size, which is max(256Mb, 
>>>>>>>>>DataRagionSize / 4) according to the Ignite documentation, so this 
>>>>>>>>>should have more than enough checkpoint buffer space to cope with 
>>>>>>>>>writes. As additional information, the cache which is displaying very 
>>>>>>>>>slow writes is in a data region with relatively slow write traffic. 
>>>>>>>>>There is a primary (default) data region with large write traffic, and 
>>>>>>>>>the vast majority of pages being written in a checkpoint will be for 
>>>>>>>>>that default data region.
>>>>>>>>> 
>>>>>>>>>4. Yes, this is very surprising. Anecdotally from our logs it appears 
>>>>>>>>>write traffic into the low write traffic cache is blocked during 
>>>>>>>>>checkpoints.
>>>>>>>>> 
>>>>>>>>>Thanks,
>>>>>>>>>Raymond.
>>>>>>>>>    
>>>>>>>>>   
>>>>>>>>>On Tue, Dec 29, 2020 at 7:31 PM Zhenya Stanilovsky < 
>>>>>>>>>arzamas...@mail.ru > wrote:
>>>>>>>>>>*  
>>>>>>>>>>Additionally to Ilya reply you can check vendors page for additional 
>>>>>>>>>>info, all in this page are applicable for ignite too [1]. Increasing 
>>>>>>>>>>threads number leads to concurrent io usage, thus if your have 
>>>>>>>>>>something like nvme — it`s up to you but in case of sas possibly 
>>>>>>>>>>better would be to reduce this param.
>>>>>>>>>>*  Log will shows you something like :
>>>>>>>>>>Parking thread=%Thread name% for timeout(ms)= %time% and appropriate :
>>>>>>>>>>Unparking thread=
>>>>>>>>>>*  No additional looging with cp buffer usage are provided. cp buffer 
>>>>>>>>>>need to be more than 10% of overall persistent  DataRegions size.
>>>>>>>>>>*  90 seconds or longer  —    Seems like problems in io or system 
>>>>>>>>>>tuning, it`s very bad score i hope. 
>>>>>>>>>>[1]  
>>>>>>>>>>https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/persistence-tuning
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 
>>>>>>>>>>>Hi,
>>>>>>>>>>> 
>>>>>>>>>>>We have been investigating some issues which appear to be related to 
>>>>>>>>>>>checkpointing. We currently use the IA 2.8.1 with the C# client.
>>>>>>>>>>> 
>>>>>>>>>>>I have been trying to gain clarity on how certain aspects of the 
>>>>>>>>>>>Ignite configuration relate to the checkpointing process:
>>>>>>>>>>> 
>>>>>>>>>>>1. Number of check pointing threads. This defaults to 4, but I don't 
>>>>>>>>>>>understand how it applies to the checkpointing process. Are more 
>>>>>>>>>>>threads generally better (eg: because it makes the disk IO parallel 
>>>>>>>>>>>across the threads), or does it only have a positive effect if you 
>>>>>>>>>>>have many data storage regions? Or something else? If this could be 
>>>>>>>>>>>clarified in the documentation (or a pointer to it which Google has 
>>>>>>>>>>>not yet found), that would be good.
>>>>>>>>>>> 
>>>>>>>>>>>2. Checkpoint frequency. This is defaulted to 180 seconds. I was 
>>>>>>>>>>>thinking that reducing this time would result in smaller less 
>>>>>>>>>>>disruptive check points. Setting it to 60 seconds seems pretty safe, 
>>>>>>>>>>>but is there a practical lower limit that should be used for use 
>>>>>>>>>>>cases with new data constantly being added, eg: 5 seconds, 10 
>>>>>>>>>>>seconds?
>>>>>>>>>>> 
>>>>>>>>>>>3. Write exclusivity constraints during checkpointing. I understand 
>>>>>>>>>>>that while a checkpoint is occurring ongoing writes will be 
>>>>>>>>>>>supported into the caches being check pointed, and if those are 
>>>>>>>>>>>writes to existing pages then those will be duplicated into the 
>>>>>>>>>>>checkpoint buffer. If this buffer becomes full or stressed then 
>>>>>>>>>>>Ignite will throttle, and perhaps block, writes until the checkpoint 
>>>>>>>>>>>is complete. If this is the case then Ignite will emit logging 
>>>>>>>>>>>(warning or informational?) that writes are being throttled.
>>>>>>>>>>> 
>>>>>>>>>>>We have cases where simple puts to caches (a few requests per 
>>>>>>>>>>>second) are taking up to 90 seconds to execute when there is an 
>>>>>>>>>>>active check point occurring, where the check point has been 
>>>>>>>>>>>triggered by the checkpoint timer. When a checkpoint is not 
>>>>>>>>>>>occurring the time to do this is usually in the milliseconds. The 
>>>>>>>>>>>checkpoints themselves can take 90 seconds or longer, and are 
>>>>>>>>>>>updating up to 30,000-40,000 pages, across a pair of data storage 
>>>>>>>>>>>regions, one with 4Gb in-memory space allocated (which should be 
>>>>>>>>>>>1,000,000 pages at the standard 4kb page size), and one small region 
>>>>>>>>>>>with 128Mb. There is no 'throttling' logging being emitted that we 
>>>>>>>>>>>can tell, so the checkpoint buffer (which should be 1Gb for the 
>>>>>>>>>>>first data region and 256 Mb for the second smaller region in this 
>>>>>>>>>>>case) does not look like it can fill up during the checkpoint.
>>>>>>>>>>> 
>>>>>>>>>>>It seems like the checkpoint is affecting the put operations, but I 
>>>>>>>>>>>don't understand why that may be given the documented checkpointing 
>>>>>>>>>>>process, and the checkpoint itself (at least via Informational 
>>>>>>>>>>>logging) is not advertising any restrictions.
>>>>>>>>>>> 
>>>>>>>>>>>Thanks,
>>>>>>>>>>>Raymond.
>>>>>>>>>>>  --
>>>>>>>>>>>
>>>>>>>>>>>Raymond Wilson
>>>>>>>>>>>Solution Architect, Civil Construction Software Systems (CCSS)
>>>>>>>>>>>  
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>>  --
>>>>>>>>>
>>>>>>>>>Raymond Wilson
>>>>>>>>>Solution Architect, Civil Construction Software Systems (CCSS)
>>>>>>>>>11 Birmingham Drive |  Christchurch, New Zealand
>>>>>>>>>+64-21-2013317  Mobile
>>>>>>>>>raymond_wil...@trimble.com
>>>>>>>>>         
>>>>>>>>> 
>>>>>>>> 
>>>>>>>>  --
>>>>>>>>
>>>>>>>>Raymond Wilson
>>>>>>>>Solution Architect, Civil Construction Software Systems (CCSS)
>>>>>>>>11 Birmingham Drive |  Christchurch, New Zealand
>>>>>>>>+64-21-2013317  Mobile
>>>>>>>>raymond_wil...@trimble.com
>>>>>>>>         
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>  
>>>>>> 
>>>>>>  --
>>>>>>
>>>>>>Raymond Wilson
>>>>>>Solution Architect, Civil Construction Software Systems (CCSS)
>>>>>>11 Birmingham Drive |  Christchurch, New Zealand
>>>>>>+64-21-2013317  Mobile
>>>>>>raymond_wil...@trimble.com
>>>>>>         
>>>>>> 
>>>>> 
>>>>>  --
>>>>>
>>>>>Raymond Wilson
>>>>>Solution Architect, Civil Construction Software Systems (CCSS)
>>>>>11 Birmingham Drive |  Christchurch, New Zealand
>>>>>+64-21-2013317  Mobile
>>>>>raymond_wil...@trimble.com
>>>>>         
>>>>> 
>>>> 
>>>> 
>>>> 
>>>>  
>>> 
>>>  --
>>>
>>>Raymond Wilson
>>>Solution Architect, Civil Construction Software Systems (CCSS)
>>>11 Birmingham Drive |  Christchurch, New Zealand
>>>+64-21-2013317  Mobile
>>>raymond_wil...@trimble.com
>>>         
>>> 
>>>
>>>
>>>  
>> 
>> 
>> 
>>  
> 
>  --
>
>Raymond Wilson
>Solution Architect, Civil Construction Software Systems (CCSS)
>11 Birmingham Drive |  Christchurch, New Zealand
>raymond_wil...@trimble.com
>         
>
Re[2]: Questions related to check pointing

Reply via email to