I'm working on getting automatic JVM thread stack dumping occurring if we detect long delays in put (PutIfAbsent) operations. Hopefully this will provide more information.
On Wed, Dec 30, 2020 at 7:48 PM Zhenya Stanilovsky <arzamas...@mail.ru> wrote: > > Don`t think so, checkpointing work perfectly well already before this fix. > Need additional info for start digging your problem, can you share ignite > logs somewhere? > > > I noticed an entry in the Ignite 2.9.1 changelog: > > - Improved checkpoint concurrent behaviour > > I am having trouble finding the relevant Jira ticket for this in the 2.9.1 > Jira area at > https://issues.apache.org/jira/browse/IGNITE-13876?jql=project%20%3D%20IGNITE%20AND%20fixVersion%20%3D%202.9.1%20and%20status%20%3D%20Resolved > > Perhaps this change may improve the checkpointing issue we are seeing? > > Raymond. > > > On Tue, Dec 29, 2020 at 8:35 PM Raymond Wilson <raymond_wil...@trimble.com > <//e.mail.ru/compose/?mailto=mailto%3araymond_wil...@trimble.com>> wrote: > > Hi Zhenya, > > 1. We currently use AWS EFS for primary storage, with provisioned IOPS to > provide sufficient IO. Our Ignite cluster currently tops out at ~10% usage > (with at least 5 nodes writing to it, including WAL and WAL archive), so we > are not saturating the EFS interface. We use the default page size > (experiments with larger page sizes showed instability when checkpointing > due to free page starvation, so we reverted to the default size). > > 2. Thanks for the detail, we will look for that in thread dumps when we > can create them. > > 3. We are using the default CP buffer size, which is max(256Mb, > DataRagionSize / 4) according to the Ignite documentation, so this should > have more than enough checkpoint buffer space to cope with writes. As > additional information, the cache which is displaying very slow writes is > in a data region with relatively slow write traffic. There is a primary > (default) data region with large write traffic, and the vast majority of > pages being written in a checkpoint will be for that default data region. > > 4. Yes, this is very surprising. Anecdotally from our logs it appears > write traffic into the low write traffic cache is blocked during > checkpoints. > > Thanks, > Raymond. > > > > On Tue, Dec 29, 2020 at 7:31 PM Zhenya Stanilovsky <arzamas...@mail.ru > <//e.mail.ru/compose/?mailto=mailto%3aarzamas...@mail.ru>> wrote: > > > 1. Additionally to Ilya reply you can check vendors page for > additional info, all in this page are applicable for ignite too [1]. > Increasing threads number leads to concurrent io usage, thus if your have > something like nvme — it`s up to you but in case of sas possibly better > would be to reduce this param. > 2. Log will shows you something like : > > Parking thread=%Thread name% for timeout(ms)= %time% > > and appropriate : > > Unparking thread= > > 3. No additional looging with cp buffer usage are provided. cp buffer > need to be more than 10% of overall persistent DataRegions size. > 4. 90 seconds or longer — Seems like problems in io or system tuning, > it`s very bad score i hope. > > [1] > https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/persistence-tuning > > > > > > Hi, > > We have been investigating some issues which appear to be related to > checkpointing. We currently use the IA 2.8.1 with the C# client. > > I have been trying to gain clarity on how certain aspects of the Ignite > configuration relate to the checkpointing process: > > 1. Number of check pointing threads. This defaults to 4, but I don't > understand how it applies to the checkpointing process. Are more threads > generally better (eg: because it makes the disk IO parallel across the > threads), or does it only have a positive effect if you have many data > storage regions? Or something else? If this could be clarified in the > documentation (or a pointer to it which Google has not yet found), that > would be good. > > 2. Checkpoint frequency. This is defaulted to 180 seconds. I was thinking > that reducing this time would result in smaller less disruptive check > points. Setting it to 60 seconds seems pretty safe, but is there a > practical lower limit that should be used for use cases with new data > constantly being added, eg: 5 seconds, 10 seconds? > > 3. Write exclusivity constraints during checkpointing. I understand that > while a checkpoint is occurring ongoing writes will be supported into the > caches being check pointed, and if those are writes to existing pages then > those will be duplicated into the checkpoint buffer. If this buffer becomes > full or stressed then Ignite will throttle, and perhaps block, writes until > the checkpoint is complete. If this is the case then Ignite will emit > logging (warning or informational?) that writes are being throttled. > > We have cases where simple puts to caches (a few requests per second) are > taking up to 90 seconds to execute when there is an active check point > occurring, where the check point has been triggered by the checkpoint > timer. When a checkpoint is not occurring the time to do this is usually in > the milliseconds. The checkpoints themselves can take 90 seconds or longer, > and are updating up to 30,000-40,000 pages, across a pair of data storage > regions, one with 4Gb in-memory space allocated (which should be 1,000,000 > pages at the standard 4kb page size), and one small region with 128Mb. > There is no 'throttling' logging being emitted that we can tell, so the > checkpoint buffer (which should be 1Gb for the first data region and 256 Mb > for the second smaller region in this case) does not look like it can fill > up during the checkpoint. > > It seems like the checkpoint is affecting the put operations, but I don't > understand why that may be given the documented checkpointing process, and > the checkpoint itself (at least via Informational logging) is not > advertising any restrictions. > > Thanks, > Raymond. > > -- > <http://www.trimble.com/> > Raymond Wilson > Solution Architect, Civil Construction Software Systems (CCSS) > > > > > > > > > > -- > <http://www.trimble.com/> > Raymond Wilson > Solution Architect, Civil Construction Software Systems (CCSS) > 11 Birmingham Drive | Christchurch, New Zealand > +64-21-2013317 Mobile > raymond_wil...@trimble.com > <//e.mail.ru/compose/?mailto=mailto%3araymond_wil...@trimble.com> > > > > <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch> > > > > -- > <http://www.trimble.com/> > Raymond Wilson > Solution Architect, Civil Construction Software Systems (CCSS) > 11 Birmingham Drive | Christchurch, New Zealand > +64-21-2013317 Mobile > raymond_wil...@trimble.com > <//e.mail.ru/compose/?mailto=mailto%3araymond_wil...@trimble.com> > > > > <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch> > > > > > > -- <http://www.trimble.com/> Raymond Wilson Solution Architect, Civil Construction Software Systems (CCSS) 11 Birmingham Drive | Christchurch, New Zealand +64-21-2013317 Mobile raymond_wil...@trimble.com <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>