Re: Re[2]: Questions related to check pointing

Raymond Wilson Wed, 30 Dec 2020 00:18:20 -0800

I'm working on getting automatic JVM thread stack dumping occurring if we
detect long delays in put (PutIfAbsent) operations. Hopefully this will
provide more information.


On Wed, Dec 30, 2020 at 7:48 PM Zhenya Stanilovsky <arzamas...@mail.ru>
wrote:

>
> Don`t think so, checkpointing work perfectly well already before this fix.
> Need additional info for start digging your problem, can you share ignite
> logs somewhere?
>
>
> I noticed an entry in the Ignite 2.9.1 changelog:
>
>    - Improved checkpoint concurrent behaviour
>
> I am having trouble finding the relevant Jira ticket for this in the 2.9.1
> Jira area at
> https://issues.apache.org/jira/browse/IGNITE-13876?jql=project%20%3D%20IGNITE%20AND%20fixVersion%20%3D%202.9.1%20and%20status%20%3D%20Resolved
>
> Perhaps this change may improve the checkpointing issue we are seeing?
>
> Raymond.
>
>
> On Tue, Dec 29, 2020 at 8:35 PM Raymond Wilson <raymond_wil...@trimble.com
> <//e.mail.ru/compose/?mailto=mailto%3araymond_wil...@trimble.com>> wrote:
>
> Hi Zhenya,
>
> 1. We currently use AWS EFS for primary storage, with provisioned IOPS to
> provide sufficient IO. Our Ignite cluster currently tops out at ~10% usage
> (with at least 5 nodes writing to it, including WAL and WAL archive), so we
> are not saturating the EFS interface. We use the default page size
> (experiments with larger page sizes showed instability when checkpointing
> due to free page starvation, so we reverted to the default size).
>
> 2. Thanks for the detail, we will look for that in thread dumps when we
> can create them.
>
> 3. We are using the default CP buffer size, which is max(256Mb,
> DataRagionSize / 4) according to the Ignite documentation, so this should
> have more than enough checkpoint buffer space to cope with writes. As
> additional information, the cache which is displaying very slow writes is
> in a data region with relatively slow write traffic. There is a primary
> (default) data region with large write traffic, and the vast majority of
> pages being written in a checkpoint will be for that default data region.
>
> 4. Yes, this is very surprising. Anecdotally from our logs it appears
> write traffic into the low write traffic cache is blocked during
> checkpoints.
>
> Thanks,
> Raymond.
>
>
>
> On Tue, Dec 29, 2020 at 7:31 PM Zhenya Stanilovsky <arzamas...@mail.ru
> <//e.mail.ru/compose/?mailto=mailto%3aarzamas...@mail.ru>> wrote:
>
>
>    1. Additionally to Ilya reply you can check vendors page for
>    additional info, all in this page are applicable for ignite too [1].
>    Increasing threads number leads to concurrent io usage, thus if your have
>    something like nvme — it`s up to you but in case of sas possibly better
>    would be to reduce this param.
>    2. Log will shows you something like :
>
>    Parking thread=%Thread name% for timeout(ms)= %time%
>
>    and appropriate :
>
>    Unparking thread=
>
>    3. No additional looging with cp buffer usage are provided. cp buffer
>    need to be more than 10% of overall persistent  DataRegions size.
>    4. 90 seconds or longer —  Seems like problems in io or system tuning,
>    it`s very bad score i hope.
>
> [1]
> https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/persistence-tuning
>
>
>
>
>
> Hi,
>
> We have been investigating some issues which appear to be related to
> checkpointing. We currently use the IA 2.8.1 with the C# client.
>
> I have been trying to gain clarity on how certain aspects of the Ignite
> configuration relate to the checkpointing process:
>
> 1. Number of check pointing threads. This defaults to 4, but I don't
> understand how it applies to the checkpointing process. Are more threads
> generally better (eg: because it makes the disk IO parallel across the
> threads), or does it only have a positive effect if you have many data
> storage regions? Or something else? If this could be clarified in the
> documentation (or a pointer to it which Google has not yet found), that
> would be good.
>
> 2. Checkpoint frequency. This is defaulted to 180 seconds. I was thinking
> that reducing this time would result in smaller less disruptive check
> points. Setting it to 60 seconds seems pretty safe, but is there a
> practical lower limit that should be used for use cases with new data
> constantly being added, eg: 5 seconds, 10 seconds?
>
> 3. Write exclusivity constraints during checkpointing. I understand that
> while a checkpoint is occurring ongoing writes will be supported into the
> caches being check pointed, and if those are writes to existing pages then
> those will be duplicated into the checkpoint buffer. If this buffer becomes
> full or stressed then Ignite will throttle, and perhaps block, writes until
> the checkpoint is complete. If this is the case then Ignite will emit
> logging (warning or informational?) that writes are being throttled.
>
> We have cases where simple puts to caches (a few requests per second) are
> taking up to 90 seconds to execute when there is an active check point
> occurring, where the check point has been triggered by the checkpoint
> timer. When a checkpoint is not occurring the time to do this is usually in
> the milliseconds. The checkpoints themselves can take 90 seconds or longer,
> and are updating up to 30,000-40,000 pages, across a pair of data storage
> regions, one with 4Gb in-memory space allocated (which should be 1,000,000
> pages at the standard 4kb page size), and one small region with 128Mb.
> There is no 'throttling' logging being emitted that we can tell, so the
> checkpoint buffer (which should be 1Gb for the first data region and 256 Mb
> for the second smaller region in this case) does not look like it can fill
> up during the checkpoint.
>
> It seems like the checkpoint is affecting the put operations, but I don't
> understand why that may be given the documented checkpointing process, and
> the checkpoint itself (at least via Informational logging) is not
> advertising any restrictions.
>
> Thanks,
> Raymond.
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
>
>
>
>
>
>
>
>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> +64-21-2013317 Mobile
> raymond_wil...@trimble.com
> <//e.mail.ru/compose/?mailto=mailto%3araymond_wil...@trimble.com>
>
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>
>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> +64-21-2013317 Mobile
> raymond_wil...@trimble.com
> <//e.mail.ru/compose/?mailto=mailto%3araymond_wil...@trimble.com>
>
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>
>
>
>
>
>


-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
+64-21-2013317 Mobile
raymond_wil...@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Re[2]: Questions related to check pointing

Reply via email to