Re: Questions related to check pointing

Raymond Wilson Mon, 28 Dec 2020 23:36:14 -0800

Hi Zhenya,

1. We currently use AWS EFS for primary storage, with provisioned IOPS to
provide sufficient IO. Our Ignite cluster currently tops out at ~10% usage
(with at least 5 nodes writing to it, including WAL and WAL archive), so we
are not saturating the EFS interface. We use the default page size
(experiments with larger page sizes showed instability when checkpointing
due to free page starvation, so we reverted to the default size).


2. Thanks for the detail, we will look for that in thread dumps when we can
create them.

3. We are using the default CP buffer size, which is max(256Mb,
DataRagionSize / 4) according to the Ignite documentation, so this should
have more than enough checkpoint buffer space to cope with writes. As
additional information, the cache which is displaying very slow writes is
in a data region with relatively slow write traffic. There is a primary
(default) data region with large write traffic, and the vast majority of
pages being written in a checkpoint will be for that default data region.

4. Yes, this is very surprising. Anecdotally from our logs it appears write
traffic into the low write traffic cache is blocked during checkpoints.

Thanks,
Raymond.



On Tue, Dec 29, 2020 at 7:31 PM Zhenya Stanilovsky <arzamas...@mail.ru>
wrote:

>
>    1. Additionally to Ilya reply you can check vendors page for
>    additional info, all in this page are applicable for ignite too [1].
>    Increasing threads number leads to concurrent io usage, thus if your have
>    something like nvme — it`s up to you but in case of sas possibly better
>    would be to reduce this param.
>    2. Log will shows you something like :
>
>    Parking thread=%Thread name% for timeout(ms)= %time%
>
>    and appropriate :
>
>    Unparking thread=
>
>    3. No additional looging with cp buffer usage are provided. cp buffer
>    need to be more than 10% of overall persistent  DataRegions size.
>    4. 90 seconds or longer —  Seems like problems in io or system tuning,
>    it`s very bad score i hope.
>
> [1]
> https://www.gridgain.com/docs/latest/perf-troubleshooting-guide/persistence-tuning
>
>
>
>
>
> Hi,
>
> We have been investigating some issues which appear to be related to
> checkpointing. We currently use the IA 2.8.1 with the C# client.
>
> I have been trying to gain clarity on how certain aspects of the Ignite
> configuration relate to the checkpointing process:
>
> 1. Number of check pointing threads. This defaults to 4, but I don't
> understand how it applies to the checkpointing process. Are more threads
> generally better (eg: because it makes the disk IO parallel across the
> threads), or does it only have a positive effect if you have many data
> storage regions? Or something else? If this could be clarified in the
> documentation (or a pointer to it which Google has not yet found), that
> would be good.
>
> 2. Checkpoint frequency. This is defaulted to 180 seconds. I was thinking
> that reducing this time would result in smaller less disruptive check
> points. Setting it to 60 seconds seems pretty safe, but is there a
> practical lower limit that should be used for use cases with new data
> constantly being added, eg: 5 seconds, 10 seconds?
>
> 3. Write exclusivity constraints during checkpointing. I understand that
> while a checkpoint is occurring ongoing writes will be supported into the
> caches being check pointed, and if those are writes to existing pages then
> those will be duplicated into the checkpoint buffer. If this buffer becomes
> full or stressed then Ignite will throttle, and perhaps block, writes until
> the checkpoint is complete. If this is the case then Ignite will emit
> logging (warning or informational?) that writes are being throttled.
>
> We have cases where simple puts to caches (a few requests per second) are
> taking up to 90 seconds to execute when there is an active check point
> occurring, where the check point has been triggered by the checkpoint
> timer. When a checkpoint is not occurring the time to do this is usually in
> the milliseconds. The checkpoints themselves can take 90 seconds or longer,
> and are updating up to 30,000-40,000 pages, across a pair of data storage
> regions, one with 4Gb in-memory space allocated (which should be 1,000,000
> pages at the standard 4kb page size), and one small region with 128Mb.
> There is no 'throttling' logging being emitted that we can tell, so the
> checkpoint buffer (which should be 1Gb for the first data region and 256 Mb
> for the second smaller region in this case) does not look like it can fill
> up during the checkpoint.
>
> It seems like the checkpoint is affecting the put operations, but I don't
> understand why that may be given the documented checkpointing process, and
> the checkpoint itself (at least via Informational logging) is not
> advertising any restrictions.
>
> Thanks,
> Raymond.
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
>
>
>
>
>
>
>


-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
+64-21-2013317 Mobile
raymond_wil...@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Questions related to check pointing

Reply via email to