Raymond,

When a checkpoint is triggered you need to have some amount of free page
slots in offheap to save metadata (for example free-lists metadata,
partition counter gaps, etc). The number of required pages depends on count
of caches, count of partitions, workload, and count of CPUs. In worst
cases, you will need up to 256*caches*partitions*CPU count of pages only to
store free-list buckets metadata. This number of pages can't be calculated
statically, so the exact amount can't be reserved in advance. Currently,
1/4 of offheap memory is reserved for this purpose (when amount of dirty
pages riches 3/4 of total number of pages checkpoint is triggered), but
sometimes it's not enough.

In your case, 64Mb data-region is allocated. Page size is 16Kb, so you have
a total of about 4000 pages (real page size in offheap is a little bit
bigger than configured page size). Checkpoint is triggered by "too many
dirty page" event, so 3/4 of pages are already dirty. And only 1000 pages
left to store metadata, it's too small. If page size is 4kb the amount of
clean pages is 4000, so your reproducer can pass in some circumstances.

Increase data region size to solve the problem.


вт, 16 июн. 2020 г. в 05:39, Raymond Wilson <raymond_wil...@trimble.com>:

> I have spent some more time on the reproducer. It is now very simple and
> reliably reproduces the issue with a simple loop adding slowly growing
> entries into a cache with no continuous query ro filters. I have attached
> the source files and the log I obtain when running it.
>
> Running from a clean slate (no existing persistent data) this reproducer
> exhibits the out of memory error when adding an element 4150 bytes in size.
>
> I did find this SO article (
> https://stackoverflow.com/questions/55937768/ignite-report-igniteoutofmemoryexception-out-of-memory-in-data-region)
> that describes the same problem. The solution offered was to increase the
> empty page pool size so it is larger than the biggest element being added.
> The empty pool size should always be bigger than the largest element added
> in the reproducer until the point of failure where 4150 bytes is the
> largest size being added. I tried increasing it to 200, it made no
> difference.
>
> The reproducer is using a pagesize of 16384 bytes.
>
> If I set the page size to the default 4096 bytes this reproducer does not
> show the error up to the size limit of 19999 bytes the reproducer tests.
> If I set the page size to 8192 bytes the reproducer does reliably fail
> with the error at the item with 6941 bytes.
>
> This feels like a bug in handling non-default page sizes. Would you
> recommend switching from 16384 bytes to 4096 for our page size? The reason
> I opted for the larger size is that we may have elements ranging in size
> from 100's of bytes to 100Kb, and sometimes larger.
>
> Thanks,
> Raymond.
>
>
> On Thu, Jun 11, 2020 at 4:25 PM Raymond Wilson <raymond_wil...@trimble.com>
> wrote:
>
>> Just a correction to context of the data region running out of memory:
>> This one does not have a queue of items or a continuous query operating on
>> a cache within it.
>>
>> Thanks,
>> Raymond.
>>
>> On Thu, Jun 11, 2020 at 4:12 PM Raymond Wilson <
>> raymond_wil...@trimble.com> wrote:
>>
>>> Pavel,
>>>
>>> I have run into a different instance of a memory out of error in a data
>>> region in a different context from the one I wrote the reproducer for. In
>>> this case, there is an activity which queues items for processing at a
>>> point in the future and which does use a continuous query, however there is
>>> also significant vanilla put/get activity against a range of other caches..
>>>
>>> This data region was permitted to grow to 1Gb and has persistence
>>> enabled. We are now using Ignite 2.8
>>>
>>> I would like to understand if this is a possible failure mode given that
>>> the data region has persistence enabled. The underlying cause appears to be
>>> 'Unable to find a page for eviction'. Should this be expected on data
>>> regions with persistence?
>>>
>>> I have included the error below.
>>>
>>> This is the initial error reported by Ignite:
>>>
>>> 2020-06-11 12:53:35,082 [98] ERR [ImmutableCacheComputeServer] JVM will
>>> be halted immediately due to the failure: [failureCtx=FailureContext
>>> [type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException:
>>> Failed to find a page for eviction [segmentCapacity=13612, loaded=5417,
>>> maxDirtyPages=4063, dirtyPages=5417, cpPages=0, pinnedInSegment=0,
>>> failedToPrepare=5417]
>>> Out of memory in data region [name=Default-Immutable, initSize=128.0
>>> MiB, maxSize=1.0 GiB, persistenceEnabled=true] Try the following:
>>>   ^-- Increase maximum off-heap memory size
>>> (DataRegionConfiguration.maxSize)
>>>   ^-- Enable Ignite persistence
>>> (DataRegionConfiguration.persistenceEnabled)
>>>   ^-- Enable eviction or expiration policies]]
>>>
>>> Following this error is a lock dump, where this is the only thread with
>>> a lock:(I am assuming the structureId member with the value
>>> 'Spatial-SubGridSegment-Mutable-602' refers to a remote actor holding a
>>> lock against an item in the local node )
>>>
>>> Thread=[name=sys-stripe-11-#12%TRex-Immutable%, id=26], state=RUNNABLE
>>> Locked pages = [284060547022916[0001025a00000044](r=0|w=1)]
>>> Locked pages log: name=sys-stripe-11-#12%TRex-Immutable%
>>> time=(1591836815071, 2020-06-11 12:53:35.071)
>>> L=1 -> Write lock pageId=284060547022916,
>>> structureId=Spatial-SubGridSegment-Mutable-602 [pageIdHex=0001025a00000044,
>>> partId=602, pageIdx=68, flags=00000001]
>>>
>>> Following the lock dump is this final error before the Ignite node stops:
>>>
>>> 2020-06-11 12:53:35,082 [98] ERR [ImmutableCacheComputeServer] JVM will
>>> be halted immediately due to the failure: [failureCtx=FailureContext
>>> [type=CRITICAL_ERROR, err=class o.a.i.i.mem.IgniteOutOfMemoryException:
>>> Failed to find a page for eviction [segmentCapacity=13612, loaded=5417,
>>> maxDirtyPages=4063, dirtyPages=5417, cpPages=0, pinnedInSegment=0,
>>> failedToPrepare=5417]
>>> Out of memory in data region [name=Default-Immutable, initSize=128.0
>>> MiB, maxSize=1.0 GiB, persistenceEnabled=true] Try the following:
>>>   ^-- Increase maximum off-heap memory size
>>> (DataRegionConfiguration.maxSize)
>>>   ^-- Enable Ignite persistence
>>> (DataRegionConfiguration.persistenceEnabled)
>>>   ^-- Enable eviction or expiration policies]]
>>>
>>>
>>>
>>>
>>> On Wed, May 13, 2020 at 2:15 AM Raymond Wilson <
>>> raymond_wil...@trimble.com> wrote:
>>>
>>>> Hi Pavel,
>>>>
>>>> The reproducer is not the actual use case which is too big to use -
>>>> it's a small example using the same mechanisms. I have not used a data
>>>> streamer before, I'll read up on it.
>>>>
>>>> I'll try running the reproducer again against 2.8 (I used 2.7.6 for the
>>>> reproducer).
>>>>
>>>> Thanks,
>>>> Raymond.
>>>>
>>>>
>>>> On Tue, May 12, 2020 at 11:18 PM Pavel Tupitsyn <ptupit...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi Raymond,
>>>>>
>>>>> First, I could not reproduce the issue. Attached program runs to
>>>>> completion on my machine.
>>>>>
>>>>> Second, I see a few issues with the attached code:
>>>>> - Cache.PutIfAbsent is used instead of DataStreamer
>>>>> - ICacheEntryEventFilter is used to remove cache entries, and is
>>>>> called twice - on add and on remove
>>>>>
>>>>> My recommendation is to use a "classic" combination of Data Streamer,
>>>>> Continuous Query, and Expiry Policy.
>>>>> Set expiry policy to a few seconds, and you won't keep much data in
>>>>> memory. Ignite will handle the removal for you.
>>>>> Let me know if I should prepare an example.
>>>>>
>>>>> Also it is not clear why persistence is needed for such a "buffer"
>>>>> cache - items are removed almost immediately,
>>>>> it would be much more efficient to disable persistence.
>>>>>
>>>>> Thanks,
>>>>> Pavel
>>>>>
>>>>> On Tue, May 12, 2020 at 12:23 PM Raymond Wilson <
>>>>> raymond_wil...@trimble.com> wrote:
>>>>>
>>>>>> Well, it appears I was wrong. It reappeared. :(
>>>>>>
>>>>>> I thought I had sent a reply to this thread but cannot find it, so I
>>>>>> am resending it now.
>>>>>>
>>>>>> Attached is a c# reproducer that throws Ignite out of memory errors
>>>>>> in the situation I outlined above where cache operations against a small
>>>>>> cache with persistence enabled.
>>>>>>
>>>>>> Let me know if you're able to reproduce it on your local systems.
>>>>>>
>>>>>> Thanks,
>>>>>> Raymond.
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 3, 2020 at 1:31 PM Raymond Wilson <
>>>>>> raymond_wil...@trimble.com> wrote:
>>>>>>
>>>>>>> It's possible this is user (me) error.
>>>>>>>
>>>>>>> I discovered I had set the cache size to be 64Mb in the server, but
>>>>>>> 65Mb (typo!) in the client. Making these two values consistent appeared 
>>>>>>> to
>>>>>>> prevent the error.
>>>>>>>
>>>>>>> Raymond.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 3, 2020 at 12:58 PM Raymond Wilson <
>>>>>>> raymond_wil...@trimble.com> wrote:
>>>>>>>
>>>>>>>> I'm using Ignite v2.7.5 with C# client.
>>>>>>>>
>>>>>>>> I have an error where Ignite throws an out of memory exception,
>>>>>>>> like this:
>>>>>>>>
>>>>>>>> 2020-03-03 12:02:58,036 [287] ERR [MutableCacheComputeServer] JVM
>>>>>>>> will be halted immediately due to the failure: 
>>>>>>>> [failureCtx=FailureContext
>>>>>>>> [type=CRITICAL_ERROR, err=class 
>>>>>>>> o.a.i.i.mem.IgniteOutOfMemoryException: Out
>>>>>>>> of memory in data region [name=TAGFileBufferQueue, initSize=64.0 MiB,
>>>>>>>> maxSize=64.0 MiB, persistenceEnabled=true] Try the following:
>>>>>>>>   ^-- Increase maximum off-heap memory size
>>>>>>>> (DataRegionConfiguration.maxSize)
>>>>>>>>   ^-- Enable Ignite persistence
>>>>>>>> (DataRegionConfiguration.persistenceEnabled)
>>>>>>>>   ^-- Enable eviction or expiration policies]]
>>>>>>>>
>>>>>>>> I don't have an eviction policy set (is this even a valid
>>>>>>>> recommendation when using persistence?)
>>>>>>>>
>>>>>>>> Increasing the off heap memory size for the data region does
>>>>>>>> prevent this error, but I want to minimise the in-memory size for this
>>>>>>>> buffer as it is essentially just a queue.
>>>>>>>>
>>>>>>>> The suggestion of enabling data persistence is strange as this data
>>>>>>>> region has already persistence enabled for it.
>>>>>>>>
>>>>>>>> My assumption is that Ignite manages the memory in this cache by
>>>>>>>> saving and loading values as required.
>>>>>>>>
>>>>>>>> The test workflow in this failure is one where ~14,500 objects
>>>>>>>> totalling ~440 Mb in size (avery object size = ~30Kb) are added to the
>>>>>>>> cache, and are then drained by a processors using a continuous query.
>>>>>>>> Elements are removed from the cache as the processor completes them.
>>>>>>>>
>>>>>>>> Is this kind of out of memory error supposed to be possible when
>>>>>>>> using persistent data regions?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Raymond.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> <http://www.trimble.com/>
>>>>>> Raymond Wilson
>>>>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>>>>> 11 Birmingham Drive | Christchurch, New Zealand
>>>>>> +64-21-2013317 Mobile
>>>>>> raymond_wil...@trimble.com
>>>>>>
>>>>>>
>>>>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> <http://www.trimble.com/>
>>>> Raymond Wilson
>>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>>> 11 Birmingham Drive | Christchurch, New Zealand
>>>> +64-21-2013317 Mobile
>>>> raymond_wil...@trimble.com
>>>>
>>>>
>>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>>
>>>
>>>
>>> --
>>> <http://www.trimble.com/>
>>> Raymond Wilson
>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>> 11 Birmingham Drive | Christchurch, New Zealand
>>> +64-21-2013317 Mobile
>>> raymond_wil...@trimble.com
>>>
>>>
>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>
>>
>>
>> --
>> <http://www.trimble.com/>
>> Raymond Wilson
>> Solution Architect, Civil Construction Software Systems (CCSS)
>> 11 Birmingham Drive | Christchurch, New Zealand
>> +64-21-2013317 Mobile
>> raymond_wil...@trimble.com
>>
>>
>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>
>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> +64-21-2013317 Mobile
> raymond_wil...@trimble.com
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>

Reply via email to