[jira] [Updated] (IGNITE-24992) Hang in put() and starvation in sys-striped pool if RANDOM_2_LRU eviction policy is used
[
https://issues.apache.org/jira/browse/IGNITE-24992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey Korotkov updated IGNITE-24992:
-
Description:
In-memory cluster.
RANDOM_2_LRU or RANDOM_LRU eviction policy is applied.
Put of large objects which occupy several pages can hang in cycle in the
{{IgniteCacheDatabaseSharedManager.ensureFreeSpace()}} since
{{Random2LruPageEvictionTracker.evictDataPage()}} keeps failing to find the
page to evict.
The immediate reason is that RANDOM_2_LRU approach can only evict pages "with
at least one touch". For large (fragmented) objects only the last page is
touched (see the {{PageEvictionTracker.touchPage()}} call in
{{AbstractFreeList#WriteRowHandler.addRow()}} method). So if only large objects
exist data region has very very small fraction of the "touched" pages
appropriate for eviction. It appears that 5000 random attempts are not enough
to get 5 candidate pages to evict. So
Random2LruPageEvictionTracker.evictDataPage() fails.
System striped pool can starvate for a long time - upto 14 hours once in real
production environment until nodes were manually restarted.
Reproducer is attached [^Random2LruPageEvictionPutLargeObjectsTest.java].
It hangs after 12th put:
{noformat}
...
>>> Key put: 21, total entries put: 12
[2025-04-02T16:34:23,108][WARN
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteCacheDatabaseSharedManager]
Page-based evictions started. Consider increasing 'maxSize' on Data Region
configuration: default
[2025-04-02T16:34:23,110][WARN
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][Random2LruPageEvictionTracker]
Too many attempts to choose data page: 5000
[2025-04-02T16:34:48,277][ERROR][tcp-disco-msg-worker-[15e5f20b
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][G]
Blocked system-critical thread has been detected. This can lead to
cluster-wide undefined behaviour [workerName=sys-stripe-7,
threadName=sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%,
blockedFor=25s]
[2025-04-02T16:34:48,278][WARN ][tcp-disco-msg-worker-[15e5f20b
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteTestResources]
Possible failure suppressed accordingly to a configured handler
[hnd=NoOpFailureHandler [super=AbstractFailureHandler
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
[name=sys-stripe-7,
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1,
finished=false, heartbeatTs=1743586463106]]]
org.apache.ignite.IgniteException: GridWorker [name=sys-stripe-7,
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1,
finished=false, heartbeatTs=1743586463106]
at
org.apache.ignite.internal.processors.cache.persistence.evict.Random2LruPageEvictionTracker.evictDataPage(Random2LruPageEvictionTracker.java:152)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.ensureFreeSpace(IgniteCacheDatabaseSharedManager.java:1251)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1812)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1732)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3225)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:271)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:266)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1096)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:597)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:398)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:316)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306)
~[classes/:?]
at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1877)
~[cl
[jira] [Updated] (IGNITE-24992) Hang in put() and starvation in sys-striped pool if RANDOM_2_LRU eviction policy is used
[
https://issues.apache.org/jira/browse/IGNITE-24992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey Korotkov updated IGNITE-24992:
-
Description:
In-memory cluster.
RANDOM_2_LRU or RANDOM_LRU eviction policy is applied. eviction policy is
applied.
Put of large objects which occupy several pages can hang in cycle in the
{{IgniteCacheDatabaseSharedManager.ensureFreeSpace()}} since
{{Random2LruPageEvictionTracker.evictDataPage()}} keeps failing to find the
page to evict.
The immediate reason is that RANDOM_2_LRU approach can only evict pages "with
at least one touch". For large (fragmented) objects only the last page is
touched (see the {{PageEvictionTracker.touchPage()}} call in
{{AbstractFreeList#WriteRowHandler.addRow()}} method). So if only large objects
exist data region has very very small fraction of the "touched" pages
appropriate for eviction. It appears that 5000 random attempts are not enough
to get 5 candidate pages to evict. So
Random2LruPageEvictionTracker.evictDataPage() fails.
System striped pool can starvate for a long time - upto 14 hours once in real
production environment until nodes were manually restarted.
Reproducer is attached [^Random2LruPageEvictionPutLargeObjectsTest.java].
It hangs after 12th put:
{noformat}
...
>>> Key put: 21, total entries put: 12
[2025-04-02T16:34:23,108][WARN
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteCacheDatabaseSharedManager]
Page-based evictions started. Consider increasing 'maxSize' on Data Region
configuration: default
[2025-04-02T16:34:23,110][WARN
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][Random2LruPageEvictionTracker]
Too many attempts to choose data page: 5000
[2025-04-02T16:34:48,277][ERROR][tcp-disco-msg-worker-[15e5f20b
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][G]
Blocked system-critical thread has been detected. This can lead to
cluster-wide undefined behaviour [workerName=sys-stripe-7,
threadName=sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%,
blockedFor=25s]
[2025-04-02T16:34:48,278][WARN ][tcp-disco-msg-worker-[15e5f20b
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteTestResources]
Possible failure suppressed accordingly to a configured handler
[hnd=NoOpFailureHandler [super=AbstractFailureHandler
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
[name=sys-stripe-7,
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1,
finished=false, heartbeatTs=1743586463106]]]
org.apache.ignite.IgniteException: GridWorker [name=sys-stripe-7,
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1,
finished=false, heartbeatTs=1743586463106]
at
org.apache.ignite.internal.processors.cache.persistence.evict.Random2LruPageEvictionTracker.evictDataPage(Random2LruPageEvictionTracker.java:152)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.ensureFreeSpace(IgniteCacheDatabaseSharedManager.java:1251)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1812)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1732)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3225)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:271)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:266)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1096)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:597)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:398)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:316)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306)
~[classes/:?]
at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(G
[jira] [Updated] (IGNITE-24992) Hang in put() and starvation in sys-striped pool if RANDOM_2_LRU eviction policy is used
[
https://issues.apache.org/jira/browse/IGNITE-24992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey Korotkov updated IGNITE-24992:
-
Description:
In-memory cluster.
RANDOM_2_LRU eviction policy is applied.
Put of large objects which occupy several pages can hang in cycle in the
{{IgniteCacheDatabaseSharedManager.ensureFreeSpace()}} since
{{Random2LruPageEvictionTracker.evictDataPage()}} keeps failing to find the
page to evict.
The immediate reason is that RANDOM_2_LRU approach can only evict pages "with
at least one touch". For large (fragmented) objects only the last page is
touched (see the {{PageEvictionTracker.touchPage()}} call in
{{AbstractFreeList#WriteRowHandler.addRow()}} method). So if only large objects
exist data region has very very small fraction of the "touched" pages
appropriate for eviction. It appears that 5000 random attempts are not enough
to get 5 candidate pages to evict. So
Random2LruPageEvictionTracker.evictDataPage() fails.
Reproducer is attached [^Random2LruPageEvictionPutLargeObjectsTest.java].
It hangs after 12th put.
***
System striped pool can starvate for a long time (upto 14 hours once in real
production environment until nodes were manually restarted) with the following
errors logged:
{noformat}
...
>>> Key put: 21, total entries put: 12
[2025-04-02T16:34:23,108][WARN
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteCacheDatabaseSharedManager]
Page-based evictions started. Consider increasing 'maxSize' on Data Region
configuration: default
[2025-04-02T16:34:23,110][WARN
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][Random2LruPageEvictionTracker]
Too many attempts to choose data page: 5000
[2025-04-02T16:34:48,277][ERROR][tcp-disco-msg-worker-[15e5f20b
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][G]
Blocked system-critical thread has been detected. This can lead to
cluster-wide undefined behaviour [workerName=sys-stripe-7,
threadName=sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%,
blockedFor=25s]
[2025-04-02T16:34:48,278][WARN ][tcp-disco-msg-worker-[15e5f20b
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteTestResources]
Possible failure suppressed accordingly to a configured handler
[hnd=NoOpFailureHandler [super=AbstractFailureHandler
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
[name=sys-stripe-7,
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1,
finished=false, heartbeatTs=1743586463106]]]
org.apache.ignite.IgniteException: GridWorker [name=sys-stripe-7,
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1,
finished=false, heartbeatTs=1743586463106]
at
org.apache.ignite.internal.processors.cache.persistence.evict.Random2LruPageEvictionTracker.evictDataPage(Random2LruPageEvictionTracker.java:152)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.ensureFreeSpace(IgniteCacheDatabaseSharedManager.java:1251)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1812)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1732)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3225)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:271)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:266)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1096)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:597)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:398)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:316)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306)
~[classes/:?]
at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(Grid
[jira] [Updated] (IGNITE-24992) Hang in put() and starvation in sys-striped pool if RANDOM_2_LRU eviction policy is used
[
https://issues.apache.org/jira/browse/IGNITE-24992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey Korotkov updated IGNITE-24992:
-
Description:
In-memory cluster.
RANDOM_2_LRU eviction policy is applied.
Put of large objects which occupy several pages can hang in cycle in the
{{IgniteCacheDatabaseSharedManager.ensureFreeSpace()}} since
{{Random2LruPageEvictionTracker.evictDataPage()}} keeps failing to find the
page to evict.
The immediate reason is that RANDOM_2_LRU approach can only evict pages "with
at least one touch". For large (fragmented) objects only the last page is
touched (see the {{PageEvictionTracker.touchPage()}} call in
{{AbstractFreeList#WriteRowHandler.addRow()}} method). So if only large objects
exist data region has very very small fraction of the "touched" pages
appropriate for eviction. It appears that 5000 random attempts are not enough
to get 5 candidate pages to evict. So
Random2LruPageEvictionTracker.evictDataPage() fails.
Reproducer is attached [^Random2LruPageEvictionPutLargeObjectsTest.java].
It hangs after 12th put.
***
System striped pool can starvate for a long time (upto 14 hours once in real
production environment until nodes were manually restarted) with the following
errors logged:
{noformat}
[2025-04-02T16:34:23,108][WARN
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteCacheDatabaseSharedManager]
Page-based evictions started. Consider increasing 'maxSize' on Data Region
configuration: default
[2025-04-02T16:34:23,110][WARN
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][Random2LruPageEvictionTracker]
Too many attempts to choose data page: 5000
[2025-04-02T16:34:48,277][ERROR][tcp-disco-msg-worker-[15e5f20b
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][G]
Blocked system-critical thread has been detected. This can lead to
cluster-wide undefined behaviour [workerName=sys-stripe-7,
threadName=sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%,
blockedFor=25s]
[2025-04-02T16:34:48,278][WARN ][tcp-disco-msg-worker-[15e5f20b
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteTestResources]
Possible failure suppressed accordingly to a configured handler
[hnd=NoOpFailureHandler [super=AbstractFailureHandler
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
[name=sys-stripe-7,
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1,
finished=false, heartbeatTs=1743586463106]]]
org.apache.ignite.IgniteException: GridWorker [name=sys-stripe-7,
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1,
finished=false, heartbeatTs=1743586463106]
at
org.apache.ignite.internal.processors.cache.persistence.evict.Random2LruPageEvictionTracker.evictDataPage(Random2LruPageEvictionTracker.java:152)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.ensureFreeSpace(IgniteCacheDatabaseSharedManager.java:1251)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1812)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1732)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3225)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:271)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:266)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1096)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:597)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:398)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:316)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306)
~[classes/:?]
at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1877)
~[classes/:?]
[jira] [Updated] (IGNITE-24992) Hang in put() and starvation in sys-striped pool if RANDOM_2_LRU eviction policy is used
[
https://issues.apache.org/jira/browse/IGNITE-24992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey Korotkov updated IGNITE-24992:
-
Description:
In-memory cluster.
RANDOM_2_LRU eviction policy is applied.
Put of large objects which occupy several pages can hang in cycle in the
{{IgniteCacheDatabaseSharedManager.ensureFreeSpace()}} since
{{Random2LruPageEvictionTracker.evictDataPage()}} keeps failing to find the
page to evict.
The immediate reason is that RANDOM_2_LRU approach can only evict pages "with
at least one touch". For large (fragmented) objects only the last page is
touched (see the {{PageEvictionTracker.touchPage()}} call in
{{AbstractFreeList#WriteRowHandler.addRow()}} method). So if only large objects
exist data region has very very small fraction of the "touched" pages
appropriate for eviction. It appears that 5000 random attempts are not enough
to get 5 candidate pages to evict. So
Random2LruPageEvictionTracker.evictDataPage() fails.
System striped pool can starvate for a long time - upto 14 hours once in real
production environment until nodes were manually restarted.
Reproducer is attached [^Random2LruPageEvictionPutLargeObjectsTest.java].
It hangs after 12th put:
{noformat}
...
>>> Key put: 21, total entries put: 12
[2025-04-02T16:34:23,108][WARN
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteCacheDatabaseSharedManager]
Page-based evictions started. Consider increasing 'maxSize' on Data Region
configuration: default
[2025-04-02T16:34:23,110][WARN
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][Random2LruPageEvictionTracker]
Too many attempts to choose data page: 5000
[2025-04-02T16:34:48,277][ERROR][tcp-disco-msg-worker-[15e5f20b
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][G]
Blocked system-critical thread has been detected. This can lead to
cluster-wide undefined behaviour [workerName=sys-stripe-7,
threadName=sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%,
blockedFor=25s]
[2025-04-02T16:34:48,278][WARN ][tcp-disco-msg-worker-[15e5f20b
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteTestResources]
Possible failure suppressed accordingly to a configured handler
[hnd=NoOpFailureHandler [super=AbstractFailureHandler
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
[name=sys-stripe-7,
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1,
finished=false, heartbeatTs=1743586463106]]]
org.apache.ignite.IgniteException: GridWorker [name=sys-stripe-7,
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1,
finished=false, heartbeatTs=1743586463106]
at
org.apache.ignite.internal.processors.cache.persistence.evict.Random2LruPageEvictionTracker.evictDataPage(Random2LruPageEvictionTracker.java:152)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.ensureFreeSpace(IgniteCacheDatabaseSharedManager.java:1251)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1812)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1732)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3225)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:271)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:266)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1096)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:597)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:398)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:316)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306)
~[classes/:?]
at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1877)
~[classes/:?]
[jira] [Updated] (IGNITE-24992) Hang in put() and starvation in sys-striped pool if RANDOM_2_LRU eviction policy is used
[
https://issues.apache.org/jira/browse/IGNITE-24992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey Korotkov updated IGNITE-24992:
-
Description:
In-memory cluster.
RANDOM_2_LRU eviction policy is applied.
Put of large objects which occupy several pages can hang in cycle in the
`IgniteCacheDatabaseSharedManager.ensureFreeSpace()` since
Random2LruPageEvictionTracker.evictDataPage() keeps failing to find the page to
evict.
The immediate reason is that RANDOM_2_LRU approach can only evict pages "with
at least one touch". For large (fragmented) objects only the last page is
touched (see the PageEvictionTracker.touchPage() call in
AbstractFreeList#WriteRowHandler.addRow() method). So if only large objects
exist data region has very very small fraction of the "touched" pages
appropriate for eviction. It appears that 5000 random attempts are not enough
to get 5 candidate pages to evict. So
Random2LruPageEvictionTracker.evictDataPage() fails.
Reproducer is attached [^Random2LruPageEvictionPutLargeObjectsTest.java].
It hangs after 12th put.
***
System striped pool can starvate for a long time (upto 14 hours once in real
production environment until nodes were manually restarted) with the following
errors logged:
{noformat}
[2025-04-02T16:34:23,108][WARN
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteCacheDatabaseSharedManager]
Page-based evictions started. Consider increasing 'maxSize' on Data Region
configuration: default
[2025-04-02T16:34:23,110][WARN
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][Random2LruPageEvictionTracker]
Too many attempts to choose data page: 5000
[2025-04-02T16:34:48,277][ERROR][tcp-disco-msg-worker-[15e5f20b
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][G]
Blocked system-critical thread has been detected. This can lead to
cluster-wide undefined behaviour [workerName=sys-stripe-7,
threadName=sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%,
blockedFor=25s]
[2025-04-02T16:34:48,278][WARN ][tcp-disco-msg-worker-[15e5f20b
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteTestResources]
Possible failure suppressed accordingly to a configured handler
[hnd=NoOpFailureHandler [super=AbstractFailureHandler
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
[name=sys-stripe-7,
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1,
finished=false, heartbeatTs=1743586463106]]]
org.apache.ignite.IgniteException: GridWorker [name=sys-stripe-7,
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1,
finished=false, heartbeatTs=1743586463106]
at
org.apache.ignite.internal.processors.cache.persistence.evict.Random2LruPageEvictionTracker.evictDataPage(Random2LruPageEvictionTracker.java:152)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.ensureFreeSpace(IgniteCacheDatabaseSharedManager.java:1251)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1812)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1732)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3225)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:271)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:266)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1096)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:597)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:398)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:316)
~[classes/:?]
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306)
~[classes/:?]
at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1877)
~[classes/:?]
at
org.apac
