[jira] [Updated] (IGNITE-24992) Hang in put() and starvation in sys-striped pool if RANDOM_2_LRU eviction policy is used
[ https://issues.apache.org/jira/browse/IGNITE-24992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Korotkov updated IGNITE-24992: - Description: In-memory cluster. RANDOM_2_LRU or RANDOM_LRU eviction policy is applied. Put of large objects which occupy several pages can hang in cycle in the {{IgniteCacheDatabaseSharedManager.ensureFreeSpace()}} since {{Random2LruPageEvictionTracker.evictDataPage()}} keeps failing to find the page to evict. The immediate reason is that RANDOM_2_LRU approach can only evict pages "with at least one touch". For large (fragmented) objects only the last page is touched (see the {{PageEvictionTracker.touchPage()}} call in {{AbstractFreeList#WriteRowHandler.addRow()}} method). So if only large objects exist data region has very very small fraction of the "touched" pages appropriate for eviction. It appears that 5000 random attempts are not enough to get 5 candidate pages to evict. So Random2LruPageEvictionTracker.evictDataPage() fails. System striped pool can starvate for a long time - upto 14 hours once in real production environment until nodes were manually restarted. Reproducer is attached [^Random2LruPageEvictionPutLargeObjectsTest.java]. It hangs after 12th put: {noformat} ... >>> Key put: 21, total entries put: 12 [2025-04-02T16:34:23,108][WARN ][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteCacheDatabaseSharedManager] Page-based evictions started. Consider increasing 'maxSize' on Data Region configuration: default [2025-04-02T16:34:23,110][WARN ][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][Random2LruPageEvictionTracker] Too many attempts to choose data page: 5000 [2025-04-02T16:34:48,277][ERROR][tcp-disco-msg-worker-[15e5f20b 127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=sys-stripe-7, threadName=sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%, blockedFor=25s] [2025-04-02T16:34:48,278][WARN ][tcp-disco-msg-worker-[15e5f20b 127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteTestResources] Possible failure suppressed accordingly to a configured handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=sys-stripe-7, igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, finished=false, heartbeatTs=1743586463106]]] org.apache.ignite.IgniteException: GridWorker [name=sys-stripe-7, igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, finished=false, heartbeatTs=1743586463106] at org.apache.ignite.internal.processors.cache.persistence.evict.Random2LruPageEvictionTracker.evictDataPage(Random2LruPageEvictionTracker.java:152) ~[classes/:?] at org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.ensureFreeSpace(IgniteCacheDatabaseSharedManager.java:1251) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1812) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1732) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3225) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:271) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:266) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1096) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:597) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:398) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:316) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306) ~[classes/:?] at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1877) ~[cl
[jira] [Updated] (IGNITE-24992) Hang in put() and starvation in sys-striped pool if RANDOM_2_LRU eviction policy is used
[ https://issues.apache.org/jira/browse/IGNITE-24992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Korotkov updated IGNITE-24992: - Description: In-memory cluster. RANDOM_2_LRU or RANDOM_LRU eviction policy is applied. eviction policy is applied. Put of large objects which occupy several pages can hang in cycle in the {{IgniteCacheDatabaseSharedManager.ensureFreeSpace()}} since {{Random2LruPageEvictionTracker.evictDataPage()}} keeps failing to find the page to evict. The immediate reason is that RANDOM_2_LRU approach can only evict pages "with at least one touch". For large (fragmented) objects only the last page is touched (see the {{PageEvictionTracker.touchPage()}} call in {{AbstractFreeList#WriteRowHandler.addRow()}} method). So if only large objects exist data region has very very small fraction of the "touched" pages appropriate for eviction. It appears that 5000 random attempts are not enough to get 5 candidate pages to evict. So Random2LruPageEvictionTracker.evictDataPage() fails. System striped pool can starvate for a long time - upto 14 hours once in real production environment until nodes were manually restarted. Reproducer is attached [^Random2LruPageEvictionPutLargeObjectsTest.java]. It hangs after 12th put: {noformat} ... >>> Key put: 21, total entries put: 12 [2025-04-02T16:34:23,108][WARN ][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteCacheDatabaseSharedManager] Page-based evictions started. Consider increasing 'maxSize' on Data Region configuration: default [2025-04-02T16:34:23,110][WARN ][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][Random2LruPageEvictionTracker] Too many attempts to choose data page: 5000 [2025-04-02T16:34:48,277][ERROR][tcp-disco-msg-worker-[15e5f20b 127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=sys-stripe-7, threadName=sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%, blockedFor=25s] [2025-04-02T16:34:48,278][WARN ][tcp-disco-msg-worker-[15e5f20b 127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteTestResources] Possible failure suppressed accordingly to a configured handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=sys-stripe-7, igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, finished=false, heartbeatTs=1743586463106]]] org.apache.ignite.IgniteException: GridWorker [name=sys-stripe-7, igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, finished=false, heartbeatTs=1743586463106] at org.apache.ignite.internal.processors.cache.persistence.evict.Random2LruPageEvictionTracker.evictDataPage(Random2LruPageEvictionTracker.java:152) ~[classes/:?] at org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.ensureFreeSpace(IgniteCacheDatabaseSharedManager.java:1251) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1812) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1732) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3225) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:271) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:266) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1096) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:597) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:398) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:316) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306) ~[classes/:?] at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(G
[jira] [Updated] (IGNITE-24992) Hang in put() and starvation in sys-striped pool if RANDOM_2_LRU eviction policy is used
[ https://issues.apache.org/jira/browse/IGNITE-24992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Korotkov updated IGNITE-24992: - Description: In-memory cluster. RANDOM_2_LRU eviction policy is applied. Put of large objects which occupy several pages can hang in cycle in the {{IgniteCacheDatabaseSharedManager.ensureFreeSpace()}} since {{Random2LruPageEvictionTracker.evictDataPage()}} keeps failing to find the page to evict. The immediate reason is that RANDOM_2_LRU approach can only evict pages "with at least one touch". For large (fragmented) objects only the last page is touched (see the {{PageEvictionTracker.touchPage()}} call in {{AbstractFreeList#WriteRowHandler.addRow()}} method). So if only large objects exist data region has very very small fraction of the "touched" pages appropriate for eviction. It appears that 5000 random attempts are not enough to get 5 candidate pages to evict. So Random2LruPageEvictionTracker.evictDataPage() fails. Reproducer is attached [^Random2LruPageEvictionPutLargeObjectsTest.java]. It hangs after 12th put. *** System striped pool can starvate for a long time (upto 14 hours once in real production environment until nodes were manually restarted) with the following errors logged: {noformat} ... >>> Key put: 21, total entries put: 12 [2025-04-02T16:34:23,108][WARN ][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteCacheDatabaseSharedManager] Page-based evictions started. Consider increasing 'maxSize' on Data Region configuration: default [2025-04-02T16:34:23,110][WARN ][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][Random2LruPageEvictionTracker] Too many attempts to choose data page: 5000 [2025-04-02T16:34:48,277][ERROR][tcp-disco-msg-worker-[15e5f20b 127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=sys-stripe-7, threadName=sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%, blockedFor=25s] [2025-04-02T16:34:48,278][WARN ][tcp-disco-msg-worker-[15e5f20b 127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteTestResources] Possible failure suppressed accordingly to a configured handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=sys-stripe-7, igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, finished=false, heartbeatTs=1743586463106]]] org.apache.ignite.IgniteException: GridWorker [name=sys-stripe-7, igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, finished=false, heartbeatTs=1743586463106] at org.apache.ignite.internal.processors.cache.persistence.evict.Random2LruPageEvictionTracker.evictDataPage(Random2LruPageEvictionTracker.java:152) ~[classes/:?] at org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.ensureFreeSpace(IgniteCacheDatabaseSharedManager.java:1251) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1812) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1732) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3225) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:271) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:266) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1096) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:597) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:398) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:316) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306) ~[classes/:?] at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(Grid
[jira] [Updated] (IGNITE-24992) Hang in put() and starvation in sys-striped pool if RANDOM_2_LRU eviction policy is used
[ https://issues.apache.org/jira/browse/IGNITE-24992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Korotkov updated IGNITE-24992: - Description: In-memory cluster. RANDOM_2_LRU eviction policy is applied. Put of large objects which occupy several pages can hang in cycle in the {{IgniteCacheDatabaseSharedManager.ensureFreeSpace()}} since {{Random2LruPageEvictionTracker.evictDataPage()}} keeps failing to find the page to evict. The immediate reason is that RANDOM_2_LRU approach can only evict pages "with at least one touch". For large (fragmented) objects only the last page is touched (see the {{PageEvictionTracker.touchPage()}} call in {{AbstractFreeList#WriteRowHandler.addRow()}} method). So if only large objects exist data region has very very small fraction of the "touched" pages appropriate for eviction. It appears that 5000 random attempts are not enough to get 5 candidate pages to evict. So Random2LruPageEvictionTracker.evictDataPage() fails. Reproducer is attached [^Random2LruPageEvictionPutLargeObjectsTest.java]. It hangs after 12th put. *** System striped pool can starvate for a long time (upto 14 hours once in real production environment until nodes were manually restarted) with the following errors logged: {noformat} [2025-04-02T16:34:23,108][WARN ][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteCacheDatabaseSharedManager] Page-based evictions started. Consider increasing 'maxSize' on Data Region configuration: default [2025-04-02T16:34:23,110][WARN ][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][Random2LruPageEvictionTracker] Too many attempts to choose data page: 5000 [2025-04-02T16:34:48,277][ERROR][tcp-disco-msg-worker-[15e5f20b 127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=sys-stripe-7, threadName=sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%, blockedFor=25s] [2025-04-02T16:34:48,278][WARN ][tcp-disco-msg-worker-[15e5f20b 127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteTestResources] Possible failure suppressed accordingly to a configured handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=sys-stripe-7, igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, finished=false, heartbeatTs=1743586463106]]] org.apache.ignite.IgniteException: GridWorker [name=sys-stripe-7, igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, finished=false, heartbeatTs=1743586463106] at org.apache.ignite.internal.processors.cache.persistence.evict.Random2LruPageEvictionTracker.evictDataPage(Random2LruPageEvictionTracker.java:152) ~[classes/:?] at org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.ensureFreeSpace(IgniteCacheDatabaseSharedManager.java:1251) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1812) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1732) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3225) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:271) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:266) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1096) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:597) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:398) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:316) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306) ~[classes/:?] at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1877) ~[classes/:?]
[jira] [Updated] (IGNITE-24992) Hang in put() and starvation in sys-striped pool if RANDOM_2_LRU eviction policy is used
[ https://issues.apache.org/jira/browse/IGNITE-24992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Korotkov updated IGNITE-24992: - Description: In-memory cluster. RANDOM_2_LRU eviction policy is applied. Put of large objects which occupy several pages can hang in cycle in the {{IgniteCacheDatabaseSharedManager.ensureFreeSpace()}} since {{Random2LruPageEvictionTracker.evictDataPage()}} keeps failing to find the page to evict. The immediate reason is that RANDOM_2_LRU approach can only evict pages "with at least one touch". For large (fragmented) objects only the last page is touched (see the {{PageEvictionTracker.touchPage()}} call in {{AbstractFreeList#WriteRowHandler.addRow()}} method). So if only large objects exist data region has very very small fraction of the "touched" pages appropriate for eviction. It appears that 5000 random attempts are not enough to get 5 candidate pages to evict. So Random2LruPageEvictionTracker.evictDataPage() fails. System striped pool can starvate for a long time - upto 14 hours once in real production environment until nodes were manually restarted. Reproducer is attached [^Random2LruPageEvictionPutLargeObjectsTest.java]. It hangs after 12th put: {noformat} ... >>> Key put: 21, total entries put: 12 [2025-04-02T16:34:23,108][WARN ][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteCacheDatabaseSharedManager] Page-based evictions started. Consider increasing 'maxSize' on Data Region configuration: default [2025-04-02T16:34:23,110][WARN ][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][Random2LruPageEvictionTracker] Too many attempts to choose data page: 5000 [2025-04-02T16:34:48,277][ERROR][tcp-disco-msg-worker-[15e5f20b 127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=sys-stripe-7, threadName=sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%, blockedFor=25s] [2025-04-02T16:34:48,278][WARN ][tcp-disco-msg-worker-[15e5f20b 127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteTestResources] Possible failure suppressed accordingly to a configured handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=sys-stripe-7, igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, finished=false, heartbeatTs=1743586463106]]] org.apache.ignite.IgniteException: GridWorker [name=sys-stripe-7, igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, finished=false, heartbeatTs=1743586463106] at org.apache.ignite.internal.processors.cache.persistence.evict.Random2LruPageEvictionTracker.evictDataPage(Random2LruPageEvictionTracker.java:152) ~[classes/:?] at org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.ensureFreeSpace(IgniteCacheDatabaseSharedManager.java:1251) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1812) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1732) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3225) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:271) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:266) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1096) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:597) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:398) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:316) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306) ~[classes/:?] at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1877) ~[classes/:?]
[jira] [Updated] (IGNITE-24992) Hang in put() and starvation in sys-striped pool if RANDOM_2_LRU eviction policy is used
[ https://issues.apache.org/jira/browse/IGNITE-24992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Korotkov updated IGNITE-24992: - Description: In-memory cluster. RANDOM_2_LRU eviction policy is applied. Put of large objects which occupy several pages can hang in cycle in the `IgniteCacheDatabaseSharedManager.ensureFreeSpace()` since Random2LruPageEvictionTracker.evictDataPage() keeps failing to find the page to evict. The immediate reason is that RANDOM_2_LRU approach can only evict pages "with at least one touch". For large (fragmented) objects only the last page is touched (see the PageEvictionTracker.touchPage() call in AbstractFreeList#WriteRowHandler.addRow() method). So if only large objects exist data region has very very small fraction of the "touched" pages appropriate for eviction. It appears that 5000 random attempts are not enough to get 5 candidate pages to evict. So Random2LruPageEvictionTracker.evictDataPage() fails. Reproducer is attached [^Random2LruPageEvictionPutLargeObjectsTest.java]. It hangs after 12th put. *** System striped pool can starvate for a long time (upto 14 hours once in real production environment until nodes were manually restarted) with the following errors logged: {noformat} [2025-04-02T16:34:23,108][WARN ][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteCacheDatabaseSharedManager] Page-based evictions started. Consider increasing 'maxSize' on Data Region configuration: default [2025-04-02T16:34:23,110][WARN ][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][Random2LruPageEvictionTracker] Too many attempts to choose data page: 5000 [2025-04-02T16:34:48,277][ERROR][tcp-disco-msg-worker-[15e5f20b 127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=sys-stripe-7, threadName=sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%, blockedFor=25s] [2025-04-02T16:34:48,278][WARN ][tcp-disco-msg-worker-[15e5f20b 127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteTestResources] Possible failure suppressed accordingly to a configured handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=sys-stripe-7, igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, finished=false, heartbeatTs=1743586463106]]] org.apache.ignite.IgniteException: GridWorker [name=sys-stripe-7, igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, finished=false, heartbeatTs=1743586463106] at org.apache.ignite.internal.processors.cache.persistence.evict.Random2LruPageEvictionTracker.evictDataPage(Random2LruPageEvictionTracker.java:152) ~[classes/:?] at org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.ensureFreeSpace(IgniteCacheDatabaseSharedManager.java:1251) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1812) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1732) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3225) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:271) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:266) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1096) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:597) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:398) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:316) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306) ~[classes/:?] at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1877) ~[classes/:?] at org.apac