[jira] [Updated] (IGNITE-24992) Hang in put() and starvation in sys-striped pool if RANDOM_2_LRU eviction policy is used

2025-04-04 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-24992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-24992:
-
Description: 
In-memory cluster.

RANDOM_2_LRU or RANDOM_LRU eviction policy is applied.

Put of large objects which occupy several pages can hang in cycle in the 
{{IgniteCacheDatabaseSharedManager.ensureFreeSpace()}} since 
{{Random2LruPageEvictionTracker.evictDataPage()}} keeps failing to find the 
page to evict.

The immediate reason is that RANDOM_2_LRU approach can only evict pages "with 
at least one touch".  For large (fragmented) objects only the last page is 
touched (see the {{PageEvictionTracker.touchPage()}}  call in 
{{AbstractFreeList#WriteRowHandler.addRow()}} method). So if only large objects 
exist data region has very very small fraction of the "touched" pages 
appropriate for eviction.  It appears that 5000 random attempts are not enough 
to get 5 candidate pages to evict.  So 
Random2LruPageEvictionTracker.evictDataPage() fails.


System striped pool can starvate for a long time - upto 14 hours once in real 
production environment until nodes were manually restarted.

Reproducer is attached [^Random2LruPageEvictionPutLargeObjectsTest.java]. 

It hangs after 12th put:

{noformat}
...
>>> Key put: 21, total entries put: 12
[2025-04-02T16:34:23,108][WARN 
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteCacheDatabaseSharedManager]
 Page-based evictions started. Consider increasing 'maxSize' on Data Region 
configuration: default
[2025-04-02T16:34:23,110][WARN 
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][Random2LruPageEvictionTracker]
 Too many attempts to choose data page: 5000



[2025-04-02T16:34:48,277][ERROR][tcp-disco-msg-worker-[15e5f20b 
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][G]
 Blocked system-critical thread has been detected. This can lead to 
cluster-wide undefined behaviour [workerName=sys-stripe-7, 
threadName=sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%, 
blockedFor=25s]
[2025-04-02T16:34:48,278][WARN ][tcp-disco-msg-worker-[15e5f20b 
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteTestResources]
 Possible failure suppressed accordingly to a configured handler 
[hnd=NoOpFailureHandler [super=AbstractFailureHandler 
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker 
[name=sys-stripe-7, 
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, 
finished=false, heartbeatTs=1743586463106]]]
 org.apache.ignite.IgniteException: GridWorker [name=sys-stripe-7, 
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, 
finished=false, heartbeatTs=1743586463106]
at 
org.apache.ignite.internal.processors.cache.persistence.evict.Random2LruPageEvictionTracker.evictDataPage(Random2LruPageEvictionTracker.java:152)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.ensureFreeSpace(IgniteCacheDatabaseSharedManager.java:1251)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1812)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1732)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3225)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:271)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:266)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1096)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:597)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:398)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:316)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306)
 ~[classes/:?]
at 
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1877)
 ~[cl

[jira] [Updated] (IGNITE-24992) Hang in put() and starvation in sys-striped pool if RANDOM_2_LRU eviction policy is used

2025-04-02 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-24992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-24992:
-
Description: 
In-memory cluster.

RANDOM_2_LRU or RANDOM_LRU eviction policy is applied. eviction policy is 
applied.

Put of large objects which occupy several pages can hang in cycle in the 
{{IgniteCacheDatabaseSharedManager.ensureFreeSpace()}} since 
{{Random2LruPageEvictionTracker.evictDataPage()}} keeps failing to find the 
page to evict.

The immediate reason is that RANDOM_2_LRU approach can only evict pages "with 
at least one touch".  For large (fragmented) objects only the last page is 
touched (see the {{PageEvictionTracker.touchPage()}}  call in 
{{AbstractFreeList#WriteRowHandler.addRow()}} method). So if only large objects 
exist data region has very very small fraction of the "touched" pages 
appropriate for eviction.  It appears that 5000 random attempts are not enough 
to get 5 candidate pages to evict.  So 
Random2LruPageEvictionTracker.evictDataPage() fails.


System striped pool can starvate for a long time - upto 14 hours once in real 
production environment until nodes were manually restarted.

Reproducer is attached [^Random2LruPageEvictionPutLargeObjectsTest.java]. 

It hangs after 12th put:

{noformat}
...
>>> Key put: 21, total entries put: 12
[2025-04-02T16:34:23,108][WARN 
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteCacheDatabaseSharedManager]
 Page-based evictions started. Consider increasing 'maxSize' on Data Region 
configuration: default
[2025-04-02T16:34:23,110][WARN 
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][Random2LruPageEvictionTracker]
 Too many attempts to choose data page: 5000



[2025-04-02T16:34:48,277][ERROR][tcp-disco-msg-worker-[15e5f20b 
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][G]
 Blocked system-critical thread has been detected. This can lead to 
cluster-wide undefined behaviour [workerName=sys-stripe-7, 
threadName=sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%, 
blockedFor=25s]
[2025-04-02T16:34:48,278][WARN ][tcp-disco-msg-worker-[15e5f20b 
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteTestResources]
 Possible failure suppressed accordingly to a configured handler 
[hnd=NoOpFailureHandler [super=AbstractFailureHandler 
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker 
[name=sys-stripe-7, 
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, 
finished=false, heartbeatTs=1743586463106]]]
 org.apache.ignite.IgniteException: GridWorker [name=sys-stripe-7, 
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, 
finished=false, heartbeatTs=1743586463106]
at 
org.apache.ignite.internal.processors.cache.persistence.evict.Random2LruPageEvictionTracker.evictDataPage(Random2LruPageEvictionTracker.java:152)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.ensureFreeSpace(IgniteCacheDatabaseSharedManager.java:1251)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1812)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1732)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3225)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:271)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:266)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1096)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:597)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:398)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:316)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306)
 ~[classes/:?]
at 
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(G

[jira] [Updated] (IGNITE-24992) Hang in put() and starvation in sys-striped pool if RANDOM_2_LRU eviction policy is used

2025-04-02 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-24992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-24992:
-
Description: 
In-memory cluster.

RANDOM_2_LRU eviction policy is applied.

Put of large objects which occupy several pages can hang in cycle in the 
{{IgniteCacheDatabaseSharedManager.ensureFreeSpace()}} since 
{{Random2LruPageEvictionTracker.evictDataPage()}} keeps failing to find the 
page to evict.

The immediate reason is that RANDOM_2_LRU approach can only evict pages "with 
at least one touch".  For large (fragmented) objects only the last page is 
touched (see the {{PageEvictionTracker.touchPage()}}  call in 
{{AbstractFreeList#WriteRowHandler.addRow()}} method). So if only large objects 
exist data region has very very small fraction of the "touched" pages 
appropriate for eviction.  It appears that 5000 random attempts are not enough 
to get 5 candidate pages to evict.  So 
Random2LruPageEvictionTracker.evictDataPage() fails.

Reproducer is attached [^Random2LruPageEvictionPutLargeObjectsTest.java]. 

It hangs after 12th put.

 

***

System striped pool can starvate for a long time (upto 14 hours once in real 
production environment until nodes were manually restarted) with the following 
errors logged:
{noformat}
...
>>> Key put: 21, total entries put: 12
[2025-04-02T16:34:23,108][WARN 
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteCacheDatabaseSharedManager]
 Page-based evictions started. Consider increasing 'maxSize' on Data Region 
configuration: default
[2025-04-02T16:34:23,110][WARN 
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][Random2LruPageEvictionTracker]
 Too many attempts to choose data page: 5000



[2025-04-02T16:34:48,277][ERROR][tcp-disco-msg-worker-[15e5f20b 
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][G]
 Blocked system-critical thread has been detected. This can lead to 
cluster-wide undefined behaviour [workerName=sys-stripe-7, 
threadName=sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%, 
blockedFor=25s]
[2025-04-02T16:34:48,278][WARN ][tcp-disco-msg-worker-[15e5f20b 
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteTestResources]
 Possible failure suppressed accordingly to a configured handler 
[hnd=NoOpFailureHandler [super=AbstractFailureHandler 
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker 
[name=sys-stripe-7, 
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, 
finished=false, heartbeatTs=1743586463106]]]
 org.apache.ignite.IgniteException: GridWorker [name=sys-stripe-7, 
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, 
finished=false, heartbeatTs=1743586463106]
at 
org.apache.ignite.internal.processors.cache.persistence.evict.Random2LruPageEvictionTracker.evictDataPage(Random2LruPageEvictionTracker.java:152)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.ensureFreeSpace(IgniteCacheDatabaseSharedManager.java:1251)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1812)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1732)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3225)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:271)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:266)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1096)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:597)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:398)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:316)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306)
 ~[classes/:?]
at 
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(Grid

[jira] [Updated] (IGNITE-24992) Hang in put() and starvation in sys-striped pool if RANDOM_2_LRU eviction policy is used

2025-04-02 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-24992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-24992:
-
Description: 
In-memory cluster.

RANDOM_2_LRU eviction policy is applied.

Put of large objects which occupy several pages can hang in cycle in the 
{{IgniteCacheDatabaseSharedManager.ensureFreeSpace()}} since 
{{Random2LruPageEvictionTracker.evictDataPage()}} keeps failing to find the 
page to evict.

The immediate reason is that RANDOM_2_LRU approach can only evict pages "with 
at least one touch".  For large (fragmented) objects only the last page is 
touched (see the {{PageEvictionTracker.touchPage()}}  call in 
{{AbstractFreeList#WriteRowHandler.addRow()}} method). So if only large objects 
exist data region has very very small fraction of the "touched" pages 
appropriate for eviction.  It appears that 5000 random attempts are not enough 
to get 5 candidate pages to evict.  So 
Random2LruPageEvictionTracker.evictDataPage() fails.

Reproducer is attached [^Random2LruPageEvictionPutLargeObjectsTest.java]. 

It hangs after 12th put.

 

***

System striped pool can starvate for a long time (upto 14 hours once in real 
production environment until nodes were manually restarted) with the following 
errors logged:
{noformat}
[2025-04-02T16:34:23,108][WARN 
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteCacheDatabaseSharedManager]
 Page-based evictions started. Consider increasing 'maxSize' on Data Region 
configuration: default
[2025-04-02T16:34:23,110][WARN 
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][Random2LruPageEvictionTracker]
 Too many attempts to choose data page: 5000



[2025-04-02T16:34:48,277][ERROR][tcp-disco-msg-worker-[15e5f20b 
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][G]
 Blocked system-critical thread has been detected. This can lead to 
cluster-wide undefined behaviour [workerName=sys-stripe-7, 
threadName=sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%, 
blockedFor=25s]
[2025-04-02T16:34:48,278][WARN ][tcp-disco-msg-worker-[15e5f20b 
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteTestResources]
 Possible failure suppressed accordingly to a configured handler 
[hnd=NoOpFailureHandler [super=AbstractFailureHandler 
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker 
[name=sys-stripe-7, 
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, 
finished=false, heartbeatTs=1743586463106]]]
 org.apache.ignite.IgniteException: GridWorker [name=sys-stripe-7, 
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, 
finished=false, heartbeatTs=1743586463106]
at 
org.apache.ignite.internal.processors.cache.persistence.evict.Random2LruPageEvictionTracker.evictDataPage(Random2LruPageEvictionTracker.java:152)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.ensureFreeSpace(IgniteCacheDatabaseSharedManager.java:1251)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1812)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1732)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3225)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:271)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:266)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1096)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:597)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:398)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:316)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306)
 ~[classes/:?]
at 
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1877)
 ~[classes/:?]
   

[jira] [Updated] (IGNITE-24992) Hang in put() and starvation in sys-striped pool if RANDOM_2_LRU eviction policy is used

2025-04-02 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-24992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-24992:
-
Description: 
In-memory cluster.

RANDOM_2_LRU eviction policy is applied.

Put of large objects which occupy several pages can hang in cycle in the 
{{IgniteCacheDatabaseSharedManager.ensureFreeSpace()}} since 
{{Random2LruPageEvictionTracker.evictDataPage()}} keeps failing to find the 
page to evict.

The immediate reason is that RANDOM_2_LRU approach can only evict pages "with 
at least one touch".  For large (fragmented) objects only the last page is 
touched (see the {{PageEvictionTracker.touchPage()}}  call in 
{{AbstractFreeList#WriteRowHandler.addRow()}} method). So if only large objects 
exist data region has very very small fraction of the "touched" pages 
appropriate for eviction.  It appears that 5000 random attempts are not enough 
to get 5 candidate pages to evict.  So 
Random2LruPageEvictionTracker.evictDataPage() fails.


System striped pool can starvate for a long time - upto 14 hours once in real 
production environment until nodes were manually restarted.

Reproducer is attached [^Random2LruPageEvictionPutLargeObjectsTest.java]. 

It hangs after 12th put:

{noformat}
...
>>> Key put: 21, total entries put: 12
[2025-04-02T16:34:23,108][WARN 
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteCacheDatabaseSharedManager]
 Page-based evictions started. Consider increasing 'maxSize' on Data Region 
configuration: default
[2025-04-02T16:34:23,110][WARN 
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][Random2LruPageEvictionTracker]
 Too many attempts to choose data page: 5000



[2025-04-02T16:34:48,277][ERROR][tcp-disco-msg-worker-[15e5f20b 
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][G]
 Blocked system-critical thread has been detected. This can lead to 
cluster-wide undefined behaviour [workerName=sys-stripe-7, 
threadName=sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%, 
blockedFor=25s]
[2025-04-02T16:34:48,278][WARN ][tcp-disco-msg-worker-[15e5f20b 
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteTestResources]
 Possible failure suppressed accordingly to a configured handler 
[hnd=NoOpFailureHandler [super=AbstractFailureHandler 
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker 
[name=sys-stripe-7, 
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, 
finished=false, heartbeatTs=1743586463106]]]
 org.apache.ignite.IgniteException: GridWorker [name=sys-stripe-7, 
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, 
finished=false, heartbeatTs=1743586463106]
at 
org.apache.ignite.internal.processors.cache.persistence.evict.Random2LruPageEvictionTracker.evictDataPage(Random2LruPageEvictionTracker.java:152)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.ensureFreeSpace(IgniteCacheDatabaseSharedManager.java:1251)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1812)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1732)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3225)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:271)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:266)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1096)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:597)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:398)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:316)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306)
 ~[classes/:?]
at 
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1877)
 ~[classes/:?]

[jira] [Updated] (IGNITE-24992) Hang in put() and starvation in sys-striped pool if RANDOM_2_LRU eviction policy is used

2025-04-02 Thread Sergey Korotkov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-24992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Korotkov updated IGNITE-24992:
-
Description: 
In-memory cluster.

RANDOM_2_LRU eviction policy is applied.

Put of large objects which occupy several pages can hang in cycle in the 
`IgniteCacheDatabaseSharedManager.ensureFreeSpace()` since  
Random2LruPageEvictionTracker.evictDataPage() keeps failing to find the page to 
evict.

The immediate reason is that RANDOM_2_LRU approach can only evict pages "with 
at least one touch".  For large (fragmented) objects only the last page is 
touched (see the PageEvictionTracker.touchPage()  call in 
AbstractFreeList#WriteRowHandler.addRow() method). So if only large objects 
exist data region has very very small fraction of the "touched" pages 
appropriate for eviction.  It appears that 5000 random attempts are not enough 
to get 5 candidate pages to evict.  So 
Random2LruPageEvictionTracker.evictDataPage() fails.

Reproducer is attached [^Random2LruPageEvictionPutLargeObjectsTest.java]. 

It hangs after 12th put.

 

***

System striped pool can starvate for a long time (upto 14 hours once in real 
production environment until nodes were manually restarted) with the following 
errors logged:
{noformat}
[2025-04-02T16:34:23,108][WARN 
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteCacheDatabaseSharedManager]
 Page-based evictions started. Consider increasing 'maxSize' on Data Region 
configuration: default
[2025-04-02T16:34:23,110][WARN 
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][Random2LruPageEvictionTracker]
 Too many attempts to choose data page: 5000



[2025-04-02T16:34:48,277][ERROR][tcp-disco-msg-worker-[15e5f20b 
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][G]
 Blocked system-critical thread has been detected. This can lead to 
cluster-wide undefined behaviour [workerName=sys-stripe-7, 
threadName=sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%, 
blockedFor=25s]
[2025-04-02T16:34:48,278][WARN ][tcp-disco-msg-worker-[15e5f20b 
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteTestResources]
 Possible failure suppressed accordingly to a configured handler 
[hnd=NoOpFailureHandler [super=AbstractFailureHandler 
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker 
[name=sys-stripe-7, 
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, 
finished=false, heartbeatTs=1743586463106]]]
 org.apache.ignite.IgniteException: GridWorker [name=sys-stripe-7, 
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, 
finished=false, heartbeatTs=1743586463106]
at 
org.apache.ignite.internal.processors.cache.persistence.evict.Random2LruPageEvictionTracker.evictDataPage(Random2LruPageEvictionTracker.java:152)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.ensureFreeSpace(IgniteCacheDatabaseSharedManager.java:1251)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1812)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1732)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3225)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:271)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:266)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1096)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:597)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:398)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:316)
 ~[classes/:?]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306)
 ~[classes/:?]
at 
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1877)
 ~[classes/:?]
at 
org.apac