[ https://issues.apache.org/jira/browse/IGNITE-11974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17189325#comment-17189325 ]
Vladislav Pyatkov commented on IGNITE-11974: -------------------------------------------- [~Arroyo] I looked at your files and have gone the same observation: 1) Partition eviction restarts all times, possible in infinity loop. 2) Partitions cannot evict successfully, because reserved. 3) They will go by circle here: {code} // Re-offer partition if clear was unsuccessful due to partition reservation. if (!success) evictPartitionAsync(grpEvictionCtx.grp, part); {code} when this expression is true: {code} if (getReservations(state) != 0 || groupReserved()) return false; {code} I don't think we can to understand where the problem is. Before, required to implement a unit test. Also, I know that various non-closed iterator can to lead the reservations. Do you not find in your code on an unclosed iterator after ScanQuery or SqlQuery? > infinite loop and 100% cpu in GridDhtPartitionsEvictor: Eviction in progress > ... > -------------------------------------------------------------------------------- > > Key: IGNITE-11974 > URL: https://issues.apache.org/jira/browse/IGNITE-11974 > Project: Ignite > Issue Type: Bug > Components: cache > Affects Versions: 2.5 > Reporter: Igor Kamyshnikov > Priority: Blocker > Attachments: eviction-in-progress-dumps.zip, > image-2019-07-10-16-07-37-185.png, server-node-restarts-1.png > > > Note: RCA was not done: > Sometimes ignite server nodes fall into infinite loop and consume 100% cpu: > {noformat} > "sys-#260008" #260285 prio=5 os_prio=0 tid=0x00007fabb020a800 nid=0x1e850 > runnable [0x00007fab26fef000] > java.lang.Thread.State: RUNNABLE > at > java.util.concurrent.ConcurrentHashMap$Traverser.advance(ConcurrentHashMap.java:3339) > at > java.util.concurrent.ConcurrentHashMap$ValueIterator.next(ConcurrentHashMap.java:3439) > at > org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionsEvictor$1.call(GridDhtPartitionsEvictor.java:84) > at > org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionsEvictor$1.call(GridDhtPartitionsEvictor.java:73) > at > org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6695) > at > org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967) > at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x0000000649b9cba0> (a > java.util.concurrent.ThreadPoolExecutor$Worker) > {noformat} > the following appears in logs each 2 minutes: > {noformat} > #### INFO 2019-07-08 12:21:45.081 (1562581305081) [sys-#98168] > [GridDhtPartitionsEvictor] > Eviction in progress > [grp=CUSTPRODINVOICEDISCUSAGE, remainingCnt=102] > {noformat} > remainingCnt remains the same once it reached 102 (the very first line in the > logs was with value equal to 101). > Some other facts: > we have a heapdump taken for *topVer = 900* . the problem appeared after > *topVer = 790*, but it looks like it was silently waiting from *topVer = 641* > (about 24 hours back). > There were 259 topology changes between 900 and 641. > All 102 GridDhtLocalPartitions can be found in the heapdump: > {noformat} > select * from > "org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition" > t where delayedRenting = true > {noformat} > They all have status = 65537 , which means (according to > org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition#state): > reservations(65537) = 1 > getPartState(65537) = OWNING > There are also 26968 instances of > org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl$$Lambda$70, > that are created by > org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl#checkEvictions > method. > 26418 of 26968 refer to AtomicInteger instance with value = 102: > 26418/102 = 259 = 900 - 641 (see topology info above). > The key thing seen from the heapdump is that topVer = 641 or topVer = 642 was > the last topology where these 102 partitions were assigned to the current > ignite server node. > {noformat} > select > t.this > ,t.this['clientEvtChange'] as clientEvtChange > ,t.this['topVer.topVer'] as topVer > > ,t.this['assignment.elementData'][555]['elementData'][0]['hostNames.elementData'][0] > as primary_part > > ,t.this['assignment.elementData'][555]['elementData'][1]['hostNames.elementData'][0] > as secondary_part > from org.apache.ignite.internal.processors.affinity.HistoryAffinityAssignment > t where length(t.this['assignment.elementData']) = 1024 > order by topVer > {noformat} > !image-2019-07-10-16-07-37-185.png! > The connection of a client node at topVer = 790 somehow triggered the > GridDhtPartitionsEvictor loop to execute. > Summary: > 1) it is seen that 102 partitions has one reservation and OWNING state. > 2) they were backup partitions. > 3) for some reason their eviction has been silently delaying (because of > reservations), but each topology change seemed to trigger eviction attempt. > 4) something managed to make > org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionsEvictor#evictPartitionAsync > to run never exiting. > Additional info: > topVer = 641 was in chain of sever nodes restarts (not sure if rebalancing > actually succeeded): > !server-node-restarts-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)