[ https://issues.apache.org/jira/browse/GEODE-654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15051853#comment-15051853 ]
Vincent Ford commented on GEODE-654: ------------------------------------ The best method to address this issue is to modify the LIFO implementation in the NewLIFOClockHand class. The proposed changes are to always remove an entry from the LIFO/LRU queue so that the last entry should change as the queue is processed regardless of whether it can be evicted or not. > Interaction of GII and LIFO eviction can cause high CPU and recovery issues > --------------------------------------------------------------------------- > > Key: GEODE-654 > URL: https://issues.apache.org/jira/browse/GEODE-654 > Project: Geode > Issue Type: Bug > Components: core > Reporter: Vincent Ford > Assignee: Vincent Ford > > This was seen in a customer engagement with GemFire but also effects Geode > and the usage of LIFO eviction algorithm. The GemFire support team led by > David Wisler identified the following pattern. In the specific case the GII > was blocked from completing processing and also blocked a put into the > processing queue of a Gateway ( not part of Geode but would effect any usage > of LIFO eviction ). > (** 1 **) > "Pooled High Priority Message Processor 4" daemon prio=10 > tid=0x00002ba27c008800 nid=0x3fc5 waiting for monitor entry > [0x00002ba1894f2000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > com.gemstone.gemfire.internal.cache.InitialImageOperation$RequestImageMessage.chunkEntries(InitialImageOperation.java:1555) > - waiting to lock <0x00000005b81413d0> (a > com.gemstone.gemfire.internal.cache.VersionedStatsRegionEntry) > 1554 synchronized(mapEntry) { // bug #46042 must sync to make sure the tag > goes with the value > 1555 VersionSource<?> id = stamp.getMemberID(); > 1556 if (id == null) { id = myId; } > 1557 foundIds.add(id); > 1558 // if the recipient passed a version vector, use it to > filter out > 1559 // entries the recipient already has > (** 2 **) > "pool-26-thread-4" prio=10 tid=0x00002ba390042800 nid=0x4073 waiting for > monitor entry [0x00002ba18b4d0000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:205) > - waiting to lock <0x00000005b7624058> (a > com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue) > at > com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.queuePrimaryEvent(SerialGatewaySenderEventProcessor.java:440) > at > com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.enqueueEvent(SerialGatewaySenderEventProcessor.java:416) > at > com.gemstone.gemfire.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:964) > at > com.gemstone.gemfire.internal.cache.LocalRegion.notifyGatewayHubs(LocalRegion.java:6295) > at > com.gemstone.gemfire.internal.cache.LocalRegion.basicPutPart2(LocalRegion.java:5858) > at > com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3203) > - locked <0x00000005b81413d0> (a > com.gemstone.gemfire.internal.cache.VersionedStatsRegionEntry) > Below it is clear that we are not getting out of the eviction loop, for some > reason the method "evictEntry" is returning 0 as the amount of evicted bytes > and that's why following thread keeps swapping between BLOCKED and RUNNABLE > and making no progress. > [2015-09-08 12:39:09] > "pool-26-thread-2" prio=10 tid=0x00002ba390005000 nid=0x406f waiting for > monitor entry [0x00002ba18b34b000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.evictEntry(AbstractLRURegionMap.java:302) > - locked <0x00000005b9b7c8c8> (a > com.gemstone.gemfire.internal.cache.VersionedThinDiskLRURegionEntry) > at > com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.lruUpdateCallback(AbstractLRURegionMap.java:522) > at > com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3272) > at > com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.basicPut(AbstractLRURegionMap.java:45) > at > com.gemstone.gemfire.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5669) > at > com.gemstone.gemfire.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:375) > at > com.gemstone.gemfire.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:118) > at > com.gemstone.gemfire.internal.cache.LocalRegion.basicPut(LocalRegion.java:5050) > at > com.gemstone.gemfire.internal.cache.LocalRegion.validatedPut(LocalRegion.java:1747) > at > com.gemstone.gemfire.internal.cache.LocalRegion.put(LocalRegion.java:1729) > at > com.gemstone.gemfire.internal.cache.AbstractRegion.put(AbstractRegion.java:286) > at > com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.putAndGetKey(SerialGatewaySenderQueue.java:220) > at > com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:212) > - locked <0x00000005b7624058> (a > com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue) > [2015-09-08 12:40:14] > "pool-26-thread-2" prio=10 tid=0x00002ba390005000 nid=0x406f runnable > [0x00002ba18b34b000] > java.lang.Thread.State: RUNNABLE > at > com.gemstone.gemfire.internal.cache.lru.NewLIFOClockHand.getLRUEntry(NewLIFOClockHand.java:40) > - locked <0x00000005b8c55c90> (a > com.gemstone.gemfire.internal.cache.lru.NewLRUClockHand$HeadLock) > at > com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.lruUpdateCallback(AbstractLRURegionMap.java:520) > at > com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3272) > at > com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.basicPut(AbstractLRURegionMap.java:45) > at > com.gemstone.gemfire.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5669) > at > com.gemstone.gemfire.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:375) > at > com.gemstone.gemfire.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:118) > at > com.gemstone.gemfire.internal.cache.LocalRegion.basicPut(LocalRegion.java:5050) > at > com.gemstone.gemfire.internal.cache.LocalRegion.validatedPut(LocalRegion.java:1747) > at > com.gemstone.gemfire.internal.cache.LocalRegion.put(LocalRegion.java:1729) > at > com.gemstone.gemfire.internal.cache.AbstractRegion.put(AbstractRegion.java:286) > at > com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.putAndGetKey(SerialGatewaySenderQueue.java:220) > at > com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:212) > - locked <0x00000005b7624058> (a > com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue) > From the AbstractLRURegionMap... > 527 while (_getCCHelper().mustEvict(stats, _getOwner(), > bytesToEvict)) { > 528 LRUEntry removalEntry = > (LRUEntry)_getLruList().getLRUEntry(); > 529 if (removalEntry != null) { > 530 if (evictEntry(removalEntry, stats) != 0) { > 531 if (debug) { > 532 debugLogging("evicted entry > key(2)=" + removalEntry.getKey() > 533 + " total entry size is > now: " + getTotalEntrySize() > 534 + " bytesToEvict :" + > bytesToEvict); > 535 // if (NewLRUClockHand.debug && > logWriter != null) { > 536 // _getLruList().dumpList(logWriter); > 537 // } > 538 } > 539 stats.incEvictions(); > 540 if (_isOwnerALocalRegion()) { > 541 if (_getOwner() instanceof > BucketRegion) { > 542 > ((BucketRegion)_getOwner()).incEvictions(1); > 543 } > 544 } > 545 if (debug) > 546 debugLogging("evictions=" + > stats.getEvictions()); > 547 _getCCHelper().afterEviction(); > 548 > 549 } > 550 } > 551 > 552 else { > 553 if (debug && getTotalEntrySize() != 0) { > 554 debugLogging("leaving evict loop > early"); > 555 } > 556 > 557 break; > 558 } > 559 } > Summary--- > 1. Serial gateway sender queues use LIFO_MEMORY as their eviction algorithm > and others may choose this option as well > 2. Due to the above, when we mustEvict, we call getLRUEntry implementation of > NewLIFEClockHand > 3. Unfortunately, this LIFO based implementation doesn't seem to work in all > cases during gII. This implementation always returns simply the last entry in > the queue without directly removing it. It is possible to have the last > element on the list (returned by getLRUEntry for LIFO_MEMORY) be in a state > that causes us to not evict any bytes, but then just continue looping forever > expecting state to change. > 4. This appears to be a corner case race condition and will be difficult to > test or reproduce given the timings and data distribution required to > encounter the issue. For example to encounter the distributed system created > by Geode must be in GII between members, have entry in queue that can not be > evicted ( part of transaction, certain internal tokens...), queue is in state > requiring eviction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)