[ 
https://issues.apache.org/jira/browse/GEODE-654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darrel Schneider resolved GEODE-654.
------------------------------------
       Resolution: Fixed
    Fix Version/s: 1.0.0-incubating

The fix is on geode in git rev: f7670e1688a95a026a0376724bfe2277cb16eb94

> Interaction of GII and LIFO eviction can cause high CPU and recovery issues
> ---------------------------------------------------------------------------
>
>                 Key: GEODE-654
>                 URL: https://issues.apache.org/jira/browse/GEODE-654
>             Project: Geode
>          Issue Type: Bug
>          Components: core
>            Reporter: Vincent Ford
>            Assignee: Vincent Ford
>             Fix For: 1.0.0-incubating
>
>
> This was seen in a customer engagement with GemFire but also effects Geode 
> and the usage of  LIFO eviction algorithm.  The GemFire support team led by 
> David Wisler identified the following pattern. In the specific case the GII 
> was blocked from completing processing and also blocked a put into the 
> processing queue of a Gateway ( not part of Geode but would effect any usage 
> of LIFO eviction ).
> (** 1 **)
> "Pooled High Priority Message Processor 4" daemon prio=10 
> tid=0x00002ba27c008800 nid=0x3fc5 waiting for monitor entry 
> [0x00002ba1894f2000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>       at 
> com.gemstone.gemfire.internal.cache.InitialImageOperation$RequestImageMessage.chunkEntries(InitialImageOperation.java:1555)
>       - waiting to lock <0x00000005b81413d0> (a 
> com.gemstone.gemfire.internal.cache.VersionedStatsRegionEntry)
> 1554  synchronized(mapEntry) { // bug #46042 must sync to make sure the tag 
> goes with the value
> 1555          VersionSource<?> id = stamp.getMemberID();
> 1556          if (id == null) { id = myId; }
> 1557                  foundIds.add(id);
> 1558                  // if the recipient passed a version vector, use it to 
> filter out
> 1559                  // entries the recipient already has
> (** 2 **)
> "pool-26-thread-4" prio=10 tid=0x00002ba390042800 nid=0x4073 waiting for 
> monitor entry [0x00002ba18b4d0000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>       at 
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:205)
>       - waiting to lock <0x00000005b7624058> (a 
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue)
>       at 
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.queuePrimaryEvent(SerialGatewaySenderEventProcessor.java:440)
>       at 
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.enqueueEvent(SerialGatewaySenderEventProcessor.java:416)
>       at 
> com.gemstone.gemfire.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:964)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegion.notifyGatewayHubs(LocalRegion.java:6295)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegion.basicPutPart2(LocalRegion.java:5858)
>       at 
> com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3203)
>       - locked <0x00000005b81413d0> (a 
> com.gemstone.gemfire.internal.cache.VersionedStatsRegionEntry)
> Below it is clear that we are not getting out of the eviction loop, for some 
> reason the method "evictEntry" is returning 0 as the amount of evicted bytes 
> and that's why following  thread keeps swapping between BLOCKED and RUNNABLE 
> and making no progress.
> [2015-09-08 12:39:09]
> "pool-26-thread-2" prio=10 tid=0x00002ba390005000 nid=0x406f waiting for 
> monitor entry [0x00002ba18b34b000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>       at 
> com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.evictEntry(AbstractLRURegionMap.java:302)
>       - locked <0x00000005b9b7c8c8> (a 
> com.gemstone.gemfire.internal.cache.VersionedThinDiskLRURegionEntry)
>       at 
> com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.lruUpdateCallback(AbstractLRURegionMap.java:522)
>       at 
> com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3272)
>       at 
> com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.basicPut(AbstractLRURegionMap.java:45)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5669)
>       at 
> com.gemstone.gemfire.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:375)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:118)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegion.basicPut(LocalRegion.java:5050)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegion.validatedPut(LocalRegion.java:1747)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegion.put(LocalRegion.java:1729)
>       at 
> com.gemstone.gemfire.internal.cache.AbstractRegion.put(AbstractRegion.java:286)
>       at 
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.putAndGetKey(SerialGatewaySenderQueue.java:220)
>       at 
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:212)
>       - locked <0x00000005b7624058> (a 
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue)
> [2015-09-08 12:40:14]
> "pool-26-thread-2" prio=10 tid=0x00002ba390005000 nid=0x406f runnable 
> [0x00002ba18b34b000]
>    java.lang.Thread.State: RUNNABLE
>       at 
> com.gemstone.gemfire.internal.cache.lru.NewLIFOClockHand.getLRUEntry(NewLIFOClockHand.java:40)
>       - locked <0x00000005b8c55c90> (a 
> com.gemstone.gemfire.internal.cache.lru.NewLRUClockHand$HeadLock)
>       at 
> com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.lruUpdateCallback(AbstractLRURegionMap.java:520)
>       at 
> com.gemstone.gemfire.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:3272)
>       at 
> com.gemstone.gemfire.internal.cache.AbstractLRURegionMap.basicPut(AbstractLRURegionMap.java:45)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5669)
>       at 
> com.gemstone.gemfire.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:375)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:118)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegion.basicPut(LocalRegion.java:5050)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegion.validatedPut(LocalRegion.java:1747)
>       at 
> com.gemstone.gemfire.internal.cache.LocalRegion.put(LocalRegion.java:1729)
>       at 
> com.gemstone.gemfire.internal.cache.AbstractRegion.put(AbstractRegion.java:286)
>       at 
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.putAndGetKey(SerialGatewaySenderQueue.java:220)
>       at 
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:212)
>       - locked <0x00000005b7624058> (a 
> com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderQueue)
> From the AbstractLRURegionMap...
> 527           while (_getCCHelper().mustEvict(stats, _getOwner(), 
> bytesToEvict)) {
> 528                   LRUEntry removalEntry = 
> (LRUEntry)_getLruList().getLRUEntry();
> 529                   if (removalEntry != null) {
> 530                           if (evictEntry(removalEntry, stats) != 0) {
> 531                                   if (debug) {
> 532                                           debugLogging("evicted entry 
> key(2)=" + removalEntry.getKey()
> 533                                                   + " total entry size is 
> now: " + getTotalEntrySize()
> 534                                                   + " bytesToEvict :" + 
> bytesToEvict);
> 535                   //                if (NewLRUClockHand.debug && 
> logWriter != null) {
> 536                   //                  _getLruList().dumpList(logWriter);
> 537                   //                }
> 538                                   }
> 539                                   stats.incEvictions();
> 540                                   if (_isOwnerALocalRegion()) {
> 541                                           if (_getOwner() instanceof 
> BucketRegion) {
> 542                                                   
> ((BucketRegion)_getOwner()).incEvictions(1);
> 543                                           }
> 544                                   }
> 545                                   if (debug)
> 546                                           debugLogging("evictions=" + 
> stats.getEvictions());
> 547                                           _getCCHelper().afterEviction();
> 548
> 549                           }
> 550                   }
> 551
> 552                   else {
> 553                           if (debug && getTotalEntrySize() != 0) {
> 554                                   debugLogging("leaving evict loop 
> early");
> 555                           }
> 556
> 557                           break;
> 558                   }
> 559           }
> Summary---
> 1. Serial gateway sender queues use LIFO_MEMORY as their eviction algorithm 
> and others may choose this option as well
> 2. Due to the above, when we mustEvict, we call getLRUEntry implementation of 
> NewLIFEClockHand 
> 3. Unfortunately, this LIFO based implementation doesn't seem to work in all 
> cases during gII. This implementation always returns simply the last entry in 
> the queue without directly removing it. It is possible to have the last 
> element on the list (returned by getLRUEntry for LIFO_MEMORY) be in a state 
> that causes us to not evict any bytes, but then just continue looping forever 
> expecting state to change.
> 4. This appears to be a corner case race condition and will be difficult to 
> test or reproduce given the timings and data distribution required to 
> encounter the issue. For example to encounter the distributed system created 
> by Geode must be in GII between members, have entry in queue that can not be 
> evicted ( part of transaction, certain internal tokens...), queue is in state 
> requiring eviction. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to