[jira] [Commented] (GEODE-8278) Gateway sender queues using heap memory way above configured value after server restart

Barrett Oglesby (Jira) Wed, 16 Dec 2020 14:12:37 -0800


    [ 
https://issues.apache.org/jira/browse/GEODE-8278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17250673#comment-17250673
 ]


Barrett Oglesby commented on GEODE-8278:
----------------------------------------

I see some code that doesn't look quite right.

First, this code in AbstractRegionMap.initialImagePut:
{noformat}
if (oldIsTombstone) {
  owner.unscheduleTombstone(oldRe);
  if (newValue != Token.TOMBSTONE) {
    lruEntryCreate(oldRe);
  } else {
    lruEntryUpdate(oldRe);
  }
}
{noformat}
Its only updating the LRU statistics if the previous entry was a tombstone. 
That doesn't seem correct.

I changed it to:
{noformat}
if (oldIsTombstone) {
  owner.unscheduleTombstone(oldRe);
}
if (newValue != Token.TOMBSTONE) {
  lruEntryCreate(oldRe);
} else {
  lruEntryUpdate(oldRe);
}
{noformat}
Then, VMLRURegionMap.resetThreadLocals is called twice.

Once by AbstractRegionMap.initialImagePut in the finally block:
{noformat}
java.lang.Exception
        at 
org.apache.geode.internal.cache.VMLRURegionMap.resetThreadLocals(VMLRURegionMap.java:609)
        at 
org.apache.geode.internal.cache.AbstractRegionMap.initialImagePut(AbstractRegionMap.java:949)
        at 
org.apache.geode.internal.cache.InitialImageOperation.processChunk(InitialImageOperation.java:941)
{noformat}
And once in VMLRURegionMap.lruUpdateCallback (after the entry has been 
processed by InitialImageOperation.processChunk):
{noformat}
java.lang.Exception
        at 
org.apache.geode.internal.cache.VMLRURegionMap.resetThreadLocals(VMLRURegionMap.java:609)
        at 
org.apache.geode.internal.cache.VMLRURegionMap.lruUpdateCallback(VMLRURegionMap.java:374)
        at 
org.apache.geode.internal.cache.InitialImageOperation.processChunk(InitialImageOperation.java:954)
{noformat}
VMLRURegionMap.resetThreadLocals method clears a few thread locals including 
lruDelta which is used by the lruUpdateCallback to determine whether to evict 
or not.

I think the first call in the finally block of 
AbstractRegionMap.initialImagePut is not correct.

Here is that code:
{noformat}
} finally {
  if (done && !deferLRUCallback) {
    lruUpdateCallback();
  } else if (!cleared) {
    resetThreadLocals();
  }
}
{noformat}
I changed it to:
{noformat}
if (!deferLRUCallback) {
  if (done) {
    lruUpdateCallback();
  } else if (!cleared) {
    resetThreadLocals();
  }
}
{noformat}
I'm not sure these are valid changes, but with these changes, during GII, I see 
eviction occurring.

With these changes, histograms after recovery show:
{noformat}
 num     #instances         #bytes  class name
----------------------------------------------
   1:          8740      113626040  [B
   2:         47952        4749896  [C
  15:          5000         320000  
org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryHeapStringKey1
  19:          5000         280000  
org.apache.geode.internal.cache.entries.VMThinDiskLRURegionEntryHeapLongKey
  32:          5003         120072  
org.apache.geode.internal.cache.VMCachedDeserializable
  46:           505          52520  
org.apache.geode.internal.cache.wan.GatewaySenderEventImpl
Total        443319      133451360

 num     #instances         #bytes  class name
----------------------------------------------
   1:         12201      184764920  [B
   2:         47927        4770984  [C
  15:          5000         320000  
org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryHeapStringKey1
  19:          5000         280000  
org.apache.geode.internal.cache.entries.VMThinDiskLRURegionEntryHeapLongKey
  23:          8796         211104  
org.apache.geode.internal.cache.VMCachedDeserializable
  47:           514          53456  
org.apache.geode.internal.cache.wan.GatewaySenderEventImpl
Total        449229      204864160
{noformat}
The top histogram shows the GII provider; the bottom histogram shows the GII 
requester.

These show that the GII provider has only recovered keys since there are 5003 
VMCachedDeserializables.

These also show that the GII requester has evicted entries since there are only 
8796 VMCachedDeserializables.

Here is some logging that shows an entry being processed that doesn't cause 
eviction. It does update the total bytes (total=39151873):
{noformat}
Pooled High Priority Message Processor 27: InitialImageOperation.processChunk 
about to initialImagePut key=3000; value=VMCachedDeserializable@2010910258
Pooled High Priority Message Processor 27: VMLRURegionMap.lruEntryUpdate 
region=/__PR/_B__ny__PARALLEL__GATEWAY__SENDER__QUEUE_62; 
re=VMThinDiskLRURegionEntryHeapLongKey@45652743 (key=3000)
Pooled High Priority Message Processor 27: VMLRURegionMap.setDelta 
region=/__PR/_B__ny__PARALLEL__GATEWAY__SENDER__QUEUE_62; lruDelta=20666
Pooled High Priority Message Processor 27: InitialImageOperation.processChunk 
done initialImagePut key=3000; value=VMCachedDeserializable@2010910258
Pooled High Priority Message Processor 27: InitialImageOperation.processChunk 
about to lruUpdateCallback key=3000
Pooled High Priority Message Processor 27: VMLRURegionMap.getDelta value=20666
Pooled High Priority Message Processor 27: VMLRURegionMap.lruUpdateCallback 
region=/__PR/_B__ny__PARALLEL__GATEWAY__SENDER__QUEUE_62; bytesToEvict=20666
Pooled High Priority Message Processor 27: VMLRURegionMap.changeTotalEntrySize 
region=/__PR/_B__ny__PARALLEL__GATEWAY__SENDER__QUEUE_62; delta=20666
Pooled High Priority Message Processor 27: MemoryLRUStatistics.updateCounter 
delta=20666; total=39151873
Pooled High Priority Message Processor 27: InitialImageOperation.processChunk 
done lruUpdateCallback key=3000
Pooled High Priority Message Processor 27: InitialImageOperation.processChunk 
done processing key=3000
{noformat}
Here is some logging during GII where the entry does cause eviction 
(VMLRURegionMap.lruUpdateCallback evicted...):
{noformat}
Pooled High Priority Message Processor 27: InitialImageOperation.processChunk 
about to initialImagePut key=2900; value=VMCachedDeserializable@1084176828
Pooled High Priority Message Processor 27: VMLRURegionMap.lruEntryUpdate 
region=/__PR/_B__ny__PARALLEL__GATEWAY__SENDER__QUEUE_75; 
re=VMThinDiskLRURegionEntryHeapLongKey@527429f5 (key=2900)
Pooled High Priority Message Processor 27: VMLRURegionMap.setDelta 
region=/__PR/_B__ny__PARALLEL__GATEWAY__SENDER__QUEUE_75; lruDelta=20666
Pooled High Priority Message Processor 27: InitialImageOperation.processChunk 
done initialImagePut key=2900; value=VMCachedDeserializable@1084176828
Pooled High Priority Message Processor 27: InitialImageOperation.processChunk 
about to lruUpdateCallback key=2900
Pooled High Priority Message Processor 27: VMLRURegionMap.getDelta value=20666
Pooled High Priority Message Processor 27: VMLRURegionMap.lruUpdateCallback 
region=/__PR/_B__ny__PARALLEL__GATEWAY__SENDER__QUEUE_75; bytesToEvict=20666
Pooled High Priority Message Processor 27: MemoryLRUStatistics.updateCounter 
delta=-20674; total=78634526
Pooled High Priority Message Processor 27: VMLRURegionMap.lruUpdateCallback 
evicted region=/__PR/_B__ny__PARALLEL__GATEWAY__SENDER__QUEUE_75; key=2900
Pooled High Priority Message Processor 27: VMLRURegionMap.changeTotalEntrySize 
region=/__PR/_B__ny__PARALLEL__GATEWAY__SENDER__QUEUE_75; delta=20666
Pooled High Priority Message Processor 27: MemoryLRUStatistics.updateCounter 
delta=20666; total=78655184
Pooled High Priority Message Processor 27: InitialImageOperation.processChunk 
done lruUpdateCallback key=2900
Pooled High Priority Message Processor 27: InitialImageOperation.processChunk 
done processing key=2900
{noformat}


> Gateway sender queues using heap memory way above configured value after 
> server restart
> ---------------------------------------------------------------------------------------
>
>                 Key: GEODE-8278
>                 URL: https://issues.apache.org/jira/browse/GEODE-8278
>             Project: Geode
>          Issue Type: Bug
>          Components: eviction
>            Reporter: Alberto Gomez
>            Assignee: Alberto Gomez
>            Priority: Major
>
> In a Geode system with the following characteristics:
>  * WAN replication
>  * partition redundant regions
>  * overflow configured for the gateway senders queues by means of persistence 
> and maximum queue memory set.
>  * gateway receivers stopped in one site (B)
>  * Operations sent to the site that does not have the gateway receivers 
> stopped (A)
> When operations are sent to site A, the gateway sender queues start to grow 
> as expected and the heap memory consumed by the queues does not grow 
> indefinitely given that there is overflow to disk when the limit is reached.
> But, if a server is restarted, the restarted server will show a much higher 
> heap memory used than the memory used by this server before it was restarted 
> or by the other servers.
> This can even provoke that the server cannot be restarted if the heap memory 
> it requires is above the limit configured.
> According to the memory analyzer the entries taking up the memory are 
> subclasses of ```VMThinDiskLRURegionEntryHeap```.
> The number of instances of this type are the same in the restarted server 
> than in the not restarted servers but on the restarted server they take much 
> more memory. The reason seems to be that the ```value``` member attribute of 
> the instances, in the case of the restarted server contains 
> ```VMCachedDeserializable``` objects while in the case of the not restarted 
> server the attribute contains either ```null``` or 
> ```GatewaySenderEventImpl``` objects that use much less memory than the 
> ```VMCachedDeserializable``` ones.
>  If redundancy is not configured for the region then the problem is not 
> manifested, i.e. the heap memory used by the restarted server is similar to 
> the one prior to the restart.
> If the node not restarted is restarted then the previously restarted node 
> seems to release the extra memory (my guess is that it is processing the 
> other process queue).
> Also, if traffic is sent again to the Geode cluster, then it seems eviction 
> kicks in and after some short time, the memory of the restarted server goes 
> down to the level it had before it had been restarted.
> As a summary, the problem seems to be that if a server does GII 
> (getInitialImage) from another server, eviction does not occur for gateway 
> sender queue entries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8278) Gateway sender queues using heap memory way above configured value after server restart

Reply via email to