horizonzy commented on issue #3408:
URL: https://github.com/apache/bookkeeper/issues/3408#issuecomment-1196586078

   After research for a long time. We found that is bookie problem, the request 
from `ReplicationWorker` is too many.
   
   The shutdown bookie holds many ledgers, when it shutdown, the `Auditor` mark 
many ledgers to underreplication. 
   And there are many `ReplicationWorker` to replicate ledger, the config 
`rereplicationEntryBatchSize ` is 500, so every `ReplicationWorker` will send 
500 read request to bookie servers, so the bookie server receives lots of 
reuqest, it will allocate direct memory for reuqest.
   
   The release operation is not catching up allocate operation, so the 
PoolChunk is more and more until it reach maxDirectMemory.
   
   @gaozhangmin supply two heap dumps file, the `less` is dumpped when 
replicate operation start, The `more` file is dumpped when the replicate for a 
while.
   
   
[less.hprof.zip](https://github.com/apache/bookkeeper/files/9197159/less.hprof.zip)
   
[more.hprof.zip](https://github.com/apache/bookkeeper/files/9197161/more.hprof.zip)
   
   I found that `PoolChunk` is 244 in `more`, 120 in `less`. The `PoolChunk` 
direct memory is 4M in bookie, so it increase 124 * 4M direct memory than 
`less`.  
   
   And there is another issue we found, if user config `DbLedgerStorage`, when 
it start, it will occupy 1/2 direct memory for readCache and writeCache, it's 
unpooled but cuupy direct memory. 
   
   In the Direct memory pool, it only has 1/2 direct memory to allocate, it 
will cause oom easier.
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@bookkeeper.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to