rdhabalia opened a new pull request #7499:
URL: https://github.com/apache/pulsar/pull/7499


   ### Motivation
   
   We have seen multiple different scenarios when broker suddenly sees huge 
spike in heap-memory usage and consumes all allocated heap-memory and 
eventually it crashes with OOM. One of the scenarios for broker crashing with 
OOM is broker can't handle the back-pressure from bookie add-entry timeout. 
   Broker limits max-pending messages per topic but it doesn't limit total 
number of pending messages across all topics. if broker is serving many topics 
with high publish rate and due to some reasons if broker started seeing 
add-entry timeout from bk-client then it allocates large number of 
non-recyclable objects which starts causing high GC and eventually it crashes 
with OOM. We saw many brokers crashed same time due to bk n/w partitioning/bk 
add-entry high add-latency. It can be easily reproducible by simulating bookie 
behavior which can cause`Bookie operation timeout` error at broker , and 
publish with 30K-40K rate with 1K topics.
   Therefore, we need a mechanism to handle bookie back-pressure at broker by 
limiting number of pending messages across all topics in the broker.
   
   Broker-Error: Add-entry timing out at bk-client
   ```
   org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - 
[prop/cluster/ns/persistent/t1] Created new ledger 123456
   13:25:04.468 [BookKeeperClientWorker-OrderedExecutor-20-0] WARN  
org.apache.bookkeeper.client.PendingAddOp - Failed to write entry (123456, 1): 
Bookie operation timeout
   13:25:04.469 [BookKeeperClientWorker-OrderedExecutor-20-0] WARN  
org.apache.bookkeeper.client.PendingAddOp - Failed to write entry (123456, 2): 
Bookie operation timeout
   13:25:04.469 [BookKeeperClientWorker-OrderedExecutor-20-0] WARN  
org.apache.bookkeeper.client.PendingAddOp - Failed to write entry (123456, 3): 
Bookie operation timeout
   13:25:04.469 [BookKeeperClientWorker-OrderedExecutor-20-0] WARN  
org.apache.bookkeeper.client.PendingAddOp - Failed to write entry (123456, 4): 
Bookie operation timeout
   ```
   Broker sees sudden spike in heap memory usage and crashes
   
![Snip20200709_56](https://user-images.githubusercontent.com/2898254/87123893-7d0c3600-c23c-11ea-9172-55ccd78f8d64.png)
   
   ### Modification
   - add configuration to restrict total pending publish messages across all 
topics in a broker: `maxConcurrentPendingPublishMessages`
   - by default this feature will be disable with value =0 and will not change 
any existing behavior


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to