TakaHiR07 opened a new pull request, #4529:
URL: https://github.com/apache/bookkeeper/pull/4529

   
   ### Motivation
   
   One of our cluster occur a case that read ledger LAC timeout and ledger can 
not recover, which make topic unavailable. After adding extra log in 
bookie-server, we finally found the bottleneck is in 
EntryLocationIndex#getLastEntryInLedgerInternal, it spend 2.5 minute to scan 
the rocksdb.  
   
![企业微信截图_ccc8453e-19b6-4c6a-8d7c-d63f2b7bf09e](https://github.com/user-attachments/assets/23d5e04c-1886-44ce-9a2d-78e02e7dcc97)
   
   Currently it may be hard to find out the bottleneck is in 
getLastEntryInLedgerInternal. Because if getLastEntryInLedgerInternal throw 
noEntry exception, the read-locations-index-time is not able to record the long 
latency. There is no way to know the bottleneck is in 
getLastEntryInLedgerInternal.
   
   
![企业微信截图_599c4094-1e4e-4edb-8c64-35bed13e0917](https://github.com/user-attachments/assets/ff4b7a08-3683-473b-b392-985334424855)
   
![企业微信截图_8d312886-7c02-4ef7-9e3a-a94525a1c8e1](https://github.com/user-attachments/assets/934064b4-8aee-4f8c-a748-e2b1eac3a040)
   
   
   Because once the bottleneck in getLastEntry occur, the worst it would cause 
ledger unavailable and pulsar topic unavailable, I think is important to add 
this metric
   
   
   
   
   ### Changes
   
   add metric.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to