TakaHiR07 opened a new pull request, #4467:
URL: https://github.com/apache/bookkeeper/pull/4467

   Fix https://github.com/apache/bookkeeper/issues/4457, 
https://github.com/apache/pulsar/issues/22986
   
   
   ### Motivation
   
   There is a dead lock issue in pulsar-3.0, which is easily to reproduce when 
some bookies become unavailable. 
   
   The issue occurred when openLedger and do recoverAdd in 
ReadOnlyLedgerHandle. 
   - this issue is not exist in branch-4.14, but exist after branch-4.16
   - this issue only exist in ReadOnlyLedgerHandle, but not exist in 
LedgerHandle. Because LedgerHandle#handleBookieFailure's implementation is 
different from ReadOnlyLedgerHandle. It would not lock metadataLock and then 
lock other pendingAddOp of the same ledger.
   
   We have analysed the stack in issue. The deadlock process can be simplify as:
        1. In thread pulsar-io-2, request2 lock pendingAddOp2 in 
writeComplete(), because of channel disconnect
        2. In thread pulsar-io-1, request1 lock pendingAddOp1 in 
writeComplete(), because of channel disconnect. And lock metadataLock in 
ReadOnlyLedgerHandle#handleBookieFailure.
        3. In thread pulsar-io-2, request2 try to apply metadataLock
        4. In thread pulsar-io-1, request1 try to apply pendingAddOp2 because 
of unsetSuccessAndSendWriteRequest
   
   
   Therefore, the deadlock occur. And I guess it occur after this pr 
https://github.com/apache/bookkeeper/pull/3784, because the pr add synchronized 
in pendingAddOp. But I think this is not the root reason.
   
   As our known for bookkeeper, handleBookieFailure and the other operations of 
entry should always run in specific thread for ledgerId. However, we saw that 
the two pendingAddOps of the same ledger is run in different "pulsar-io" 
thread. This is not reasonable.
   
   After diving into code, we can find that "pulsar-io" threadPool is used for 
PerChannelBookieClient connect, and "BookKeeperClientWorker" threadPool is used 
for handling write or read entry. But if connect fail and throw exception, 
bookieClient.completeAdd still use "pulsar-io" to execute. That is the root 
reason of the dead lock issue. If we can make sure all the pendingAddOps of the 
same ledger run in the specific executor, the dead lock will not occur.
   
   
   
   ### Changes
   
   If channel is not ready for addEntry, execute completeAddOperation in 
specific orderedExecutor of ledger, instead of in eventLoopGroup. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to