BewareMyPower opened a new pull request, #17645:
URL: https://github.com/apache/pulsar/pull/17645

   Fixes #14848
   
   ### Motivation
   
   There were several fixes on `ClientTest.testReferenceCount` but it's still 
very flaky.
   
   The root cause is even after a `Reader` went out of the scope and 
destructed, there was still a `Reader` object existed in the thread of the 
event loop. See
   
https://github.com/apache/pulsar/blob/845daf5cac23a4dda4a209d91c85804a0bcaf28a/pulsar-client-cpp/lib/ReaderImpl.cc#L88
   
   To verify this point, I added some logs and saw:
   
   ```
   2022-09-14 03:52:28.427 INFO  [140046042864960] Reader:39 | Reader ctor 
0x7fffd2a7c110
   # ...
   2022-09-14 03:52:28.444 INFO  [140046039774976] Reader:42 | Reader ctor 
0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3)
   # ...
   2022-09-14 03:52:28.445 INFO  [140046042864960] ClientTest:217 | Reference 
count of the reader: 4
   # ...
   2022-09-14 03:52:28.445 INFO  [140046042864960] ClientImpl:490 | Closing 
Pulsar client with 1 producers and 2 consumers
   2022-09-14 03:52:28.445 INFO  [140046039774976] Reader:55 | Reader dtor 
0x7f5f0273d720 ReaderImpl(0x7f5efc00a9d0, 3)
   ```
   
   The first `Reader` object 0x7fffd2a7c110 was constructed in main thread
   140046042864960. However, it destructed before another `Reader` object 
0x0x7f5f0273d720 that was constructed in event loop thread
   140046039774976. When the callback passed to `createReaderAsync` completed 
the promise, the `createReader` immediately returns, at the same time the 
`Reader` object in the callback was still in the scope and not destructed.
   
   Since `Reader` holds a `shared_ptr<ReaderImpl>` and `ReaderImpl` holds a 
`shared_ptr<ConsumerImpl>`, if we check the reference count too quickly, the 
reference count of the underlying consumer is still positive because the 
`Reader` was not destructed at the moment.
   
   ### Modifications
   
   Since we cannot determine the precise destructed time point because that 
`Reader` object is in the event loop thread, we have to wait for a while. This 
PR adds a `waitUntil` utility function to wait for at most some time until the 
condition is met. Then wait until the reference count becomes 0 after the 
`Reader` object goes out of scope.
   
   Replace `ASSERT_EQ` with `EXPECT_EQ` to let the test continue if it failed.
   
   ### Verifying this change
   
   Following the steps here to reproduce:
   https://github.com/apache/pulsar/issues/14848#issuecomment-1246375370
   
   The test never failed even with `--gtest_repeat=100`.
   
   ### Documentation
   
   <!-- DO NOT REMOVE THIS SECTION. CHECK THE PROPER BOX ONLY. -->
   
   - [ ] `doc-required` 
   (Your PR needs to update docs and you will update later)
   
   - [x] `doc-not-needed` 
   (Please explain why)
   
   - [ ] `doc` 
   (Your PR contains doc changes)
   
   - [ ] `doc-complete`
   (Docs have been already added)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to