justintime4tea opened a new issue, #24919:
URL: https://github.com/apache/pulsar/issues/24919

   ### Search before reporting
   
   - [x] I searched in the [issues](https://github.com/apache/pulsar/issues) 
and found nothing similar.
   
   
   ### What issue do you find in Pulsar docs?
   
   Ran into a rather nasty to debug bug when writing integration tests for a 
node-js application that uses "pulsar-client" NPM package. Essentially, the 
issue is though it seems like the "pulsar-client" supports multiple instances 
of a Pulsar client, the C++ side of things doesn't fully isolate these 
instances and a single `static LogCallback *logCallback` is created for ALL 
clients with each client overwriting this static callback with whatever handler 
the last executed `setLogCallbackHandler` or `Pulsar.Client` constructor is 
given. This isn't really documented other than some comments I see in the C++ 
code.
   
   ## Why this is an issue
   This `static LogCallback *logCallback` being shared results in a potential 
race-condition whereby some final message from one of the multiple C++ pulsar 
client is attempted to be logged and a crash occurs. Since neither the `Close` 
or `~Client()` destructor clear/set `static Client::logCallback` to null on the 
C++ side, it's possible that `static Client::logCallback` becomes a dangling 
pointer pointing to a `ThreadSafeFunction` that was part of a previous client, 
which has since been destroyed.
   
   ## Repro
   **Looking at the Client::Close method and the Client::~Client destructor:**
   - Client::~Client() {} (The destructor is empty)
   - Napi::Value Client::Close(const Napi::CallbackInfo &info) (This function 
only calls pulsar_client_close_async. It does not touch Client::logCallback.)
   
   **Conceptual repro steps:**
   
   1. You create Client A, which sets the static Client::logCallback to point 
to its logger.
   2. You create Client B, which replaces the static Client::logCallback to 
point to its logger. (Client A's original logger function is now abandoned).
   3. You call await ClientB.close(). The Close method is called, and 
eventually ~Client() is called.
   4. Neither Close nor the destructor clears the logCallback.
   5. logCallback is now a dangling pointer, pointing to a ThreadSafeFunction 
that was part of Client B, which is now destroyed.
   6. A background thread from Client A tries to log a message. It uses the 
static logCallback, accesses the dangling pointer, and the process crashes.
   
   ### What is your suggestion?
   
   I suggest that, at bare minimum and baring the C++ code being "corrected", 
documentation is updated so that consumers of pulsar client libraries are made 
aware of this. I'm not sure exactly how, because it doesn't seem like you'd be 
safe even just setting the hanlder a single time due to the nature of the the 
"close" and "destructor" code in C++.
   
   
   ### Any reference?
   
   **Diagnostics**
   Though I don't have a code repro I do have a breakdown of how I stumbled 
upon this - In integration tests where there are many pulsar clients being 
created and closed, along with the underlying Node JS app being tested 
creating/closing it's own pulsar client, the race condition is triggered and a 
SIGABRT is displayed near the end of the test output with the following message:
   
   ```
   libc++abi: terminating due to uncaught exception of type Napi::Error:
   ```
   
   Running the tests using lldb to capture the coredump results in the 
following which lead me to the issue:
   
   ```
       frame #8: 0x00000001922fbbd8 libc++abi.dylib`__cxa_throw + 92
       frame #9: 0x0000000151a485d4 
pulsar.node`Napi::Function::Call(std::initializer_list<napi_value__*> const&) 
const + 196
       frame #10: 0x0000000151a4fc10 pulsar.node`LogMessageProxy(Napi::Env, 
Napi::Function, LogMessage*) + 272
       frame #11: 0x0000000151a4c1ec 
pulsar.node`Napi::ThreadSafeFunction::CallJS(napi_env__*, napi_value__*, void*, 
void*) + 68
       frame #12: 0x00000001000c6b94 node`v8impl::(anonymous 
namespace)::ThreadSafeFunction::AsyncCb(uv_async_s*) + 472
   ```
   
   For whatever reason, this only occurs on node versions below v24. I tested 
in many v22.x versions and v23.x versions but I can't seem to reproduce this 
issue with node v24.
   
   The full dump for thread #1:
   
   ```
   * thread #1
     * frame #0: 0x000000019230a388 libsystem_kernel.dylib`__pthread_kill + 8
       frame #1: 0x000000019234388c libsystem_pthread.dylib`pthread_kill + 296
       frame #2: 0x000000019224ca3c libsystem_c.dylib`abort + 124
       frame #3: 0x00000001922f9384 libc++abi.dylib`abort_message + 132
       frame #4: 0x00000001922e7cd8 
libc++abi.dylib`demangling_terminate_handler() + 316
       frame #5: 0x0000000191f6cde4 libobjc.A.dylib`_objc_terminate() + 172
       frame #6: 0x00000001922f8698 libc++abi.dylib`std::__terminate(void 
(*)()) + 16
       frame #7: 0x00000001922fbc30 
libc++abi.dylib`__cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 88
       frame #8: 0x00000001922fbbd8 libc++abi.dylib`__cxa_throw + 92
       frame #9: 0x0000000151a485d4 
pulsar.node`Napi::Function::Call(std::initializer_list<napi_value__*> const&) 
const + 196
       frame #10: 0x0000000151a4fc10 pulsar.node`LogMessageProxy(Napi::Env, 
Napi::Function, LogMessage*) + 272
       frame #11: 0x0000000151a4c1ec 
pulsar.node`Napi::ThreadSafeFunction::CallJS(napi_env__*, napi_value__*, void*, 
void*) + 68
       frame #12: 0x00000001000c6b94 node`v8impl::(anonymous 
namespace)::ThreadSafeFunction::AsyncCb(uv_async_s*) + 472
       frame #13: 0x0000000100f01478 node`uv__async_io + 304
       frame #14: 0x0000000100f1524c node`uv__io_poll + 1512
       frame #15: 0x0000000100f019e0 node`uv_run + 408
       frame #16: 0x0000000100093004 node`node::Environment::CleanupHandles() + 
188
       frame #17: 0x000000010009377c node`node::Environment::RunCleanup() + 292
       frame #18: 0x0000000100009ff0 
node`node::FreeEnvironment(node::Environment*) + 128
       frame #19: 0x000000010014d26c node`node::NodeMainInstance::Run() + 304
       frame #20: 0x00000001000bf5f4 node`node::Start(int, char**) + 600
       frame #21: 0x0000000191fa2b98 dyld`start + 6076
   ```
   
   
   ### Are you willing to submit a PR?
   
   - [ ] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to