On Thu, Apr 10, 2025 at 2:05 PM 'Attila Szegedi' via v8-dev
<[email protected]> wrote:
>
> Hi folks,
>
> I'm Attila, and I'm an engineer at Datadog working on our Node.js 
> profiler[0]. (As a side note, I'd be happy to sometimes give a larger 
> overview of how we do V8 profiling at Datadog.)
>
> TL;DR: if we interrupt V8 with a signal while it's executing 
> HandleScope::DeleteExtensions(), a call to Isolate::GetCurrentContext() will 
> crash even though Isolate::InContext() returns true.
>
> The whole story:
>
> We mostly rely on V8 built-in CpuProfiler APIs, except we also try to gather 
> some more data on each sample, so we intercept the PROF signal handler and do 
> some work before and after delegating to the V8's original PROF handler. E.g. 
> we also gather elapsed thread CPU time and recently we started to gather the 
> value of node::AsyncHooksGetExecutionAsyncId(Isolate*) (I get this is a V8 
> and not a Node list – my question ultimately has nothing to do with Node.js 
> per se.)
>
> Now I know you'd say that invoking random methods from a signal handler is 
> obviously a recipe for disaster, but in this case this API is just a few 
> dependent loads through a chain of pointers that doesn't change much if at 
> all. If I pseudo-unrolled what this Node.js method does, then it'd be 
> something like this (omitting a bunch of defensive checks for isEmpty etc.)
>
> if (isolate->InContext()) {
>   auto ctx = isolate->GetCurrentContext();
>   auto env = ctx ->GetAlignedPointerFromEmbedderData(NodeEnvironmentIndex);
>   return env->async_hooks_->async_id_fields[AsyncHooks::kAsyncIdCounter];
> }
>
> The thing is, it does work – a very obvious case when it wouldn't work is if 
> the PROF signal handler interrupted V8 while it's collecting garbage, but we 
> do guard against it by installing GC prologue and epilogue callbacks and not 
> touching the engine when we know it's GCing.
>
> However, we see one case when it doesn't work, and I can't figure it out on 
> my own so I'm kindly asking for some eyeballs here. Namely, we're seeing a 
> crash with a stack trace of:
>
> Isolate::GetCurrentContext()
> node::AsyncHooksGetExecutionAsyncId(Isolate*)
> dd::SignalHandler::HandleProfilerSignal(…)
> v8::internal::HandleScope::DeleteExtensions(Isolate*)
> node::Environment::RunTimers(…)
> uv__run_timers
> uv_run
>
> So node::Environment::RunTimers has a HandleScope instance[1]; this 
> HandleScope is being destroyed when it's going out of C++ scope, which should 
> be just before the method returns. As part of the destruction process, 
> DeleteExtensions is invoked. Our PROF handler interrupts here, and calls 
> node::AsyncHooksGetExecutionAsyncId() which will invoke Isolate::InContext() 
> and receive true, but then it'll crash in Isolate::GetCurrentContext().
>
> We see repeated crashes with this exact stack trace. There's something in 
> HandleScope::DeleteExtensions() that causes Isolate::GetCurrentContext() to 
> crash even though Isolate::InContext() returned true. If anyone knows off the 
> top of their head what causes this, it'd be appreciated. If we can fix it, 
> even better. I'd settle for being able to observe this state and bail out 
> without crashing.
>
> Again, I do understand that signal handlers can't expect to observe data in 
> an internally consistent state. FWIW, I have a hunch that Isolate::InContext 
> should really be returning false in this situation since at the time the 
> handle scope is destroyed, there's no JS code executing at all, it's about to 
> return to libuv's run_timers. There are a bunch of Call()s[2] in JS functions 
> in a loop in that method, but by the time the handle scope is destroyed, 
> execution is outside of 'em all. The code of RunTimers seems sufficiently 
> complex that this shouldn't be explainable by compiler inlining and 
> reordering.
>
> Again, any insight appreciated, and I'm happy to work more on this with 
> anyone.
>
> Thanks,
>   Attila.
>
> --
> [0]: https://github.com/DataDog/pprof-nodejs
> [1]: 
> https://github.com/nodejs/node/blob/9bbbe60f6b0001964cb62182ce17e7f3980ccf06/src/env.cc#L1490
> [2]: 
> https://github.com/nodejs/node/blob/9bbbe60f6b0001964cb62182ce17e7f3980ccf06/src/env.cc#L1505

HandleScope::DeleteExtensions depends on thread-local state, grep
around for HandleScopeData. Sooner or later you're going to call into
V8 right when it's updating said state. Node doesn't enter/exit
isolates often, but still.

That Environment::RunTimers shows up in the stack trace is not
unexpected; it changes the active context and that's another point
where you're racing the thread from the signal handler.

Is this a new or existing problem? I remember someone from DD reaching
out to me about this or a very similar problem 1-2 years ago, asking
if I was interested in coming in as a consultant (I work on V8, node
and libuv) but I didn't have time back then.

-- 
-- 
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev
--- 
You received this message because you are subscribed to the Google Groups 
"v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/v8-dev/CAHQurc-aw9wjd40aXhnLUyTkhJ_cVY%2BFb24kjmVu1n%2BaRZvT2w%40mail.gmail.com.

Reply via email to