On Thu, Apr 10, 2025 at 2:05 PM 'Attila Szegedi' via v8-dev <[email protected]> wrote: > > Hi folks, > > I'm Attila, and I'm an engineer at Datadog working on our Node.js > profiler[0]. (As a side note, I'd be happy to sometimes give a larger > overview of how we do V8 profiling at Datadog.) > > TL;DR: if we interrupt V8 with a signal while it's executing > HandleScope::DeleteExtensions(), a call to Isolate::GetCurrentContext() will > crash even though Isolate::InContext() returns true. > > The whole story: > > We mostly rely on V8 built-in CpuProfiler APIs, except we also try to gather > some more data on each sample, so we intercept the PROF signal handler and do > some work before and after delegating to the V8's original PROF handler. E.g. > we also gather elapsed thread CPU time and recently we started to gather the > value of node::AsyncHooksGetExecutionAsyncId(Isolate*) (I get this is a V8 > and not a Node list – my question ultimately has nothing to do with Node.js > per se.) > > Now I know you'd say that invoking random methods from a signal handler is > obviously a recipe for disaster, but in this case this API is just a few > dependent loads through a chain of pointers that doesn't change much if at > all. If I pseudo-unrolled what this Node.js method does, then it'd be > something like this (omitting a bunch of defensive checks for isEmpty etc.) > > if (isolate->InContext()) { > auto ctx = isolate->GetCurrentContext(); > auto env = ctx ->GetAlignedPointerFromEmbedderData(NodeEnvironmentIndex); > return env->async_hooks_->async_id_fields[AsyncHooks::kAsyncIdCounter]; > } > > The thing is, it does work – a very obvious case when it wouldn't work is if > the PROF signal handler interrupted V8 while it's collecting garbage, but we > do guard against it by installing GC prologue and epilogue callbacks and not > touching the engine when we know it's GCing. > > However, we see one case when it doesn't work, and I can't figure it out on > my own so I'm kindly asking for some eyeballs here. Namely, we're seeing a > crash with a stack trace of: > > Isolate::GetCurrentContext() > node::AsyncHooksGetExecutionAsyncId(Isolate*) > dd::SignalHandler::HandleProfilerSignal(…) > v8::internal::HandleScope::DeleteExtensions(Isolate*) > node::Environment::RunTimers(…) > uv__run_timers > uv_run > > So node::Environment::RunTimers has a HandleScope instance[1]; this > HandleScope is being destroyed when it's going out of C++ scope, which should > be just before the method returns. As part of the destruction process, > DeleteExtensions is invoked. Our PROF handler interrupts here, and calls > node::AsyncHooksGetExecutionAsyncId() which will invoke Isolate::InContext() > and receive true, but then it'll crash in Isolate::GetCurrentContext(). > > We see repeated crashes with this exact stack trace. There's something in > HandleScope::DeleteExtensions() that causes Isolate::GetCurrentContext() to > crash even though Isolate::InContext() returned true. If anyone knows off the > top of their head what causes this, it'd be appreciated. If we can fix it, > even better. I'd settle for being able to observe this state and bail out > without crashing. > > Again, I do understand that signal handlers can't expect to observe data in > an internally consistent state. FWIW, I have a hunch that Isolate::InContext > should really be returning false in this situation since at the time the > handle scope is destroyed, there's no JS code executing at all, it's about to > return to libuv's run_timers. There are a bunch of Call()s[2] in JS functions > in a loop in that method, but by the time the handle scope is destroyed, > execution is outside of 'em all. The code of RunTimers seems sufficiently > complex that this shouldn't be explainable by compiler inlining and > reordering. > > Again, any insight appreciated, and I'm happy to work more on this with > anyone. > > Thanks, > Attila. > > -- > [0]: https://github.com/DataDog/pprof-nodejs > [1]: > https://github.com/nodejs/node/blob/9bbbe60f6b0001964cb62182ce17e7f3980ccf06/src/env.cc#L1490 > [2]: > https://github.com/nodejs/node/blob/9bbbe60f6b0001964cb62182ce17e7f3980ccf06/src/env.cc#L1505
HandleScope::DeleteExtensions depends on thread-local state, grep around for HandleScopeData. Sooner or later you're going to call into V8 right when it's updating said state. Node doesn't enter/exit isolates often, but still. That Environment::RunTimers shows up in the stack trace is not unexpected; it changes the active context and that's another point where you're racing the thread from the signal handler. Is this a new or existing problem? I remember someone from DD reaching out to me about this or a very similar problem 1-2 years ago, asking if I was interested in coming in as a consultant (I work on V8, node and libuv) but I didn't have time back then. -- -- v8-dev mailing list [email protected] http://groups.google.com/group/v8-dev --- You received this message because you are subscribed to the Google Groups "v8-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/v8-dev/CAHQurc-aw9wjd40aXhnLUyTkhJ_cVY%2BFb24kjmVu1n%2BaRZvT2w%40mail.gmail.com.
