On Fri, 6 Sep 2024 08:32:01 GMT, Thomas Stuefe <stu...@openjdk.org> wrote:

>> src/hotspot/share/nmt/nmtCommon.hpp line 153:
>> 
>>> 151:       intx const current =  os::current_thread_id();
>>> 152:       intx const owner = Atomic::load(&_owner);
>>> 153:       assert(current != owner, "Lock is not reentrant");
>> 
>> ThreadCritical is reentrant though. We would need to do a lot of testing 
>> and/or code analysis to ensure we don't have the possibility of needing a 
>> reentrant lock with NMT. The most likely problems would be if we triggered a 
>> crash whilst in a locked section of code.
>
> We have a healthy amount of tests already. We need a wide use of 
> assert_locked and need to take a close look at error reporting. Locking, in 
> NMT, is rather simple.
> 
> Error handling: My pragmatic "good enough for this rare scenario" proposal 
> would be: 
> 
> At the entry of error reporting, if the NMT mutex is locked by the crashing 
> thread, just unlock it manually and try to continue. The chance is high that 
> this will actually work well enough to survive the few mallocs we may do 
> during error reporting. 
> 
> Reasoning: We use NMT mutex as we did ThreadCritical: rather widely scoped. 
> The time windows where things are out of sync and reentrant use of the same 
> functionality can bite us are small. Moreover, NMT's two subsystems - malloc 
> tracker and mmap tracker - are mostly independent from each other, and even 
> leaving the mmap tracker in an inconsistent state will not prevent us from 
> doing malloc during error reporting. We don't use mmap during error 
> reporting, just malloc. Finally, moreover, NMT malloc code is comparatively 
> simple and stable; mmap code is more complex, and we plan to rework parts of 
> it in the near future.
> 
> TL;DR I think just manually releasing the lock in error reporting has a high 
> chance of succeeding. Especially if we disable the NMT report in the hs-err 
> file I am not even sure this is necessary, though. We have safeguards against 
> error reporting locking up (step timeouts and secondary crash guards).

Thinking through this more, we can probably shorten my advice to this:

When entering error handling, if NMT lock is held by crashing thread and NMT 
level is detail, switch back to NMT level summary.

That's it.

Reasoning: we only do mallocs inside hs-err generation. For malloc, we only 
lock inside NMT if the NMT level is detail. And this should never change: 
malloc is too hot to justify a lock in summary mode.

The only point where possible crashes or deadlocks can occur in error reporting 
is when doing the NMT report itself - but chances for this to happen are 
remote, and we have secondary crash handling and error step timeouts to deal 
with this if it happens.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20852#discussion_r1746910791

Reply via email to