On Fri, 6 Sep 2024 08:32:01 GMT, Thomas Stuefe <stu...@openjdk.org> wrote:
>> src/hotspot/share/nmt/nmtCommon.hpp line 153: >> >>> 151: intx const current = os::current_thread_id(); >>> 152: intx const owner = Atomic::load(&_owner); >>> 153: assert(current != owner, "Lock is not reentrant"); >> >> ThreadCritical is reentrant though. We would need to do a lot of testing >> and/or code analysis to ensure we don't have the possibility of needing a >> reentrant lock with NMT. The most likely problems would be if we triggered a >> crash whilst in a locked section of code. > > We have a healthy amount of tests already. We need a wide use of > assert_locked and need to take a close look at error reporting. Locking, in > NMT, is rather simple. > > Error handling: My pragmatic "good enough for this rare scenario" proposal > would be: > > At the entry of error reporting, if the NMT mutex is locked by the crashing > thread, just unlock it manually and try to continue. The chance is high that > this will actually work well enough to survive the few mallocs we may do > during error reporting. > > Reasoning: We use NMT mutex as we did ThreadCritical: rather widely scoped. > The time windows where things are out of sync and reentrant use of the same > functionality can bite us are small. Moreover, NMT's two subsystems - malloc > tracker and mmap tracker - are mostly independent from each other, and even > leaving the mmap tracker in an inconsistent state will not prevent us from > doing malloc during error reporting. We don't use mmap during error > reporting, just malloc. Finally, moreover, NMT malloc code is comparatively > simple and stable; mmap code is more complex, and we plan to rework parts of > it in the near future. > > TL;DR I think just manually releasing the lock in error reporting has a high > chance of succeeding. Especially if we disable the NMT report in the hs-err > file I am not even sure this is necessary, though. We have safeguards against > error reporting locking up (step timeouts and secondary crash guards). Thinking through this more, we can probably shorten my advice to this: When entering error handling, if NMT lock is held by crashing thread and NMT level is detail, switch back to NMT level summary. That's it. Reasoning: we only do mallocs inside hs-err generation. For malloc, we only lock inside NMT if the NMT level is detail. And this should never change: malloc is too hot to justify a lock in summary mode. The only point where possible crashes or deadlocks can occur in error reporting is when doing the NMT report itself - but chances for this to happen are remote, and we have secondary crash handling and error step timeouts to deal with this if it happens. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/20852#discussion_r1746910791