Hi David, out of interest, would pthread_getcpuclockid also crash when fed an pthread_t of an existing thread which has been suspended and not yet started (on Solaris, AIX)? Not that it would have much to report :)
-- 1643 // exclude externally visible JavaThreads 1644 if (thread->is_Java_thread() && !thread->is_hidden_from_external_view()) { 1645 return; 1646 } 1647 1648 // NonJavaThread instances may not be fully initialized yet, so we need to 1649 // skip any that aren't - check for zero stack_size() 1650 if (!thread->is_Java_thread() && thread->stack_size() == 0) { 1651 return; 1652 } So do we really want to skip the check for externally hidden java threads? If not, would not this b better: 1648 // Thread instances may not be fully initialized yet, so we need to 1649 // skip any that aren't - check for zero stack_size() 1650 if (thread->stack_size() == 0) { 1651 return; 1652 } -- I agree an "is_initialized" flag would be better. We do have the ThreadState in the associated OsThread, could that be used? Thanks, Thomas On Tue, Nov 20, 2018 at 9:53 AM David Holmes <david.hol...@oracle.com> wrote: > > After discussions with Kim I've decided to split out the NJT list update > into a separate RFE: > > https://bugs.openjdk.java.net/browse/JDK-8214097 > > So only the change in management.cpp needs reviewing and testing. > > Updated webrev: > > http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/ > > Thanks, > David > > On 20/11/2018 10:01 am, David Holmes wrote: > > Bug: https://bugs.openjdk.java.net/browse/JDK-8212207 > > webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/ > > > > There is an internal management API that reports CPU times for > > NonJavaThreads (NJTs). That functionality requires a valid/live target > > thread so that we can use its pthread_t identity to obtain its CPU clock > > via pthread_getcpuclockid(). > > > > There is an iteration mechanism for NJTs in which the NJT is registered > > during its constructor and de-registered during its destructor. A thread > > that has only been constructed has not yet executed and so is not a > > valid target for this management API. This seems to be the cause of > > failures reported in this bug (and JDK-8213434). Registering a NJT only > > when it starts executing is an appealing fix for this, but that impacts > > all current users of the NJT list and straight-away causes a problem > > with the BarrierSet initialization logic. So I don't attempt that. > > > > Instead the first part of the fix is for ThreadTimesClosure::do_thread > > to skip threads that have not yet executed - which we can recognize by > > seeing an uninitialized (i.e. zero) stackbase. > > > > A second part of the fix, which can be deferred to a separate RFE for > > NJT lifecycle management if desired, tackles the problem of encountering > > a terminated thread during iteration - which can also lead to SEGVs. > > This can arise because NJT's are not actually "destructed", even if they > > terminate, and so they never get removed from the NJT list. Calling > > destructors is problematic because the code using these NJTs assume they > > are always valid. So the fix in this case is to move the de-registering > > from the NJT list out of the destructor and into the Thread::call_run() > > method so it is done before a thread actually terminates. This can be > > considered a first step in cleaning up the NJT lifecycle, where the > > remaining steps touch on a lot of areas and so need to be handled > > separately e.g. see JDK-8087340 for shutting down WorkGang GC worker > > threads. > > > > Testing: tiers 1 -3 > > > > I should point out that I've been unable to reproduce this failure > > locally, even after thousands of runs. I'm hoping Zhengyu can test this > > in the conditions reported in JDK-8213434. > > > > Thanks, > > David