ThreadCpuTimesDeadlock.java crashes with SEGV in pthread_getcpuclockid+0x0

Thomas Stüfe Tue, 27 Nov 2018 23:37:02 -0800

Hi David,


On Wed, Nov 28, 2018 at 8:26 AM David Holmes <david.hol...@oracle.com> wrote:
>
> Hi Thomas,
>
> On 28/11/2018 4:30 pm, Thomas Stüfe wrote:
> > Hi David,
> >
> > I admit I still do not really get it..
> >
> > E.g. for GC threads:
> >
> > We have
> >
> > static ConcurrentMarkSweepThread*
> > ConcurrentMarkSweepThread::start(CMSCollector* collector);
> >
> > which creates a ConcurrentMarkSweepThread and then calls
> > os::create_thread() to hook it up with an OSThread and start it.
> >
> > ConcurrentMarkSweepThread derives from NonJavaThread, which in its
> > constructor adds the thread object to the list.
> >
> > So there is a time gap where the NJT is part of the list, but
> > Thread::_osthread is still NULL.
> >
> > In ThreadTimesClosure::do_thread(), we call
> > os::thread_cpu_time(thread)->fast_cpu_time(thread)->os::Linux::pthread_getcpuclockid(thread->osthread()->pthread_id(),
> > &clockid);
> >
> > Should we then not crash when dereferencing the NULL
> > osthread()->pthread_id()? Why do we crash inside
> > pthread_getcpuclockid?
> >
> > If I look further into os::create_thread(), I see that there is
> > another, smaller time gap where we create OSThread and anchor it into
> > the Thread object:
> >
> >    thread->set_osthread(osthread);
> >
> > and then later, after pthread_create is thru, set its thread id:
> >
> >      // Store pthread info into the OSThread
> >      osthread->set_pthread_id(tid);
> >
> > When OSThread is created, we call OSThread::pd_initialize() and set
> > its _threadid to 0. We do this in the constructor, before anchoring
> > the OSThread in its Thread.
> >
> > So for my understanding, we should have two situations:
> >
> > (1)- NJT is in list, but its _osthread member is NULL. In that case I
> > would expect a different crash.
> > (2)- NJT is in list, _osthread is set, but its _thread_id is NULL.
>
> Yes both situations may be possible. In the related bug report only the
> 0 thread_id was observed.

Oh, was it 0? That was not clear from the bug report. I wondered
whether it could have been a random value.

>
> > (Modulo any concurrency issues, e.g. could the
> > "thread->set_osthread(osthread);" be visible before OSThread is
> > initialized?
> >
> > Depending on what is happening, a fix for (1) would probably be a
> > querying for osthread==NULL, a fix for (2) would be to guard os::---
> > functions - well, at least os::thread_cpu_time() - to disregard
> > Threads with pthread_t == 0. Both fixes seem better to me than
> > querying the stacksize() when walking the list - that seems a bit
> > awkward.
>
> I find it awkward that the list is populated with partially constructed
> threads in the first place.

Sure. That is awkward too.

> This particular code is only interested in
> live threads that can safely be queried. The stack_size() check is a
> surrogate for "this is a live thread that can be queried".

I understand. But would a check in os::thread_cpu_time() for
osthread==NULL || osthread->pthread_id == 0 not be safer and help in
more situations?

Hence my former questions. If pthread_id were not 0 but either some
random value or a pthread_id of a long-terminated thread, that would
not help.

>
> > --
> > P.s.
> >
> > ConcurrentGCThread::ConcurrentGCThread() :
> >    _should_terminate(false), _has_terminated(false) {
> > };
> >
> > I was surprised to see no invocation to the base class ctor in the
> > initializer list. I was not aware that this was even possible. For
> > code clearness, I would prefer the call to the base class ctor to be
> > explicit.)
>
> I assume it is implicit. But yes it should be explicit.
>
> Cheers,
> David
>

..Thomas

> > --
> >
> > Cheers, Thomas
> >
> >
> > On Wed, Nov 28, 2018 at 3:58 AM David Holmes <david.hol...@oracle.com> 
> > wrote:
> >>
> >> Sorry for the delayed response.
> >>
> >> On 21/11/2018 3:01 pm, Kim Barrett wrote:
> >>>> On Nov 20, 2018, at 3:50 AM, David Holmes <david.hol...@oracle.com> 
> >>>> wrote:
> >>>>
> >>>> After discussions with Kim I've decided to split out the NJT list update 
> >>>> into a separate RFE:
> >>>>
> >>>> https://bugs.openjdk.java.net/browse/JDK-8214097
> >>>>
> >>>> So only the change in management.cpp needs reviewing and testing.
> >>>>
> >>>> Updated webrev:
> >>>>
> >>>> http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/
> >>>
> >>> Looks good.
> >>
> >> Thanks Kim.
> >>
> >> I've decided to stick with this simple fix for NJTs only.
> >>
> >> David
> >>
> >>>>
> >>>> Thanks,
> >>>> David
> >>>>
> >>>> On 20/11/2018 10:01 am, David Holmes wrote:
> >>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
> >>>>> webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
> >>>>> There is an internal management API that reports CPU times for 
> >>>>> NonJavaThreads (NJTs). That functionality requires a valid/live target 
> >>>>> thread so that we can use its pthread_t identity to obtain its CPU 
> >>>>> clock via pthread_getcpuclockid().
> >>>>> There is an iteration mechanism for NJTs in which the NJT is registered 
> >>>>> during its constructor and de-registered during its destructor. A 
> >>>>> thread that has only been constructed has not yet executed and so is 
> >>>>> not a valid target for this management API. This seems to be the cause 
> >>>>> of failures reported in this bug (and JDK-8213434). Registering a NJT 
> >>>>> only when it starts executing is an appealing fix for this, but that 
> >>>>> impacts all current users of the NJT list and straight-away causes a 
> >>>>> problem with the BarrierSet initialization logic. So I don't attempt 
> >>>>> that.
> >>>>> Instead the first part of the fix is for ThreadTimesClosure::do_thread 
> >>>>> to skip threads that have not yet executed - which we can recognize by 
> >>>>> seeing an uninitialized (i.e. zero) stackbase.
> >>>>> A second part of the fix, which can be deferred to a separate RFE for 
> >>>>> NJT lifecycle management if desired, tackles the problem of 
> >>>>> encountering a terminated thread during iteration - which can also lead 
> >>>>> to SEGVs. This can arise because NJT's are not actually "destructed", 
> >>>>> even if they terminate, and so they never get removed from the NJT 
> >>>>> list. Calling destructors is problematic because the code using these 
> >>>>> NJTs assume they are always valid. So the fix in this case is to move 
> >>>>> the de-registering from the NJT list out of the destructor and into the 
> >>>>> Thread::call_run() method so it is done before a thread actually 
> >>>>> terminates. This can be considered a first step in cleaning up the NJT 
> >>>>> lifecycle, where the remaining steps touch on a lot of areas and so 
> >>>>> need to be handled separately e.g. see JDK-8087340 for shutting down 
> >>>>> WorkGang GC worker threads.
> >>>>> Testing: tiers 1 -3
> >>>>> I should point out that I've been unable to reproduce this failure 
> >>>>> locally, even after thousands of runs. I'm hoping Zhengyu can test this 
> >>>>> in the conditions reported in JDK-8213434.
> >>>>> Thanks,
> >>>>> David
> >>>
> >>>

Re: RFR (S) 8212207: runtime/InternalApi/ThreadCpuTimesDeadlock.java crashes with SEGV in pthread_getcpuclockid+0x0

Reply via email to