Hi David,

I admit I still do not really get it..

E.g. for GC threads:

We have

static ConcurrentMarkSweepThread*
ConcurrentMarkSweepThread::start(CMSCollector* collector);

which creates a ConcurrentMarkSweepThread and then calls
os::create_thread() to hook it up with an OSThread and start it.

ConcurrentMarkSweepThread derives from NonJavaThread, which in its
constructor adds the thread object to the list.

So there is a time gap where the NJT is part of the list, but
Thread::_osthread is still NULL.

In ThreadTimesClosure::do_thread(), we call
os::thread_cpu_time(thread)->fast_cpu_time(thread)->os::Linux::pthread_getcpuclockid(thread->osthread()->pthread_id(),
&clockid);

Should we then not crash when dereferencing the NULL
osthread()->pthread_id()? Why do we crash inside
pthread_getcpuclockid?

If I look further into os::create_thread(), I see that there is
another, smaller time gap where we create OSThread and anchor it into
the Thread object:

  thread->set_osthread(osthread);

and then later, after pthread_create is thru, set its thread id:

    // Store pthread info into the OSThread
    osthread->set_pthread_id(tid);

When OSThread is created, we call OSThread::pd_initialize() and set
its _threadid to 0. We do this in the constructor, before anchoring
the OSThread in its Thread.

So for my understanding, we should have two situations:

(1)- NJT is in list, but its _osthread member is NULL. In that case I
would expect a different crash.
(2)- NJT is in list, _osthread is set, but its _thread_id is NULL.

(Modulo any concurrency issues, e.g. could the
"thread->set_osthread(osthread);" be visible before OSThread is
initialized?

Depending on what is happening, a fix for (1) would probably be a
querying for osthread==NULL, a fix for (2) would be to guard os::---
functions - well, at least os::thread_cpu_time() - to disregard
Threads with pthread_t == 0. Both fixes seem better to me than
querying the stacksize() when walking the list - that seems a bit
awkward.

--
P.s.

ConcurrentGCThread::ConcurrentGCThread() :
  _should_terminate(false), _has_terminated(false) {
};

I was surprised to see no invocation to the base class ctor in the
initializer list. I was not aware that this was even possible. For
code clearness, I would prefer the call to the base class ctor to be
explicit.)
--

Cheers, Thomas


On Wed, Nov 28, 2018 at 3:58 AM David Holmes <david.hol...@oracle.com> wrote:
>
> Sorry for the delayed response.
>
> On 21/11/2018 3:01 pm, Kim Barrett wrote:
> >> On Nov 20, 2018, at 3:50 AM, David Holmes <david.hol...@oracle.com> wrote:
> >>
> >> After discussions with Kim I've decided to split out the NJT list update 
> >> into a separate RFE:
> >>
> >> https://bugs.openjdk.java.net/browse/JDK-8214097
> >>
> >> So only the change in management.cpp needs reviewing and testing.
> >>
> >> Updated webrev:
> >>
> >> http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/
> >
> > Looks good.
>
> Thanks Kim.
>
> I've decided to stick with this simple fix for NJTs only.
>
> David
>
> >>
> >> Thanks,
> >> David
> >>
> >> On 20/11/2018 10:01 am, David Holmes wrote:
> >>> Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
> >>> webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
> >>> There is an internal management API that reports CPU times for 
> >>> NonJavaThreads (NJTs). That functionality requires a valid/live target 
> >>> thread so that we can use its pthread_t identity to obtain its CPU clock 
> >>> via pthread_getcpuclockid().
> >>> There is an iteration mechanism for NJTs in which the NJT is registered 
> >>> during its constructor and de-registered during its destructor. A thread 
> >>> that has only been constructed has not yet executed and so is not a valid 
> >>> target for this management API. This seems to be the cause of failures 
> >>> reported in this bug (and JDK-8213434). Registering a NJT only when it 
> >>> starts executing is an appealing fix for this, but that impacts all 
> >>> current users of the NJT list and straight-away causes a problem with the 
> >>> BarrierSet initialization logic. So I don't attempt that.
> >>> Instead the first part of the fix is for ThreadTimesClosure::do_thread to 
> >>> skip threads that have not yet executed - which we can recognize by 
> >>> seeing an uninitialized (i.e. zero) stackbase.
> >>> A second part of the fix, which can be deferred to a separate RFE for NJT 
> >>> lifecycle management if desired, tackles the problem of encountering a 
> >>> terminated thread during iteration - which can also lead to SEGVs. This 
> >>> can arise because NJT's are not actually "destructed", even if they 
> >>> terminate, and so they never get removed from the NJT list. Calling 
> >>> destructors is problematic because the code using these NJTs assume they 
> >>> are always valid. So the fix in this case is to move the de-registering 
> >>> from the NJT list out of the destructor and into the Thread::call_run() 
> >>> method so it is done before a thread actually terminates. This can be 
> >>> considered a first step in cleaning up the NJT lifecycle, where the 
> >>> remaining steps touch on a lot of areas and so need to be handled 
> >>> separately e.g. see JDK-8087340 for shutting down WorkGang GC worker 
> >>> threads.
> >>> Testing: tiers 1 -3
> >>> I should point out that I've been unable to reproduce this failure 
> >>> locally, even after thousands of runs. I'm hoping Zhengyu can test this 
> >>> in the conditions reported in JDK-8213434.
> >>> Thanks,
> >>> David
> >
> >

Reply via email to