ThreadCpuTimesDeadlock.java crashes with SEGV in pthread_getcpuclockid+0x0

David Holmes Wed, 28 Nov 2018 00:03:02 -0800

On 28/11/2018 5:46 pm, Thomas Stüfe wrote:

Are you sure?


No. I'm completely confusing myself - too many different "thread id's" :(

On Linux, pthread id is set in the parent thread, after pthread_create
returns. Only the kernel thread id is set by the child (I find this
duality confusing). thread_cpu_time uses pthread_id. So, on Linux, it
may or may not be set before the thread stack boundaries are
initialized, depending on whether the native entry function ran before
the parent thread continued.


Yes you are right. The code is still broken but for a different reason.

I may just fix this all in JDK-8214097 - Rework thread initializationand teardown logic - by only adding to the list after the thread has runits own initialization logic. Then the guard I just added to themanagement code can be deleted again.


Thanks,
David

..Thomas

On Wed, Nov 28, 2018 at 8:35 AM David Holmes <david.hol...@oracle.com> wrote:


#$%$#@! I missed the fact that the thread id is set _after_ we do
record_stack_base_and-size(). I thought that was in call_run() not
native_thread_entry().

<sigh> New bug being filed.

Thanks,
David

On 28/11/2018 5:26 pm, David Holmes wrote:

Hi Thomas,

On 28/11/2018 4:30 pm, Thomas Stüfe wrote:

Hi David,

I admit I still do not really get it..

E.g. for GC threads:

We have

static ConcurrentMarkSweepThread*
ConcurrentMarkSweepThread::start(CMSCollector* collector);

which creates a ConcurrentMarkSweepThread and then calls
os::create_thread() to hook it up with an OSThread and start it.

ConcurrentMarkSweepThread derives from NonJavaThread, which in its
constructor adds the thread object to the list.

So there is a time gap where the NJT is part of the list, but
Thread::_osthread is still NULL.

In ThreadTimesClosure::do_thread(), we call
os::thread_cpu_time(thread)->fast_cpu_time(thread)->os::Linux::pthread_getcpuclockid(thread->osthread()->pthread_id(),

&clockid);

Should we then not crash when dereferencing the NULL
osthread()->pthread_id()? Why do we crash inside
pthread_getcpuclockid?

If I look further into os::create_thread(), I see that there is
another, smaller time gap where we create OSThread and anchor it into
the Thread object:

    thread->set_osthread(osthread);

and then later, after pthread_create is thru, set its thread id:

      // Store pthread info into the OSThread
      osthread->set_pthread_id(tid);

When OSThread is created, we call OSThread::pd_initialize() and set
its _threadid to 0. We do this in the constructor, before anchoring
the OSThread in its Thread.

So for my understanding, we should have two situations:

(1)- NJT is in list, but its _osthread member is NULL. In that case I
would expect a different crash.
(2)- NJT is in list, _osthread is set, but its _thread_id is NULL.


Yes both situations may be possible. In the related bug report only the
0 thread_id was observed.

(Modulo any concurrency issues, e.g. could the
"thread->set_osthread(osthread);" be visible before OSThread is
initialized?

Depending on what is happening, a fix for (1) would probably be a
querying for osthread==NULL, a fix for (2) would be to guard os::---
functions - well, at least os::thread_cpu_time() - to disregard
Threads with pthread_t == 0. Both fixes seem better to me than
querying the stacksize() when walking the list - that seems a bit
awkward.


I find it awkward that the list is populated with partially constructed
threads in the first place. This particular code is only interested in
live threads that can safely be queried. The stack_size() check is a
surrogate for "this is a live thread that can be queried".

--
P.s.

ConcurrentGCThread::ConcurrentGCThread() :
    _should_terminate(false), _has_terminated(false) {
};

I was surprised to see no invocation to the base class ctor in the
initializer list. I was not aware that this was even possible. For
code clearness, I would prefer the call to the base class ctor to be
explicit.)


I assume it is implicit. But yes it should be explicit.

Cheers,
David

--

Cheers, Thomas


On Wed, Nov 28, 2018 at 3:58 AM David Holmes <david.hol...@oracle.com>
wrote:


Sorry for the delayed response.

On 21/11/2018 3:01 pm, Kim Barrett wrote:

On Nov 20, 2018, at 3:50 AM, David Holmes <david.hol...@oracle.com>
wrote:

After discussions with Kim I've decided to split out the NJT list
update into a separate RFE:

https://bugs.openjdk.java.net/browse/JDK-8214097

So only the change in management.cpp needs reviewing and testing.

Updated webrev:

http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/


Looks good.


Thanks Kim.

I've decided to stick with this simple fix for NJTs only.

David


Thanks,
David

On 20/11/2018 10:01 am, David Holmes wrote:

Bug: https://bugs.openjdk.java.net/browse/JDK-8212207
webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/
There is an internal management API that reports CPU times for
NonJavaThreads (NJTs). That functionality requires a valid/live
target thread so that we can use its pthread_t identity to obtain
its CPU clock via pthread_getcpuclockid().
There is an iteration mechanism for NJTs in which the NJT is
registered during its constructor and de-registered during its
destructor. A thread that has only been constructed has not yet
executed and so is not a valid target for this management API.
This seems to be the cause of failures reported in this bug (and
JDK-8213434). Registering a NJT only when it starts executing is
an appealing fix for this, but that impacts all current users of
the NJT list and straight-away causes a problem with the
BarrierSet initialization logic. So I don't attempt that.
Instead the first part of the fix is for
ThreadTimesClosure::do_thread to skip threads that have not yet
executed - which we can recognize by seeing an uninitialized (i.e.
zero) stackbase.
A second part of the fix, which can be deferred to a separate RFE
for NJT lifecycle management if desired, tackles the problem of
encountering a terminated thread during iteration - which can also
lead to SEGVs. This can arise because NJT's are not actually
"destructed", even if they terminate, and so they never get
removed from the NJT list. Calling destructors is problematic
because the code using these NJTs assume they are always valid. So
the fix in this case is to move the de-registering from the NJT
list out of the destructor and into the Thread::call_run() method
so it is done before a thread actually terminates. This can be
considered a first step in cleaning up the NJT lifecycle, where
the remaining steps touch on a lot of areas and so need to be
handled separately e.g. see JDK-8087340 for shutting down WorkGang
GC worker threads.
Testing: tiers 1 -3
I should point out that I've been unable to reproduce this failure
locally, even after thousands of runs. I'm hoping Zhengyu can test
this in the conditions reported in JDK-8213434.
Thanks,
David

Re: RFR (S) 8212207: runtime/InternalApi/ThreadCpuTimesDeadlock.java crashes with SEGV in pthread_getcpuclockid+0x0

Reply via email to