Hi David,
On Wed, Nov 28, 2018 at 8:26 AM David Holmes <david.hol...@oracle.com> wrote: > > Hi Thomas, > > On 28/11/2018 4:30 pm, Thomas Stüfe wrote: > > Hi David, > > > > I admit I still do not really get it.. > > > > E.g. for GC threads: > > > > We have > > > > static ConcurrentMarkSweepThread* > > ConcurrentMarkSweepThread::start(CMSCollector* collector); > > > > which creates a ConcurrentMarkSweepThread and then calls > > os::create_thread() to hook it up with an OSThread and start it. > > > > ConcurrentMarkSweepThread derives from NonJavaThread, which in its > > constructor adds the thread object to the list. > > > > So there is a time gap where the NJT is part of the list, but > > Thread::_osthread is still NULL. > > > > In ThreadTimesClosure::do_thread(), we call > > os::thread_cpu_time(thread)->fast_cpu_time(thread)->os::Linux::pthread_getcpuclockid(thread->osthread()->pthread_id(), > > &clockid); > > > > Should we then not crash when dereferencing the NULL > > osthread()->pthread_id()? Why do we crash inside > > pthread_getcpuclockid? > > > > If I look further into os::create_thread(), I see that there is > > another, smaller time gap where we create OSThread and anchor it into > > the Thread object: > > > > thread->set_osthread(osthread); > > > > and then later, after pthread_create is thru, set its thread id: > > > > // Store pthread info into the OSThread > > osthread->set_pthread_id(tid); > > > > When OSThread is created, we call OSThread::pd_initialize() and set > > its _threadid to 0. We do this in the constructor, before anchoring > > the OSThread in its Thread. > > > > So for my understanding, we should have two situations: > > > > (1)- NJT is in list, but its _osthread member is NULL. In that case I > > would expect a different crash. > > (2)- NJT is in list, _osthread is set, but its _thread_id is NULL. > > Yes both situations may be possible. In the related bug report only the > 0 thread_id was observed. Oh, was it 0? That was not clear from the bug report. I wondered whether it could have been a random value. > > > (Modulo any concurrency issues, e.g. could the > > "thread->set_osthread(osthread);" be visible before OSThread is > > initialized? > > > > Depending on what is happening, a fix for (1) would probably be a > > querying for osthread==NULL, a fix for (2) would be to guard os::--- > > functions - well, at least os::thread_cpu_time() - to disregard > > Threads with pthread_t == 0. Both fixes seem better to me than > > querying the stacksize() when walking the list - that seems a bit > > awkward. > > I find it awkward that the list is populated with partially constructed > threads in the first place. Sure. That is awkward too. > This particular code is only interested in > live threads that can safely be queried. The stack_size() check is a > surrogate for "this is a live thread that can be queried". I understand. But would a check in os::thread_cpu_time() for osthread==NULL || osthread->pthread_id == 0 not be safer and help in more situations? Hence my former questions. If pthread_id were not 0 but either some random value or a pthread_id of a long-terminated thread, that would not help. > > > -- > > P.s. > > > > ConcurrentGCThread::ConcurrentGCThread() : > > _should_terminate(false), _has_terminated(false) { > > }; > > > > I was surprised to see no invocation to the base class ctor in the > > initializer list. I was not aware that this was even possible. For > > code clearness, I would prefer the call to the base class ctor to be > > explicit.) > > I assume it is implicit. But yes it should be explicit. > > Cheers, > David > ..Thomas > > -- > > > > Cheers, Thomas > > > > > > On Wed, Nov 28, 2018 at 3:58 AM David Holmes <david.hol...@oracle.com> > > wrote: > >> > >> Sorry for the delayed response. > >> > >> On 21/11/2018 3:01 pm, Kim Barrett wrote: > >>>> On Nov 20, 2018, at 3:50 AM, David Holmes <david.hol...@oracle.com> > >>>> wrote: > >>>> > >>>> After discussions with Kim I've decided to split out the NJT list update > >>>> into a separate RFE: > >>>> > >>>> https://bugs.openjdk.java.net/browse/JDK-8214097 > >>>> > >>>> So only the change in management.cpp needs reviewing and testing. > >>>> > >>>> Updated webrev: > >>>> > >>>> http://cr.openjdk.java.net/~dholmes/8212207/webrev.v2/ > >>> > >>> Looks good. > >> > >> Thanks Kim. > >> > >> I've decided to stick with this simple fix for NJTs only. > >> > >> David > >> > >>>> > >>>> Thanks, > >>>> David > >>>> > >>>> On 20/11/2018 10:01 am, David Holmes wrote: > >>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8212207 > >>>>> webrev: http://cr.openjdk.java.net/~dholmes/8212207/webrev/ > >>>>> There is an internal management API that reports CPU times for > >>>>> NonJavaThreads (NJTs). That functionality requires a valid/live target > >>>>> thread so that we can use its pthread_t identity to obtain its CPU > >>>>> clock via pthread_getcpuclockid(). > >>>>> There is an iteration mechanism for NJTs in which the NJT is registered > >>>>> during its constructor and de-registered during its destructor. A > >>>>> thread that has only been constructed has not yet executed and so is > >>>>> not a valid target for this management API. This seems to be the cause > >>>>> of failures reported in this bug (and JDK-8213434). Registering a NJT > >>>>> only when it starts executing is an appealing fix for this, but that > >>>>> impacts all current users of the NJT list and straight-away causes a > >>>>> problem with the BarrierSet initialization logic. So I don't attempt > >>>>> that. > >>>>> Instead the first part of the fix is for ThreadTimesClosure::do_thread > >>>>> to skip threads that have not yet executed - which we can recognize by > >>>>> seeing an uninitialized (i.e. zero) stackbase. > >>>>> A second part of the fix, which can be deferred to a separate RFE for > >>>>> NJT lifecycle management if desired, tackles the problem of > >>>>> encountering a terminated thread during iteration - which can also lead > >>>>> to SEGVs. This can arise because NJT's are not actually "destructed", > >>>>> even if they terminate, and so they never get removed from the NJT > >>>>> list. Calling destructors is problematic because the code using these > >>>>> NJTs assume they are always valid. So the fix in this case is to move > >>>>> the de-registering from the NJT list out of the destructor and into the > >>>>> Thread::call_run() method so it is done before a thread actually > >>>>> terminates. This can be considered a first step in cleaning up the NJT > >>>>> lifecycle, where the remaining steps touch on a lot of areas and so > >>>>> need to be handled separately e.g. see JDK-8087340 for shutting down > >>>>> WorkGang GC worker threads. > >>>>> Testing: tiers 1 -3 > >>>>> I should point out that I've been unable to reproduce this failure > >>>>> locally, even after thousands of runs. I'm hoping Zhengyu can test this > >>>>> in the conditions reported in JDK-8213434. > >>>>> Thanks, > >>>>> David > >>> > >>>