Benoit Sigoure created HBASE-18372:
--------------------------------------

             Summary: Potential infinite busy loop in HMaster's 
ProcedureExecutor
                 Key: HBASE-18372
                 URL: https://issues.apache.org/jira/browse/HBASE-18372
             Project: HBase
          Issue Type: Bug
          Components: master
    Affects Versions: 1.3.1
         Environment: Kernel 3.10.0-327.10.1.el7.x86_64
JVM 1.8.0_102
            Reporter: Benoit Sigoure


While investigating an issue today with [~timoha] we saw the HMaster 
consistently burning 1.5 cores of CPU cycles.  Upon looking more closely, it 
was actually all 8 threads of {{ProcedureExecutor}} thread pool taking 
constantly ~15% of a CPU core each (I identified this by looking at individual 
threads in {{top}} and cross-referencing the thread IDs with the thread IDs in 
a JVM stack trace).  The HMaster log or output didn't contain anything 
suspicious and it was hard for us to ascertain what exactly was happening.  It 
just looked like these threads were regularly spinning, doing nothing.  We just 
saw a lot of {{futex}} system calls happening all the time, and all the threads 
of the thread pool regularly taking turns in waking up and going back to sleep.

My reading of the code in {{procedure2/ProcedureExecutor.java}} is that this 
can happen if the threads in the thread pool have been interrupted for some 
reason:

{code}
  private void execLoop() {
    while (isRunning()) {
      Procedure proc = runnables.poll();
      if (proc == null) continue;
{code}
and then in {master/procedure/MasterProcedureScheduler.java}:
{code}
  @Override
  public Procedure poll() {
    return poll(-1);
  }

  @edu.umd.cs.findbugs.annotations.SuppressWarnings("WA_AWAIT_NOT_IN_LOOP")
  Procedure poll(long waitNsec) {
    Procedure pollResult = null;
    schedLock.lock();
    try {
      if (queueSize == 0) {
        if (waitNsec < 0) {
          schedWaitCond.await();
[...]
    } catch (InterruptedException e) {
      Thread.currentThread().interrupt();
    } finally {
      schedLock.unlock();
    }
    return pollResult;
  }
{code}

so my theory is the threads in the thread pool have all been interrupted (maybe 
by a procedure that ran earlier and left its thread interrupted) and so we are 
perpetually looping in {{execLoop}}, which ends up calling 
{{schedWaitCond.await();}}, which ends up throwing an {{InterruptedException}}, 
which ends up resetting the interrupt status of the thread, and rinse and 
repeat.

But again I wasn't able to get any cold hard evidence that this is what was 
happening.  There was just no other evidence that could explain this behavior, 
and I wasn't able to guess what else could be causing this that was consistent 
with what we saw and what I understood from reading the code.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to