[ https://issues.apache.org/jira/browse/HBASE-26012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Viraj Jasani updated HBASE-26012: --------------------------------- Fix Version/s: 2.4.5 3.0.0-alpha-2 2.3.6 2.5.0 3.0.0-alpha-1 > Improve logging and dequeue logic in DelayQueue > ----------------------------------------------- > > Key: HBASE-26012 > URL: https://issues.apache.org/jira/browse/HBASE-26012 > Project: HBase > Issue Type: Improvement > Reporter: Viraj Jasani > Assignee: Viraj Jasani > Priority: Major > Fix For: 3.0.0-alpha-1, 2.5.0, 2.3.6, 3.0.0-alpha-2, 2.4.5 > > > In Remote Procedure dispatcher, before submitting (sub)Procedure to Thread > pool, we enqueue it as DelayedWithTimeout object on DelayQueue. > TimeoutExecutorThread keeps dequeuing elements from this DelayQueue and > submit the Procedure to the threadpool. The expiration of DelayedWithTimeout > is determined by getDelay(TimeUnit): > > {code:java} > @Override > public long getDelay(final TimeUnit unit) { > return DelayedUtil.getRemainingTime(unit, getTimeout()); > } > {code} > {code:java} > /** > * @return Time remaining as milliseconds. > */ > public static long getRemainingTime(final TimeUnit resultUnit, final long > timeout) { > final long currentTime = EnvironmentEdgeManager.currentTime(); > if (currentTime >= timeout) { > return 0; > } > return resultUnit.convert(timeout - currentTime, TimeUnit.MILLISECONDS); > } > {code} > Hence, in order for the elements to get dequeued on time, it is necessary > that EnvironmentEdgeManager.currentTime() returns the current time in millis. > As part of unit test, if we use our own custom EnvironmentEdge and inject it > using EnvironmentEdgeManager.injectEdge before creating any tables, it is > possible that we continue returning same value (based on custom impl) with > EnvironmentEdgeManager.currentTime(). If that is the case, getRemainingTime > as mentioned above, will never return 0 and hence, the procedure wrapped in > DelayedWithTimeout might never be dequeued from DelayQueue because it's delay > will not expire. > As of today, our system goes in hanging state while waiting for table regions > to be available (as mentioned above, DelayedWithTimeout object never gets > dequeued from DelayQueue). > Thread dump might show something like this consistently: > {code:java} > "ProcedureDispatcherTimeoutThread" #319 daemon prio=5 os_prio=31 > tid=0x00007fcaf0cae800 nid=0x21d03 waiting on condition [0x0000700019293000] > java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000007225a0090> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) > at java.util.concurrent.DelayQueue.take(DelayQueue.java:223) > at > org.apache.hadoop.hbase.procedure2.util.DelayedUtil.takeWithoutInterrupt(DelayedUtil.java:82) > at > org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher$TimeoutExecutorThread.run(RemoteProcedureDispatcher.java:314) > Locked ownable synchronizers: > - None > {code} > Although running into situation like this is not likely possible unless > custom EnvironmentEdge is used as mentioned above, we should improve our > dequeue logic as well as log important message to show where we are stuck. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)