[
https://issues.apache.org/jira/browse/TEZ-4501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
László Bodor updated TEZ-4501:
------------------------------
Description:
The relevant part from AsyncDispatcher is:
{code}
synchronized (waitForDrained) {
while (!drained && eventHandlingThread.isAlive() &&
System.currentTimeMillis() < endTime) {
waitForDrained.wait(1000);
LOG.info(
"Waiting for AsyncDispatcher to drain. Current queue size: {},
handler thread state: {}",
eventQueue.size(), eventHandlingThread.getState());
}
}
{code}
Here is a log snippet from TestLocalMode when looping for
TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT simply caused a test timeout because
drained was false.
{code}
2023-07-03 12:15:39,938 INFO [Time-limited test] common.AsyncDispatcher
(AsyncDispatcher.java:serviceStop(148)) - AsyncDispatcher is draining to stop,
ignoring any new events.
2023-07-03 12:15:39,938 INFO [CallbackExecutor] common.AsyncDispatcher
(AsyncDispatcher.java:handle(369)) - AsyncDispatcher put event: EventType:
C_COMPLETED
2023-07-03 12:15:39,938 WARN [CallbackExecutor] common.AsyncDispatcher
(AsyncDispatcher.java:handle(373)) - AsyncDispatcher thread interrupted (while
putting event): EventType: C_COMPLETED
java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223)
at
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
at
java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
at
org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370)
at
org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97)
at
org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357)
at
com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Exception in thread "CallbackExecutor"
org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
java.lang.InterruptedException
at
org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:375)
at
org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97)
at
org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357)
at
com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223)
at
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
at
java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
at
org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370)
... 6 more
2023-07-03 12:15:40,938 INFO [Time-limited test] common.AsyncDispatcher
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain.
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:41,939 INFO [Time-limited test] common.AsyncDispatcher
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain.
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:42,939 INFO [Time-limited test] common.AsyncDispatcher
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain.
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:43,939 INFO [Time-limited test] common.AsyncDispatcher
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain.
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:44,940 INFO [Time-limited test] common.AsyncDispatcher
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain.
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:45,940 INFO [Time-limited test] common.AsyncDispatcher
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain.
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:46,940 INFO [Time-limited test] common.AsyncDispatcher
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain.
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:47,941 INFO [Time-limited test] common.AsyncDispatcher
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain.
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:48,941 INFO [Time-limited test] common.AsyncDispatcher
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain.
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:49,149 INFO [Time-limited test] service.AbstractService
(AbstractService.java:noteFailure(267)) - Service Central failed in state
STOPPED
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at
org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:155)
{code}
what happened here is:
1. test is stopped (tezclient.stop)
2. eventQueue.put has been interrupted, but drained is already set to false
3. eventQueue.size == 0 and still looping, this is only possible if [drained is
false: it was set in
GenericEventHandler.handle|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L338]
was:
The relevant part from AsyncDispatcher is:
{code}
synchronized (waitForDrained) {
while (!drained && eventHandlingThread.isAlive() &&
System.currentTimeMillis() < endTime) {
waitForDrained.wait(1000);
LOG.info(
"Waiting for AsyncDispatcher to drain. Current queue size: {},
handler thread state: {}",
eventQueue.size(), eventHandlingThread.getState());
}
}
{code}
Here is a log snippet from TestLocalMode when looping for
TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT simply caused a test timeout because
drained was false.
{code}
2023-07-03 12:15:39,938 INFO [Time-limited test] common.AsyncDispatcher
(AsyncDispatcher.java:serviceStop(148)) - AsyncDispatcher is draining to stop,
ignoring any new events.
2023-07-03 12:15:39,938 INFO [CallbackExecutor] common.AsyncDispatcher
(AsyncDispatcher.java:handle(369)) - AsyncDispatcher put event: EventType:
C_COMPLETED
2023-07-03 12:15:39,938 WARN [CallbackExecutor] common.AsyncDispatcher
(AsyncDispatcher.java:handle(373)) - AsyncDispatcher thread interrupted (while
putting event): EventType: C_COMPLETED
java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223)
at
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
at
java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
at
org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370)
at
org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97)
at
org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357)
at
com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Exception in thread "CallbackExecutor"
org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
java.lang.InterruptedException
at
org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:375)
at
org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97)
at
org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357)
at
com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223)
at
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
at
java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
at
org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370)
... 6 more
2023-07-03 12:15:40,938 INFO [Time-limited test] common.AsyncDispatcher
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain.
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:41,939 INFO [Time-limited test] common.AsyncDispatcher
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain.
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:42,939 INFO [Time-limited test] common.AsyncDispatcher
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain.
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:43,939 INFO [Time-limited test] common.AsyncDispatcher
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain.
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:44,940 INFO [Time-limited test] common.AsyncDispatcher
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain.
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:45,940 INFO [Time-limited test] common.AsyncDispatcher
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain.
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:46,940 INFO [Time-limited test] common.AsyncDispatcher
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain.
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:47,941 INFO [Time-limited test] common.AsyncDispatcher
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain.
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:48,941 INFO [Time-limited test] common.AsyncDispatcher
(AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to drain.
Current queue size: 0, handler thread state: WAITING
2023-07-03 12:15:49,149 INFO [Time-limited test] service.AbstractService
(AbstractService.java:noteFailure(267)) - Service Central failed in state
STOPPED
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at
org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:155)
{code}
eventQueue.size == 0 and still looping, this is only possible if [drained is
false: it was set in
GenericEventHandler.handle|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L338]
> Fix TestLocalMode timeouts
> --------------------------
>
> Key: TEZ-4501
> URL: https://issues.apache.org/jira/browse/TEZ-4501
> Project: Apache Tez
> Issue Type: Bug
> Reporter: László Bodor
> Assignee: László Bodor
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> The relevant part from AsyncDispatcher is:
> {code}
> synchronized (waitForDrained) {
> while (!drained && eventHandlingThread.isAlive() &&
> System.currentTimeMillis() < endTime) {
> waitForDrained.wait(1000);
> LOG.info(
> "Waiting for AsyncDispatcher to drain. Current queue size: {},
> handler thread state: {}",
> eventQueue.size(), eventHandlingThread.getState());
> }
> }
> {code}
> Here is a log snippet from TestLocalMode when looping for
> TEZ_AM_DISPATCHER_DRAIN_EVENTS_TIMEOUT simply caused a test timeout because
> drained was false.
> {code}
> 2023-07-03 12:15:39,938 INFO [Time-limited test] common.AsyncDispatcher
> (AsyncDispatcher.java:serviceStop(148)) - AsyncDispatcher is draining to
> stop, ignoring any new events.
> 2023-07-03 12:15:39,938 INFO [CallbackExecutor] common.AsyncDispatcher
> (AsyncDispatcher.java:handle(369)) - AsyncDispatcher put event: EventType:
> C_COMPLETED
> 2023-07-03 12:15:39,938 WARN [CallbackExecutor] common.AsyncDispatcher
> (AsyncDispatcher.java:handle(373)) - AsyncDispatcher thread interrupted
> (while putting event): EventType: C_COMPLETED
> java.lang.InterruptedException
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223)
> at
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
> at
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
> at
> org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370)
> at
> org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97)
> at
> org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357)
> at
> com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Exception in thread "CallbackExecutor"
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
> java.lang.InterruptedException
> at
> org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:375)
> at
> org.apache.tez.dag.app.ContainerLauncherContextImpl.containerCompleted(ContainerLauncherContextImpl.java:97)
> at
> org.apache.tez.dag.app.launcher.LocalContainerLauncher$RunningTaskCallback.onFailure(LocalContainerLauncher.java:357)
> at
> com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1135)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.InterruptedException
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1223)
> at
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
> at
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:339)
> at
> org.apache.tez.common.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:370)
> ... 6 more
> 2023-07-03 12:15:40,938 INFO [Time-limited test] common.AsyncDispatcher
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:41,939 INFO [Time-limited test] common.AsyncDispatcher
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:42,939 INFO [Time-limited test] common.AsyncDispatcher
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:43,939 INFO [Time-limited test] common.AsyncDispatcher
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:44,940 INFO [Time-limited test] common.AsyncDispatcher
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:45,940 INFO [Time-limited test] common.AsyncDispatcher
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:46,940 INFO [Time-limited test] common.AsyncDispatcher
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:47,941 INFO [Time-limited test] common.AsyncDispatcher
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:48,941 INFO [Time-limited test] common.AsyncDispatcher
> (AsyncDispatcher.java:serviceStop(156)) - Waiting for AsyncDispatcher to
> drain. Current queue size: 0, handler thread state: WAITING
> 2023-07-03 12:15:49,149 INFO [Time-limited test] service.AbstractService
> (AbstractService.java:noteFailure(267)) - Service Central failed in state
> STOPPED
> java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at
> org.apache.tez.common.AsyncDispatcher.serviceStop(AsyncDispatcher.java:155)
> {code}
> what happened here is:
> 1. test is stopped (tezclient.stop)
> 2. eventQueue.put has been interrupted, but drained is already set to false
> 3. eventQueue.size == 0 and still looping, this is only possible if [drained
> is false: it was set in
> GenericEventHandler.handle|https://github.com/apache/tez/blob/50380751b7e7e628aeb40a985b94aba98d2a88db/tez-common/src/main/java/org/apache/tez/common/AsyncDispatcher.java#L338]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)