Peter Bacsko created YARN-10460: ----------------------------------- Summary: Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail Key: YARN-10460 URL: https://issues.apache.org/jira/browse/YARN-10460 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, test Reporter: Peter Bacsko Assignee: Peter Bacsko
In our downstream build environment, we're using JUnit 4.13. Recently, we discovered a truly weird test failure in TestNodeStatusUpdater. The problem is that timeout handling has changed in Junit 4.13. See the difference between these two snippets: 4.12 {noformat} @Override public void evaluate() throws Throwable { CallableStatement callable = new CallableStatement(); FutureTask<Throwable> task = new FutureTask<Throwable>(callable); threadGroup = new ThreadGroup("FailOnTimeoutGroup"); Thread thread = new Thread(threadGroup, task, "Time-limited test"); thread.setDaemon(true); thread.start(); callable.awaitStarted(); Throwable throwable = getResult(task, thread); if (throwable != null) { throw throwable; } } {noformat} 4.13 {noformat} @Override public void evaluate() throws Throwable { CallableStatement callable = new CallableStatement(); FutureTask<Throwable> task = new FutureTask<Throwable>(callable); ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); Thread thread = new Thread(threadGroup, task, "Time-limited test"); try { thread.setDaemon(true); thread.start(); callable.awaitStarted(); Throwable throwable = getResult(task, thread); if (throwable != null) { throw throwable; } } finally { try { thread.join(1); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } try { threadGroup.destroy(); <---- This } catch (IllegalThreadStateException e) { // If a thread from the group is still alive, the ThreadGroup cannot be destroyed. // Swallow the exception to keep the same behavior prior to this change. } } } {noformat} The change comes from [https://github.com/junit-team/junit4/pull/1517]. Unfortunately, destroying the thread group causes an issue because there are all sorts of object caching in the IPC layer. The exception is: {noformat} java.lang.IllegalThreadStateException at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) at java.lang.Thread.init(Thread.java:402) at java.lang.Thread.init(Thread.java:349) at java.lang.Thread.<init>(Thread.java:675) at java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) at com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) at java.util.concurrent.ThreadPoolExecutor$Worker.<init>(ThreadPoolExecutor.java:612) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) at org.apache.hadoop.ipc.Client.call(Client.java:1458) at org.apache.hadoop.ipc.Client.call(Client.java:1405) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy81.startContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) at org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251) at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576) {noformat} Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the client object in {{ProtobufRpcEngine}}/{{ProtobufRpcEngine2}} is stored as long as they're needed. But since the backing thread group is destroyed in the previous test, it's no longer possible to create new threads. A quick workaround is to stop the clients and completely clear the {{ClientCache}} in {{ProtobufRpcEngine}} before each testcase. I tried this and it solves the problem but it feels hacky. Not sure if there is a better approach. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org