[ https://issues.apache.org/jira/browse/YARN-10460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Peter Bacsko updated YARN-10460: -------------------------------- Description: In our downstream build environment, we're using JUnit 4.13. Recently, we discovered a truly weird test failure in TestNodeStatusUpdater. The problem is that timeout handling has changed in Junit 4.13. See the difference between these two snippets: 4.12 {noformat} @Override public void evaluate() throws Throwable { CallableStatement callable = new CallableStatement(); FutureTask<Throwable> task = new FutureTask<Throwable>(callable); threadGroup = new ThreadGroup("FailOnTimeoutGroup"); Thread thread = new Thread(threadGroup, task, "Time-limited test"); thread.setDaemon(true); thread.start(); callable.awaitStarted(); Throwable throwable = getResult(task, thread); if (throwable != null) { throw throwable; } } {noformat} 4.13 {noformat} @Override public void evaluate() throws Throwable { CallableStatement callable = new CallableStatement(); FutureTask<Throwable> task = new FutureTask<Throwable>(callable); ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); Thread thread = new Thread(threadGroup, task, "Time-limited test"); try { thread.setDaemon(true); thread.start(); callable.awaitStarted(); Throwable throwable = getResult(task, thread); if (throwable != null) { throw throwable; } } finally { try { thread.join(1); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } try { threadGroup.destroy(); <---- This } catch (IllegalThreadStateException e) { // If a thread from the group is still alive, the ThreadGroup cannot be destroyed. // Swallow the exception to keep the same behavior prior to this change. } } } {noformat} The change comes from [https://github.com/junit-team/junit4/pull/1517]. Unfortunately, destroying the thread group causes an issue because there are all sorts of object caching in the IPC layer. The exception is: {noformat} java.lang.IllegalThreadStateException at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) at java.lang.Thread.init(Thread.java:402) at java.lang.Thread.init(Thread.java:349) at java.lang.Thread.<init>(Thread.java:675) at java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) at com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) at java.util.concurrent.ThreadPoolExecutor$Worker.<init>(ThreadPoolExecutor.java:612) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) at org.apache.hadoop.ipc.Client.call(Client.java:1458) at org.apache.hadoop.ipc.Client.call(Client.java:1405) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy81.startContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) at org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251) at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576) {noformat} Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the client object in {{ProtobufRpcEngine}}/{{ProtobufRpcEngine2}} are stored as long as they're needed. But since the backing thread group is destroyed in the previous test, it's no longer possible to create new threads. A quick workaround is to stop the clients and completely clear the {{ClientCache}} in {{ProtobufRpcEngine}} before each testcase. I tried this and it solves the problem but it feels hacky. Not sure if there is a better approach. was: In our downstream build environment, we're using JUnit 4.13. Recently, we discovered a truly weird test failure in TestNodeStatusUpdater. The problem is that timeout handling has changed in Junit 4.13. See the difference between these two snippets: 4.12 {noformat} @Override public void evaluate() throws Throwable { CallableStatement callable = new CallableStatement(); FutureTask<Throwable> task = new FutureTask<Throwable>(callable); threadGroup = new ThreadGroup("FailOnTimeoutGroup"); Thread thread = new Thread(threadGroup, task, "Time-limited test"); thread.setDaemon(true); thread.start(); callable.awaitStarted(); Throwable throwable = getResult(task, thread); if (throwable != null) { throw throwable; } } {noformat} 4.13 {noformat} @Override public void evaluate() throws Throwable { CallableStatement callable = new CallableStatement(); FutureTask<Throwable> task = new FutureTask<Throwable>(callable); ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); Thread thread = new Thread(threadGroup, task, "Time-limited test"); try { thread.setDaemon(true); thread.start(); callable.awaitStarted(); Throwable throwable = getResult(task, thread); if (throwable != null) { throw throwable; } } finally { try { thread.join(1); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } try { threadGroup.destroy(); <---- This } catch (IllegalThreadStateException e) { // If a thread from the group is still alive, the ThreadGroup cannot be destroyed. // Swallow the exception to keep the same behavior prior to this change. } } } {noformat} The change comes from [https://github.com/junit-team/junit4/pull/1517]. Unfortunately, destroying the thread group causes an issue because there are all sorts of object caching in the IPC layer. The exception is: {noformat} java.lang.IllegalThreadStateException at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) at java.lang.Thread.init(Thread.java:402) at java.lang.Thread.init(Thread.java:349) at java.lang.Thread.<init>(Thread.java:675) at java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) at com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) at java.util.concurrent.ThreadPoolExecutor$Worker.<init>(ThreadPoolExecutor.java:612) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) at org.apache.hadoop.ipc.Client.call(Client.java:1458) at org.apache.hadoop.ipc.Client.call(Client.java:1405) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) at com.sun.proxy.$Proxy81.startContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) at org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251) at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576) {noformat} Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the client object in {{ProtobufRpcEngine}}/{{ProtobufRpcEngine2}} is stored as long as they're needed. But since the backing thread group is destroyed in the previous test, it's no longer possible to create new threads. A quick workaround is to stop the clients and completely clear the {{ClientCache}} in {{ProtobufRpcEngine}} before each testcase. I tried this and it solves the problem but it feels hacky. Not sure if there is a better approach. > Upgrading to JUnit 4.13 causes tests in TestNodeStatusUpdater to fail > --------------------------------------------------------------------- > > Key: YARN-10460 > URL: https://issues.apache.org/jira/browse/YARN-10460 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, test > Reporter: Peter Bacsko > Assignee: Peter Bacsko > Priority: Major > > In our downstream build environment, we're using JUnit 4.13. Recently, we > discovered a truly weird test failure in TestNodeStatusUpdater. > The problem is that timeout handling has changed in Junit 4.13. See the > difference between these two snippets: > 4.12 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask<Throwable> task = new FutureTask<Throwable>(callable); > threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } > {noformat} > > 4.13 > {noformat} > @Override > public void evaluate() throws Throwable { > CallableStatement callable = new CallableStatement(); > FutureTask<Throwable> task = new FutureTask<Throwable>(callable); > ThreadGroup threadGroup = new ThreadGroup("FailOnTimeoutGroup"); > Thread thread = new Thread(threadGroup, task, "Time-limited test"); > try { > thread.setDaemon(true); > thread.start(); > callable.awaitStarted(); > Throwable throwable = getResult(task, thread); > if (throwable != null) { > throw throwable; > } > } finally { > try { > thread.join(1); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > } > try { > threadGroup.destroy(); <---- This > } catch (IllegalThreadStateException e) { > // If a thread from the group is still alive, the ThreadGroup > cannot be destroyed. > // Swallow the exception to keep the same behavior prior to > this change. > } > } > } > {noformat} > The change comes from [https://github.com/junit-team/junit4/pull/1517]. > Unfortunately, destroying the thread group causes an issue because there are > all sorts of object caching in the IPC layer. The exception is: > {noformat} > java.lang.IllegalThreadStateException > at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867) > at java.lang.Thread.init(Thread.java:402) > at java.lang.Thread.init(Thread.java:349) > at java.lang.Thread.<init>(Thread.java:675) > at > java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) > at > com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:163) > at > java.util.concurrent.ThreadPoolExecutor$Worker.<init>(ThreadPoolExecutor.java:612) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) > at > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) > at > org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1136) > at org.apache.hadoop.ipc.Client.call(Client.java:1458) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy81.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown.startContainer(TestNodeManagerShutdown.java:251) > at > org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown(TestNodeStatusUpdater.java:1576) > {noformat} > Both the {{clientExecutor}} in {{org.apache.hadoop.ipc.Client}} and the > client object in {{ProtobufRpcEngine}}/{{ProtobufRpcEngine2}} are stored as > long as they're needed. But since the backing thread group is destroyed in > the previous test, it's no longer possible to create new threads. > A quick workaround is to stop the clients and completely clear the > {{ClientCache}} in {{ProtobufRpcEngine}} before each testcase. I tried this > and it solves the problem but it feels hacky. Not sure if there is a better > approach. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org