[jira] [Updated] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shilun Fan updated YARN-6409: - Target Version/s: 3.5.0 (was: 3.4.0) > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Kanwaljeet Sachdev >Priority: Major > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch, YARN-6409.03.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:367) > at >
[jira] [Updated] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-6409: --- Target Version/s: 3.4.0 (was: 3.3.0) Bulk update: moved all 3.3.0 non-blocker issues, please move back if it is a blocker. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Kanwaljeet Sachdev >Priority: Major > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch, YARN-6409.03.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at >
[jira] [Updated] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil Govindan updated YARN-6409: - Target Version/s: 3.3.0 (was: 3.2.0) Bulk update: moved all 3.2.0 non-blocker issues, please move back if it is a blocker. > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Kanwaljeet Sachdev >Priority: Major > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch, YARN-6409.03.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at >
[jira] [Updated] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haibo Chen updated YARN-6409: - Attachment: YARN-6409.03.patch > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch, YARN-6409.03.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:367) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:560)
[jira] [Updated] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haibo Chen updated YARN-6409: - Attachment: YARN-6409.02.patch Updated the patch to only blacklist a node upon AM launch failure due to SocketTimeoutException (up to three levels down in the cause chain) > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: YARN-6409.00.patch, YARN-6409.01.patch, > YARN-6409.02.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at >
[jira] [Updated] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haibo Chen updated YARN-6409: - Attachment: YARN-6409.01.patch > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: YARN-6409.00.patch, YARN-6409.01.patch > > > Currently, node blacklisting upon AM failures only handles failures that > happen after AM container is launched (see > RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch > can also fail if the NM, where the AM container is allocated, goes > unresponsive. Because it is not handled, scheduler may continue to allocate > AM containers on that same NM for the following app attempts. > {code} > Application application_1478721503753_0870 failed 2 times due to Error > launching appattempt_1478721503753_0870_02. Got exception: > java.io.IOException: Failed on local exception: java.io.IOException: > java.net.SocketTimeoutException: 6 millis timeout while waiting for > channel to be ready for read. ch : java.nio.channels.SocketChannel[connected > local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host > Details : local host is: "*.me.com/17.111.179.113"; destination host is: > "*.me.com":8041; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1408) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > > at com.sun.proxy.$Proxy86.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) > > at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > > at com.sun.proxy.$Proxy87.startContainers(Unknown Source) > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) > > at > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) > > at > org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) > > at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) > at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) > at org.apache.hadoop.ipc.Client.call(Client.java:1447) > ... 15 more > Caused by: java.net.SocketTimeoutException: 6 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 > remote=*.me.com/17.111.178.125:8041] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:367) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:560) > at
[jira] [Updated] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haibo Chen updated YARN-6409: - Description: Currently, node blacklisting upon AM failures only handles failures that happen after AM container is launched (see RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch can also fail if the NM, where the AM container is allocated, goes unresponsive. Because it is not handled, scheduler may continue to allocate AM containers on that same NM for the following app attempts. {code} Application application_1478721503753_0870 failed 2 times due to Error launching appattempt_1478721503753_0870_02. Got exception: java.io.IOException: Failed on local exception: java.io.IOException: java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041]; Host Details : local host is: "*.me.com/17.111.179.113"; destination host is: "*.me.com":8041; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Client.call(Client.java:1475) at org.apache.hadoop.ipc.Client.call(Client.java:1408) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at com.sun.proxy.$Proxy86.startContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) at com.sun.proxy.$Proxy87.startContainers(Unknown Source) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041] at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) at org.apache.hadoop.ipc.Client.call(Client.java:1447) ... 15 more Caused by: java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:367) at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:560) at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:375) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:730) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:726) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:725) ... 18 more . Failing the application. {code} was: Currently, node blacklisting upon AM failures only handles failures that happen after AM container is launched (see
[jira] [Updated] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haibo Chen updated YARN-6409: - Description: Currently, node blacklisting upon AM failures only handles failures that happen after AM container is launched (see RMAppAttemptImpl.shouldCountTowardsNodeBlacklisting()). However, AM launch can also fail if the NM, where the AM container is allocated, goes unresponsive. Because it is not handled, scheduler may continue to allocate AM containers on that same NM for the following app attempts. {code} Application application_1478721503753_0870 failed 2 times due to Error launching appattempt_1478721503753_0870_02. Got exception: java.io.IOException: Failed on local exception: java.io.IOException: java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 remote=mr22p32ic-ztep01041101.me.com/17.111.178.125:8041]; Host Details : local host is: "*.me.com/17.111.179.113"; destination host is: "*.me.com":8041; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Client.call(Client.java:1475) at org.apache.hadoop.ipc.Client.call(Client.java:1408) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at com.sun.proxy.$Proxy86.startContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96) at sun.reflect.GeneratedMethodAccessor155.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) at com.sun.proxy.$Proxy87.startContainers(Unknown Source) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:120) at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:256) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041] at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:687) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:650) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:738) at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1524) at org.apache.hadoop.ipc.Client.call(Client.java:1447) ... 15 more Caused by: java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/17.111.179.113:46702 remote=*.me.com/17.111.178.125:8041] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:367) at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:560) at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:375) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:730) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:726) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:725) ... 18 more . Failing the application. {code} > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 >
[jira] [Updated] (YARN-6409) RM does not blacklist node for AM launch failures
[ https://issues.apache.org/jira/browse/YARN-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haibo Chen updated YARN-6409: - Attachment: YARN-6409.00.patch > RM does not blacklist node for AM launch failures > - > > Key: YARN-6409 > URL: https://issues.apache.org/jira/browse/YARN-6409 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.0.0-alpha2 >Reporter: Haibo Chen >Assignee: Haibo Chen > Attachments: YARN-6409.00.patch > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org