Summary: I am running Spark 1.5 on CDH5.5.1. Under extreme load intermittently I am getting this connection failure exception and later negative executor in the Spark UI.
Exception: TRACE: org.apache.hadoop.hbase.ipc.AbstractRpcClient - Call: Multi, callTime: 76ms INFO : org.apache.spark.network.client.TransportClientFactory - Found inactive connection to xxxx/xxx.xxx.xxx.xxxx, creating a new one. ERROR: org.apache.spark.network.shuffle.RetryingBlockFetcher - Exception while beginning fetch of 1 outstanding blocks (after 1 retries) java.io.IOException: Failed to connect to xxxx/xxx.xxx.xxx.xxxx at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.ConnectException: Connection refused: xxxx/xxx.xxx.xxx.xxxx at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) ... 1 more Related Defects: https://issues.apache.org/jira/browse/SPARK-2319 https://issues.apache.org/jira/browse/SPARK-9591 [cid:image001.png@01D1F6EE.1CCFE110]