Hello, I get a lot of these exceptions on my mesos cluster when running spark jobs:
14/07/19 16:29:43 WARN spark.network.SendingConnection: Error finishing connection to prd-atl-mesos-slave-010/10.88.160.200:37586 java.net.ConnectException: Connection timed out at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:318) at org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/07/19 16:29:43 INFO spark.network.ConnectionManager: Handling connection error on connection to ConnectionManagerId(prd-atl-mesos-slave-010,37586) 14/07/19 16:29:43 INFO spark.network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(prd-atl-mesos-slave-010,37586) 14/07/19 16:29:43 INFO spark.network.ConnectionManager: Notifying org.apache.spark.network.ConnectionManager$MessageStatus@4b0472b4 14/07/19 16:29:43 INFO spark.network.ConnectionManager: Notifying org.apache.spark.network.ConnectionManager$MessageStatus@1106ade6 14/07/19 16:29:43 ERROR spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator: Could not get block(s) from ConnectionManagerId(prd-atl-mesos-slave-010,37586) 14/07/19 16:29:43 ERROR spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator: Could not get block(s) from ConnectionManagerId(prd-atl-mesos-slave-010,37586) 14/07/19 16:29:43 WARN spark.network.SendingConnection: Error finishing connection to prd-atl-mesos-slave-004/10.88.160.156:35446 java.net.ConnectException: Connection timed out at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:318) at org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/07/19 16:29:43 INFO spark.network.ConnectionManager: Handling connection error on connection to ConnectionManagerId(prd-atl-mesos-slave-004,35446) 14/07/19 16:29:43 INFO spark.network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(prd-atl-mesos-slave-004,35446) I've tried bumping up the spark.akka.timeout, but it doesn't seem to have much of an effect. Has anyone else seen these? Is there a spark configuration option that I should tune? Or perhaps some JVM properties that I should be setting on my executors? TIA