Hi there,

  In our experiment with spark, we found same spark application has large
variance on execution time and sometimes even fail totally. And in the log,
we find this usually due to task resubmit from fetch failure, with log as
following,
     14/03/16 16:40:38 WARN TaskSetManager: Lost TID 6606 (task 2.0:452)
     14/03/16 16:40:38 WARN TaskSetManager: Loss was due to fetch failure
from BlockManagerId(139, Host YYY, 34619, 0)

And in worker contain the failed taskSet, we can found log like this,

14/03/16 16:40:35 ERROR SendingConnection: Exception while reading
SendingConnection to ConnectionManagerId(Host YYY,34619)
java.io.IOException: Connection timed out
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
        at
org.apache.spark.network.SendingConnection.read(Connection.scala:398)
        at
org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:158)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

And in the worker reported fetch failure, we get log like this

14/03/16 16:40:40 ERROR BlockFetcherIterator$BasicBlockFetcherIterator:
Could not get block(s) from ConnectionManagerId(Host XXX,45878)
14/03/16 16:40:40 ERROR BlockFetcherIterator$BasicBlockFetcherIterator:
Could not get block(s) from ConnectionManagerId(Host XXX,45878)

We have test the connections between two worker is ok by scp between each
other, and have increase /proc/sys/net/ipv4/tcp_max_syn_backlog to make
sure enough room to connect tcp connection. And we use spark-0.9 standalone
cluster with hundreds of workers. And the problem seems occur randomly.

Any ideas on how to debug this issue?

Thanks,
Jiacheng Guo

Reply via email to