Hi Ted, Any thoughts on this ???
I am getting the same kind of error when I kill a worker on one of the machines. Even after killing the worker using kill -9 command, the executor shows up on the spark UI with negative active tasks. All the tasks on that worker starts to fail with the following exception. 16/04/01 23:54:20 WARN TaskSetManager: Lost task 141.0 in stage 19859.0 (TID 190333, 192.168.33.96): java.io.IOException: Failed to connect to / 192.168.33.97:63276 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.ConnectException: Connection refused: / 192.168.33.97:63276 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ... 1 more Cheers !! Abhi On Fri, Apr 1, 2016 at 9:04 AM, Abhishek Anand <abhis.anan...@gmail.com> wrote: > This is what I am getting in the executor logs > > 16/03/29 10:49:00 ERROR DiskBlockObjectWriter: Uncaught exception while > reverting partial writes to file > /data/spark-e2fc248f-a212-4a99-9d6c-4e52d6a69070/executor-37679a6c-cb96-451e-a284-64d6b4fe9910/blockmgr-f8ca72f4-f329-468b-8e65-ef97f8fb285c/38/temp_shuffle_8f266d70-3fc6-41e5-bbaa-c413a7b08ea4 > java.io.IOException: No space left on device > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:315) > at > org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58) > at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > at org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:274) > > > > It happens every time the disk is full. > > On Fri, Apr 1, 2016 at 2:18 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Can you show the stack trace ? >> >> The log message came from >> DiskBlockObjectWriter#revertPartialWritesAndClose(). >> Unfortunately, the method doesn't throw exception, making it a bit hard >> for caller to know of the disk full condition. >> >> On Thu, Mar 31, 2016 at 11:32 AM, Abhishek Anand <abhis.anan...@gmail.com >> > wrote: >> >>> >>> Hi, >>> >>> Why is it so that when my disk space is full on one of the workers then >>> the executor on that worker becomes unresponsive and the jobs on that >>> worker fails with the exception >>> >>> >>> 16/03/29 10:49:00 ERROR DiskBlockObjectWriter: Uncaught exception while >>> reverting partial writes to file >>> /data/spark-e2fc248f-a212-4a99-9d6c-4e52d6a69070/executor-37679a6c-cb96-451e-a284-64d6b4fe9910/blockmgr-f8ca72f4-f329-468b-8e65-ef97f8fb285c/38/temp_shuffle_8f266d70-3fc6-41e5-bbaa-c413a7b08ea4 >>> java.io.IOException: No space left on device >>> >>> >>> This is leading to my job getting stuck. >>> >>> As a workaround I have to kill the executor, clear the space on disk and >>> new executor relaunched by the worker and the failed stages are recomputed. >>> >>> >>> How can I get rid of this problem i.e why my job get stuck on disk full >>> issue on one of the workers ? >>> >>> >>> Cheers !!! >>> Abhi >>> >>> >> >