Re: Disk Full on one Worker is leading to Job Stuck and Executor Unresponsive

Abhishek Anand Fri, 01 Apr 2016 11:50:48 -0700

Hi Ted,

Any thoughts on this ???


I am getting the same kind of error when I kill a worker on one of the
machines.
Even after killing the worker using kill -9 command, the executor shows up
on the spark UI with negative active tasks.

All the tasks on that worker starts to fail with the following exception.


16/04/01 23:54:20 WARN TaskSetManager: Lost task 141.0 in stage 19859.0
(TID 190333, 192.168.33.96): java.io.IOException: Failed to connect to /
192.168.33.97:63276
        at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
        at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
        at
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
        at
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
        at
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
        at
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: /
192.168.33.97:63276
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
        at
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
        at
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
        at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
        at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
        ... 1 more




Cheers !!
Abhi

On Fri, Apr 1, 2016 at 9:04 AM, Abhishek Anand <abhis.anan...@gmail.com>
wrote:

> This is what I am getting in the executor logs
>
> 16/03/29 10:49:00 ERROR DiskBlockObjectWriter: Uncaught exception while
> reverting partial writes to file
> /data/spark-e2fc248f-a212-4a99-9d6c-4e52d6a69070/executor-37679a6c-cb96-451e-a284-64d6b4fe9910/blockmgr-f8ca72f4-f329-468b-8e65-ef97f8fb285c/38/temp_shuffle_8f266d70-3fc6-41e5-bbaa-c413a7b08ea4
> java.io.IOException: No space left on device
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:315)
> at
> org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)
> at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> at org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:274)
>
>
>
> It happens every time the disk is full.
>
> On Fri, Apr 1, 2016 at 2:18 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Can you show the stack trace ?
>>
>> The log message came from
>> DiskBlockObjectWriter#revertPartialWritesAndClose().
>> Unfortunately, the method doesn't throw exception, making it a bit hard
>> for caller to know of the disk full condition.
>>
>> On Thu, Mar 31, 2016 at 11:32 AM, Abhishek Anand <abhis.anan...@gmail.com
>> > wrote:
>>
>>>
>>> Hi,
>>>
>>> Why is it so that when my disk space is full on one of the workers then
>>> the executor on that worker becomes unresponsive and the jobs on that
>>> worker fails with the exception
>>>
>>>
>>> 16/03/29 10:49:00 ERROR DiskBlockObjectWriter: Uncaught exception while
>>> reverting partial writes to file
>>> /data/spark-e2fc248f-a212-4a99-9d6c-4e52d6a69070/executor-37679a6c-cb96-451e-a284-64d6b4fe9910/blockmgr-f8ca72f4-f329-468b-8e65-ef97f8fb285c/38/temp_shuffle_8f266d70-3fc6-41e5-bbaa-c413a7b08ea4
>>> java.io.IOException: No space left on device
>>>
>>>
>>> This is leading to my job getting stuck.
>>>
>>> As a workaround I have to kill the executor, clear the space on disk and
>>> new executor  relaunched by the worker and the failed stages are recomputed.
>>>
>>>
>>> How can I get rid of this problem i.e why my job get stuck on disk full
>>> issue on one of the workers ?
>>>
>>>
>>> Cheers !!!
>>> Abhi
>>>
>>>
>>
>

Re: Disk Full on one Worker is leading to Job Stuck and Executor Unresponsive

Reply via email to