Try these:

- Disable shuffle : spark.shuffle.spill=false (It might end up in OOM)
- Enable log rotation:

sparkConf.set("spark.executor.logs.rolling.strategy", "size")
.set("spark.executor.logs.rolling.size.maxBytes", "1024")
.set("spark.executor.logs.rolling.maxRetainedFiles", "3")


Also see, whats really getting filled on disk.

Thanks
Best Regards

On Sat, Mar 28, 2015 at 8:18 PM, Nathan Marin <[email protected]> wrote:

> Hi,
>
> I’ve been trying to use Spark Streaming for my real-time analysis
> application using the Kafka Stream API on a cluster (using the yarn
> version) of 6 executors with 4 dedicated cores and 8192mb of dedicated
> RAM.
>
> The thing is, my application should run 24/7 but the disk usage is
> leaking. This leads to some exceptions occurring when Spark tries to
> write on a file system where no space is left.
>
> Here are some graphs showing the disk space remaining on a node where
> my application is deployed:
> http://i.imgur.com/vdPXCP0.png
> The "drops" occurred on a 3 minute interval.
>
> The Disk Usage goes back to normal once I kill my application:
> http://i.imgur.com/ERZs2Cj.png
>
> The persistance level of my RDD is MEMORY_AND_DISK_SER_2, but even
> when I tried MEMORY_ONLY_SER_2 the same thing happened (this mode
> shouldn't even allow spark to write on disk, right?).
>
> My question is: How can I force Spark (Streaming?) to remove whatever
> he stores immediately after he processed-it? Obviously it doesn’t look
> like the disk is being cleaned up (even though the memory does) even
> with me calling the rdd.unpersist() method foreach RDD processed.
>
> Here’s a sample of my application code:
> http://pastebin.com/K86LE1J6
>
> Maybe something is wrong in my app too?
>
> Thanks for your help,
> NM
>
> ------------------------------
> View this message in context: [Spark Streaming] Disk not being cleaned up
> during runtime after RDD being processed
> <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Disk-not-being-cleaned-up-during-runtime-after-RDD-being-processed-tp22271.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>

Reply via email to