Hi, I’ve been trying to use Spark Streaming for my real-time analysis application using the Kafka Stream API on a cluster (using the yarn version) of 6 executors with 4 dedicated cores and 8192mb of dedicated RAM.
The thing is, my application should run 24/7 but the disk usage is leaking. This leads to some exceptions occurring when Spark tries to write on a file system where no space is left. Here are some graphs showing the disk space remaining on a node where my application is deployed: http://i.imgur.com/vdPXCP0.png The « drops » occurred on a 3 minute interval. The Disk Usage goes back to normal once I kill my application: http://i.imgur.com/ERZs2Cj.png The persistance level of my RDD is MEMORY_AND_DISK_SER_2, but even when I tried MEMORY_ONLY_SER_2 the same thing happened. My question is: How can I force Spark (Streaming?) to remove whatever he stores immediately after he processed-it? Obviously it doesn’t look like the disk is being cleaned up (even though the memory does) even with me calling the rdd.unpersist() method foreach RDD processed. Here’s a sample of my application code: http://pastebin.com/K86LE1J6 Maybe something is wrong in my app too? Thanks for your help, NM