Hi All!

I have a very weird memory issue (which is what a lot of people will most 
likely say ;-)) with Spark running in standalone mode inside a Docker 
container. Our setup is as follows: We have a Docker container in which we have 
a Spring boot application that runs Spark in standalone mode. This Spring boot 
app also contains a few scheduled tasks. These tasks trigger Spark jobs. The 
Spark jobs scrape a SQL database shuffles the data a bit and then writes the 
results to a different SQL table. Our current data set is very small (the 
largest table contains a few million rows).

The problem is that the Docker host (a CentOS VM) that runs the Docker 
container crashes after a while because the memory gets exhausted. I currently 
have limited the Spark memory usage to 512M (I have set both executor and 
driver memory) and in the Spark UI I can see that the largest job only takes 
about 10 MB of memory.

After digging a bit further I noticed that Spark eats up all the buffer / cache 
memory on the machine. After clearing this manually by forcing Linux to drop 
caches (echo 2 > /proc/sys/vm/drop_caches) (clearing the dentries and inodes) 
the cache usage drops considerably but if I don't keep doing this regularly I 
see that the cache usage slowly keeps going up until all memory is used in 
buffer/cache.

Does anyone have an idea what I might be doing wrong / what is going on here?

Big thanks in advance for any help!

Regards,
Stein Welberg

Attachment: signature.asc
Description: Message signed with OpenPGP

Reply via email to