loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

Joris Billen Wed, 30 Mar 2022 08:16:18 -0700

Hi,
I have a pyspark job submitted through spark-submit that does some heavy 
processing for 1 day of data. It runs with no errors. I have to loop over many 
days, so I run this spark job in a loop. I notice after couple executions the 
memory is increasing on all worker nodes and eventually this leads to 
faillures. My job does some caching, but I understand that when the job ends 
successfully, then the sparkcontext is destroyed and the cache should be 
cleared. However it seems that something keeps on filling the memory a bit more 
and more after each run. THis is the memory behaviour over time, which in the 
end will start leading to failures :
[cid:C5C58A91-D7ED-4522-9984-C75192E4A9AA@home]


(what we see is: green=physical memory used, green-blue=physical memory cached, 
grey=memory capacity =straight line around 31GB )
This runs on a healthy spark 2.4 and was optimized already to come to a stable 
job in terms of spark-submit resources parameters like 
driver-memory/num-executors/executor-memory/executor-cores/spark.locality.wait).
Any clue how to “really” clear the memory in between jobs? So basically 
currently I can loop 10x and then need to restart my cluster so all memory is 
cleared completely.


Thanks for any info!

loop of spark jobs leads to increase in memory on worker nodes and eventually faillure

Reply via email to