Hi, I have a pyspark job submitted through spark-submit that does some heavy processing for 1 day of data. It runs with no errors. I have to loop over many days, so I run this spark job in a loop. I notice after couple executions the memory is increasing on all worker nodes and eventually this leads to faillures. My job does some caching, but I understand that when the job ends successfully, then the sparkcontext is destroyed and the cache should be cleared. However it seems that something keeps on filling the memory a bit more and more after each run. THis is the memory behaviour over time, which in the end will start leading to failures : [cid:C5C58A91-D7ED-4522-9984-C75192E4A9AA@home]
(what we see is: green=physical memory used, green-blue=physical memory cached, grey=memory capacity =straight line around 31GB ) This runs on a healthy spark 2.4 and was optimized already to come to a stable job in terms of spark-submit resources parameters like driver-memory/num-executors/executor-memory/executor-cores/spark.locality.wait). Any clue how to “really” clear the memory in between jobs? So basically currently I can loop 10x and then need to restart my cluster so all memory is cleared completely. Thanks for any info!