Hi everyone, I am trying to run a pyspark code on some data sets sequentially [basically 1. Read data into a dataframe 2.Perform some join/filter/aggregation 3. Write modified data in parquet format to a target location] Now, while running this pyspark code across *multiple independent data sets sequentially*, the memory usage from the previous data set doesn't seem to get released/cleared and hence spark's memory consumption (JVM memory consumption from Task Manager) keeps on increasing till it fails at some data set. So, is there a way to clear/remove dataframes that I know are not going to be used later? Basically, can I clear out some memory programmatically (in the pyspark code) when processing for a particular data set ends? At no point, I am caching any dataframe (so unpersist() is also not a solution).
I am running spark using local[*] as master. There is a single SparkSession that is doing all the processing. If it is not possible to clear out memory, what can be a better approach for this problem? Can someone please help me with this and tell me if I am going wrong anywhere? --Thanks, Shuporno Choudhury