Hi everyone,
I am trying to run a pyspark code on some data sets sequentially [basically
1. Read data into a dataframe 2.Perform some join/filter/aggregation 3.
Write modified data in parquet format to a target location]
Now, while running this pyspark code across *multiple independent data sets
sequentially*, the memory usage from the previous data set doesn't seem to
get released/cleared and hence spark's memory consumption (JVM memory
consumption from Task Manager) keeps on increasing till it fails at some
data set.
So, is there a way to clear/remove dataframes that I know are not going to
be used later?
Basically, can I clear out some memory programmatically (in the pyspark
code) when processing for a particular data set ends?
At no point, I am caching any dataframe (so unpersist() is also not a
solution).

I am running spark using local[*] as master. There is a single SparkSession
that is doing all the processing.
If it is not possible to clear out memory, what can be a better approach
for this problem?

Can someone please help me with this and tell me if I am going wrong
anywhere?

--Thanks,
Shuporno Choudhury

Reply via email to