Disclaimer - I use Spark with Scala and not Python.

But I am guessing that Jorn's reference to modularization is to ensure that you 
do the processing inside methods/functions and call those methods sequentially.
I believe that as long as an RDD/dataset variable is in scope, its memory may 
not be getting released.
By having functions, they will get out of scope and their memory can be 
released.

Also, assuming that the variables are not daisy-chained/inter-related as that 
too will not make it easy.


From: Jay <jayadeep.jayara...@gmail.com>
Date: Monday, June 4, 2018 at 9:41 PM
To: Shuporno Choudhury <shuporno.choudh...@gmail.com>
Cc: "Jörn Franke [via Apache Spark User List]" 
<ml+s1001560n32458...@n3.nabble.com>, <user@spark.apache.org>
Subject: Re: [PySpark] Releasing memory after a spark job is finished

Can you tell us what version of Spark you are using and if Dynamic Allocation 
is enabled ?

Also, how are the files being read ? Is it a single read of all files using a 
file matching regex or are you running different threads in the same pyspark 
job?


On Mon 4 Jun, 2018, 1:27 PM Shuporno Choudhury, 
<shuporno.choudh...@gmail.com<mailto:shuporno.choudh...@gmail.com>> wrote:
Thanks a lot for the insight.
Actually I have the exact same transformations for all the datasets, hence only 
1 python code.
Now, do you suggest that I run different spark-submit for all the different 
datasets given that I have the exact same transformations?

On Tue 5 Jun, 2018, 1:48 AM Jörn Franke [via Apache Spark User List], 
<ml+s1001560n32458...@n3.nabble.com<mailto:ml%2bs1001560n32458...@n3.nabble.com>>
 wrote:
Yes if they are independent with different transformations then I would create 
a separate python program. Especially for big data processing frameworks one 
should avoid to put everything in one big monotholic applications.


On 4. Jun 2018, at 22:02, Shuporno Choudhury <[hidden 
email]<http://user/SendEmail.jtp?type=node&node=32458&i=0>> wrote:
Hi,

Thanks for the input.
I was trying to get the functionality first, hence I was using local mode. I 
will be running on a cluster definitely but later.

Sorry for my naivety, but can you please elaborate on the modularity concept 
that you mentioned and how it will affect whatever I am already doing?
Do you mean running a different spark-submit for each different dataset when 
you say 'an independent python program for each process '?

On Tue, 5 Jun 2018 at 01:12, Jörn Franke [via Apache Spark User List] <[hidden 
email]<http://user/SendEmail.jtp?type=node&node=32458&i=1>> wrote:
Why don’t you modularize your code and write for each process an independent 
python program that is submitted via Spark?

Not sure though if Spark local make sense. If you don’t have a cluster then a 
normal python program can be much better.

On 4. Jun 2018, at 21:37, Shuporno Choudhury <[hidden 
email]<http://user/SendEmail.jtp?type=node&node=32455&i=0>> wrote:
Hi everyone,
I am trying to run a pyspark code on some data sets sequentially [basically 1. 
Read data into a dataframe 2.Perform some join/filter/aggregation 3. Write 
modified data in parquet format to a target location]
Now, while running this pyspark code across multiple independent data sets 
sequentially, the memory usage from the previous data set doesn't seem to get 
released/cleared and hence spark's memory consumption (JVM memory consumption 
from Task Manager) keeps on increasing till it fails at some data set.
So, is there a way to clear/remove dataframes that I know are not going to be 
used later?
Basically, can I clear out some memory programmatically (in the pyspark code) 
when processing for a particular data set ends?
At no point, I am caching any dataframe (so unpersist() is also not a solution).

I am running spark using local[*] as master. There is a single SparkSession 
that is doing all the processing.
If it is not possible to clear out memory, what can be a better approach for 
this problem?

Can someone please help me with this and tell me if I am going wrong anywhere?

--Thanks,
Shuporno Choudhury

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Releasing-memory-after-a-spark-job-is-finished-tp32454p32455.html
To start a new topic under Apache Spark User List, email [hidden 
email]<http://user/SendEmail.jtp?type=node&node=32458&i=2>
To unsubscribe from Apache Spark User List, click here.
NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>


--
--Thanks,
Shuporno Choudhury

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Releasing-memory-after-a-spark-job-is-finished-tp32454p32458.html
To start a new topic under Apache Spark User List, email 
ml+s1001560n1...@n3.nabble.com<mailto:ml%2bs1001560n1...@n3.nabble.com>
To unsubscribe from Apache Spark User List, click 
here<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=c2h1cG9ybm8uY2hvdWRodXJ5QGdtYWlsLmNvbXwxfC0xODI0MTU0MzQ0>.
NAML<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>

Reply via email to