Hi,

Thanks for the input.
I was trying to get the functionality first, hence I was using local mode.
I will be running on a cluster definitely but later.

Sorry for my naivety, but can you please elaborate on the modularity
concept that you mentioned and how it will affect whatever I am already
doing?
Do you mean running a different spark-submit for each different dataset
when you say 'an independent python program for each process '?

On Tue, 5 Jun 2018 at 01:12, Jörn Franke [via Apache Spark User List] <
ml+s1001560n32455...@n3.nabble.com> wrote:

> Why don’t you modularize your code and write for each process an
> independent python program that is submitted via Spark?
>
> Not sure though if Spark local make sense. If you don’t have a cluster
> then a normal python program can be much better.
>
> On 4. Jun 2018, at 21:37, Shuporno Choudhury <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=32455&i=0>> wrote:
>
> Hi everyone,
> I am trying to run a pyspark code on some data sets sequentially [basically
> 1. Read data into a dataframe 2.Perform some join/filter/aggregation 3.
> Write modified data in parquet format to a target location]
> Now, while running this pyspark code across *multiple independent data
> sets sequentially*, the memory usage from the previous data set doesn't
> seem to get released/cleared and hence spark's memory consumption (JVM
> memory consumption from Task Manager) keeps on increasing till it fails at
> some data set.
> So, is there a way to clear/remove dataframes that I know are not going to
> be used later?
> Basically, can I clear out some memory programmatically (in the pyspark
> code) when processing for a particular data set ends?
> At no point, I am caching any dataframe (so unpersist() is also not a
> solution).
>
> I am running spark using local[*] as master. There is a single
> SparkSession that is doing all the processing.
> If it is not possible to clear out memory, what can be a better approach
> for this problem?
>
> Can someone please help me with this and tell me if I am going wrong
> anywhere?
>
> --Thanks,
> Shuporno Choudhury
>
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Releasing-memory-after-a-spark-job-is-finished-tp32454p32455.html
> To start a new topic under Apache Spark User List, email
> ml+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=c2h1cG9ybm8uY2hvdWRodXJ5QGdtYWlsLmNvbXwxfC0xODI0MTU0MzQ0>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>


-- 
--Thanks,
Shuporno Choudhury

Reply via email to