Yes if they are independent with different transformations then I would create 
a separate python program. Especially for big data processing frameworks one 
should avoid to put everything in one big monotholic applications.


> On 4. Jun 2018, at 22:02, Shuporno Choudhury <shuporno.choudh...@gmail.com> 
> wrote:
> 
> Hi,
> 
> Thanks for the input.
> I was trying to get the functionality first, hence I was using local mode. I 
> will be running on a cluster definitely but later.
> 
> Sorry for my naivety, but can you please elaborate on the modularity concept 
> that you mentioned and how it will affect whatever I am already doing?
> Do you mean running a different spark-submit for each different dataset when 
> you say 'an independent python program for each process '?
> 
>> On Tue, 5 Jun 2018 at 01:12, Jörn Franke [via Apache Spark User List] 
>> <ml+s1001560n32455...@n3.nabble.com> wrote:
>> Why don’t you modularize your code and write for each process an independent 
>> python program that is submitted via Spark?
>> 
>> Not sure though if Spark local make sense. If you don’t have a cluster then 
>> a normal python program can be much better.
>> 
>>> On 4. Jun 2018, at 21:37, Shuporno Choudhury <[hidden email]> wrote:
>>> 
>>> Hi everyone,
>>> I am trying to run a pyspark code on some data sets sequentially [basically 
>>> 1. Read data into a dataframe 2.Perform some join/filter/aggregation 3. 
>>> Write modified data in parquet format to a target location]
>>> Now, while running this pyspark code across multiple independent data sets 
>>> sequentially, the memory usage from the previous data set doesn't seem to 
>>> get released/cleared and hence spark's memory consumption (JVM memory 
>>> consumption from Task Manager) keeps on increasing till it fails at some 
>>> data set.
>>> So, is there a way to clear/remove dataframes that I know are not going to 
>>> be used later? 
>>> Basically, can I clear out some memory programmatically (in the pyspark 
>>> code) when processing for a particular data set ends?
>>> At no point, I am caching any dataframe (so unpersist() is also not a 
>>> solution).
>>> 
>>> I am running spark using local[*] as master. There is a single SparkSession 
>>> that is doing all the processing.
>>> If it is not possible to clear out memory, what can be a better approach 
>>> for this problem?
>>> 
>>> Can someone please help me with this and tell me if I am going wrong 
>>> anywhere?
>>> 
>>> --Thanks,
>>> Shuporno Choudhury
>> 
>> 
>> If you reply to this email, your message will be added to the discussion 
>> below:
>> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Releasing-memory-after-a-spark-job-is-finished-tp32454p32455.html
>> To start a new topic under Apache Spark User List, email 
>> ml+s1001560n1...@n3.nabble.com 
>> To unsubscribe from Apache Spark User List, click here.
>> NAML
> 
> 
> -- 
> --Thanks,
> Shuporno Choudhury

Reply via email to