Yes if they are independent with different transformations then I would create a separate python program. Especially for big data processing frameworks one should avoid to put everything in one big monotholic applications.
> On 4. Jun 2018, at 22:02, Shuporno Choudhury <shuporno.choudh...@gmail.com> > wrote: > > Hi, > > Thanks for the input. > I was trying to get the functionality first, hence I was using local mode. I > will be running on a cluster definitely but later. > > Sorry for my naivety, but can you please elaborate on the modularity concept > that you mentioned and how it will affect whatever I am already doing? > Do you mean running a different spark-submit for each different dataset when > you say 'an independent python program for each process '? > >> On Tue, 5 Jun 2018 at 01:12, Jörn Franke [via Apache Spark User List] >> <ml+s1001560n32455...@n3.nabble.com> wrote: >> Why don’t you modularize your code and write for each process an independent >> python program that is submitted via Spark? >> >> Not sure though if Spark local make sense. If you don’t have a cluster then >> a normal python program can be much better. >> >>> On 4. Jun 2018, at 21:37, Shuporno Choudhury <[hidden email]> wrote: >>> >>> Hi everyone, >>> I am trying to run a pyspark code on some data sets sequentially [basically >>> 1. Read data into a dataframe 2.Perform some join/filter/aggregation 3. >>> Write modified data in parquet format to a target location] >>> Now, while running this pyspark code across multiple independent data sets >>> sequentially, the memory usage from the previous data set doesn't seem to >>> get released/cleared and hence spark's memory consumption (JVM memory >>> consumption from Task Manager) keeps on increasing till it fails at some >>> data set. >>> So, is there a way to clear/remove dataframes that I know are not going to >>> be used later? >>> Basically, can I clear out some memory programmatically (in the pyspark >>> code) when processing for a particular data set ends? >>> At no point, I am caching any dataframe (so unpersist() is also not a >>> solution). >>> >>> I am running spark using local[*] as master. There is a single SparkSession >>> that is doing all the processing. >>> If it is not possible to clear out memory, what can be a better approach >>> for this problem? >>> >>> Can someone please help me with this and tell me if I am going wrong >>> anywhere? >>> >>> --Thanks, >>> Shuporno Choudhury >> >> >> If you reply to this email, your message will be added to the discussion >> below: >> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Releasing-memory-after-a-spark-job-is-finished-tp32454p32455.html >> To start a new topic under Apache Spark User List, email >> ml+s1001560n1...@n3.nabble.com >> To unsubscribe from Apache Spark User List, click here. >> NAML > > > -- > --Thanks, > Shuporno Choudhury