Can you tell us what version of Spark you are using and if Dynamic Allocation is enabled ?
Also, how are the files being read ? Is it a single read of all files using a file matching regex or are you running different threads in the same pyspark job? On Mon 4 Jun, 2018, 1:27 PM Shuporno Choudhury, < shuporno.choudh...@gmail.com> wrote: > Thanks a lot for the insight. > Actually I have the exact same transformations for all the datasets, hence > only 1 python code. > Now, do you suggest that I run different spark-submit for all the > different datasets given that I have the exact same transformations? > > On Tue 5 Jun, 2018, 1:48 AM Jörn Franke [via Apache Spark User List], < > ml+s1001560n32458...@n3.nabble.com> wrote: > >> Yes if they are independent with different transformations then I would >> create a separate python program. Especially for big data processing >> frameworks one should avoid to put everything in one big monotholic >> applications. >> >> >> On 4. Jun 2018, at 22:02, Shuporno Choudhury <[hidden email] >> <http:///user/SendEmail.jtp?type=node&node=32458&i=0>> wrote: >> >> Hi, >> >> Thanks for the input. >> I was trying to get the functionality first, hence I was using local >> mode. I will be running on a cluster definitely but later. >> >> Sorry for my naivety, but can you please elaborate on the modularity >> concept that you mentioned and how it will affect whatever I am already >> doing? >> Do you mean running a different spark-submit for each different dataset >> when you say 'an independent python program for each process '? >> >> On Tue, 5 Jun 2018 at 01:12, Jörn Franke [via Apache Spark User List] >> <[hidden >> email] <http:///user/SendEmail.jtp?type=node&node=32458&i=1>> wrote: >> >> Why don’t you modularize your code and write for each process an >>> independent python program that is submitted via Spark? >>> >>> Not sure though if Spark local make sense. If you don’t have a cluster >>> then a normal python program can be much better. >>> >>> On 4. Jun 2018, at 21:37, Shuporno Choudhury <[hidden email] >>> <http:///user/SendEmail.jtp?type=node&node=32455&i=0>> wrote: >>> >>> Hi everyone, >>> I am trying to run a pyspark code on some data sets sequentially [basically >>> 1. Read data into a dataframe 2.Perform some join/filter/aggregation 3. >>> Write modified data in parquet format to a target location] >>> Now, while running this pyspark code across *multiple independent data >>> sets sequentially*, the memory usage from the previous data set doesn't >>> seem to get released/cleared and hence spark's memory consumption (JVM >>> memory consumption from Task Manager) keeps on increasing till it fails at >>> some data set. >>> So, is there a way to clear/remove dataframes that I know are not going >>> to be used later? >>> Basically, can I clear out some memory programmatically (in the pyspark >>> code) when processing for a particular data set ends? >>> At no point, I am caching any dataframe (so unpersist() is also not a >>> solution). >>> >>> I am running spark using local[*] as master. There is a single >>> SparkSession that is doing all the processing. >>> If it is not possible to clear out memory, what can be a better approach >>> for this problem? >>> >>> Can someone please help me with this and tell me if I am going wrong >>> anywhere? >>> >>> --Thanks, >>> Shuporno Choudhury >>> >>> >>> >>> ------------------------------ >>> If you reply to this email, your message will be added to the discussion >>> below: >>> >>> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Releasing-memory-after-a-spark-job-is-finished-tp32454p32455.html >>> >> To start a new topic under Apache Spark User List, email [hidden email] >>> <http:///user/SendEmail.jtp?type=node&node=32458&i=2> >>> >> To unsubscribe from Apache Spark User List, click here. >>> NAML >>> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >>> >> >> >> -- >> --Thanks, >> Shuporno Choudhury >> >> >> >> ------------------------------ >> > If you reply to this email, your message will be added to the discussion >> below: >> > >> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Releasing-memory-after-a-spark-job-is-finished-tp32454p32458.html >> > To start a new topic under Apache Spark User List, email >> ml+s1001560n1...@n3.nabble.com >> To unsubscribe from Apache Spark User List, click here >> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=c2h1cG9ybm8uY2hvdWRodXJ5QGdtYWlsLmNvbXwxfC0xODI0MTU0MzQ0> >> . >> NAML >> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >> >