Can you tell us what version of Spark you are using and if Dynamic
Allocation is enabled ?

Also, how are the files being read ? Is it a single read of all files using
a file matching regex or are you running different threads in the same
pyspark job?



On Mon 4 Jun, 2018, 1:27 PM Shuporno Choudhury, <
shuporno.choudh...@gmail.com> wrote:

> Thanks a lot for the insight.
> Actually I have the exact same transformations for all the datasets, hence
> only 1 python code.
> Now, do you suggest that I run different spark-submit for all the
> different datasets given that I have the exact same transformations?
>
> On Tue 5 Jun, 2018, 1:48 AM Jörn Franke [via Apache Spark User List], <
> ml+s1001560n32458...@n3.nabble.com> wrote:
>
>> Yes if they are independent with different transformations then I would
>> create a separate python program. Especially for big data processing
>> frameworks one should avoid to put everything in one big monotholic
>> applications.
>>
>>
>> On 4. Jun 2018, at 22:02, Shuporno Choudhury <[hidden email]
>> <http:///user/SendEmail.jtp?type=node&node=32458&i=0>> wrote:
>>
>> Hi,
>>
>> Thanks for the input.
>> I was trying to get the functionality first, hence I was using local
>> mode. I will be running on a cluster definitely but later.
>>
>> Sorry for my naivety, but can you please elaborate on the modularity
>> concept that you mentioned and how it will affect whatever I am already
>> doing?
>> Do you mean running a different spark-submit for each different dataset
>> when you say 'an independent python program for each process '?
>>
>> On Tue, 5 Jun 2018 at 01:12, Jörn Franke [via Apache Spark User List] 
>> <[hidden
>> email] <http:///user/SendEmail.jtp?type=node&node=32458&i=1>> wrote:
>>
>> Why don’t you modularize your code and write for each process an
>>> independent python program that is submitted via Spark?
>>>
>>> Not sure though if Spark local make sense. If you don’t have a cluster
>>> then a normal python program can be much better.
>>>
>>> On 4. Jun 2018, at 21:37, Shuporno Choudhury <[hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=32455&i=0>> wrote:
>>>
>>> Hi everyone,
>>> I am trying to run a pyspark code on some data sets sequentially [basically
>>> 1. Read data into a dataframe 2.Perform some join/filter/aggregation 3.
>>> Write modified data in parquet format to a target location]
>>> Now, while running this pyspark code across *multiple independent data
>>> sets sequentially*, the memory usage from the previous data set doesn't
>>> seem to get released/cleared and hence spark's memory consumption (JVM
>>> memory consumption from Task Manager) keeps on increasing till it fails at
>>> some data set.
>>> So, is there a way to clear/remove dataframes that I know are not going
>>> to be used later?
>>> Basically, can I clear out some memory programmatically (in the pyspark
>>> code) when processing for a particular data set ends?
>>> At no point, I am caching any dataframe (so unpersist() is also not a
>>> solution).
>>>
>>> I am running spark using local[*] as master. There is a single
>>> SparkSession that is doing all the processing.
>>> If it is not possible to clear out memory, what can be a better approach
>>> for this problem?
>>>
>>> Can someone please help me with this and tell me if I am going wrong
>>> anywhere?
>>>
>>> --Thanks,
>>> Shuporno Choudhury
>>>
>>>
>>>
>>> ------------------------------
>>> If you reply to this email, your message will be added to the discussion
>>> below:
>>>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Releasing-memory-after-a-spark-job-is-finished-tp32454p32455.html
>>>
>> To start a new topic under Apache Spark User List, email [hidden email]
>>> <http:///user/SendEmail.jtp?type=node&node=32458&i=2>
>>>
>> To unsubscribe from Apache Spark User List, click here.
>>> NAML
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>>
>>
>>
>> --
>> --Thanks,
>> Shuporno Choudhury
>>
>>
>>
>> ------------------------------
>>
> If you reply to this email, your message will be added to the discussion
>> below:
>>
>
>> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Releasing-memory-after-a-spark-job-is-finished-tp32454p32458.html
>>
> To start a new topic under Apache Spark User List, email
>> ml+s1001560n1...@n3.nabble.com
>> To unsubscribe from Apache Spark User List, click here
>> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=c2h1cG9ybm8uY2hvdWRodXJ5QGdtYWlsLmNvbXwxfC0xODI0MTU0MzQ0>
>> .
>> NAML
>> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>

Reply via email to