Re: Newbie pyspark memory mgmt question

Andrew Davidson Wed, 05 Jan 2022 15:39:54 -0800

Thanks Sean

Andy

From: Sean Owen <sro...@gmail.com>
Date: Wednesday, January 5, 2022 at 3:38 PM
To: Andrew Davidson <aedav...@ucsc.edu>, Nicholas Gustafson 
<njgustaf...@gmail.com>
Cc: "user @spark" <user@spark.apache.org>
Subject: Re: Newbie pyspark memory mgmt question

There is no memory leak, no. You can .cache() or .persist() DataFrames, and 
that can use memory until you .unpersist(), but you're not doing that and they 
are garbage collected anyway.
Hard to say what's running out of memory without knowing more about your data 
size, partitions, cluster size, etc

On Wed, Jan 5, 2022 at 5:27 PM Andrew Davidson <aedav...@ucsc.edu.invalid> 
wrote:
Hi

I am running into OOM problems. My cluster should be much bigger than I need. I 
wonder if it has to do with the way I am writing my code. Below are three style 
cases. I wonder if they cause memory to be leaked?

Case 1 :

df1 = spark.read.load( cvs file)

df1 = df1.someTransform()

df1 = df1.sometranform()

df1.write(csv file)

I assume lazy evaluation. First action is write. So does not  leak memory

Case 2.

I added actions to make it easier to debug

df1 = spark.read.load( cvs file)

print( df.count() )

df1 = df1.someTransform()

print( df.count() )

df1 = df1.sometranform()

print( df.count() )

df1.write(csv file)

Does this leak memory?

Case 3.
If you remove the debug actions. You have the original version of my code.

For f in listOfFiles

df1 = spark.read.load( cvs file)

df1  = df.select( [“a”, “b”] )

print( df1.count() )

                        df1.createOrReplaceTempView( "df1" )

                        from \n\
                               retDF as rc, \n\
                               sample  \n\
                            where \n\
                                rc.Name == df1.Name \n'.format(“a”)
 if i == 0 :
                retDF = df1
            else :
                retDF = self.spark.sql( sqlStmt )

                           print( retDF.count() )
   retDF.createOrReplaceTempView( "retDF" )

Does this leak memory? Is there some sort of destroy(), delete(), ??? function 
I should be calling ?

I wonder if I would be better off using the dataframe version of join() ?

Kind regards

Andy

Re: Newbie pyspark memory mgmt question

Reply via email to