Newbie pyspark memory mgmt question

Andrew Davidson Wed, 05 Jan 2022 15:27:25 -0800

Hi

I am running into OOM problems. My cluster should be much bigger than I need. I 
wonder if it has to do with the way I am writing my code. Below are three style 
cases. I wonder if they cause memory to be leaked?


Case 1 :

df1 = spark.read.load( cvs file)

df1 = df1.someTransform()

df1 = df1.sometranform()

df1.write(csv file)



I assume lazy evaluation. First action is write. So does not  leak memory



Case 2.

I added actions to make it easier to debug



df1 = spark.read.load( cvs file)

print( df.count() )



df1 = df1.someTransform()

print( df.count() )



df1 = df1.sometranform()

print( df.count() )



df1.write(csv file)

Does this leak memory?

Case 3.
If you remove the debug actions. You have the original version of my code.

For f in listOfFiles

df1 = spark.read.load( cvs file)

df1  = df.select( [“a”, “b”] )

print( df1.count() )

                        df1.createOrReplaceTempView( "df1" )



                        from \n\
                               retDF as rc, \n\
                               sample  \n\
                            where \n\
                                rc.Name == df1.Name \n'.format(“a”)
 if i == 0 :
                retDF = df1
            else :
                retDF = self.spark.sql( sqlStmt )

                           print( retDF.count() )
   retDF.createOrReplaceTempView( "retDF" )


Does this leak memory? Is there some sort of destroy(), delete(), ??? function 
I should be calling ?

I wonder if I would be better off using the dataframe version of join() ?

Kind regards

Andy

Newbie pyspark memory mgmt question

Reply via email to