Hi
I am running into OOM problems. My cluster should be much bigger than I need. I
wonder if it has to do with the way I am writing my code. Below are three style
cases. I wonder if they cause memory to be leaked?
Case 1 :
df1 = spark.read.load( cvs file)
df1 = df1.someTransform()
df1 = df1.sometranform()
df1.write(csv file)
I assume lazy evaluation. First action is write. So does not leak memory
Case 2.
I added actions to make it easier to debug
df1 = spark.read.load( cvs file)
print( df.count() )
df1 = df1.someTransform()
print( df.count() )
df1 = df1.sometranform()
print( df.count() )
df1.write(csv file)
Does this leak memory?
Case 3.
If you remove the debug actions. You have the original version of my code.
For f in listOfFiles
df1 = spark.read.load( cvs file)
df1 = df.select( [“a”, “b”] )
print( df1.count() )
df1.createOrReplaceTempView( "df1" )
from \n\
retDF as rc, \n\
sample \n\
where \n\
rc.Name == df1.Name \n'.format(“a”)
if i == 0 :
retDF = df1
else :
retDF = self.spark.sql( sqlStmt )
print( retDF.count() )
retDF.createOrReplaceTempView( "retDF" )
Does this leak memory? Is there some sort of destroy(), delete(), ??? function
I should be calling ?
I wonder if I would be better off using the dataframe version of join() ?
Kind regards
Andy