Thanks Sean Andy
From: Sean Owen <sro...@gmail.com> Date: Wednesday, January 5, 2022 at 3:38 PM To: Andrew Davidson <aedav...@ucsc.edu>, Nicholas Gustafson <njgustaf...@gmail.com> Cc: "user @spark" <user@spark.apache.org> Subject: Re: Newbie pyspark memory mgmt question There is no memory leak, no. You can .cache() or .persist() DataFrames, and that can use memory until you .unpersist(), but you're not doing that and they are garbage collected anyway. Hard to say what's running out of memory without knowing more about your data size, partitions, cluster size, etc On Wed, Jan 5, 2022 at 5:27 PM Andrew Davidson <aedav...@ucsc.edu.invalid> wrote: Hi I am running into OOM problems. My cluster should be much bigger than I need. I wonder if it has to do with the way I am writing my code. Below are three style cases. I wonder if they cause memory to be leaked? Case 1 : df1 = spark.read.load( cvs file) df1 = df1.someTransform() df1 = df1.sometranform() df1.write(csv file) I assume lazy evaluation. First action is write. So does not leak memory Case 2. I added actions to make it easier to debug df1 = spark.read.load( cvs file) print( df.count() ) df1 = df1.someTransform() print( df.count() ) df1 = df1.sometranform() print( df.count() ) df1.write(csv file) Does this leak memory? Case 3. If you remove the debug actions. You have the original version of my code. For f in listOfFiles df1 = spark.read.load( cvs file) df1 = df.select( [“a”, “b”] ) print( df1.count() ) df1.createOrReplaceTempView( "df1" ) from \n\ retDF as rc, \n\ sample \n\ where \n\ rc.Name == df1.Name \n'.format(“a”) if i == 0 : retDF = df1 else : retDF = self.spark.sql( sqlStmt ) print( retDF.count() ) retDF.createOrReplaceTempView( "retDF" ) Does this leak memory? Is there some sort of destroy(), delete(), ??? function I should be calling ? I wonder if I would be better off using the dataframe version of join() ? Kind regards Andy