Hi Below is typical pseudo code I find myself writing over and over again. There is only a single action at the very end of the program. The early narrow transformations potentially hold on to a lot of needless data. I have a for loop over join. (ie wide transformation). Followed by a bunch more narrow transformations. Will setting my lists to None improve performance?
What are best practices? Kind regards Andy def run(): listOfDF = [] for filePath in listOfFiles: df = spark.read.load( filePath, ...) listOfDF.append(df) list2OfDF = [] for df in listOfDF: df2 = df.select( .... ) lsit2OfDF.append( df2 ) # will setting to list to None free cache? # or just driver memory listOfDF = None df3 = list2OfDF[0] for i in range( 1, len(list2OfDF) ): df = list2OfDF[i] df3 = df3.join(df ...) # will setting to list to None free cache? # or just driver memory List2OfDF = None lots of narrow transformations on d3 return df3 def main() : df = run() df.write()