Pyspark garbage collection and cache management best practices

Andrew Davidson Sun, 26 Dec 2021 10:44:17 -0800

Hi

Below is typical pseudo code I find myself writing over and over again. There 
is only a single action at the very end of the program. The early narrow 
transformations potentially hold on to a lot of needless data. I have a for 
loop over join. (ie wide transformation). Followed by a bunch more narrow 
transformations. Will setting my lists to None improve performance?


What are best practices?

Kind regards

Andy

def run():
    listOfDF = []
    for filePath in listOfFiles:
        df = spark.read.load( filePath, ...)
        listOfDF.append(df)


    list2OfDF = []
    for df in listOfDF:
        df2 = df.select( .... )
        lsit2OfDF.append( df2 )

    # will setting to list to None free cache?
    # or just driver memory
    listOfDF = None


    df3 = list2OfDF[0]

    for i in range( 1, len(list2OfDF) ):
        df = list2OfDF[i]
        df3 = df3.join(df ...)

    # will setting to list to None free cache?
    # or just driver memory
    List2OfDF = None


    lots of narrow transformations on d3

    return df3

def main() :
    df = run()
    df.write()

Pyspark garbage collection and cache management best practices

Reply via email to