Pyspark debugging best practices

Andrew Davidson Sun, 26 Dec 2021 11:00:20 -0800

Hi

I am having trouble debugging my driver. It runs correctly on smaller data set 
but fails on large ones.  It is very hard to figure out what the bug is. I 
suspect it may have something do with the way spark is installed and 
configured. I am using google cloud platform dataproc pyspark


The log messages are not helpful. The error message will be something like
"User application exited with status 1"

And

jsonPayload: {
class: "server.TThreadPoolServer"
filename: "hive-server2.log"
message: "Error occurred during processing of message."
thread: "HiveServer2-Handler-Pool: Thread-40"
}

I am able to access the spark history server however it does not capture 
anything if the driver crashes. I am unable to figure out how to access spark 
web UI.

My driver program looks something like the pseudo code bellow. A long list of 
transforms with a single action, (i.e. write) at the end. Adding log messages 
is not helpful because of lazy evaluations. I am tempted to add something like

Logger.warn( “DEBUG df.count():{}”.format( df.count() )” to try and inline some 
sort of diagnostic message.

What do you think?

Is there a better way to debug this?

Kind regards

Andy

def run():
    listOfDF = []
    for filePath in listOfFiles:
        df = spark.read.load( filePath, ...)
        listOfDF.append(df)


    list2OfDF = []
    for df in listOfDF:
        df2 = df.select( .... )
        lsit2OfDF.append( df2 )

    # will setting list to None free cache?
    # or just driver memory
    listOfDF = None


    df3 = list2OfDF[0]

    for i in range( 1, len(list2OfDF) ):
        df = list2OfDF[i]
        df3 = df3.join(df ...)

    # will setting to list to None free cache?
    # or just driver memory
    List2OfDF = None


    lots of narrow transformations on d3

    return df3

def main() :
    df = run()
    df.write()

Pyspark debugging best practices

Reply via email to