Hi
I am having trouble debugging my driver. It runs correctly on smaller data set
but fails on large ones. It is very hard to figure out what the bug is. I
suspect it may have something do with the way spark is installed and
configured. I am using google cloud platform dataproc pyspark
The log messages are not helpful. The error message will be something like
"User application exited with status 1"
And
jsonPayload: {
class: "server.TThreadPoolServer"
filename: "hive-server2.log"
message: "Error occurred during processing of message."
thread: "HiveServer2-Handler-Pool: Thread-40"
}
I am able to access the spark history server however it does not capture
anything if the driver crashes. I am unable to figure out how to access spark
web UI.
My driver program looks something like the pseudo code bellow. A long list of
transforms with a single action, (i.e. write) at the end. Adding log messages
is not helpful because of lazy evaluations. I am tempted to add something like
Logger.warn( “DEBUG df.count():{}”.format( df.count() )” to try and inline some
sort of diagnostic message.
What do you think?
Is there a better way to debug this?
Kind regards
Andy
def run():
listOfDF = []
for filePath in listOfFiles:
df = spark.read.load( filePath, ...)
listOfDF.append(df)
list2OfDF = []
for df in listOfDF:
df2 = df.select( .... )
lsit2OfDF.append( df2 )
# will setting list to None free cache?
# or just driver memory
listOfDF = None
df3 = list2OfDF[0]
for i in range( 1, len(list2OfDF) ):
df = list2OfDF[i]
df3 = df3.join(df ...)
# will setting to list to None free cache?
# or just driver memory
List2OfDF = None
lots of narrow transformations on d3
return df3
def main() :
df = run()
df.write()