Hi Gourav I will give databricks a try.
Each data gets loaded into a data frame. I select one column from the data frame I join the column to the accumulated joins from previous data frames in the list To debug. I think am gaining to put an action and log statement after each join. I do not think it will change the performance. I believe the physical plan will be the same how ever hopefully it will shed some light. At the very least I will know if it making progress or not. And hopefully where it is breaking Happy new year Andy On Tue, Dec 28, 2021 at 4:19 AM Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > Hi Andrew, > > Any chance you might give Databricks a try in GCP? > > The above transformations look complicated to me, why are you adding > dataframes to a list? > > > Regards, > Gourav Sengupta > > > > On Sun, Dec 26, 2021 at 7:00 PM Andrew Davidson <aedav...@ucsc.edu.invalid> > wrote: > >> Hi >> >> >> >> I am having trouble debugging my driver. It runs correctly on smaller >> data set but fails on large ones. It is very hard to figure out what the >> bug is. I suspect it may have something do with the way spark is installed >> and configured. I am using google cloud platform dataproc pyspark >> >> >> >> The log messages are not helpful. The error message will be something >> like >> "User application exited with status 1" >> >> >> >> And >> >> >> >> jsonPayload: { >> >> class: "server.TThreadPoolServer" >> >> filename: "hive-server2.log" >> >> message: "Error occurred during processing of message." >> >> thread: "HiveServer2-Handler-Pool: Thread-40" >> >> } >> >> >> >> I am able to access the spark history server however it does not capture >> anything if the driver crashes. I am unable to figure out how to access >> spark web UI. >> >> >> >> My driver program looks something like the pseudo code bellow. A long >> list of transforms with a single action, (i.e. write) at the end. Adding >> log messages is not helpful because of lazy evaluations. I am tempted to >> add something like >> >> >> >> Logger.warn( “DEBUG df.count():{}”.format( df.count() )” to try and >> inline some sort of diagnostic message. >> >> >> >> What do you think? >> >> >> >> Is there a better way to debug this? >> >> >> >> Kind regards >> >> >> >> Andy >> >> >> >> def run(): >> >> listOfDF = [] >> >> for filePath in listOfFiles: >> >> df = spark.read.load( filePath, ...) >> >> listOfDF.append(df) >> >> >> >> >> >> list2OfDF = [] >> >> for df in listOfDF: >> >> df2 = df.select( .... ) >> >> lsit2OfDF.append( df2 ) >> >> >> >> # will setting list to None free cache? >> >> # or just driver memory >> >> listOfDF = None >> >> >> >> >> >> df3 = list2OfDF[0] >> >> >> >> for i in range( 1, len(list2OfDF) ): >> >> df = list2OfDF[i] >> >> df3 = df3.join(df ...) >> >> >> >> # will setting to list to None free cache? >> >> # or just driver memory >> >> List2OfDF = None >> >> >> >> >> >> lots of narrow transformations on d3 >> >> >> >> return df3 >> >> >> >> def main() : >> >> df = run() >> >> df.write() >> >> >> >> >> >> >> >