Hello Andy, Are you sure you want to perform lots of join operations, and not simple unions ? Are you doing inner joins or outer joins ? Can you provide us with a rough amount of your list size plus each individual dataset size ? Have a look at execution plan would help, maybe the high amount of join operations makes execution plan too complicated at the end of the day ; checkpointing could help there ?
Cheers, David Le jeu. 30 déc. 2021 à 16:56, Andrew Davidson <aedav...@ucsc.edu.invalid> a écrit : > Hi Gourav > > I will give databricks a try. > > Each data gets loaded into a data frame. > I select one column from the data frame > I join the column to the accumulated joins from previous data frames in > the list > > To debug. I think am gaining to put an action and log statement after each > join. I do not think it will change the performance. I believe the physical > plan will be the same how ever hopefully it will shed some light. > > At the very least I will know if it making progress or not. And hopefully > where it is breaking > > Happy new year > > Andy > > On Tue, Dec 28, 2021 at 4:19 AM Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > >> Hi Andrew, >> >> Any chance you might give Databricks a try in GCP? >> >> The above transformations look complicated to me, why are you adding >> dataframes to a list? >> >> >> Regards, >> Gourav Sengupta >> >> >> >> On Sun, Dec 26, 2021 at 7:00 PM Andrew Davidson <aedav...@ucsc.edu.invalid> >> wrote: >> >>> Hi >>> >>> >>> >>> I am having trouble debugging my driver. It runs correctly on smaller >>> data set but fails on large ones. It is very hard to figure out what the >>> bug is. I suspect it may have something do with the way spark is installed >>> and configured. I am using google cloud platform dataproc pyspark >>> >>> >>> >>> The log messages are not helpful. The error message will be something >>> like >>> "User application exited with status 1" >>> >>> >>> >>> And >>> >>> >>> >>> jsonPayload: { >>> >>> class: "server.TThreadPoolServer" >>> >>> filename: "hive-server2.log" >>> >>> message: "Error occurred during processing of message." >>> >>> thread: "HiveServer2-Handler-Pool: Thread-40" >>> >>> } >>> >>> >>> >>> I am able to access the spark history server however it does not capture >>> anything if the driver crashes. I am unable to figure out how to access >>> spark web UI. >>> >>> >>> >>> My driver program looks something like the pseudo code bellow. A long >>> list of transforms with a single action, (i.e. write) at the end. Adding >>> log messages is not helpful because of lazy evaluations. I am tempted to >>> add something like >>> >>> >>> >>> Logger.warn( “DEBUG df.count():{}”.format( df.count() )” to try and >>> inline some sort of diagnostic message. >>> >>> >>> >>> What do you think? >>> >>> >>> >>> Is there a better way to debug this? >>> >>> >>> >>> Kind regards >>> >>> >>> >>> Andy >>> >>> >>> >>> def run(): >>> >>> listOfDF = [] >>> >>> for filePath in listOfFiles: >>> >>> df = spark.read.load( filePath, ...) >>> >>> listOfDF.append(df) >>> >>> >>> >>> >>> >>> list2OfDF = [] >>> >>> for df in listOfDF: >>> >>> df2 = df.select( .... ) >>> >>> lsit2OfDF.append( df2 ) >>> >>> >>> >>> # will setting list to None free cache? >>> >>> # or just driver memory >>> >>> listOfDF = None >>> >>> >>> >>> >>> >>> df3 = list2OfDF[0] >>> >>> >>> >>> for i in range( 1, len(list2OfDF) ): >>> >>> df = list2OfDF[i] >>> >>> df3 = df3.join(df ...) >>> >>> >>> >>> # will setting to list to None free cache? >>> >>> # or just driver memory >>> >>> List2OfDF = None >>> >>> >>> >>> >>> >>> lots of narrow transformations on d3 >>> >>> >>> >>> return df3 >>> >>> >>> >>> def main() : >>> >>> df = run() >>> >>> df.write() >>> >>> >>> >>> >>> >>> >>> >>