Re: Joining many tables Re: Pyspark debugging best practices

2022-01-03 Thread Sonal Goyal
# this should not change the physical plan. I.e. we still > have the same number of shuffles > > # which results in the same number of stage. We are just not > building up a plan with thousands > > # of stages. > > # > > s

Joining many tables Re: Pyspark debugging best practices

2022-01-03 Thread Andrew Davidson
#rawCountsSDF.explain() self.logger.info( "END\n" ) return retNumReadsDF From: David Diebold Date: Monday, January 3, 2022 at 12:39 AM To: Andrew Davidson , "user @spark" Subject: Re: Pyspark debugging best practices Hello Andy, Are you sure you wa

Re: Pyspark debugging best practices

2022-01-03 Thread David Diebold
Hello Andy, Are you sure you want to perform lots of join operations, and not simple unions ? Are you doing inner joins or outer joins ? Can you provide us with a rough amount of your list size plus each individual dataset size ? Have a look at execution plan would help, maybe the high amount of

Re: Pyspark debugging best practices

2021-12-30 Thread Andrew Davidson
Hi Gourav I will give databricks a try. Each data gets loaded into a data frame. I select one column from the data frame I join the column to the accumulated joins from previous data frames in the list To debug. I think am gaining to put an action and log statement after each join. I do not

Re: Pyspark debugging best practices

2021-12-28 Thread Gourav Sengupta
Hi Andrew, Any chance you might give Databricks a try in GCP? The above transformations look complicated to me, why are you adding dataframes to a list? Regards, Gourav Sengupta On Sun, Dec 26, 2021 at 7:00 PM Andrew Davidson wrote: > Hi > > > > I am having trouble debugging my driver. It

Pyspark debugging best practices

2021-12-26 Thread Andrew Davidson
Hi I am having trouble debugging my driver. It runs correctly on smaller data set but fails on large ones. It is very hard to figure out what the bug is. I suspect it may have something do with the way spark is installed and configured. I am using google cloud platform dataproc pyspark The