Re: Pyspark debugging best practices

Andrew Davidson Thu, 30 Dec 2021 07:55:33 -0800

Hi Gourav

I will give databricks a try.


Each data gets loaded into a data frame.
I select one column from the data frame
I join the column to the  accumulated joins from previous data frames in
the list

To debug. I think am gaining to put an action and log statement after each
join. I do not think it will change the performance. I believe the physical
plan will be the same how ever hopefully it will shed some light.

At the very least I will know if it making progress or not. And hopefully
where it is breaking

Happy new year

Andy

On Tue, Dec 28, 2021 at 4:19 AM Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> Hi Andrew,
>
> Any chance you might give Databricks a try in GCP?
>
> The above transformations look complicated to me, why are you adding
> dataframes to a list?
>
>
> Regards,
> Gourav Sengupta
>
>
>
> On Sun, Dec 26, 2021 at 7:00 PM Andrew Davidson <aedav...@ucsc.edu.invalid>
> wrote:
>
>> Hi
>>
>>
>>
>> I am having trouble debugging my driver. It runs correctly on smaller
>> data set but fails on large ones.  It is very hard to figure out what the
>> bug is. I suspect it may have something do with the way spark is installed
>> and configured. I am using google cloud platform dataproc pyspark
>>
>>
>>
>> The log messages are not helpful. The error message will be something
>> like
>> "User application exited with status 1"
>>
>>
>>
>> And
>>
>>
>>
>> jsonPayload: {
>>
>> class: "server.TThreadPoolServer"
>>
>> filename: "hive-server2.log"
>>
>> message: "Error occurred during processing of message."
>>
>> thread: "HiveServer2-Handler-Pool: Thread-40"
>>
>> }
>>
>>
>>
>> I am able to access the spark history server however it does not capture
>> anything if the driver crashes. I am unable to figure out how to access
>> spark web UI.
>>
>>
>>
>> My driver program looks something like the pseudo code bellow. A long
>> list of transforms with a single action, (i.e. write) at the end. Adding
>> log messages is not helpful because of lazy evaluations. I am tempted to
>> add something like
>>
>>
>>
>> Logger.warn( “DEBUG df.count():{}”.format( df.count() )” to try and
>> inline some sort of diagnostic message.
>>
>>
>>
>> What do you think?
>>
>>
>>
>> Is there a better way to debug this?
>>
>>
>>
>> Kind regards
>>
>>
>>
>> Andy
>>
>>
>>
>> def run():
>>
>>     listOfDF = []
>>
>>     for filePath in listOfFiles:
>>
>>         df = spark.read.load( filePath, ...)
>>
>>         listOfDF.append(df)
>>
>>
>>
>>
>>
>>     list2OfDF = []
>>
>>     for df in listOfDF:
>>
>>         df2 = df.select( .... )
>>
>>         lsit2OfDF.append( df2 )
>>
>>
>>
>>     # will setting list to None free cache?
>>
>>     # or just driver memory
>>
>>     listOfDF = None
>>
>>
>>
>>
>>
>>     df3 = list2OfDF[0]
>>
>>
>>
>>     for i in range( 1, len(list2OfDF) ):
>>
>>         df = list2OfDF[i]
>>
>>         df3 = df3.join(df ...)
>>
>>
>>
>>     # will setting to list to None free cache?
>>
>>     # or just driver memory
>>
>>     List2OfDF = None
>>
>>
>>
>>
>>
>>     lots of narrow transformations on d3
>>
>>
>>
>>     return df3
>>
>>
>>
>> def main() :
>>
>>     df = run()
>>
>>     df.write()
>>
>>
>>
>>
>>
>>
>>
>

Re: Pyspark debugging best practices

Reply via email to