Re: Pyspark debugging best practices

David Diebold Mon, 03 Jan 2022 00:39:40 -0800

Hello Andy,

Are you sure you want to perform lots of join operations, and not simple
unions ?
Are you doing inner joins or outer joins ?
Can you provide us with a rough amount of your list size plus each
individual dataset size ?
Have a look at execution plan would help, maybe the high amount of join
operations makes execution plan too complicated at the end of the day ;
checkpointing could help there ?


Cheers,
David


Le jeu. 30 déc. 2021 à 16:56, Andrew Davidson <aedav...@ucsc.edu.invalid> a
écrit :

> Hi Gourav
>
> I will give databricks a try.
>
> Each data gets loaded into a data frame.
> I select one column from the data frame
> I join the column to the  accumulated joins from previous data frames in
> the list
>
> To debug. I think am gaining to put an action and log statement after each
> join. I do not think it will change the performance. I believe the physical
> plan will be the same how ever hopefully it will shed some light.
>
> At the very least I will know if it making progress or not. And hopefully
> where it is breaking
>
> Happy new year
>
> Andy
>
> On Tue, Dec 28, 2021 at 4:19 AM Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
>> Hi Andrew,
>>
>> Any chance you might give Databricks a try in GCP?
>>
>> The above transformations look complicated to me, why are you adding
>> dataframes to a list?
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>>
>>
>> On Sun, Dec 26, 2021 at 7:00 PM Andrew Davidson <aedav...@ucsc.edu.invalid>
>> wrote:
>>
>>> Hi
>>>
>>>
>>>
>>> I am having trouble debugging my driver. It runs correctly on smaller
>>> data set but fails on large ones.  It is very hard to figure out what the
>>> bug is. I suspect it may have something do with the way spark is installed
>>> and configured. I am using google cloud platform dataproc pyspark
>>>
>>>
>>>
>>> The log messages are not helpful. The error message will be something
>>> like
>>> "User application exited with status 1"
>>>
>>>
>>>
>>> And
>>>
>>>
>>>
>>> jsonPayload: {
>>>
>>> class: "server.TThreadPoolServer"
>>>
>>> filename: "hive-server2.log"
>>>
>>> message: "Error occurred during processing of message."
>>>
>>> thread: "HiveServer2-Handler-Pool: Thread-40"
>>>
>>> }
>>>
>>>
>>>
>>> I am able to access the spark history server however it does not capture
>>> anything if the driver crashes. I am unable to figure out how to access
>>> spark web UI.
>>>
>>>
>>>
>>> My driver program looks something like the pseudo code bellow. A long
>>> list of transforms with a single action, (i.e. write) at the end. Adding
>>> log messages is not helpful because of lazy evaluations. I am tempted to
>>> add something like
>>>
>>>
>>>
>>> Logger.warn( “DEBUG df.count():{}”.format( df.count() )” to try and
>>> inline some sort of diagnostic message.
>>>
>>>
>>>
>>> What do you think?
>>>
>>>
>>>
>>> Is there a better way to debug this?
>>>
>>>
>>>
>>> Kind regards
>>>
>>>
>>>
>>> Andy
>>>
>>>
>>>
>>> def run():
>>>
>>>     listOfDF = []
>>>
>>>     for filePath in listOfFiles:
>>>
>>>         df = spark.read.load( filePath, ...)
>>>
>>>         listOfDF.append(df)
>>>
>>>
>>>
>>>
>>>
>>>     list2OfDF = []
>>>
>>>     for df in listOfDF:
>>>
>>>         df2 = df.select( .... )
>>>
>>>         lsit2OfDF.append( df2 )
>>>
>>>
>>>
>>>     # will setting list to None free cache?
>>>
>>>     # or just driver memory
>>>
>>>     listOfDF = None
>>>
>>>
>>>
>>>
>>>
>>>     df3 = list2OfDF[0]
>>>
>>>
>>>
>>>     for i in range( 1, len(list2OfDF) ):
>>>
>>>         df = list2OfDF[i]
>>>
>>>         df3 = df3.join(df ...)
>>>
>>>
>>>
>>>     # will setting to list to None free cache?
>>>
>>>     # or just driver memory
>>>
>>>     List2OfDF = None
>>>
>>>
>>>
>>>
>>>
>>>     lots of narrow transformations on d3
>>>
>>>
>>>
>>>     return df3
>>>
>>>
>>>
>>> def main() :
>>>
>>>     df = run()
>>>
>>>     df.write()
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>

Re: Pyspark debugging best practices

Reply via email to