Hi,

 

I agree with Gourav that just measuring execution time is a simplistic approach 
that may lead you to miss important details, in particular when running 
distributed computations.

WebUI, REST API, and metrics instrumentation in Spark can be quite useful for 
further drill down. See https://spark.apache.org/docs/latest/monitoring.html

You can also have a look at this tool that takes care of automating collecting 
and aggregating some executor task metrics: 
https://github.com/LucaCanali/sparkMeasure

 

Best,

Luca

 

From: Gourav Sengupta <gourav.sengu...@gmail.com> 
Sent: Thursday, December 23, 2021 14:23
To: bit...@bitfox.top
Cc: user <user@spark.apache.org>
Subject: Re: measure running time

 

Hi,

 

I do not think that such time comparisons make any sense at all in distributed 
computation. Just saying that an operation in RDD and Dataframe can be compared 
based on their start and stop time may not provide any valid information.

 

You will have to look into the details of timing and the steps. For example, 
please look at the SPARK UI to see how timings are calculated in distributed 
computing mode, there are several well written papers on this.

 

 

Thanks and Regards,

Gourav Sengupta

 

 

 

 

 

On Thu, Dec 23, 2021 at 10:57 AM <bit...@bitfox.top <mailto:bit...@bitfox.top> 
> wrote:

hello community,

In pyspark how can I measure the running time to the command?
I just want to compare the running time of the RDD API and dataframe 
API, in my this blog:
https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/

I tried spark.time() it doesn't work.
Thank you.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
<mailto:user-unsubscr...@spark.apache.org> 

Reply via email to