Re: measure running time

bitfox Fri, 24 Dec 2021 02:34:48 -0800

but I already installed it:

Requirement already satisfied: sparkmeasure in/usr/local/lib/python2.7/dist-packages


so how? thank you.

On 2021-12-24 18:15, Hollis wrote:

Hi bitfox,

you need pip install sparkmeasure firstly. then can lanch in pysaprk.

from sparkmeasure import StageMetrics
stagemetrics = StageMetrics(spark)
stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*)

from range(1000) cross join range(1000) cross join
range(100)").show()')
+---------+

| count(1)|
+---------+
|100000000|
+---------+

Regards,
Hollis

At 2021-12-24 09:18:19, bit...@bitfox.top wrote:

Hello list,

I run with Spark 3.2.0

After I started pyspark with:
$ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17

I can't load from the module sparkmeasure:

from sparkmeasure import StageMetrics

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'sparkmeasure'

Do you know why? @Luca thanks.


On 2021-12-24 04:20, bit...@bitfox.top wrote:

Thanks Gourav and Luca. I will try with the tools you provide in

the

Github.

On 2021-12-23 23:40, Luca Canali wrote:

Hi,

I agree with Gourav that just measuring execution time is a

simplistic

approach that may lead you to miss important details, in

particular

when running distributed computations.

WebUI, REST API, and metrics instrumentation in Spark can be quite
useful for further drill down. See
https://spark.apache.org/docs/latest/monitoring.html

You can also have a look at this tool that takes care of

automating

collecting and aggregating some executor task metrics:
https://github.com/LucaCanali/sparkMeasure

Best,

Luca

From: Gourav Sengupta <gourav.sengu...@gmail.com>
Sent: Thursday, December 23, 2021 14:23
To: bit...@bitfox.top
Cc: user <user@spark.apache.org>
Subject: Re: measure running time

Hi,

I do not think that such time comparisons make any sense at all in
distributed computation. Just saying that an operation in RDD and
Dataframe can be compared based on their start and stop time may

not

provide any valid information.

You will have to look into the details of timing and the steps.

For

example, please look at the SPARK UI to see how timings are

calculated

in distributed computing mode, there are several well written

papers

on this.

Thanks and Regards,

Gourav Sengupta

On Thu, Dec 23, 2021 at 10:57 AM <bit...@bitfox.top> wrote:

hello community,

In pyspark how can I measure the running time to the command?
I just want to compare the running time of the RDD API and

dataframe


API, in my this blog:

https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/


I tried spark.time() it doesn't work.
Thank you.

---------------------------------------------------------------------

To unsubscribe e-mail: user-unsubscr...@spark.apache.org

---------------------------------------------------------------------

To unsubscribe e-mail: user-unsubscr...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: measure running time

Reply via email to