Hi bitfox,

you need pip install sparkmeasure firstly. then can lanch in pysaprk.


| >>> from sparkmeasure import StageMetrics
>>> stagemetrics = StageMetrics(spark)
>>> stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*) from 
>>> range(1000) cross join range(1000) cross join range(100)").show()')
+---------+                                                                     
| count(1)|
+---------+
|100000000|
+---------+



|


Regards,
Hollis






At 2021-12-24 09:18:19, bit...@bitfox.top wrote:
>Hello list,
>
>I run with Spark 3.2.0
>
>After I started pyspark with:
>$ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17
>
>I can't load from the module sparkmeasure:
>
>>>> from sparkmeasure import StageMetrics
>Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>ModuleNotFoundError: No module named 'sparkmeasure'
>
>Do you know why? @Luca thanks.
>
>
>On 2021-12-24 04:20, bit...@bitfox.top wrote:
>> Thanks Gourav and Luca. I will try with the tools you provide in the 
>> Github.
>> 
>> On 2021-12-23 23:40, Luca Canali wrote:
>>> Hi,
>>> 
>>> I agree with Gourav that just measuring execution time is a simplistic
>>> approach that may lead you to miss important details, in particular
>>> when running distributed computations.
>>> 
>>> WebUI, REST API, and metrics instrumentation in Spark can be quite
>>> useful for further drill down. See
>>> https://spark.apache.org/docs/latest/monitoring.html
>>> 
>>> You can also have a look at this tool that takes care of automating
>>> collecting and aggregating some executor task metrics:
>>> https://github.com/LucaCanali/sparkMeasure
>>> 
>>> Best,
>>> 
>>> Luca
>>> 
>>> From: Gourav Sengupta <gourav.sengu...@gmail.com>
>>> Sent: Thursday, December 23, 2021 14:23
>>> To: bit...@bitfox.top
>>> Cc: user <user@spark.apache.org>
>>> Subject: Re: measure running time
>>> 
>>> Hi,
>>> 
>>> I do not think that such time comparisons make any sense at all in
>>> distributed computation. Just saying that an operation in RDD and
>>> Dataframe can be compared based on their start and stop time may not
>>> provide any valid information.
>>> 
>>> You will have to look into the details of timing and the steps. For
>>> example, please look at the SPARK UI to see how timings are calculated
>>> in distributed computing mode, there are several well written papers
>>> on this.
>>> 
>>> Thanks and Regards,
>>> 
>>> Gourav Sengupta
>>> 
>>> On Thu, Dec 23, 2021 at 10:57 AM <bit...@bitfox.top> wrote:
>>> 
>>>> hello community,
>>>> 
>>>> In pyspark how can I measure the running time to the command?
>>>> I just want to compare the running time of the RDD API and dataframe
>>>> 
>>>> API, in my this blog:
>>>> 
>>> https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/
>>>> 
>>>> I tried spark.time() it doesn't work.
>>>> Thank you.
>>>> 
>>>> 
>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>---------------------------------------------------------------------
>To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to