Hi bitfox,
you need pip install sparkmeasure firstly. then can lanch in pysaprk. | >>> from sparkmeasure import StageMetrics >>> stagemetrics = StageMetrics(spark) >>> stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*) from >>> range(1000) cross join range(1000) cross join range(100)").show()') +---------+ | count(1)| +---------+ |100000000| +---------+ | Regards, Hollis At 2021-12-24 09:18:19, bit...@bitfox.top wrote: >Hello list, > >I run with Spark 3.2.0 > >After I started pyspark with: >$ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17 > >I can't load from the module sparkmeasure: > >>>> from sparkmeasure import StageMetrics >Traceback (most recent call last): > File "<stdin>", line 1, in <module> >ModuleNotFoundError: No module named 'sparkmeasure' > >Do you know why? @Luca thanks. > > >On 2021-12-24 04:20, bit...@bitfox.top wrote: >> Thanks Gourav and Luca. I will try with the tools you provide in the >> Github. >> >> On 2021-12-23 23:40, Luca Canali wrote: >>> Hi, >>> >>> I agree with Gourav that just measuring execution time is a simplistic >>> approach that may lead you to miss important details, in particular >>> when running distributed computations. >>> >>> WebUI, REST API, and metrics instrumentation in Spark can be quite >>> useful for further drill down. See >>> https://spark.apache.org/docs/latest/monitoring.html >>> >>> You can also have a look at this tool that takes care of automating >>> collecting and aggregating some executor task metrics: >>> https://github.com/LucaCanali/sparkMeasure >>> >>> Best, >>> >>> Luca >>> >>> From: Gourav Sengupta <gourav.sengu...@gmail.com> >>> Sent: Thursday, December 23, 2021 14:23 >>> To: bit...@bitfox.top >>> Cc: user <user@spark.apache.org> >>> Subject: Re: measure running time >>> >>> Hi, >>> >>> I do not think that such time comparisons make any sense at all in >>> distributed computation. Just saying that an operation in RDD and >>> Dataframe can be compared based on their start and stop time may not >>> provide any valid information. >>> >>> You will have to look into the details of timing and the steps. For >>> example, please look at the SPARK UI to see how timings are calculated >>> in distributed computing mode, there are several well written papers >>> on this. >>> >>> Thanks and Regards, >>> >>> Gourav Sengupta >>> >>> On Thu, Dec 23, 2021 at 10:57 AM <bit...@bitfox.top> wrote: >>> >>>> hello community, >>>> >>>> In pyspark how can I measure the running time to the command? >>>> I just want to compare the running time of the RDD API and dataframe >>>> >>>> API, in my this blog: >>>> >>> https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/ >>>> >>>> I tried spark.time() it doesn't work. >>>> Thank you. >>>> >>>> >>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >--------------------------------------------------------------------- >To unsubscribe e-mail: user-unsubscr...@spark.apache.org