You probably did not install it on your cluster, nor included the python package with your app
On Fri, Dec 24, 2021, 4:35 AM <bit...@bitfox.top> wrote: > but I already installed it: > > Requirement already satisfied: sparkmeasure in > /usr/local/lib/python2.7/dist-packages > > so how? thank you. > > On 2021-12-24 18:15, Hollis wrote: > > Hi bitfox, > > > > you need pip install sparkmeasure firstly. then can lanch in pysaprk. > > > >>>> from sparkmeasure import StageMetrics > >>>> stagemetrics = StageMetrics(spark) > >>>> stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*) > > from range(1000) cross join range(1000) cross join > > range(100)").show()') > > +---------+ > > > > | count(1)| > > +---------+ > > |100000000| > > +---------+ > > > > Regards, > > Hollis > > > > At 2021-12-24 09:18:19, bit...@bitfox.top wrote: > >> Hello list, > >> > >> I run with Spark 3.2.0 > >> > >> After I started pyspark with: > >> $ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17 > >> > >> I can't load from the module sparkmeasure: > >> > >>>>> from sparkmeasure import StageMetrics > >> Traceback (most recent call last): > >> File "<stdin>", line 1, in <module> > >> ModuleNotFoundError: No module named 'sparkmeasure' > >> > >> Do you know why? @Luca thanks. > >> > >> > >> On 2021-12-24 04:20, bit...@bitfox.top wrote: > >>> Thanks Gourav and Luca. I will try with the tools you provide in > > the > >>> Github. > >>> > >>> On 2021-12-23 23:40, Luca Canali wrote: > >>>> Hi, > >>>> > >>>> I agree with Gourav that just measuring execution time is a > > simplistic > >>>> approach that may lead you to miss important details, in > > particular > >>>> when running distributed computations. > >>>> > >>>> WebUI, REST API, and metrics instrumentation in Spark can be quite > >>>> useful for further drill down. See > >>>> https://spark.apache.org/docs/latest/monitoring.html > >>>> > >>>> You can also have a look at this tool that takes care of > > automating > >>>> collecting and aggregating some executor task metrics: > >>>> https://github.com/LucaCanali/sparkMeasure > >>>> > >>>> Best, > >>>> > >>>> Luca > >>>> > >>>> From: Gourav Sengupta <gourav.sengu...@gmail.com> > >>>> Sent: Thursday, December 23, 2021 14:23 > >>>> To: bit...@bitfox.top > >>>> Cc: user <user@spark.apache.org> > >>>> Subject: Re: measure running time > >>>> > >>>> Hi, > >>>> > >>>> I do not think that such time comparisons make any sense at all in > >>>> distributed computation. Just saying that an operation in RDD and > >>>> Dataframe can be compared based on their start and stop time may > > not > >>>> provide any valid information. > >>>> > >>>> You will have to look into the details of timing and the steps. > > For > >>>> example, please look at the SPARK UI to see how timings are > > calculated > >>>> in distributed computing mode, there are several well written > > papers > >>>> on this. > >>>> > >>>> Thanks and Regards, > >>>> > >>>> Gourav Sengupta > >>>> > >>>> On Thu, Dec 23, 2021 at 10:57 AM <bit...@bitfox.top> wrote: > >>>> > >>>>> hello community, > >>>>> > >>>>> In pyspark how can I measure the running time to the command? > >>>>> I just want to compare the running time of the RDD API and > > dataframe > >>>>> > >>>>> API, in my this blog: > >>>>> > >>>> > > > https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/ > >>>>> > >>>>> I tried spark.time() it doesn't work. > >>>>> Thank you. > >>>>> > >>>>> > >>>> > > --------------------------------------------------------------------- > >>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >>> > >>> > > --------------------------------------------------------------------- > >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >> > >> --------------------------------------------------------------------- > >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >