You probably did not install it on your cluster, nor included the python
package with your app

On Fri, Dec 24, 2021, 4:35 AM <bit...@bitfox.top> wrote:

> but I already installed it:
>
> Requirement already satisfied: sparkmeasure in
> /usr/local/lib/python2.7/dist-packages
>
> so how? thank you.
>
> On 2021-12-24 18:15, Hollis wrote:
> > Hi bitfox,
> >
> > you need pip install sparkmeasure firstly. then can lanch in pysaprk.
> >
> >>>> from sparkmeasure import StageMetrics
> >>>> stagemetrics = StageMetrics(spark)
> >>>> stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*)
> > from range(1000) cross join range(1000) cross join
> > range(100)").show()')
> > +---------+
> >
> > | count(1)|
> > +---------+
> > |100000000|
> > +---------+
> >
> > Regards,
> > Hollis
> >
> > At 2021-12-24 09:18:19, bit...@bitfox.top wrote:
> >> Hello list,
> >>
> >> I run with Spark 3.2.0
> >>
> >> After I started pyspark with:
> >> $ pyspark --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17
> >>
> >> I can't load from the module sparkmeasure:
> >>
> >>>>> from sparkmeasure import StageMetrics
> >> Traceback (most recent call last):
> >>   File "<stdin>", line 1, in <module>
> >> ModuleNotFoundError: No module named 'sparkmeasure'
> >>
> >> Do you know why? @Luca thanks.
> >>
> >>
> >> On 2021-12-24 04:20, bit...@bitfox.top wrote:
> >>> Thanks Gourav and Luca. I will try with the tools you provide in
> > the
> >>> Github.
> >>>
> >>> On 2021-12-23 23:40, Luca Canali wrote:
> >>>> Hi,
> >>>>
> >>>> I agree with Gourav that just measuring execution time is a
> > simplistic
> >>>> approach that may lead you to miss important details, in
> > particular
> >>>> when running distributed computations.
> >>>>
> >>>> WebUI, REST API, and metrics instrumentation in Spark can be quite
> >>>> useful for further drill down. See
> >>>> https://spark.apache.org/docs/latest/monitoring.html
> >>>>
> >>>> You can also have a look at this tool that takes care of
> > automating
> >>>> collecting and aggregating some executor task metrics:
> >>>> https://github.com/LucaCanali/sparkMeasure
> >>>>
> >>>> Best,
> >>>>
> >>>> Luca
> >>>>
> >>>> From: Gourav Sengupta <gourav.sengu...@gmail.com>
> >>>> Sent: Thursday, December 23, 2021 14:23
> >>>> To: bit...@bitfox.top
> >>>> Cc: user <user@spark.apache.org>
> >>>> Subject: Re: measure running time
> >>>>
> >>>> Hi,
> >>>>
> >>>> I do not think that such time comparisons make any sense at all in
> >>>> distributed computation. Just saying that an operation in RDD and
> >>>> Dataframe can be compared based on their start and stop time may
> > not
> >>>> provide any valid information.
> >>>>
> >>>> You will have to look into the details of timing and the steps.
> > For
> >>>> example, please look at the SPARK UI to see how timings are
> > calculated
> >>>> in distributed computing mode, there are several well written
> > papers
> >>>> on this.
> >>>>
> >>>> Thanks and Regards,
> >>>>
> >>>> Gourav Sengupta
> >>>>
> >>>> On Thu, Dec 23, 2021 at 10:57 AM <bit...@bitfox.top> wrote:
> >>>>
> >>>>> hello community,
> >>>>>
> >>>>> In pyspark how can I measure the running time to the command?
> >>>>> I just want to compare the running time of the RDD API and
> > dataframe
> >>>>>
> >>>>> API, in my this blog:
> >>>>>
> >>>>
> >
> https://bitfoxtop.wordpress.com/2021/12/23/count-email-addresses-using-sparks-rdd-and-dataframe/
> >>>>>
> >>>>> I tried spark.time() it doesn't work.
> >>>>> Thank you.
> >>>>>
> >>>>>
> >>>>
> > ---------------------------------------------------------------------
> >>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>>
> >>>
> > ---------------------------------------------------------------------
> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to