In these scenarios it's fairly standard to report the metrics either
directly or through accumulators (
http://spark.apache.org/docs/latest/programming-guide.html#accumulators-a-nameaccumlinka)
to a time series database such as Graphite (http://graphite.wikidot.com/)
or OpenTSDB (http://opentsdb.net/) and monitor the progress through the UI
provided by the DB.

*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052

* <http://www.magnetic.com/>*

On Mon, Nov 30, 2015 at 1:43 PM, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi,
>
> My limited understanding of Spark tells me that a task is the least
> possible working unit and Spark itself won't give you much. It
> wouldn't expect so since "acount" is a business entity not Spark's
> one.
>
> What about using mapPartitions* to know the details of partitions and
> do whatever you want (log to stdout or whatever)? Just a thought.
>
> Pozdrawiam,
> Jacek
>
> --
> Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
> http://blog.jaceklaskowski.pl
> Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> Follow me at https://twitter.com/jaceklaskowski
> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>
>
> On Sun, Nov 29, 2015 at 3:12 PM, Yuhao Yang <hhb...@gmail.com> wrote:
> > Hi all,
> >
> > I got a simple processing job for 20000 accounts on 8 partitions. It's
> > roughly 2500 accounts on each partition. Each account will take about 1s
> to
> > complete the computation. That means each partition will take about 2500
> > seconds to finish the batch.
> >
> > My question is how can I get the detailed progress of how many accounts
> has
> > been processed for each partition during the computation. An ideal
> solution
> > would allow me to know how many accounts has been processed periodically
> > (like every minute) so I can monitor and take action to save some time.
> > Right now on UI I can only get that task is running.
> >
> > I know one solution is to split the data horizontally on driver and
> submit
> > to spark in mini batches, yet I think that would waste some cluster
> resource
> > and create extra complexity for result handling.
> >
> > Any experience or best practice is welcome. Thanks a lot.
> >
> > Regards,
> > Yuhao
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Reply via email to