In these scenarios it's fairly standard to report the metrics either directly or through accumulators ( http://spark.apache.org/docs/latest/programming-guide.html#accumulators-a-nameaccumlinka) to a time series database such as Graphite (http://graphite.wikidot.com/) or OpenTSDB (http://opentsdb.net/) and monitor the progress through the UI provided by the DB.
*Alex Rovner* *Director, Data Engineering * *o:* 646.759.0052 * <http://www.magnetic.com/>* On Mon, Nov 30, 2015 at 1:43 PM, Jacek Laskowski <[email protected]> wrote: > Hi, > > My limited understanding of Spark tells me that a task is the least > possible working unit and Spark itself won't give you much. It > wouldn't expect so since "acount" is a business entity not Spark's > one. > > What about using mapPartitions* to know the details of partitions and > do whatever you want (log to stdout or whatever)? Just a thought. > > Pozdrawiam, > Jacek > > -- > Jacek Laskowski | https://medium.com/@jaceklaskowski/ | > http://blog.jaceklaskowski.pl > Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/ > Follow me at https://twitter.com/jaceklaskowski > Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski > > > On Sun, Nov 29, 2015 at 3:12 PM, Yuhao Yang <[email protected]> wrote: > > Hi all, > > > > I got a simple processing job for 20000 accounts on 8 partitions. It's > > roughly 2500 accounts on each partition. Each account will take about 1s > to > > complete the computation. That means each partition will take about 2500 > > seconds to finish the batch. > > > > My question is how can I get the detailed progress of how many accounts > has > > been processed for each partition during the computation. An ideal > solution > > would allow me to know how many accounts has been processed periodically > > (like every minute) so I can monitor and take action to save some time. > > Right now on UI I can only get that task is running. > > > > I know one solution is to split the data horizontally on driver and > submit > > to spark in mini batches, yet I think that would waste some cluster > resource > > and create extra complexity for result handling. > > > > Any experience or best practice is welcome. Thanks a lot. > > > > Regards, > > Yuhao > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
