Yes, I am looking for programmatic way of tracking progress.  
SparkListener.scala does not track at RDD item level so it will not tell how 
many items have been processed.

I wonder is there any way to track the accumulator value as it reflect the 
correct number of items processed so far?

Ningjun

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Tuesday, February 23, 2016 2:30 PM
To: Kevin Mellott
Cc: Wang, Ningjun (LNG-NPV); user@spark.apache.org
Subject: Re: How to get progress information of an RDD operation

I think Ningjun was looking for programmatic way of tracking progress.

I took a look at:
./core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala

but there doesn't seem to exist fine grained events directly reflecting what 
Ningjun looks for.

On Tue, Feb 23, 2016 at 11:24 AM, Kevin Mellott 
<kevin.r.mell...@gmail.com<mailto:kevin.r.mell...@gmail.com>> wrote:
Have you considered using the Spark Web UI to view progress on your job? It 
does a very good job showing the progress of the overall job, as well as allows 
you to drill into the individual tasks and server activity.

On Tue, Feb 23, 2016 at 12:53 PM, Wang, Ningjun (LNG-NPV) 
<ningjun.w...@lexisnexis.com<mailto:ningjun.w...@lexisnexis.com>> wrote:
How can I get progress information of a RDD operation? For example

val lines = sc.textFile("c:/temp/input.txt")  // a RDD of millions of line
lines.foreach(line => {
        handleLine(line)
    })
The input.txt contains millions of lines. The entire operation take 6 hours. I 
want to print out how many lines are processed every 1 minute so user know the 
progress. How can I do that?

One way I am thinking of is to use accumulator, e.g.



val lines = sc.textFile("c:/temp/input.txt")
val acCount = sc.accumulator(0L)
lines.foreach(line => {
        handleLine(line)
        acCount += 1
}

However how can I print out account every 1 minutes?


Ningjun



Reply via email to