Hi TD,
Do you know how to cancel the rdd.countApprox(5000) tasks after the timeout? 
Otherwise it keeps running until completion, producing results not used but 
consuming resources.

     On Wednesday, May 13, 2015 10:33 AM, Du Li <l...@yahoo-inc.com.INVALID> 

  Hi TD,
Thanks a lot. rdd.countApprox(5000).initialValue worked! Now my streaming app 
is standing a much better chance to complete processing each batch within the 
batch interval.

     On Tuesday, May 12, 2015 10:31 PM, Tathagata Das <t...@databricks.com> 

 From the code it seems that as soon as the " rdd.countApprox(5000)" returns, 
you can call "pResult.initialValue()" to get the approximate count at that 
point of time (that is after timeout). Calling "pResult.getFinalValue()" will 
further block until the job is over, and give the final correct values that you 
would have received by "rdd.count()"
On Tue, May 12, 2015 at 5:03 PM, Du Li <l...@yahoo-inc.com.invalid> wrote:

I tested the following in my streaming app and hoped to get an approximate 
count within 5 seconds. However, rdd.countApprox(5000).getFinalValue() seemed 
to always return after it finishes completely, just like rdd.count(), which 
often exceeded 5 seconds. The values for low, mean, and high were the same.
val pResult = rdd.countApprox(5000)val bDouble = 
low=${bDouble.low.toLong}, mean=${bDouble.mean.toLong}, 
Can any expert here help explain the right way of usage?


     On Wednesday, May 6, 2015 7:55 AM, Du Li <l...@yahoo-inc.com.INVALID> 

 I have to count RDD's in a spark streaming app. When data goes large, count() 
becomes expensive. Did anybody have experience using countApprox()? How 
accurate/reliable is it? 
The documentation is pretty modest. Suppose the timeout parameter is in 
milliseconds. Can I retrieve the count value by calling getFinalValue()? Does 
it block and return only after the timeout? Or do I need to define 
onComplete/onFail handlers to extract count value from the partial results?




Reply via email to