HI,
I tested the following in my streaming app and hoped to get an approximate 
count within 5 seconds. However, rdd.countApprox(5000).getFinalValue() seemed 
to always return after it finishes completely, just like rdd.count(), which 
often exceeded 5 seconds. The values for low, mean, and high were the same.
val pResult = rdd.countApprox(5000)val bDouble = 
pResult.getFinalValue()logInfo(s"countApprox().getFinalValue(): 
low=${bDouble.low.toLong}, mean=${bDouble.mean.toLong}, 
high=${bDouble.high.toLong}")
Can any expert here help explain the right way of usage?
Thanks,Du


 



     On Wednesday, May 6, 2015 7:55 AM, Du Li <l...@yahoo-inc.com.INVALID> 
wrote:
   

 I have to count RDD's in a spark streaming app. When data goes large, count() 
becomes expensive. Did anybody have experience using countApprox()? How 
accurate/reliable is it? 
The documentation is pretty modest. Suppose the timeout parameter is in 
milliseconds. Can I retrieve the count value by calling getFinalValue()? Does 
it block and return only after the timeout? Or do I need to define 
onComplete/onFail handlers to extract count value from the partial results?
Thanks,Du

  

Reply via email to