Late to the thread, but, why is counting distinct elements over a
24-hour window not possible? you can certainly do it now, and I'd
presume it's possible with structured streaming with a window.
countByValueAndWindow should do it right? the keys (with non-zero
counts, I suppose) in a window are
Thanks you guys for the help.I will try
At 2016-05-18 07:17:08, "Mich Talebzadeh" wrote:
Thanks Chris,
In a nutshell I don't think one can do that.
So let us see. Here is my program that is looking for share prices > 95.9. It
does work. It is pretty simple
Thanks Chris,
In a nutshell I don't think one can do that.
So let us see. Here is my program that is looking for share prices > 95.9.
It does work. It is pretty simple
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.Row
import
you can use HyperLogLog with Spark Streaming to accomplish this.
here is an example from my fluxcapacitor GitHub repo:
https://github.com/fluxcapacitor/pipeline/tree/master/myapps/spark/streaming/src/main/scala/com/advancedspark/streaming/rating/approx
here's an accompanying SlideShare
Ok but how about something similar to
val countByValueAndWindow = price.filter(_ >
95.0).countByValueAndWindow(Seconds(windowLength), Seconds(slidingInterval))
Using a new count => c*ountDistinctByValueAndWindow ?*
val countDistinctByValueAndWindow = price.filter(_ >
In 2.0 you won't be able to do this. The long term vision would be to make
this possible, but a window will be required (like the 24 hours you
suggest).
On Tue, May 17, 2016 at 1:36 AM, Todd wrote:
> Hi,
> We have a requirement to do count(distinct) in a processing batch
Hi,
We have a requirement to do count(distinct) in a processing batch against all
the streaming data(eg, last 24 hours' data),that is,when we do
count(distinct),we actually want to compute distinct against last 24 hours'
data.
Does structured streaming support this scenario?Thanks!