Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-18 Thread Sean Owen
Late to the thread, but, why is counting distinct elements over a 24-hour window not possible? you can certainly do it now, and I'd presume it's possible with structured streaming with a window. countByValueAndWindow should do it right? the keys (with non-zero counts, I suppose) in a window are

Re:Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Todd
Thanks you guys for the help.I will try At 2016-05-18 07:17:08, "Mich Talebzadeh" wrote: Thanks Chris, In a nutshell I don't think one can do that. So let us see. Here is my program that is looking for share prices > 95.9. It does work. It is pretty simple

Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Mich Talebzadeh
Thanks Chris, In a nutshell I don't think one can do that. So let us see. Here is my program that is looking for share prices > 95.9. It does work. It is pretty simple import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.sql.Row import

Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Chris Fregly
you can use HyperLogLog with Spark Streaming to accomplish this. here is an example from my fluxcapacitor GitHub repo: https://github.com/fluxcapacitor/pipeline/tree/master/myapps/spark/streaming/src/main/scala/com/advancedspark/streaming/rating/approx here's an accompanying SlideShare

Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Mich Talebzadeh
Ok but how about something similar to val countByValueAndWindow = price.filter(_ > 95.0).countByValueAndWindow(Seconds(windowLength), Seconds(slidingInterval)) Using a new count => c*ountDistinctByValueAndWindow ?* val countDistinctByValueAndWindow = price.filter(_ >

Re: Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Michael Armbrust
In 2.0 you won't be able to do this. The long term vision would be to make this possible, but a window will be required (like the 24 hours you suggest). On Tue, May 17, 2016 at 1:36 AM, Todd wrote: > Hi, > We have a requirement to do count(distinct) in a processing batch

Does Structured Streaming support count(distinct) over all the streaming data?

2016-05-17 Thread Todd
Hi, We have a requirement to do count(distinct) in a processing batch against all the streaming data(eg, last 24 hours' data),that is,when we do count(distinct),we actually want to compute distinct against last 24 hours' data. Does structured streaming support this scenario?Thanks!