Hello -

I am trying to implement an outlier detection application on streaming data.
I am a newbie to Spark and hence would like some advice on the confusions
that I have ..

I am thinking of using StreamingKMeans - is this a good choice ? I have one
stream of data and I need an online algorithm. But here are some questions
that immediately come to my mind ..

1. I cannot do separate training, cross validation etc. Is this a good idea
to do training and prediction online ? 

2. The data will be read from the stream coming from Kafka in microbatches
of (say) 3 seconds. I get a DStream on which I train and get the clusters.
How can I decide on the number of clusters ? Using StreamingKMeans is there
any way I can iterate on microbatches with different values of k to find the
optimal one ?

3. Even if I fix k, after training on every microbatch I get a DStream. How
can I compute things like clustering score on the DStream ?
StreamingKMeansModel has a computeCost function but it takes an RDD. I can
use dstream.foreachRDD { // process RDD for the micro batch here } - is this
the idiomatic way ? 

4. If I use dstream.foreachRDD { .. } and use functions like new
StandardScaler().fit(rdd) to do feature normalization, then it works when I
have data in the stream. But when the microbatch is empty (say I don't have
data for some time), the fit method throws exception as it gets an empty
collection. Things start working ok when data starts coming back to the
stream. But is this the way to go ?

any suggestion will be welcome ..

regards.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/using-StreamingKMeans-tp28109.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to