Thanks TD, happy to share my experience with MLLib + Spark Streaming 
integration.

Here's a gist with two examples I have working, one for 
StreamingLinearRegression and another for StreamingKMeans.

https://gist.github.com/freeman-lab/9672685

The goal in each case was to implement a streaming version of the algorithm, 
using as much as possible directly from MLLib. For Linear Regression this was 
straightforward, because the MLLib version already uses a (stochastic) update 
rule, which I just use to update the model inside a foreachRDD(), using each 
new batch of data. For KMeans, I used the model class from MLLib, but extended 
it to keep a running count for each cluster. I also had to re-implement a chunk 
of the core algorithm in the form of an update rule. Tighter integration in 
this case would, I think, require refactoring some of MLLib (e.g. to use 
something like this update function), but this works fine.

One unresolved issue: for these kinds of algorithms, the dimensionality of the 
data must be known in advance. Would be cool to automatically detect it based 
on the first record.

-- Jeremy

On Mar 19, 2014, at 9:03 PM, Tathagata Das <tathagata.das1...@gmail.com> wrote:

> Yes, of course you can conceptually apply machine learning algorithm on Spark 
> Streaming. However the current MLLib does not yet have direct support for 
> Spark Streaming's DStream. However, since DStreams are essentially a sequence 
> of RDDs, you can apply MLLib algorithms on those RDDs. Take a look at 
> DStream.transform() and DStream.foreachRDD() operations, which allows you 
> access RDDs in a DStream. You can apply MLLib functions on them.
> 
> Some people have attempted to make a tighter integration between MLLib and 
> Spark Streaming. Jeremy (cc'ed) can say more about his adventures. 
> 
> TD
> 
> 
> On Sun, Mar 16, 2014 at 5:56 PM, Nasir Khan <nasirkhan.onl...@gmail.com> 
> wrote:
> hi, I m into a project in which i have to get streaming URL's and Filter it
> and classify it as benin or suspicious. Now Machine Learning and Streaming
> are two separate things in apache spark (AFAIK). my Question is Can we apply
> Online Machine Learning Algorithms on Streams??
> 
> I am at Beginner Level, Kindly Explain in abit detail and if some one can
> direct me to some good material for me will be greats.....
> 
> Thanks
> Nasir Khan.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Machine-Learning-on-streaming-data-tp2732.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 

Reply via email to