Thanks TD, happy to share my experience with MLLib + Spark Streaming integration.
Here's a gist with two examples I have working, one for StreamingLinearRegression and another for StreamingKMeans. https://gist.github.com/freeman-lab/9672685 The goal in each case was to implement a streaming version of the algorithm, using as much as possible directly from MLLib. For Linear Regression this was straightforward, because the MLLib version already uses a (stochastic) update rule, which I just use to update the model inside a foreachRDD(), using each new batch of data. For KMeans, I used the model class from MLLib, but extended it to keep a running count for each cluster. I also had to re-implement a chunk of the core algorithm in the form of an update rule. Tighter integration in this case would, I think, require refactoring some of MLLib (e.g. to use something like this update function), but this works fine. One unresolved issue: for these kinds of algorithms, the dimensionality of the data must be known in advance. Would be cool to automatically detect it based on the first record. -- Jeremy On Mar 19, 2014, at 9:03 PM, Tathagata Das <tathagata.das1...@gmail.com> wrote: > Yes, of course you can conceptually apply machine learning algorithm on Spark > Streaming. However the current MLLib does not yet have direct support for > Spark Streaming's DStream. However, since DStreams are essentially a sequence > of RDDs, you can apply MLLib algorithms on those RDDs. Take a look at > DStream.transform() and DStream.foreachRDD() operations, which allows you > access RDDs in a DStream. You can apply MLLib functions on them. > > Some people have attempted to make a tighter integration between MLLib and > Spark Streaming. Jeremy (cc'ed) can say more about his adventures. > > TD > > > On Sun, Mar 16, 2014 at 5:56 PM, Nasir Khan <nasirkhan.onl...@gmail.com> > wrote: > hi, I m into a project in which i have to get streaming URL's and Filter it > and classify it as benin or suspicious. Now Machine Learning and Streaming > are two separate things in apache spark (AFAIK). my Question is Can we apply > Online Machine Learning Algorithms on Streams?? > > I am at Beginner Level, Kindly Explain in abit detail and if some one can > direct me to some good material for me will be greats..... > > Thanks > Nasir Khan. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Machine-Learning-on-streaming-data-tp2732.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >