MLlib supports streaming linear models: http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression and k-means: http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
With an iteration parameter of 1, this amounts to mini-batch SGD where the mini-batch is the Spark Streaming batch. On Mon, Mar 16, 2015 at 2:57 PM, Alex Minnaar <aminn...@verticalscope.com> wrote: > I wanted to ask a basic question about the types of algorithms that are > possible to apply to a DStream with Spark streaming. With Spark it is > possible to perform iterative computations on RDDs like in the gradient > descent example > > > val points = spark.textFile(...).map(parsePoint).cache() > var w = Vector.random(D) // current separating plane > for (i <- 1 to ITERATIONS) { > val gradient = points.map(p => > (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x > ).reduce(_ + _) > w -= gradient > } > > > which has a global state w that is updated after each iteration and the > updated value is then used in the next iteration. My question is whether > this type of algorithm is possible if the points variable was a DStream > instead of an RDD? It seems like you could perform the same map as above > which would create a gradient DStream and also use updateStateByKey to > create a DStream for the w variable. But the problem is that there doesn't > seem to be a way to reuse the w DStream inside the map. I don't think that > it is possible for DStreams to communicate this way. Am I correct that > this is not possible with DStreams or am I missing something? > > > Note: The reason I ask this question is that many machine learning > algorithms are trained by stochastic gradient descent. sgd is similar to > the above gradient descent algorithm except each iteration is on a new > "minibatch" of data points rather than the same data points for every > iteration. It seems like Spark streaming provides a natural way to stream > in these minibatches (as RDDs) but if it is not able to keep track of an > updating global state variable then I don't think it Spark streaming can be > used for sgd. > > > Thanks, > > > Alex >