MLlib supports streaming linear models:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression
and k-means:
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means

With an iteration parameter of 1, this amounts to mini-batch SGD where the
mini-batch is the Spark Streaming batch.

On Mon, Mar 16, 2015 at 2:57 PM, Alex Minnaar <aminn...@verticalscope.com>
wrote:

>  I wanted to ask a basic question about the types of algorithms that are
> possible to apply to a DStream with Spark streaming.  With Spark it is
> possible to perform iterative computations on RDDs like in the gradient
> descent example
>
>
>    val points = spark.textFile(...).map(parsePoint).cache()
>     var w = Vector.random(D) // current separating plane
>     for (i <- 1 to ITERATIONS) {
>       val gradient = points.map(p =>
>         (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
>       ).reduce(_ + _)
>       w -= gradient
>     }
>
>
>  which has a global state w that is updated after each iteration and the
> updated value is then used in the next iteration.  My question is whether
> this type of algorithm is possible if the points variable was a DStream
> instead of an RDD?  It seems like you could perform the same map as above
> which would create a gradient DStream and also use updateStateByKey to
> create a DStream for the w variable.  But the problem is that there doesn't
> seem to be a way to reuse the w DStream inside the map.  I don't think that
> it is possible for DStreams to communicate this way.  Am I correct that
> this is not possible with DStreams or am I missing something?
>
>
>  Note:  The reason I ask this question is that many machine learning
> algorithms are trained by stochastic gradient descent.  sgd is similar to
> the above gradient descent algorithm except each iteration is on a new
> "minibatch" of data points rather than the same data points for every
> iteration.  It seems like Spark streaming provides a natural way to stream
> in these minibatches (as RDDs) but if it is not able to keep track of an
> updating global state variable then I don't think it Spark streaming can be
> used for sgd.
>
>
>  Thanks,
>
>
>  Alex
>

Reply via email to