I wanted to ask a basic question about the types of algorithms that are 
possible to apply to a DStream with Spark streaming.  With Spark it is possible 
to perform iterative computations on RDDs like in the gradient descent example


  val points = spark.textFile(...).map(parsePoint).cache()
    var w = Vector.random(D) // current separating plane
    for (i <- 1 to ITERATIONS) {
      val gradient = points.map(p =>
        (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
      ).reduce(_ + _)
      w -= gradient
    }


which has a global state w that is updated after each iteration and the updated 
value is then used in the next iteration.  My question is whether this type of 
algorithm is possible if the points variable was a DStream instead of an RDD?  
It seems like you could perform the same map as above which would create a 
gradient DStream and also use updateStateByKey to create a DStream for the w 
variable.  But the problem is that there doesn't seem to be a way to reuse the 
w DStream inside the map.  I don't think that it is possible for DStreams to 
communicate this way.  Am I correct that this is not possible with DStreams or 
am I missing something?


Note:  The reason I ask this question is that many machine learning algorithms 
are trained by stochastic gradient descent.  sgd is similar to the above 
gradient descent algorithm except each iteration is on a new "minibatch" of 
data points rather than the same data points for every iteration.  It seems 
like Spark streaming provides a natural way to stream in these minibatches (as 
RDDs) but if it is not able to keep track of an updating global state variable 
then I don't think it Spark streaming can be used for sgd.


Thanks,


Alex

Reply via email to