Hi, Spark Streaming has a `mapWithState` API to run a map on a stream while maintaining a state as elements are read.
The core RDD API does not seem to have anything similar. Given a RDD of elements of type T, an initial state of type S and a map function (S,T) -> (S,T), return an RDD of Ts obtained by applying the map function in sequence, updating the state as elements are mapped. There seems to be some interest on the user mailing list for this: http://apache-spark-user-list.1001560.n3.nabble.com/Keep-state-inside-map-function-td10968.html The solution suggested there is to use mapPartitions, but that does not make it possible to share the state from one partition to another. I am thinking a proper mapWithState could be implemented with a dedicated RDD along the lines of what zipWithIndex does. When the RDD is created, run a job to compute the state at partition boundaries. Then, store those states in the partitions returned, which lets you iterate these partitions independently, starting from the stored state. Obviously I can do this in my own project with a custom RDD. Would there be appetite to have this in Spark itself? Cheers, Antonin --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org