Hi,

Spark Streaming has a `mapWithState` API to run a map on a stream while
maintaining a state as elements are read.

The core RDD API does not seem to have anything similar. Given a RDD of
elements of type T, an initial state of type S and a map function (S,T)
-> (S,T), return an RDD of Ts obtained by applying the map function in
sequence, updating the state as elements are mapped.

There seems to be some interest on the user mailing list for this:
http://apache-spark-user-list.1001560.n3.nabble.com/Keep-state-inside-map-function-td10968.html
The solution suggested there is to use mapPartitions, but that does not
make it possible to share the state from one partition to another.

I am thinking a proper mapWithState could be implemented with a
dedicated RDD along the lines of what zipWithIndex does. When the RDD is
created, run a job to compute the state at partition boundaries. Then,
store those states in the partitions returned, which lets you iterate
these partitions independently, starting from the stored state.

Obviously I can do this in my own project with a custom RDD. Would there
be appetite to have this in Spark itself?

Cheers,
Antonin

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to