Yes, this is what we've done as of now (if you read earlier threads). And we were saying that we'd prefer if Spark supported persistent worker memory management in a little bit less hacky way ;)
On Mon, Apr 28, 2014 at 8:44 AM, Ian O'Connell <i...@ianoconnell.com> wrote: > A mutable map in an object should do what your looking for then I believe. > You just reference the object as an object in your closure so it won't be > swept up when your closure is serialized and you can reference variables of > the object on the remote host then. e.g.: > > object MyObject { > val mmap = scala.collection.mutable.Map[Long, Long]() > } > > rdd.map { ele => > MyObject.mmap.getOrElseUpdate(ele, 1L) > ... > }.map {ele => > require(MyObject.mmap(ele) == 1L) > > }.count > > Along with the data loss just be careful with thread safety and multiple > threads/partitions on one host so the map should be viewed as shared > amongst a larger space. > > > > Also with your exact description it sounds like your data should be > encoded into the RDD if its per-record/per-row: RDD[(MyBaseData, > LastIterationSideValues)] > > > > On Mon, Apr 28, 2014 at 1:51 AM, Sung Hwan Chung <coded...@cs.stanford.edu > > wrote: > >> In our case, we'd like to keep memory content from one iteration to the >> next, and not just during a single mapPartition call because then we can do >> more efficient computations using the values from the previous iteration. >> >> So essentially, we need to declare objects outside the scope of the >> map/reduce calls (but residing in individual workers), then those can be >> accessed from the map/reduce calls. >> >> We'd be making some assumptions as you said, such as - RDD partition is >> statically located and can't move from worker to another worker unless the >> worker crashes. >> >> >> >> On Mon, Apr 28, 2014 at 1:35 AM, Sean Owen <so...@cloudera.com> wrote: >> >>> On Mon, Apr 28, 2014 at 9:30 AM, Sung Hwan Chung < >>> coded...@cs.stanford.edu> wrote: >>> >>>> Actually, I do not know how to do something like this or whether this >>>> is possible - thus my suggestive statement. >>>> >>>> Can you already declare persistent memory objects per worker? I tried >>>> something like constructing a singleton object within map functions, but >>>> that didn't work as it seemed to actually serialize singletons and pass it >>>> back and forth in a weird manner. >>>> >>>> >>> Does it need to be persistent across operations, or just persist for the >>> lifetime of processing of one partition in one mapPartition? The latter is >>> quite easy and might give most of the speedup. >>> >>> Maybe that's 'enough', even if it means you re-cache values several >>> times in a repeated iterative computation. It would certainly avoid >>> managing a lot of complexity in trying to keep that state alive remotely >>> across operations. I'd also be interested if there is any reliable way to >>> do that, though it seems hard since it means you embed assumptions about >>> where particular data is going to be processed. >>> >>> >> >