Hey Todd, This approach violates the normal semantics of RDD transformations as you point out. I think you pointed out some issues already, and there are others. For instance say you cache originalRDD and some of the partitions end up in memory and others end up on disk. The ones that end up in memory will be mutated in-place when you create trasnformedRDD, the ones that are serialized disk won't actually be changed (because there will be a copy into memory from the serialized on-disk data). So you could end up where originalRDD is partially mutated.
Also, in the case of failures your map might run twice (e.g. run partially once, fail, then get re-run and succeed). So if your mutation e.g. relied on the current state of the object, it could end up having unexpected behavior. We'll probably never "disallow" this in Spark because we can't really control what you do inside of the function. But I'd be careful using this approach... - Patrick On Sat, Apr 26, 2014 at 5:59 AM, Lisonbee, Todd <todd.lison...@intel.com>wrote: > For example, > > val originalRDD: RDD[SomeCaseClass] = ... > > // Option 1: objects are copied, setting prop1 in the process > val transformedRDD = originalRDD.map( item => item.copy(prop1 = > calculation() ) > > // Option 2: objects are re-used and modified > val tranformedRDD = originalRDD.map( item => item.prop1 = calculation() ) > > I did a couple of small tests with option 2 and noticed less time was > spent in garbage collection. It didn't add up to much but with a large > enough data set it would make a difference. Also, it seems that less > memory would be used. > > Potential gotchas: > > - Objects in originalRDD are being modified, so you can't expect them to > have not changed > - You also can't rely on objects in originalRDD having the new value > because originalRDD might be re-caclulated > - If originalRDD was a PairRDD, and you modified the keys, it could cause > issues > - more? > > Other than the potential gotchas, is there any reason not to reuse objects > across RDD's? Is it a recommended practice for reducing memory usage and > garbage collection or not? > > Is it safe to do this in code you expect to work on future versions of > Spark? > > Thanks in advance, > > Todd >