Hey Todd,

This approach violates the normal semantics of RDD transformations as you
point out. I think you pointed out some issues already, and there are
others. For instance say you cache originalRDD and some of the partitions
end up in memory and others end up on disk. The ones that end up in memory
will be mutated in-place when you create trasnformedRDD, the ones that are
serialized disk won't actually be changed (because there will be a copy
into memory from the serialized on-disk data). So you could end up where
originalRDD is partially mutated.

Also, in the case of failures your map might run twice (e.g. run partially
once, fail, then get re-run and succeed). So if your mutation e.g. relied
on the current state of the object, it could end up having unexpected
behavior.

We'll probably never "disallow" this in Spark because we can't really
control what you do inside of the function. But I'd be careful using this
approach...

- Patrick


On Sat, Apr 26, 2014 at 5:59 AM, Lisonbee, Todd <todd.lison...@intel.com>wrote:

> For example,
>
> val originalRDD: RDD[SomeCaseClass] = ...
>
> // Option 1: objects are copied, setting prop1 in the process
> val transformedRDD = originalRDD.map( item => item.copy(prop1 =
> calculation() )
>
> // Option 2: objects are re-used and modified
> val tranformedRDD = originalRDD.map( item => item.prop1 = calculation() )
>
> I did a couple of small tests with option 2 and noticed less time was
> spent in garbage collection.  It didn't add up to much but with a large
> enough data set it would make a difference.  Also, it seems that less
> memory would be used.
>
> Potential gotchas:
>
> - Objects in originalRDD are being modified, so you can't expect them to
> have not changed
> - You also can't rely on objects in originalRDD having the new value
> because originalRDD might be re-caclulated
> - If originalRDD was a PairRDD, and you modified the keys, it could cause
> issues
> - more?
>
> Other than the potential gotchas, is there any reason not to reuse objects
> across RDD's?  Is it a recommended practice for reducing memory usage and
> garbage collection or not?
>
> Is it safe to do this in code you expect to work on future versions of
> Spark?
>
> Thanks in advance,
>
> Todd
>

Reply via email to