Hi all,
I am trying to persist a spark RDD in which the elements of each partition
all share access to a single, large object. However, this object seems get
stored in memory several times. Reducing my problem down to the toy case of
just a single partition with only 200 elements:

*val* /nElements/ = 200
*class* Elem(*val* s:Array[Int])
*val* /rdd/ = /sc/.parallelize(/Seq/(1)).mapPartitions( _ => {
    *val* sharedArray = Array./ofDim/[Int](10000000)   /// Should require
~40MB/
    (1 to /nElements/).toIterator.map(i => *new* Elem(sharedArray))
}).cache()
/rdd/.count()   /// force computation/

This consumes the expected amount of memory, as seen in the logs:
/storage.MemoryStore: Block rdd_1_0 stored as values in memory (estimated
size 38.2 MB, free 5.7 GB)/

However, 200 is the maximum number of elements for which this is so. Setting
nElements=201 yields:
/storage.MemoryStore: Block rdd_1_0 stored as values in memory (estimated
size 76.7 MB, free 5.7 GB)/

What causes this? Where does this magic number 200 come from, and how can I
increase it?

Thanks for your help!
- Luke



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-with-object-shared-across-elements-within-a-partition-Magic-number-200-tp19559.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to