Hi all, I am trying to persist a spark RDD in which the elements of each partition all share access to a single, large object. However, this object seems get stored in memory several times. Reducing my problem down to the toy case of just a single partition with only 200 elements:
*val* /nElements/ = 200 *class* Elem(*val* s:Array[Int]) *val* /rdd/ = /sc/.parallelize(/Seq/(1)).mapPartitions( _ => { *val* sharedArray = Array./ofDim/[Int](10000000) /// Should require ~40MB/ (1 to /nElements/).toIterator.map(i => *new* Elem(sharedArray)) }).cache() /rdd/.count() /// force computation/ This consumes the expected amount of memory, as seen in the logs: /storage.MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 38.2 MB, free 5.7 GB)/ However, 200 is the maximum number of elements for which this is so. Setting nElements=201 yields: /storage.MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 76.7 MB, free 5.7 GB)/ What causes this? Where does this magic number 200 come from, and how can I increase it? Thanks for your help! - Luke -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-with-object-shared-across-elements-within-a-partition-Magic-number-200-tp19559.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org