I currently working on a machine learning project, which require the RDDs' content to be (mostly partially) updated during each iteration. Because the program will be converted directly from "traditional" python object-oriented code, the content of the RDD will be modified in the mapping function. To test the functionality and memory , I writed a testing program:
class TestClass(object): def __init__(self): self.data = [] def setup(self): self.data = range(20000) return self def addNumber(self, number): length = len(self.data) for i in range(length): self.data[i] += number return self def sumUp(self): totoal = 0 for n in self.data: totoal += n return totoal and Spark main: origData = [] for i in range(50): origData.append((i, TestClass())) # create the RDD and cache it rddData = sc.parallelize(origData).mapValues(lambda v : v.setup()) rddData.cache() # modifying the content of RDD in map function scD = rddData for i in range(100): scD = scD.mapValues(lambda v : v.addNumber(10)) scDSum = scD.map(lambda (k, v) : v.sumUp()) v = scDSum.reduce(lambda a, b: a + b) print " ------ after the transfermation, the value is ", v scDSum = rddData .map(lambda (k, v) : v.sumUp()) v = scDSum.reduce(lambda a, b: a + b) print " ------ after the transformation, the cached value is ", v - By judging the results, it seems to me that when the RDDs is cached, the directed modification doesn't affect the content - By the monitoring of the memory usage, it seems to me that the memory are not duplicated during each RDD (narrow dependence) transformation (or I am wrong) therefore, my question is: - how the cache works, does it make a copy of the data separately ? - How the memory is managed in the MAP function? (in narrow dependence) Are the entire RDDs first duplicated, modified and then assigned to the new RDDs, and afterward the old RDDs will be deleted from the memory. Or the new RDDs will reuse the same memory of the old RDDs, without the duplication/copy of the memory? - If the new generated RDDs directly use the memory of the old RDDs (in narrow dependence) , why the cached RDDs still reserve old content. Are the cached RDDs treated differently from uncached RDDs in memory management. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-memory-questions-tp13805.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org