I currently working on a machine learning project, which require the RDDs'
content to be (mostly partially) updated during each iteration. Because the
program will be converted directly from "traditional" python object-oriented
code, the content of the RDD will be modified in the mapping function. To
test the functionality and memory , I writed a testing program:

   class TestClass(object):
    def __init__(self):
        self.data = []

    def setup(self):
        self.data = range(20000)
        return self

    def addNumber(self, number):
        length = len(self.data)
        for i in range(length):
            self.data[i] += number
        return self

    def sumUp(self):
        totoal = 0
        for n in self.data:
            totoal += n
        return totoal

and Spark main:

    origData = []
    for i in range(50):
        origData.append((i, TestClass()))
    # create the RDD and cache it   
    rddData = sc.parallelize(origData).mapValues(lambda v : v.setup())
    rddData.cache()

    # modifying the content of RDD in map function 
    scD = rddData
    for i in range(100):
         scD = scD.mapValues(lambda v : v.addNumber(10))

    scDSum = scD.map(lambda (k, v) : v.sumUp())
    v = scDSum.reduce(lambda a, b: a + b)

    print " ------ after the transfermation, the value is ", v

    scDSum = rddData .map(lambda (k, v) : v.sumUp())
    v = scDSum.reduce(lambda a, b: a + b)

    print " ------ after the transformation, the cached value is ", v

  - By judging the results, it seems to me that when the RDDs is cached, the
directed modification doesn't affect the content
  - By the monitoring of the memory usage, it seems to me that the memory
are not duplicated during each RDD (narrow dependence) transformation (or I
am wrong)

therefore, my question is:
  - how the cache works, does it make a copy of the data separately ? 
  - How the memory is managed in the MAP function? (in narrow dependence)
Are the entire RDDs first duplicated, modified and then assigned to the new
RDDs, and afterward the old RDDs will be deleted from the memory. Or the new
RDDs will reuse the same memory of the old RDDs, without the
duplication/copy of the memory?
  - If the new generated RDDs directly use the memory of the old RDDs (in
narrow dependence) , why the cached RDDs still reserve old content. Are the
cached RDDs treated differently from uncached RDDs in memory management. 






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-memory-questions-tp13805.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to