On Tue, Sep 9, 2014 at 10:07 AM, Boxian Dong <box...@indoo.rs> wrote:
> I currently working on a machine learning project, which require the RDDs'
> content to be (mostly partially) updated during each iteration. Because the
> program will be converted directly from "traditional" python object-oriented
> code, the content of the RDD will be modified in the mapping function. To
> test the functionality and memory , I writed a testing program:
>
>    class TestClass(object):
>     def __init__(self):
>         self.data = []
>
>     def setup(self):
>         self.data = range(20000)
>         return self
>
>     def addNumber(self, number):
>         length = len(self.data)
>         for i in range(length):
>             self.data[i] += number
>         return self
>
>     def sumUp(self):
>         totoal = 0
>         for n in self.data:
>             totoal += n
>         return totoal
>
> and Spark main:
>
>     origData = []
>     for i in range(50):
>         origData.append((i, TestClass()))
>     # create the RDD and cache it
>     rddData = sc.parallelize(origData).mapValues(lambda v : v.setup())
>     rddData.cache()
>
>     # modifying the content of RDD in map function
>     scD = rddData
>     for i in range(100):
>          scD = scD.mapValues(lambda v : v.addNumber(10))
>
>     scDSum = scD.map(lambda (k, v) : v.sumUp())
>     v = scDSum.reduce(lambda a, b: a + b)
>
>     print " ------ after the transfermation, the value is ", v
>
>     scDSum = rddData .map(lambda (k, v) : v.sumUp())
>     v = scDSum.reduce(lambda a, b: a + b)
>
>     print " ------ after the transformation, the cached value is ", v
>
>   - By judging the results, it seems to me that when the RDDs is cached, the
> directed modification doesn't affect the content
>   - By the monitoring of the memory usage, it seems to me that the memory
> are not duplicated during each RDD (narrow dependence) transformation (or I
> am wrong)
>
> therefore, my question is:
>   - how the cache works, does it make a copy of the data separately ?
>   - How the memory is managed in the MAP function? (in narrow dependence)
> Are the entire RDDs first duplicated, modified and then assigned to the new
> RDDs, and afterward the old RDDs will be deleted from the memory. Or the new
> RDDs will reuse the same memory of the old RDDs, without the
> duplication/copy of the memory?

I'm trying to answer some of your questions:

The RDD is cached in JVM (after serialized by pickle). In Python, it
reads the serialized
data from socket then deserialized it into Python objects, call mapper
or reducer on
them, finally sending them back to JVM via socket. The Python process only hold
a batch of objects, so the memory usage will be smaller than the whole
partition.

The cache in JVM is created after first iteration of them. So when you
process them
the second time (or even more), they will be read from cache in JVM.

RDDs are read only, you can not modify them, each transformation will create new
RDDs. During MAP function, the objects in RDDs are throwed away after accessing,
any modification to them will be lost.

>   - If the new generated RDDs directly use the memory of the old RDDs (in
> narrow dependence) , why the cached RDDs still reserve old content. Are the
> cached RDDs treated differently from uncached RDDs in memory management.

There is no two RDDs sharing any memory, they are totally separated.

>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-memory-questions-tp13805.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to