Re: RDD memory questions

2014-09-12 Thread Boxian Dong
Thank you very much for your help :)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-memory-questions-tp13805p14069.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD memory questions

2014-09-10 Thread Boxian Dong
Thank you very much for your kindly help. I rise some another questions:

   - If the RDD is stored in serialized format, is that means that whenever
the RDD is processed, it will be unpacked and packed again from and back to
the JVM even they are located on the same machine? 
http://apache-spark-user-list.1001560.n3.nabble.com/file/n13862/rdd_img.png 

   - Can the RDD be partially unpacked from the serialized state? or when
every a RDD is touched, it must be fully unpacked, and of course pack again
afterword.

  -  When a RDD is cached, is it saved in a unserialized format or
serialized format? If it's saved in a unserialized format, is the partially
reading of RDD from JVM to PYTHON runtime possible? 

Thank you very much



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-memory-questions-tp13805p13862.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD memory questions

2014-09-10 Thread Davies Liu
On Wed, Sep 10, 2014 at 1:05 AM, Boxian Dong box...@indoo.rs wrote:
 Thank you very much for your kindly help. I rise some another questions:

- If the RDD is stored in serialized format, is that means that whenever
 the RDD is processed, it will be unpacked and packed again from and back to
 the JVM even they are located on the same machine?
 http://apache-spark-user-list.1001560.n3.nabble.com/file/n13862/rdd_img.png

In PySpark, Yes. But in Spark generally, no, you have several choice
to cache RDD
in Scala, serialized or not.

- Can the RDD be partially unpacked from the serialized state? or when
 every a RDD is touched, it must be fully unpacked, and of course pack again
 afterword.

The items in RDD are deserialized batch by batch, so if you call rdd.take(),
only first small parts of items are deserialized.

The cache of RDD are kept in JVM, you do not need to pack again after
visiting them.

   -  When a RDD is cached, is it saved in a unserialized format or
 serialized format? If it's saved in a unserialized format, is the partially
 reading of RDD from JVM to PYTHON runtime possible?

For PySpark, they are all saved in serialized format. During a transformation
of RDD, you can only see the current partition, you can not access other
partitions or other RDD.

The RDD always are read-only, so you can not modify them any time.
(all the modification will be dropped.)

 Thank you very much



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/RDD-memory-questions-tp13805p13862.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD memory questions

2014-09-09 Thread Davies Liu
On Tue, Sep 9, 2014 at 10:07 AM, Boxian Dong box...@indoo.rs wrote:
 I currently working on a machine learning project, which require the RDDs'
 content to be (mostly partially) updated during each iteration. Because the
 program will be converted directly from traditional python object-oriented
 code, the content of the RDD will be modified in the mapping function. To
 test the functionality and memory , I writed a testing program:

class TestClass(object):
 def __init__(self):
 self.data = []

 def setup(self):
 self.data = range(2)
 return self

 def addNumber(self, number):
 length = len(self.data)
 for i in range(length):
 self.data[i] += number
 return self

 def sumUp(self):
 totoal = 0
 for n in self.data:
 totoal += n
 return totoal

 and Spark main:

 origData = []
 for i in range(50):
 origData.append((i, TestClass()))
 # create the RDD and cache it
 rddData = sc.parallelize(origData).mapValues(lambda v : v.setup())
 rddData.cache()

 # modifying the content of RDD in map function
 scD = rddData
 for i in range(100):
  scD = scD.mapValues(lambda v : v.addNumber(10))

 scDSum = scD.map(lambda (k, v) : v.sumUp())
 v = scDSum.reduce(lambda a, b: a + b)

 print  -- after the transfermation, the value is , v

 scDSum = rddData .map(lambda (k, v) : v.sumUp())
 v = scDSum.reduce(lambda a, b: a + b)

 print  -- after the transformation, the cached value is , v

   - By judging the results, it seems to me that when the RDDs is cached, the
 directed modification doesn't affect the content
   - By the monitoring of the memory usage, it seems to me that the memory
 are not duplicated during each RDD (narrow dependence) transformation (or I
 am wrong)

 therefore, my question is:
   - how the cache works, does it make a copy of the data separately ?
   - How the memory is managed in the MAP function? (in narrow dependence)
 Are the entire RDDs first duplicated, modified and then assigned to the new
 RDDs, and afterward the old RDDs will be deleted from the memory. Or the new
 RDDs will reuse the same memory of the old RDDs, without the
 duplication/copy of the memory?

I'm trying to answer some of your questions:

The RDD is cached in JVM (after serialized by pickle). In Python, it
reads the serialized
data from socket then deserialized it into Python objects, call mapper
or reducer on
them, finally sending them back to JVM via socket. The Python process only hold
a batch of objects, so the memory usage will be smaller than the whole
partition.

The cache in JVM is created after first iteration of them. So when you
process them
the second time (or even more), they will be read from cache in JVM.

RDDs are read only, you can not modify them, each transformation will create new
RDDs. During MAP function, the objects in RDDs are throwed away after accessing,
any modification to them will be lost.

   - If the new generated RDDs directly use the memory of the old RDDs (in
 narrow dependence) , why the cached RDDs still reserve old content. Are the
 cached RDDs treated differently from uncached RDDs in memory management.

There is no two RDDs sharing any memory, they are totally separated.






 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/RDD-memory-questions-tp13805.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org