Re: Model parallelism with RDD

2015-07-17 Thread Shivaram Venkataraman
*To:* Ulanov, Alexander *Cc:* shiva...@eecs.berkeley.edu; dev@spark.apache.org *Subject:* Re: Model parallelism with RDD Yeah I can see that being the case -- caching implies creating objects that will be stored in memory. So there is a trade-off between storing data in memory but having to garbage

RE: Model parallelism with RDD

2015-07-17 Thread Ulanov, Alexander
: shiva...@eecs.berkeley.edu; dev@spark.apache.org Subject: Re: Model parallelism with RDD You can also use checkpoint to truncate the lineage and the data can be persisted to HDFS. Fundamentally the state of the RDD needs to be saved to memory or disk if you don't want to repeat the computation

RE: Model parallelism with RDD

2015-07-16 Thread Ulanov, Alexander
spark.sql.unsafe.enabled=true removes the GC when persisting/unpersisting the DataFrame? Best regards, Alexander From: Ulanov, Alexander Sent: Monday, July 13, 2015 11:15 AM To: shiva...@eecs.berkeley.edu Cc: dev@spark.apache.org Subject: RE: Model parallelism with RDD Below are the average

RE: Model parallelism with RDD

2015-07-13 Thread Ulanov, Alexander
} println(Avg iteration time: + avgTime / numIterations) Best regards, Alexander From: Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu] Sent: Friday, July 10, 2015 10:04 PM To: Ulanov, Alexander Cc: shiva...@eecs.berkeley.edu; dev@spark.apache.org Subject: Re: Model parallelism with RDD

Model parallelism with RDD

2015-07-10 Thread Ulanov, Alexander
Hi, I am interested how scalable can be the model parallelism within Spark. Suppose, the model contains N weights of type Double and N is so large that does not fit into the memory of a single node. So, we can store the model in RDD[Double] within several nodes. To train the model, one needs

Re: Model parallelism with RDD

2015-07-10 Thread Shivaram Venkataraman
I think you need to do `newRDD.cache()` and `newRDD.count` before you do oldRDD.unpersist(true) -- Otherwise it might be recomputing all the previous iterations each time. Thanks Shivaram On Fri, Jul 10, 2015 at 7:44 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi, I am interested

Re: Model parallelism with RDD

2015-07-10 Thread Shivaram Venkataraman
Yeah I can see that being the case -- caching implies creating objects that will be stored in memory. So there is a trade-off between storing data in memory but having to garbage collect it later vs. recomputing the data. Shivaram On Fri, Jul 10, 2015 at 9:49 PM, Ulanov, Alexander