Re: Iterating on RDDs

2015-02-27 Thread Vijayasarathy Kannan
As you suggested, I tried to save the grouped RDD and persisted it in
memory before the iterations begin. The performance seems to be much better
now.

My previous comment that the run times doubled was from a wrong observation.

Thanks.


On Fri, Feb 27, 2015 at 10:27 AM, Vijayasarathy Kannan kvi...@vt.edu
wrote:

 Thanks.

 I tried persist() on the RDD. The runtimes appear to have doubled now
 (without persist() it was ~7s per iteration and now its ~15s). I am running
 standalone Spark on a 8-core machine.
 Any thoughts on why the increase in runtime?

 On Thu, Feb 26, 2015 at 4:27 PM, Imran Rashid iras...@cloudera.com
 wrote:


 val grouped = R.groupBy[VertexId](G).persist(StorageLeve.MEMORY_ONLY_SER)
 // or whatever persistence makes more sense for you ...
 while(true) {
   val res = grouped.flatMap(F)
   res.collect.foreach(func)
   if(criteria)
  break
 }

 On Thu, Feb 26, 2015 at 10:56 AM, Vijayasarathy Kannan kvi...@vt.edu
 wrote:

 Hi,

 I have the following use case.

 (1) I have an RDD of edges of a graph (say R).
 (2) do a groupBy on R (by say source vertex) and call a function F on
 each group.
 (3) collect the results from Fs and do some computation
 (4) repeat the above steps until some criteria is met

 In (2), the groups are always going to be the same (since R is grouped
 by source vertex).

 Question:
 Is R distributed every iteration (when in (2)) or is it distributed only
 once when it is created?

 A sample code snippet is below.

 while(true) {
   val res = R.groupBy[VertexId](G).flatMap(F)
   res.collect.foreach(func)
   if(criteria)
  break
 }

 Since the groups remain the same, what is the best way to go about
 implementing the above logic?






Re: Iterating on RDDs

2015-02-27 Thread Vijayasarathy Kannan
Thanks.

I tried persist() on the RDD. The runtimes appear to have doubled now
(without persist() it was ~7s per iteration and now its ~15s). I am running
standalone Spark on a 8-core machine.
Any thoughts on why the increase in runtime?

On Thu, Feb 26, 2015 at 4:27 PM, Imran Rashid iras...@cloudera.com wrote:


 val grouped = R.groupBy[VertexId](G).persist(StorageLeve.MEMORY_ONLY_SER)
 // or whatever persistence makes more sense for you ...
 while(true) {
   val res = grouped.flatMap(F)
   res.collect.foreach(func)
   if(criteria)
  break
 }

 On Thu, Feb 26, 2015 at 10:56 AM, Vijayasarathy Kannan kvi...@vt.edu
 wrote:

 Hi,

 I have the following use case.

 (1) I have an RDD of edges of a graph (say R).
 (2) do a groupBy on R (by say source vertex) and call a function F on
 each group.
 (3) collect the results from Fs and do some computation
 (4) repeat the above steps until some criteria is met

 In (2), the groups are always going to be the same (since R is grouped by
 source vertex).

 Question:
 Is R distributed every iteration (when in (2)) or is it distributed only
 once when it is created?

 A sample code snippet is below.

 while(true) {
   val res = R.groupBy[VertexId](G).flatMap(F)
   res.collect.foreach(func)
   if(criteria)
  break
 }

 Since the groups remain the same, what is the best way to go about
 implementing the above logic?





Re: Iterating on RDDs

2015-02-26 Thread Imran Rashid
val grouped = R.groupBy[VertexId](G).persist(StorageLeve.MEMORY_ONLY_SER)
// or whatever persistence makes more sense for you ...
while(true) {
  val res = grouped.flatMap(F)
  res.collect.foreach(func)
  if(criteria)
 break
}

On Thu, Feb 26, 2015 at 10:56 AM, Vijayasarathy Kannan kvi...@vt.edu
wrote:

 Hi,

 I have the following use case.

 (1) I have an RDD of edges of a graph (say R).
 (2) do a groupBy on R (by say source vertex) and call a function F on each
 group.
 (3) collect the results from Fs and do some computation
 (4) repeat the above steps until some criteria is met

 In (2), the groups are always going to be the same (since R is grouped by
 source vertex).

 Question:
 Is R distributed every iteration (when in (2)) or is it distributed only
 once when it is created?

 A sample code snippet is below.

 while(true) {
   val res = R.groupBy[VertexId](G).flatMap(F)
   res.collect.foreach(func)
   if(criteria)
  break
 }

 Since the groups remain the same, what is the best way to go about
 implementing the above logic?