Re: problem about broadcast variable in iteration

2014-05-29 Thread randylu
hi, Andrew Ash, thanks for your reply.
In fact, I have already used unpersist(), but it doesn't take effect.
One reason that I select 1.0.0 version is just for it providing unpersist()
interface.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-broadcast-variable-in-iteration-tp5479p6519.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: problem about broadcast variable in iteration

2014-05-25 Thread Andrew Ash
Hi Randy,

In Spark 1.0 there was a lot of work done to allow unpersisting data that's
no longer needed.  See the below pull request.

Try running kvGlobal.unpersist() on line 11 before the re-broadcast of the
next variable to see if you can cut the dependency there.

https://github.com/apache/spark/pull/126

Alternatively, it sounds like your algorithm needs some additional state to
join against to produce each successive iteration of RDD.  Have you
considered storing that data in an RDD rather than a broadcast variable?

Andrew


On Wed, May 7, 2014 at 10:02 PM, randylu  wrote:

> But when i put broadcast variable out of for-circle, it workes well(if not
> concerned about memory issue as you pointed out):
>  1  var rdd1 = ...
>  2  var rdd2 = ...
>  3  var kv = ...
>  4  var kvGlobal = sc.broadcast(kv)   // broadcast kv
>  5  for (i <- 0 until n) {
>  6rdd1 = rdd2.map {
>  7  case t => doSomething(t, kvGlobal.value)
>  8}.cache()
>  9var tmp = rdd1.reduceByKey().collect()
> 10kv = updateKV(tmp)   // update kv for
> each
> iteration
> 11kvGlobal = sc.broadcast(kv)   // broadcast kv
> 12rdd2 = rdd1
> 13 }
> 14 rdd2.saveAsTextFile()
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-broadcast-variable-in-iteration-tp5479p5497.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: problem about broadcast variable in iteration

2014-05-15 Thread randylu
But when i put broadcast variable out of for-circle, it workes well(if not
concerned about memory issue as you pointed out): 
 1  var rdd1 = ... 
 2  var rdd2 = ... 
 3  var kv = ... 
 4  var kvGlobal = sc.broadcast(kv)   // broadcast kv 
 5  for (i <- 0 until n) { 
 6rdd1 = rdd2.map { 
 7  case t => doSomething(t, kvGlobal.value) 
 8}.cache()
 9var tmp = rdd1.reduceByKey().collect() 
10kv = updateKV(tmp)   // update kv for each
iteration 
11kvGlobal = sc.broadcast(kv)   // broadcast kv 
12rdd2 = rdd1 
13 } 
14 rdd2.saveAsTextFile() 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-broadcast-variable-in-iteration-tp5479p5497.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: problem about broadcast variable in iteration

2014-05-15 Thread randylu
rdd1 is cached, but it has no effect: 
 1  var rdd1 = ... 
 2  var rdd2 = ... 
 3  var kv = ... 
 4  for (i <- 0 until n) { 
 5var kvGlobal = sc.broadcast(kv)   // broadcast kv 
 6rdd1 = rdd2.map { 
 7  case t => doSomething(t, kvGlobal.value) 
 8}.cache()
 9var tmp = rdd1.reduceByKey().collect() 
10kv = updateKV(tmp)   // update kv for each
iteration 
11rdd2 = rdd1 
12 } 
13 rdd2.saveAsTextFile() 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-broadcast-variable-in-iteration-tp5479p5496.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: problem about broadcast variable in iteration

2014-05-15 Thread Earthson
RDD is not cached? 

Because recomputing may be required, every broadcast object is included in
the dependences of RDDs, this may also have memory issue(when n and kv is
too large in your case).



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-broadcast-variable-in-iteration-tp5479p5495.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


problem about broadcast variable in iteration

2014-05-15 Thread randylu
My code just like follows:
 1  var rdd1 = ...
 2  var rdd2 = ...
 3  var kv = ...
 4  for (i <- 0 until n) {
 5var kvGlobal = sc.broadcast(kv)   // broadcast kv
 6rdd1 = rdd2.map {
 7  case t => doSomething(t, kvGlobal.value)
 8}
 9var tmp = rdd1.reduceByKey().collect()
10kv = updateKV(tmp)   // update kv for each
iteration
11rdd2 = rdd1
12 }
13 rdd2.saveAsTextFile()
  In 1st itreation, when processed line9, each slave need to read
broadcast_1;
  In 2nd iteration, when processed line9, each slave need to read
broadcast_1 and broadcast_2;
  In 3rd iteration, when processed line9, each slave need to read
broadcast_1, broadcast_2 and broadcast_3;
  ...
  broadcast_/n/ all correspond to kvGlobal at different iterations.
  why in /n/th iteration,  each slave need to read from broadcast_1 to
broadcast_/n/, why not just reading broadcast_/n/.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-broadcast-variable-in-iteration-tp5479.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: problem about broadcast variable in iteration

2014-05-10 Thread randylu
i run in spark 1.0.0, the newest under-development version.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-broadcast-variable-in-iteration-tp5479p5480.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.