Re: problem about broadcast variable in iteration
hi, Andrew Ash, thanks for your reply. In fact, I have already used unpersist(), but it doesn't take effect. One reason that I select 1.0.0 version is just for it providing unpersist() interface. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-broadcast-variable-in-iteration-tp5479p6519.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: problem about broadcast variable in iteration
Hi Randy, In Spark 1.0 there was a lot of work done to allow unpersisting data that's no longer needed. See the below pull request. Try running kvGlobal.unpersist() on line 11 before the re-broadcast of the next variable to see if you can cut the dependency there. https://github.com/apache/spark/pull/126 Alternatively, it sounds like your algorithm needs some additional state to join against to produce each successive iteration of RDD. Have you considered storing that data in an RDD rather than a broadcast variable? Andrew On Wed, May 7, 2014 at 10:02 PM, randylu wrote: > But when i put broadcast variable out of for-circle, it workes well(if not > concerned about memory issue as you pointed out): > 1 var rdd1 = ... > 2 var rdd2 = ... > 3 var kv = ... > 4 var kvGlobal = sc.broadcast(kv) // broadcast kv > 5 for (i <- 0 until n) { > 6rdd1 = rdd2.map { > 7 case t => doSomething(t, kvGlobal.value) > 8}.cache() > 9var tmp = rdd1.reduceByKey().collect() > 10kv = updateKV(tmp) // update kv for > each > iteration > 11kvGlobal = sc.broadcast(kv) // broadcast kv > 12rdd2 = rdd1 > 13 } > 14 rdd2.saveAsTextFile() > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-broadcast-variable-in-iteration-tp5479p5497.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >
Re: problem about broadcast variable in iteration
But when i put broadcast variable out of for-circle, it workes well(if not concerned about memory issue as you pointed out): 1 var rdd1 = ... 2 var rdd2 = ... 3 var kv = ... 4 var kvGlobal = sc.broadcast(kv) // broadcast kv 5 for (i <- 0 until n) { 6rdd1 = rdd2.map { 7 case t => doSomething(t, kvGlobal.value) 8}.cache() 9var tmp = rdd1.reduceByKey().collect() 10kv = updateKV(tmp) // update kv for each iteration 11kvGlobal = sc.broadcast(kv) // broadcast kv 12rdd2 = rdd1 13 } 14 rdd2.saveAsTextFile() -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-broadcast-variable-in-iteration-tp5479p5497.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: problem about broadcast variable in iteration
rdd1 is cached, but it has no effect: 1 var rdd1 = ... 2 var rdd2 = ... 3 var kv = ... 4 for (i <- 0 until n) { 5var kvGlobal = sc.broadcast(kv) // broadcast kv 6rdd1 = rdd2.map { 7 case t => doSomething(t, kvGlobal.value) 8}.cache() 9var tmp = rdd1.reduceByKey().collect() 10kv = updateKV(tmp) // update kv for each iteration 11rdd2 = rdd1 12 } 13 rdd2.saveAsTextFile() -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-broadcast-variable-in-iteration-tp5479p5496.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: problem about broadcast variable in iteration
RDD is not cached? Because recomputing may be required, every broadcast object is included in the dependences of RDDs, this may also have memory issue(when n and kv is too large in your case). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-broadcast-variable-in-iteration-tp5479p5495.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
problem about broadcast variable in iteration
My code just like follows: 1 var rdd1 = ... 2 var rdd2 = ... 3 var kv = ... 4 for (i <- 0 until n) { 5var kvGlobal = sc.broadcast(kv) // broadcast kv 6rdd1 = rdd2.map { 7 case t => doSomething(t, kvGlobal.value) 8} 9var tmp = rdd1.reduceByKey().collect() 10kv = updateKV(tmp) // update kv for each iteration 11rdd2 = rdd1 12 } 13 rdd2.saveAsTextFile() In 1st itreation, when processed line9, each slave need to read broadcast_1; In 2nd iteration, when processed line9, each slave need to read broadcast_1 and broadcast_2; In 3rd iteration, when processed line9, each slave need to read broadcast_1, broadcast_2 and broadcast_3; ... broadcast_/n/ all correspond to kvGlobal at different iterations. why in /n/th iteration, each slave need to read from broadcast_1 to broadcast_/n/, why not just reading broadcast_/n/. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-broadcast-variable-in-iteration-tp5479.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: problem about broadcast variable in iteration
i run in spark 1.0.0, the newest under-development version. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-broadcast-variable-in-iteration-tp5479p5480.html Sent from the Apache Spark User List mailing list archive at Nabble.com.