Those results look very good for the larger workloads (100MB and 1GB). Were you also able to run experiments for smaller amounts of data? For instance broadcasting a single variable to the entire cluster? In the paper you state that HDFS-based mechanisms performed well only for small amounts of data. Do you have an approximation for the trade-off point when HDFS-based becomes more favorable, and BitTorrent-like performs worse? I also read that the minimum size transmitted using a broadcast variable is 4MB. Maybe I should look for a different way of sharing this constant?
Use case: I am looking for the most efficient way to perform a transformation involving a constant (of which the value is determined at runtime) for a large input file. Scala example: var constant1 = sc.broadcast(2) // The actual value, 2 in this case, would be a result from a different function, generated during runtime val result = input.map(x => x + constant1.value) On 11 March 2015 at 21:13, Mosharaf Chowdhury <mosharafka...@gmail.com> wrote: > The current broadcast algorithm in Spark approximates the one described > in the Section 5 of this paper > <http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf>. > It is expected to scale sub-linearly; i.e., O(log N), where N is the > number of machines in your cluster. > We evaluated up to 100 machines, and it does follow O(log N) scaling. > > -- > Mosharaf Chowdhury > http://www.mosharaf.com/ > > On Wed, Mar 11, 2015 at 3:11 PM, Tom Hubregtsen <thubregt...@gmail.com> > wrote: > >> Thanks Mosharaf, for the quick response! Can you maybe give me some >> pointers to an explanation of this strategy? Or elaborate a bit more on it? >> Which parts are involved in which way? Where are the time penalties and how >> scalable is this implementation? >> >> Thanks again, >> >> Tom >> >> On 11 March 2015 at 16:01, Mosharaf Chowdhury <mosharafka...@gmail.com> >> wrote: >> >>> Hi Tom, >>> >>> That's an outdated document from 4/5 years ago. >>> >>> Spark currently uses a BitTorrent like mechanism that's been tuned for >>> datacenter environments. >>> >>> Mosharaf >>> ------------------------------ >>> From: Tom <thubregt...@gmail.com> >>> Sent: 3/11/2015 4:58 PM >>> To: user@spark.apache.org >>> Subject: Which strategy is used for broadcast variables? >>> >>> In "Performance and Scalability of Broadcast in Spark" by Mosharaf >>> Chowdhury >>> I read that Spark uses HDFS for its broadcast variables. This seems >>> highly >>> inefficient. In the same paper alternatives are proposed, among which >>> "Bittorent Broadcast (BTB)". While studying "Learning Spark," page 105, >>> second paragraph about Broadcast Variables, I read " The value is sent to >>> each node only once, using an efficient, BitTorrent-like communication >>> mechanism." >>> >>> - Is the book talking about the proposed BTB from the paper? >>> >>> - Is this currently the default? >>> >>> - If not, what is? >>> >>> Thanks, >>> >>> Tom >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Which-strategy-is-used-for-broadcast-variables-tp22004.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >