Those results look very good for the larger workloads (100MB and 1GB). Were
you also able to run experiments for smaller amounts of data? For instance
broadcasting a single variable to the entire cluster? In the paper you
state that HDFS-based mechanisms performed well only for small amounts of
data. Do you have an approximation for the trade-off point when HDFS-based
becomes more favorable, and BitTorrent-like performs worse? I also read
that the minimum size transmitted using a broadcast variable is 4MB. Maybe
I should look for a different way of sharing this constant?

Use case: I am looking for the most efficient way to perform a
transformation involving a constant (of which the value is determined at
runtime) for a large input file.

Scala example:
var constant1 = sc.broadcast(2) // The actual value, 2 in this case, would
be a result from a different function, generated during runtime
val result = input.map(x => x + constant1.value)

On 11 March 2015 at 21:13, Mosharaf Chowdhury <mosharafka...@gmail.com>
wrote:

> The current broadcast algorithm in Spark approximates the one described
> in the Section 5 of this paper
> <http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf>.
> It is expected to scale sub-linearly; i.e., O(log N), where N is the
> number of machines in your cluster.
> We evaluated up to 100 machines, and it does follow O(log N) scaling.
>
> --
> Mosharaf Chowdhury
> http://www.mosharaf.com/
>
> On Wed, Mar 11, 2015 at 3:11 PM, Tom Hubregtsen <thubregt...@gmail.com>
> wrote:
>
>> Thanks Mosharaf, for the quick response! Can you maybe give me some
>> pointers to an explanation of this strategy? Or elaborate a bit more on it?
>> Which parts are involved in which way? Where are the time penalties and how
>> scalable is this implementation?
>>
>> Thanks again,
>>
>> Tom
>>
>> On 11 March 2015 at 16:01, Mosharaf Chowdhury <mosharafka...@gmail.com>
>> wrote:
>>
>>> Hi Tom,
>>>
>>> That's an outdated document from 4/5 years ago.
>>>
>>> Spark currently uses a BitTorrent like mechanism that's been tuned for
>>> datacenter environments.
>>>
>>> Mosharaf
>>> ------------------------------
>>> From: Tom <thubregt...@gmail.com>
>>> Sent: ‎3/‎11/‎2015 4:58 PM
>>> To: user@spark.apache.org
>>> Subject: Which strategy is used for broadcast variables?
>>>
>>> In "Performance and Scalability of Broadcast in Spark" by Mosharaf
>>> Chowdhury
>>> I read that Spark uses HDFS for its broadcast variables. This seems
>>> highly
>>> inefficient. In the same paper alternatives are proposed, among which
>>> "Bittorent Broadcast (BTB)". While studying "Learning Spark," page 105,
>>> second paragraph about Broadcast Variables, I read " The value is sent to
>>> each node only once, using an efficient, BitTorrent-like communication
>>> mechanism."
>>>
>>> - Is the book talking about the proposed BTB from the paper?
>>>
>>> - Is this currently the default?
>>>
>>> - If not, what is?
>>>
>>> Thanks,
>>>
>>> Tom
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Which-strategy-is-used-for-broadcast-variables-tp22004.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Reply via email to