Hi Yiannis,

Broadcast variables are meant for *immutable* data.  They are not meant for
data structures that you intend to update.  (It might *happen* to work when
running local mode, though I doubt it, and it would probably be a bug if it
did.  It will certainly not work when running on a cluster.)

This probably seems like a huge restriction, but its really fundamental to
spark's execution model.  B/c they are immutable, spark can make
optimizations around when & how the broadcast variable is shared.
Furthermore, its very important for having clearly defined semantics.  Eg.,
imagine that your broadcast variable was a hashmap.  What would the
eventual result be if task 1 updated key X to have value A, but task 2
updated key X to have value B?  How should the updates from each task be
combined together?

You have a few alternatives.  It really depends a lot on your use case
which one is right, their are a lot of factors to consider.

1) put your updates in another RDD, collect() it, update your variable on
the driver, rebroadcast it.  (least scalable)

2) use an accumulator to get the updates from each stage.  (maybe a bit
more efficient, b)

3) use some completely different mechanism for storing the data in your
broadcast var.  Eg., use a distributed key-value store.  Or put the data in
another RDD, which you join against your data.  (most scalable, but may not
be applicable at all.)

which one is right depends a lot on what you are trying to do.

Imran



On Wed, Feb 25, 2015 at 8:02 AM, Yiannis Gkoufas <johngou...@gmail.com>
wrote:

> What I think is happening that the map operations are executed
> concurrently and the map operation in rdd2 has the initial copy of
> myObjectBroadcated.
> Is there a way to apply the transformations sequentially? First
> materialize rdd1 and then rdd2.
>
> Thanks a lot!
>
> On 24 February 2015 at 18:49, Yiannis Gkoufas <johngou...@gmail.com>
> wrote:
>
>> Sorry for the mistake, I actually have it this way:
>>
>> val myObject = new MyObject();
>> val myObjectBroadcasted = sc.broadcast(myObject);
>>
>> val rdd1 = sc.textFile("/file1").map(e =>
>> {
>>  myObjectBroadcasted.value.insert(e._1);
>>  (e._1,1)
>> });
>> rdd.cache.count(); //to make sure it is transformed.
>>
>> val rdd2 = sc.textFile("/file2").map(e =>
>> {
>>  val lookedUp = myObjectBroadcasted.value.lookup(e._1);
>>  (e._1, lookedUp)
>> });
>>
>> On 24 February 2015 at 17:36, Ganelin, Ilya <ilya.gane...@capitalone.com>
>> wrote:
>>
>>>  You're not using the broadcasted variable within your map operations.
>>> You're attempting to modify myObjrct directly which won't work because you
>>> are modifying the serialized copy on the executor. You want to do
>>> myObjectBroadcasted.value.insert and myObjectBroadcasted.value.lookup.
>>>
>>>
>>>
>>> Sent with Good (www.good.com)
>>>
>>>
>>>
>>> -----Original Message-----
>>> *From: *Yiannis Gkoufas [johngou...@gmail.com]
>>> *Sent: *Tuesday, February 24, 2015 12:12 PM Eastern Standard Time
>>> *To: *user@spark.apache.org
>>> *Subject: *Brodcast Variable updated from one transformation and used
>>> from another
>>>
>>> Hi all,
>>>
>>> I am trying to do the following.
>>>
>>> val myObject = new MyObject();
>>> val myObjectBroadcasted = sc.broadcast(myObject);
>>>
>>> val rdd1 = sc.textFile("/file1").map(e =>
>>> {
>>>  myObject.insert(e._1);
>>>  (e._1,1)
>>> });
>>> rdd.cache.count(); //to make sure it is transformed.
>>>
>>> val rdd2 = sc.textFile("/file2").map(e =>
>>> {
>>>  val lookedUp = myObject.lookup(e._1);
>>>  (e._1, lookedUp)
>>> });
>>>
>>> When I check the contents of myObject within the map of rdd1 everything
>>> seems ok.
>>> On the other hand when I check the contents of myObject within the map
>>> of rdd2 it seems to be empty.
>>> I am doing something wrong?
>>>
>>> Thanks a lot!
>>>
>>> ------------------------------
>>>
>>> The information contained in this e-mail is confidential and/or
>>> proprietary to Capital One and/or its affiliates. The information
>>> transmitted herewith is intended only for use by the individual or entity
>>> to which it is addressed.  If the reader of this message is not the
>>> intended recipient, you are hereby notified that any review,
>>> retransmission, dissemination, distribution, copying or other use of, or
>>> taking of any action in reliance upon this information is strictly
>>> prohibited. If you have received this communication in error, please
>>> contact the sender and delete the material from your computer.
>>>
>>
>>
>

Reply via email to