Re: Brodcast Variable updated from one transformation and used from another

2015-02-25 Thread Yiannis Gkoufas
What I think is happening that the map operations are executed concurrently
and the map operation in rdd2 has the initial copy of myObjectBroadcated.
Is there a way to apply the transformations sequentially? First materialize
rdd1 and then rdd2.

Thanks a lot!

On 24 February 2015 at 18:49, Yiannis Gkoufas johngou...@gmail.com wrote:

 Sorry for the mistake, I actually have it this way:

 val myObject = new MyObject();
 val myObjectBroadcasted = sc.broadcast(myObject);

 val rdd1 = sc.textFile(/file1).map(e =
 {
  myObjectBroadcasted.value.insert(e._1);
  (e._1,1)
 });
 rdd.cache.count(); //to make sure it is transformed.

 val rdd2 = sc.textFile(/file2).map(e =
 {
  val lookedUp = myObjectBroadcasted.value.lookup(e._1);
  (e._1, lookedUp)
 });

 On 24 February 2015 at 17:36, Ganelin, Ilya ilya.gane...@capitalone.com
 wrote:

  You're not using the broadcasted variable within your map operations.
 You're attempting to modify myObjrct directly which won't work because you
 are modifying the serialized copy on the executor. You want to do
 myObjectBroadcasted.value.insert and myObjectBroadcasted.value.lookup.



 Sent with Good (www.good.com)



 -Original Message-
 *From: *Yiannis Gkoufas [johngou...@gmail.com]
 *Sent: *Tuesday, February 24, 2015 12:12 PM Eastern Standard Time
 *To: *user@spark.apache.org
 *Subject: *Brodcast Variable updated from one transformation and used
 from another

 Hi all,

 I am trying to do the following.

 val myObject = new MyObject();
 val myObjectBroadcasted = sc.broadcast(myObject);

 val rdd1 = sc.textFile(/file1).map(e =
 {
  myObject.insert(e._1);
  (e._1,1)
 });
 rdd.cache.count(); //to make sure it is transformed.

 val rdd2 = sc.textFile(/file2).map(e =
 {
  val lookedUp = myObject.lookup(e._1);
  (e._1, lookedUp)
 });

 When I check the contents of myObject within the map of rdd1 everything
 seems ok.
 On the other hand when I check the contents of myObject within the map of
 rdd2 it seems to be empty.
 I am doing something wrong?

 Thanks a lot!

 --

 The information contained in this e-mail is confidential and/or
 proprietary to Capital One and/or its affiliates. The information
 transmitted herewith is intended only for use by the individual or entity
 to which it is addressed.  If the reader of this message is not the
 intended recipient, you are hereby notified that any review,
 retransmission, dissemination, distribution, copying or other use of, or
 taking of any action in reliance upon this information is strictly
 prohibited. If you have received this communication in error, please
 contact the sender and delete the material from your computer.





Re: Brodcast Variable updated from one transformation and used from another

2015-02-25 Thread Imran Rashid
Hi Yiannis,

Broadcast variables are meant for *immutable* data.  They are not meant for
data structures that you intend to update.  (It might *happen* to work when
running local mode, though I doubt it, and it would probably be a bug if it
did.  It will certainly not work when running on a cluster.)

This probably seems like a huge restriction, but its really fundamental to
spark's execution model.  B/c they are immutable, spark can make
optimizations around when  how the broadcast variable is shared.
Furthermore, its very important for having clearly defined semantics.  Eg.,
imagine that your broadcast variable was a hashmap.  What would the
eventual result be if task 1 updated key X to have value A, but task 2
updated key X to have value B?  How should the updates from each task be
combined together?

You have a few alternatives.  It really depends a lot on your use case
which one is right, their are a lot of factors to consider.

1) put your updates in another RDD, collect() it, update your variable on
the driver, rebroadcast it.  (least scalable)

2) use an accumulator to get the updates from each stage.  (maybe a bit
more efficient, b)

3) use some completely different mechanism for storing the data in your
broadcast var.  Eg., use a distributed key-value store.  Or put the data in
another RDD, which you join against your data.  (most scalable, but may not
be applicable at all.)

which one is right depends a lot on what you are trying to do.

Imran



On Wed, Feb 25, 2015 at 8:02 AM, Yiannis Gkoufas johngou...@gmail.com
wrote:

 What I think is happening that the map operations are executed
 concurrently and the map operation in rdd2 has the initial copy of
 myObjectBroadcated.
 Is there a way to apply the transformations sequentially? First
 materialize rdd1 and then rdd2.

 Thanks a lot!

 On 24 February 2015 at 18:49, Yiannis Gkoufas johngou...@gmail.com
 wrote:

 Sorry for the mistake, I actually have it this way:

 val myObject = new MyObject();
 val myObjectBroadcasted = sc.broadcast(myObject);

 val rdd1 = sc.textFile(/file1).map(e =
 {
  myObjectBroadcasted.value.insert(e._1);
  (e._1,1)
 });
 rdd.cache.count(); //to make sure it is transformed.

 val rdd2 = sc.textFile(/file2).map(e =
 {
  val lookedUp = myObjectBroadcasted.value.lookup(e._1);
  (e._1, lookedUp)
 });

 On 24 February 2015 at 17:36, Ganelin, Ilya ilya.gane...@capitalone.com
 wrote:

  You're not using the broadcasted variable within your map operations.
 You're attempting to modify myObjrct directly which won't work because you
 are modifying the serialized copy on the executor. You want to do
 myObjectBroadcasted.value.insert and myObjectBroadcasted.value.lookup.



 Sent with Good (www.good.com)



 -Original Message-
 *From: *Yiannis Gkoufas [johngou...@gmail.com]
 *Sent: *Tuesday, February 24, 2015 12:12 PM Eastern Standard Time
 *To: *user@spark.apache.org
 *Subject: *Brodcast Variable updated from one transformation and used
 from another

 Hi all,

 I am trying to do the following.

 val myObject = new MyObject();
 val myObjectBroadcasted = sc.broadcast(myObject);

 val rdd1 = sc.textFile(/file1).map(e =
 {
  myObject.insert(e._1);
  (e._1,1)
 });
 rdd.cache.count(); //to make sure it is transformed.

 val rdd2 = sc.textFile(/file2).map(e =
 {
  val lookedUp = myObject.lookup(e._1);
  (e._1, lookedUp)
 });

 When I check the contents of myObject within the map of rdd1 everything
 seems ok.
 On the other hand when I check the contents of myObject within the map
 of rdd2 it seems to be empty.
 I am doing something wrong?

 Thanks a lot!

 --

 The information contained in this e-mail is confidential and/or
 proprietary to Capital One and/or its affiliates. The information
 transmitted herewith is intended only for use by the individual or entity
 to which it is addressed.  If the reader of this message is not the
 intended recipient, you are hereby notified that any review,
 retransmission, dissemination, distribution, copying or other use of, or
 taking of any action in reliance upon this information is strictly
 prohibited. If you have received this communication in error, please
 contact the sender and delete the material from your computer.






Re: Brodcast Variable updated from one transformation and used from another

2015-02-24 Thread Yiannis Gkoufas
Sorry for the mistake, I actually have it this way:

val myObject = new MyObject();
val myObjectBroadcasted = sc.broadcast(myObject);

val rdd1 = sc.textFile(/file1).map(e =
{
 myObjectBroadcasted.value.insert(e._1);
 (e._1,1)
});
rdd.cache.count(); //to make sure it is transformed.

val rdd2 = sc.textFile(/file2).map(e =
{
 val lookedUp = myObjectBroadcasted.value.lookup(e._1);
 (e._1, lookedUp)
});

On 24 February 2015 at 17:36, Ganelin, Ilya ilya.gane...@capitalone.com
wrote:

  You're not using the broadcasted variable within your map operations.
 You're attempting to modify myObjrct directly which won't work because you
 are modifying the serialized copy on the executor. You want to do
 myObjectBroadcasted.value.insert and myObjectBroadcasted.value.lookup.



 Sent with Good (www.good.com)



 -Original Message-
 *From: *Yiannis Gkoufas [johngou...@gmail.com]
 *Sent: *Tuesday, February 24, 2015 12:12 PM Eastern Standard Time
 *To: *user@spark.apache.org
 *Subject: *Brodcast Variable updated from one transformation and used
 from another

 Hi all,

 I am trying to do the following.

 val myObject = new MyObject();
 val myObjectBroadcasted = sc.broadcast(myObject);

 val rdd1 = sc.textFile(/file1).map(e =
 {
  myObject.insert(e._1);
  (e._1,1)
 });
 rdd.cache.count(); //to make sure it is transformed.

 val rdd2 = sc.textFile(/file2).map(e =
 {
  val lookedUp = myObject.lookup(e._1);
  (e._1, lookedUp)
 });

 When I check the contents of myObject within the map of rdd1 everything
 seems ok.
 On the other hand when I check the contents of myObject within the map of
 rdd2 it seems to be empty.
 I am doing something wrong?

 Thanks a lot!

 --

 The information contained in this e-mail is confidential and/or
 proprietary to Capital One and/or its affiliates. The information
 transmitted herewith is intended only for use by the individual or entity
 to which it is addressed.  If the reader of this message is not the
 intended recipient, you are hereby notified that any review,
 retransmission, dissemination, distribution, copying or other use of, or
 taking of any action in reliance upon this information is strictly
 prohibited. If you have received this communication in error, please
 contact the sender and delete the material from your computer.



RE: Brodcast Variable updated from one transformation and used from another

2015-02-24 Thread Ganelin, Ilya
You're not using the broadcasted variable within your map operations. You're 
attempting to modify myObjrct directly which won't work because you are 
modifying the serialized copy on the executor. You want to do 
myObjectBroadcasted.value.insert and myObjectBroadcasted.value.lookup.



Sent with Good (www.good.com)


-Original Message-
From: Yiannis Gkoufas [johngou...@gmail.commailto:johngou...@gmail.com]
Sent: Tuesday, February 24, 2015 12:12 PM Eastern Standard Time
To: user@spark.apache.org
Subject: Brodcast Variable updated from one transformation and used from another

Hi all,

I am trying to do the following.

val myObject = new MyObject();
val myObjectBroadcasted = sc.broadcast(myObject);

val rdd1 = sc.textFile(/file1).map(e =
{
 myObject.insert(e._1);
 (e._1,1)
});
rdd.cache.count(); //to make sure it is transformed.

val rdd2 = sc.textFile(/file2).map(e =
{
 val lookedUp = myObject.lookup(e._1);
 (e._1, lookedUp)
});

When I check the contents of myObject within the map of rdd1 everything seems 
ok.
On the other hand when I check the contents of myObject within the map of rdd2 
it seems to be empty.
I am doing something wrong?

Thanks a lot!


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Brodcast Variable updated from one transformation and used from another

2015-02-24 Thread Yiannis Gkoufas
Hi all,

I am trying to do the following.

val myObject = new MyObject();
val myObjectBroadcasted = sc.broadcast(myObject);

val rdd1 = sc.textFile(/file1).map(e =
{
 myObject.insert(e._1);
 (e._1,1)
});
rdd.cache.count(); //to make sure it is transformed.

val rdd2 = sc.textFile(/file2).map(e =
{
 val lookedUp = myObject.lookup(e._1);
 (e._1, lookedUp)
});

When I check the contents of myObject within the map of rdd1 everything
seems ok.
On the other hand when I check the contents of myObject within the map of
rdd2 it seems to be empty.
I am doing something wrong?

Thanks a lot!