Re: CollectAsMap, Broadcasting.

2015-02-26 Thread Paweł Szulc
Correct me if I'm wrong, but he can actually run thus code without broadcasting the users map, however the code will be less efficient. czw., 26 lut 2015, 12:31 PM Sean Owen użytkownik so...@cloudera.com napisał: Yes, but there is no concept of executors 'deleting' an RDD. And you would want

Re: CollectAsMap, Broadcasting.

2015-02-26 Thread Sean Owen
Yes that's correct; it works but broadcasting would be more efficient. On Thu, Feb 26, 2015 at 1:20 PM, Paweł Szulc paul.sz...@gmail.com wrote: Correct me if I'm wrong, but he can actually run thus code without broadcasting the users map, however the code will be less efficient. czw., 26

Re: CollectAsMap, Broadcasting.

2015-02-26 Thread Sean Owen
No. That code is just Scala code executing on the driver. usersMap is a local object. This bit has nothing to do with Spark. Yes you would have to broadcast it to use it efficient in functions (not on the driver). On Thu, Feb 26, 2015 at 10:24 AM, Guillermo Ortiz konstt2...@gmail.com wrote: So,

CollectAsMap, Broadcasting.

2015-02-26 Thread Guillermo Ortiz
I have a question, If I execute this code, val users = sc.textFile(/tmp/users.log).map(x = x.split(,)).map( v = (v(0), v(1))) val contacts = sc.textFile(/tmp/contacts.log).map(y = y.split(,)).map( v = (v(0), v(1))) val usersMap = contacts.collectAsMap() contacts.map(v = (v._1, (usersMap(v._1),

Re: CollectAsMap, Broadcasting.

2015-02-26 Thread Sean Owen
No, it exists only on the driver, not the executors. Executors don't retain partitions unless they are supposed to be persisted. Generally, broadcasting a small Map to accomplish a join 'manually' is more efficient than a join, but you are right that this is mostly because joins usually involve

Re: CollectAsMap, Broadcasting.

2015-02-26 Thread Guillermo Ortiz
So, on my example, when I execute: val usersMap = contacts.collectAsMap() -- Map goes to the driver and just lives there in the beginning. contacts.map(v = (v._1, (usersMap(v._1), v._2))).collect When I execute usersMap(v._1), Does driver has to send to the executorX the value which it needs? I

Re: CollectAsMap, Broadcasting.

2015-02-26 Thread Guillermo Ortiz
Isn't it contacts.map(v = (v._1, (usersMap(v._1), v._2))).collect() executed in the executors? why is it executed in the driver? contacts are not a local object, right? 2015-02-26 11:27 GMT+01:00 Sean Owen so...@cloudera.com: No. That code is just Scala code executing on the driver. usersMap

Re: CollectAsMap, Broadcasting.

2015-02-26 Thread Sean Owen
Yes, in that code, usersMap has been serialized to every executor. I thought you were referring to accessing the copy in the driver. On Thu, Feb 26, 2015 at 10:47 AM, Guillermo Ortiz konstt2...@gmail.com wrote: Isn't it contacts.map(v = (v._1, (usersMap(v._1), v._2))).collect() executed in the

Re: CollectAsMap, Broadcasting.

2015-02-26 Thread Guillermo Ortiz
One last time to be sure I got it right, the executing sequence here goes like this?: val usersMap = contacts.collectAsMap() #The contacts RDD is collected by the executors and sent to the driver, the executors delete the rdd contacts.map(v = (v._1, (usersMap(v._1), v._2))).collect() #The userMap

Re: CollectAsMap, Broadcasting.

2015-02-26 Thread Sean Owen
Yes, but there is no concept of executors 'deleting' an RDD. And you would want to broadcast the usersMap if you're using it this way. On Thu, Feb 26, 2015 at 11:26 AM, Guillermo Ortiz konstt2...@gmail.com wrote: One last time to be sure I got it right, the executing sequence here goes like