Isn't it "contacts.map(v => (v._1, (usersMap(v._1), v._2))).collect()"
executed in the executors?  why is it executed in the driver?
contacts are not a local object, right?


2015-02-26 11:27 GMT+01:00 Sean Owen <so...@cloudera.com>:
> No. That code is just Scala code executing on the driver. usersMap is
> a local object. This bit has nothing to do with Spark.
>
> Yes you would have to broadcast it to use it efficient in functions
> (not on the driver).
>
> On Thu, Feb 26, 2015 at 10:24 AM, Guillermo Ortiz <konstt2...@gmail.com> 
> wrote:
>> So, on my example, when I execute:
>> val usersMap = contacts.collectAsMap() --> Map goes to the driver and
>> just lives there in the beginning.
>> contacts.map(v => (v._1, (usersMap(v._1), v._2))).collect
>>
>> When I execute usersMap(v._1),
>> Does driver has to send to the executorX the "value" which it needs? I
>> guess I'm missing something.
>> How does the data transfer among usersMap(just in the driver) and
>> executors work?
>>
>> On this case it looks like better to use broadcasting like:
>> val usersMap = contacts.collectAsMap()
>> val bc = sc.broadcast(usersMap)
>> contacts.map(v => (v._1, (bc.value(v._1), v._2))).collect()
>>
>> 2015-02-26 11:16 GMT+01:00 Sean Owen <so...@cloudera.com>:
>>> No, it exists only on the driver, not the executors. Executors don't
>>> retain partitions unless they are supposed to be persisted.
>>>
>>> Generally, broadcasting a small Map to accomplish a join 'manually' is
>>> more efficient than a join, but you are right that this is mostly
>>> because joins usually involve shuffles. If not, it's not as clear
>>> which way is best. I suppose that if the Map is large-ish, it's safer
>>> to not keep pulling it to the driver.
>>>
>>> On Thu, Feb 26, 2015 at 10:00 AM, Guillermo Ortiz <konstt2...@gmail.com> 
>>> wrote:
>>>> I have a question,
>>>>
>>>> If I execute this code,
>>>>
>>>> val users = sc.textFile("/tmp/users.log").map(x => x.split(",")).map(
>>>> v => (v(0), v(1)))
>>>> val contacts = sc.textFile("/tmp/contacts.log").map(y =>
>>>> y.split(",")).map( v => (v(0), v(1)))
>>>> val usersMap = contacts.collectAsMap()
>>>> contacts.map(v => (v._1, (usersMap(v._1), v._2))).collect()
>>>>
>>>> When I execute collectAsMap, where is data? in each Executor?? I guess
>>>> than each executor has data that it proccesed. The result is sent to
>>>> the driver, but I guess that each executor keeps its piece of
>>>> processed data.
>>>>
>>>> I guess that it's more efficient that to use a join in this case
>>>> because there's not shuffle but If I save usersMap as a broadcast
>>>> variable, wouldn't it be less efficient because I'm sending data to
>>>> executors and don't need it?
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to