Re: CollectAsMap, Broadcasting.

Sean Owen Thu, 26 Feb 2015 02:29:35 -0800

No. That code is just Scala code executing on the driver. usersMap is
a local object. This bit has nothing to do with Spark.


Yes you would have to broadcast it to use it efficient in functions
(not on the driver).

On Thu, Feb 26, 2015 at 10:24 AM, Guillermo Ortiz <konstt2...@gmail.com> wrote:
> So, on my example, when I execute:
> val usersMap = contacts.collectAsMap() --> Map goes to the driver and
> just lives there in the beginning.
> contacts.map(v => (v._1, (usersMap(v._1), v._2))).collect
>
> When I execute usersMap(v._1),
> Does driver has to send to the executorX the "value" which it needs? I
> guess I'm missing something.
> How does the data transfer among usersMap(just in the driver) and
> executors work?
>
> On this case it looks like better to use broadcasting like:
> val usersMap = contacts.collectAsMap()
> val bc = sc.broadcast(usersMap)
> contacts.map(v => (v._1, (bc.value(v._1), v._2))).collect()
>
> 2015-02-26 11:16 GMT+01:00 Sean Owen <so...@cloudera.com>:
>> No, it exists only on the driver, not the executors. Executors don't
>> retain partitions unless they are supposed to be persisted.
>>
>> Generally, broadcasting a small Map to accomplish a join 'manually' is
>> more efficient than a join, but you are right that this is mostly
>> because joins usually involve shuffles. If not, it's not as clear
>> which way is best. I suppose that if the Map is large-ish, it's safer
>> to not keep pulling it to the driver.
>>
>> On Thu, Feb 26, 2015 at 10:00 AM, Guillermo Ortiz <konstt2...@gmail.com> 
>> wrote:
>>> I have a question,
>>>
>>> If I execute this code,
>>>
>>> val users = sc.textFile("/tmp/users.log").map(x => x.split(",")).map(
>>> v => (v(0), v(1)))
>>> val contacts = sc.textFile("/tmp/contacts.log").map(y =>
>>> y.split(",")).map( v => (v(0), v(1)))
>>> val usersMap = contacts.collectAsMap()
>>> contacts.map(v => (v._1, (usersMap(v._1), v._2))).collect()
>>>
>>> When I execute collectAsMap, where is data? in each Executor?? I guess
>>> than each executor has data that it proccesed. The result is sent to
>>> the driver, but I guess that each executor keeps its piece of
>>> processed data.
>>>
>>> I guess that it's more efficient that to use a join in this case
>>> because there's not shuffle but If I save usersMap as a broadcast
>>> variable, wouldn't it be less efficient because I'm sending data to
>>> executors and don't need it?
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: CollectAsMap, Broadcasting.

Reply via email to