Re: CollectAsMap, Broadcasting.

Sean Owen Thu, 26 Feb 2015 02:19:28 -0800

No, it exists only on the driver, not the executors. Executors don't
retain partitions unless they are supposed to be persisted.


Generally, broadcasting a small Map to accomplish a join 'manually' is
more efficient than a join, but you are right that this is mostly
because joins usually involve shuffles. If not, it's not as clear
which way is best. I suppose that if the Map is large-ish, it's safer
to not keep pulling it to the driver.

On Thu, Feb 26, 2015 at 10:00 AM, Guillermo Ortiz <konstt2...@gmail.com> wrote:
> I have a question,
>
> If I execute this code,
>
> val users = sc.textFile("/tmp/users.log").map(x => x.split(",")).map(
> v => (v(0), v(1)))
> val contacts = sc.textFile("/tmp/contacts.log").map(y =>
> y.split(",")).map( v => (v(0), v(1)))
> val usersMap = contacts.collectAsMap()
> contacts.map(v => (v._1, (usersMap(v._1), v._2))).collect()
>
> When I execute collectAsMap, where is data? in each Executor?? I guess
> than each executor has data that it proccesed. The result is sent to
> the driver, but I guess that each executor keeps its piece of
> processed data.
>
> I guess that it's more efficient that to use a join in this case
> because there's not shuffle but If I save usersMap as a broadcast
> variable, wouldn't it be less efficient because I'm sending data to
> executors and don't need it?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: CollectAsMap, Broadcasting.

Reply via email to