I have a question,

If I execute this code,

val users = sc.textFile("/tmp/users.log").map(x => x.split(",")).map(
v => (v(0), v(1)))
val contacts = sc.textFile("/tmp/contacts.log").map(y =>
y.split(",")).map( v => (v(0), v(1)))
val usersMap = contacts.collectAsMap()
contacts.map(v => (v._1, (usersMap(v._1), v._2))).collect()

When I execute collectAsMap, where is data? in each Executor?? I guess
than each executor has data that it proccesed. The result is sent to
the driver, but I guess that each executor keeps its piece of
processed data.

I guess that it's more efficient that to use a join in this case
because there's not shuffle but If I save usersMap as a broadcast
variable, wouldn't it be less efficient because I'm sending data to
executors and don't need it?

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to