I have a question, If I execute this code,
val users = sc.textFile("/tmp/users.log").map(x => x.split(",")).map( v => (v(0), v(1))) val contacts = sc.textFile("/tmp/contacts.log").map(y => y.split(",")).map( v => (v(0), v(1))) val usersMap = contacts.collectAsMap() contacts.map(v => (v._1, (usersMap(v._1), v._2))).collect() When I execute collectAsMap, where is data? in each Executor?? I guess than each executor has data that it proccesed. The result is sent to the driver, but I guess that each executor keeps its piece of processed data. I guess that it's more efficient that to use a join in this case because there's not shuffle but If I save usersMap as a broadcast variable, wouldn't it be less efficient because I'm sending data to executors and don't need it? --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org