No, it exists only on the driver, not the executors. Executors don't retain partitions unless they are supposed to be persisted.
Generally, broadcasting a small Map to accomplish a join 'manually' is more efficient than a join, but you are right that this is mostly because joins usually involve shuffles. If not, it's not as clear which way is best. I suppose that if the Map is large-ish, it's safer to not keep pulling it to the driver. On Thu, Feb 26, 2015 at 10:00 AM, Guillermo Ortiz <konstt2...@gmail.com> wrote: > I have a question, > > If I execute this code, > > val users = sc.textFile("/tmp/users.log").map(x => x.split(",")).map( > v => (v(0), v(1))) > val contacts = sc.textFile("/tmp/contacts.log").map(y => > y.split(",")).map( v => (v(0), v(1))) > val usersMap = contacts.collectAsMap() > contacts.map(v => (v._1, (usersMap(v._1), v._2))).collect() > > When I execute collectAsMap, where is data? in each Executor?? I guess > than each executor has data that it proccesed. The result is sent to > the driver, but I guess that each executor keeps its piece of > processed data. > > I guess that it's more efficient that to use a join in this case > because there's not shuffle but If I save usersMap as a broadcast > variable, wouldn't it be less efficient because I'm sending data to > executors and don't need it? > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org