Isn't it "contacts.map(v => (v._1, (usersMap(v._1), v._2))).collect()" executed in the executors? why is it executed in the driver? contacts are not a local object, right?
2015-02-26 11:27 GMT+01:00 Sean Owen <so...@cloudera.com>: > No. That code is just Scala code executing on the driver. usersMap is > a local object. This bit has nothing to do with Spark. > > Yes you would have to broadcast it to use it efficient in functions > (not on the driver). > > On Thu, Feb 26, 2015 at 10:24 AM, Guillermo Ortiz <konstt2...@gmail.com> > wrote: >> So, on my example, when I execute: >> val usersMap = contacts.collectAsMap() --> Map goes to the driver and >> just lives there in the beginning. >> contacts.map(v => (v._1, (usersMap(v._1), v._2))).collect >> >> When I execute usersMap(v._1), >> Does driver has to send to the executorX the "value" which it needs? I >> guess I'm missing something. >> How does the data transfer among usersMap(just in the driver) and >> executors work? >> >> On this case it looks like better to use broadcasting like: >> val usersMap = contacts.collectAsMap() >> val bc = sc.broadcast(usersMap) >> contacts.map(v => (v._1, (bc.value(v._1), v._2))).collect() >> >> 2015-02-26 11:16 GMT+01:00 Sean Owen <so...@cloudera.com>: >>> No, it exists only on the driver, not the executors. Executors don't >>> retain partitions unless they are supposed to be persisted. >>> >>> Generally, broadcasting a small Map to accomplish a join 'manually' is >>> more efficient than a join, but you are right that this is mostly >>> because joins usually involve shuffles. If not, it's not as clear >>> which way is best. I suppose that if the Map is large-ish, it's safer >>> to not keep pulling it to the driver. >>> >>> On Thu, Feb 26, 2015 at 10:00 AM, Guillermo Ortiz <konstt2...@gmail.com> >>> wrote: >>>> I have a question, >>>> >>>> If I execute this code, >>>> >>>> val users = sc.textFile("/tmp/users.log").map(x => x.split(",")).map( >>>> v => (v(0), v(1))) >>>> val contacts = sc.textFile("/tmp/contacts.log").map(y => >>>> y.split(",")).map( v => (v(0), v(1))) >>>> val usersMap = contacts.collectAsMap() >>>> contacts.map(v => (v._1, (usersMap(v._1), v._2))).collect() >>>> >>>> When I execute collectAsMap, where is data? in each Executor?? I guess >>>> than each executor has data that it proccesed. The result is sent to >>>> the driver, but I guess that each executor keeps its piece of >>>> processed data. >>>> >>>> I guess that it's more efficient that to use a join in this case >>>> because there's not shuffle but If I save usersMap as a broadcast >>>> variable, wouldn't it be less efficient because I'm sending data to >>>> executors and don't need it? >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org