Matei, Thanks for reply.
I don't worry that much about more code because I migrate from mapreduce, so I have existing code to handle it. But if I want to use a new tech, I will always prefer "right" way not a temporary easy way!. I will go with RDD first to test the performance. Thanks! Shuai -----Original Message----- From: Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: Wednesday, November 05, 2014 6:27 PM To: Shuai Zheng Cc: user@spark.apache.org Subject: Re: Any "Replicated" RDD in Spark? If you start with an RDD, you do have to collect to the driver and broadcast to do this. Between the two options you listed, I think this one is simpler to implement, and there won't be a huge difference in performance, so you can go for it. Opening InputStreams to a distributed file system by hand can be a lot of code. Matei > On Nov 5, 2014, at 12:37 PM, Shuai Zheng <szheng.c...@gmail.com> wrote: > > And another similar case: > > If I have get a RDD from previous step, but for next step it should be > a map side join (so I need to broadcast this RDD to every nodes). What > is the best way for me to do that? Collect RDD in driver first and > create broadcast? Or any shortcut in spark for this? > > Thanks! > > -----Original Message----- > From: Shuai Zheng [mailto:szheng.c...@gmail.com] > Sent: Wednesday, November 05, 2014 3:32 PM > To: 'Matei Zaharia' > Cc: 'user@spark.apache.org' > Subject: RE: Any "Replicated" RDD in Spark? > > Nice. > > Then I have another question, if I have a file (or a set of files: > part-0, part-1, might be a few hundreds MB csv to 1-2 GB, created by > other program), need to create hashtable from it, later broadcast it > to each node to allow query (map side join). I have two options to do it: > > 1, I can just load the file in a general code (open a inputstream, > etc), parse content and then create the broadcast from it. > 2, I also can use a standard way to create the RDD from these file, > run the map to parse it, then collect it as map, wrap the result as > broadcast to push to all nodes again. > > I think the option 2 might be more consistent with spark's concept > (and less code?)? But how about the performance? The gain is can > parallel load and parse the data, penalty is after load we need to > collect and broadcast result again? Please share your opinion. I am > not sure what is the best practice here (in theory, either way works, > but in real world, which one is better?). > > Regards, > > Shuai > > -----Original Message----- > From: Matei Zaharia [mailto:matei.zaha...@gmail.com] > Sent: Monday, November 03, 2014 4:15 PM > To: Shuai Zheng > Cc: user@spark.apache.org > Subject: Re: Any "Replicated" RDD in Spark? > > You need to use broadcast followed by flatMap or mapPartitions to do > map-side joins (in your map function, you can look at the hash table > you broadcast and see what records match it). Spark SQL also does it > by default for tables smaller than the > spark.sql.autoBroadcastJoinThreshold setting (by default 10 KB, which > is really small, but you can bump this up with set > spark.sql.autoBroadcastJoinThreshold=1000000 for example). > > Matei > >> On Nov 3, 2014, at 1:03 PM, Shuai Zheng <szheng.c...@gmail.com> wrote: >> >> Hi All, >> >> I have spent last two years on hadoop but new to spark. >> I am planning to move one of my existing system to spark to get some > enhanced features. >> >> My question is: >> >> If I try to do a map side join (something similar to "Replicated" key >> word > in Pig), how can I do it? Is it anyway to declare a RDD as "replicated" > (means distribute it to all nodes and each node will have a full copy)? >> >> I know I can use accumulator to get this feature, but I am not sure >> what > is the best practice. And if I accumulator to broadcast the data set, > can then (after broadcast) convert it into a RDD and do the join? >> >> Regards, >> >> Shuai > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For > additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org