Hi Ted, Thanks very much, yea, using broadcast is much faster.
Best, Peng On Tue, Mar 31, 2015 at 8:49 AM, Ted Yu <yuzhih...@gmail.com> wrote: > You can use broadcast variable. > > See also this thread: > > http://search-hadoop.com/m/JW1q5GX7U22/Spark+broadcast+variable&subj=How+Broadcast+variable+scale+ > > > > > On Mar 31, 2015, at 4:43 AM, Peng Xia <sparkpeng...@gmail.com> wrote: > > > > Hi, > > > > I have a RDD (rdd1)where each line is split into an array ["a", "b", > "c], etc. > > And I also have a local dictionary p (dict1) stores key value pair > {"a":1, "b": 2, c:3} > > I want to replace the keys in the rdd with the its corresponding value > in the dict: > > rdd1.map(lambda line: [dict1[item] for item in line]) > > > > But this task is not distributed, I believe the reason is the dict1 is a > local instance. > > Can any one provide suggestions on this to parallelize this? > > > > > > Thanks, > > Best, > > Peng > > >