Hi Ted,

Thanks very much, yea, using broadcast is much faster.

Best,
Peng

On Tue, Mar 31, 2015 at 8:49 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> You can use broadcast variable.
>
> See also this thread:
>
> http://search-hadoop.com/m/JW1q5GX7U22/Spark+broadcast+variable&subj=How+Broadcast+variable+scale+
>
>
>
> > On Mar 31, 2015, at 4:43 AM, Peng Xia <sparkpeng...@gmail.com> wrote:
> >
> > Hi,
> >
> > I have a RDD (rdd1)where each line is split into an array ["a", "b",
> "c], etc.
> > And I also have a local dictionary p (dict1) stores key value pair
> {"a":1, "b": 2, c:3}
> > I want to replace the keys in the rdd with the its corresponding value
> in the dict:
> > rdd1.map(lambda line: [dict1[item] for item in line])
> >
> > But this task is not distributed, I believe the reason is the dict1 is a
> local instance.
> > Can any one provide suggestions on this to parallelize this?
> >
> >
> > Thanks,
> > Best,
> > Peng
> >
>

Reply via email to