You can use broadcast variable. 

See also this thread:
http://search-hadoop.com/m/JW1q5GX7U22/Spark+broadcast+variable&subj=How+Broadcast+variable+scale+



> On Mar 31, 2015, at 4:43 AM, Peng Xia <sparkpeng...@gmail.com> wrote:
> 
> Hi,
> 
> I have a RDD (rdd1)where each line is split into an array ["a", "b", "c], etc.
> And I also have a local dictionary p (dict1) stores key value pair {"a":1, 
> "b": 2, c:3}
> I want to replace the keys in the rdd with the its corresponding value in the 
> dict:
> rdd1.map(lambda line: [dict1[item] for item in line])
> 
> But this task is not distributed, I believe the reason is the dict1 is a 
> local instance.
> Can any one provide suggestions on this to parallelize this?
> 
> 
> Thanks,
> Best,
> Peng
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to