Re: refer to dictionary

2015-03-31 Thread Ted Yu
You can use broadcast variable. 

See also this thread:
http://search-hadoop.com/m/JW1q5GX7U22/Spark+broadcast+variablesubj=How+Broadcast+variable+scale+



 On Mar 31, 2015, at 4:43 AM, Peng Xia sparkpeng...@gmail.com wrote:
 
 Hi,
 
 I have a RDD (rdd1)where each line is split into an array [a, b, c], etc.
 And I also have a local dictionary p (dict1) stores key value pair {a:1, 
 b: 2, c:3}
 I want to replace the keys in the rdd with the its corresponding value in the 
 dict:
 rdd1.map(lambda line: [dict1[item] for item in line])
 
 But this task is not distributed, I believe the reason is the dict1 is a 
 local instance.
 Can any one provide suggestions on this to parallelize this?
 
 
 Thanks,
 Best,
 Peng
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: refer to dictionary

2015-03-31 Thread Peng Xia
Hi Ted,

Thanks very much, yea, using broadcast is much faster.

Best,
Peng

On Tue, Mar 31, 2015 at 8:49 AM, Ted Yu yuzhih...@gmail.com wrote:

 You can use broadcast variable.

 See also this thread:

 http://search-hadoop.com/m/JW1q5GX7U22/Spark+broadcast+variablesubj=How+Broadcast+variable+scale+



  On Mar 31, 2015, at 4:43 AM, Peng Xia sparkpeng...@gmail.com wrote:
 
  Hi,
 
  I have a RDD (rdd1)where each line is split into an array [a, b,
 c], etc.
  And I also have a local dictionary p (dict1) stores key value pair
 {a:1, b: 2, c:3}
  I want to replace the keys in the rdd with the its corresponding value
 in the dict:
  rdd1.map(lambda line: [dict1[item] for item in line])
 
  But this task is not distributed, I believe the reason is the dict1 is a
 local instance.
  Can any one provide suggestions on this to parallelize this?
 
 
  Thanks,
  Best,
  Peng