Re: refer to dictionary
You can use broadcast variable. See also this thread: http://search-hadoop.com/m/JW1q5GX7U22/Spark+broadcast+variablesubj=How+Broadcast+variable+scale+ On Mar 31, 2015, at 4:43 AM, Peng Xia sparkpeng...@gmail.com wrote: Hi, I have a RDD (rdd1)where each line is split into an array [a, b, c], etc. And I also have a local dictionary p (dict1) stores key value pair {a:1, b: 2, c:3} I want to replace the keys in the rdd with the its corresponding value in the dict: rdd1.map(lambda line: [dict1[item] for item in line]) But this task is not distributed, I believe the reason is the dict1 is a local instance. Can any one provide suggestions on this to parallelize this? Thanks, Best, Peng - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: refer to dictionary
Hi Ted, Thanks very much, yea, using broadcast is much faster. Best, Peng On Tue, Mar 31, 2015 at 8:49 AM, Ted Yu yuzhih...@gmail.com wrote: You can use broadcast variable. See also this thread: http://search-hadoop.com/m/JW1q5GX7U22/Spark+broadcast+variablesubj=How+Broadcast+variable+scale+ On Mar 31, 2015, at 4:43 AM, Peng Xia sparkpeng...@gmail.com wrote: Hi, I have a RDD (rdd1)where each line is split into an array [a, b, c], etc. And I also have a local dictionary p (dict1) stores key value pair {a:1, b: 2, c:3} I want to replace the keys in the rdd with the its corresponding value in the dict: rdd1.map(lambda line: [dict1[item] for item in line]) But this task is not distributed, I believe the reason is the dict1 is a local instance. Can any one provide suggestions on this to parallelize this? Thanks, Best, Peng