Re: how to set random seed
Thanks for the reply. I have not tried it out (I will today and report on my results) but I think what I need to do is to call mapPartitions and pass it a function that sets the seed. I was planning to pass the seed value in the closure. Something like: my_seed = 42 def f(iterator): random.seed(my_seed) yield my_seed rdd.mapPartitions(f) From: ayan guha guha.a...@gmail.com Sent: Thursday, May 14, 2015 2:29 AM To: Charles Hayden Cc: user Subject: Re: how to set random seed Sorry for late reply. Here is what I was thinking import random as r def main(): get SparkContext #Just for fun, lets assume seed is an id filename=bin.dat seed = id(filename) #broadcast it br = sc.broadcast(seed) #set up dummy list lst = [] for i in range(4): x=[] for j in range(4): x.append(j) lst.append(x) print lst base = sc.parallelize(lst) print base.map(randomize).collect() Randomize looks like def randomize(lst): local_seed = br.value r.seed(local_seed) r.shuffle(lst) return lst Let me know if this helps... base = sc.parallelize(lst) print base.map(randomize).collect() On Wed, May 13, 2015 at 11:41 PM, Charles Hayden charles.hay...@atigeo.commailto:charles.hay...@atigeo.com wrote: ?Can you elaborate? Broadcast will distribute the seed, which is only one number. But what construct do I use to plant the seed (call random.seed()) once on each worker? From: ayan guha guha.a...@gmail.commailto:guha.a...@gmail.com Sent: Tuesday, May 12, 2015 11:17 PM To: Charles Hayden Cc: user Subject: Re: how to set random seed Easiest way is to broadcast it. On 13 May 2015 10:40, Charles Hayden charles.hay...@atigeo.commailto:charles.hay...@atigeo.com wrote: In pySpark, I am writing a map with a lambda that calls random.shuffle. For testing, I want to be able to give it a seed, so that successive runs will produce the same shuffle. I am looking for a way to set this same random seed once on each worker. Is there any simple way to do it?? -- Best Regards, Ayan Guha
how to set random seed
In pySpark, I am writing a map with a lambda that calls random.shuffle. For testing, I want to be able to give it a seed, so that successive runs will produce the same shuffle. I am looking for a way to set this same random seed once on each worker. Is there any simple way to do it??
pyspark error with zip
? The following program fails in the zip step. x = sc.parallelize([1, 2, 3, 1, 2, 3]) y = sc.parallelize([1, 2, 3]) z = x.distinct() print x.zip(y).collect() The error that is produced depends on whether multiple partitions have been specified or not. I understand that the two RDDs [must] have the same number of partitions and the same number of elements in each partition. What is the best way to work around this restriction? I have been performing the operation with the following code, but I am hoping to find something more efficient. def safe_zip(left, right): ix_left = left.zipWithIndex().map(lambda row: (row[1], row[0])) ix_right = right.zipWithIndex().map(lambda row: (row[1], row[0])) return ix_left.join(ix_right).sortByKey().values()
Re: How to get a top X percent of a distribution represented as RDD
?You could also consider using a count-min data structure such as in https://github.com/laserson/dsq? to get approximate quantiles, then use whatever values you want to filter the original sequence. From: Debasish Das debasish.da...@gmail.com Sent: Thursday, March 26, 2015 9:45 PM To: Aung Htet Cc: user Subject: Re: How to get a top X percent of a distribution represented as RDD Idea is to use a heap and get topK elements from every partition...then use aggregateBy and for combOp do a merge routine from mergeSort...basically get 100 items from partition 1, 100 items from partition 2, merge them so that you get sorted 200 items and take 100...for merge you can use heap as well...Matei had a BPQ inside Spark which we use all the time...Passing arrays over wire is better than passing full heap objects and merge routine on array should run faster but needs experiment... On Thu, Mar 26, 2015 at 9:26 PM, Aung Htet aung@gmail.commailto:aung@gmail.com wrote: Hi Debasish, Thanks for your suggestions. In-memory version is quite useful. I do not quite understand how you can use aggregateBy to get 10% top K elements. Can you please give an example? Thanks, Aung On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das debasish.da...@gmail.commailto:debasish.da...@gmail.com wrote: You can do it in-memory as wellget 10% topK elements from each partition and use merge from any sort algorithm like timsortbasically aggregateBy Your version uses shuffle but this version is 0 shuffle..assuming your data set is cached you will be using in-memory allReduce through treeAggregate... But this is only good for top 10% or bottom 10%...if you need to do it for top 30% then may be the shuffle version will work better... On Thu, Mar 26, 2015 at 8:31 PM, Aung Htet aung@gmail.commailto:aung@gmail.com wrote: Hi all, I have a distribution represented as an RDD of tuples, in rows of (segment, score) For each segment, I want to discard tuples with top X percent scores. This seems hard to do in Spark RDD. A naive algorithm would be - 1) Sort RDD by segment score (descending) 2) Within each segment, number the rows from top to bottom. 3) For each segment, calculate the cut off index. i.e. 90 for 10% cut off out of a segment with 100 rows. 4) For the entire RDD, filter rows with row num = cut off index This does not look like a good algorithm. I would really appreciate if someone can suggest a better way to implement this in Spark. Regards, Aung