Re: how to set random seed

2015-05-14 Thread Charles Hayden
): random.seed(my_seed) yield my_seed rdd.mapPartitions(f) From: ayan guha guha.a...@gmail.com Sent: Thursday, May 14, 2015 2:29 AM To: Charles Hayden Cc: user Subject: Re: how to set random seed Sorry for late reply. Here is what I was thinking import random

how to set random seed

2015-05-12 Thread Charles Hayden
In pySpark, I am writing a map with a lambda that calls random.shuffle. For testing, I want to be able to give it a seed, so that successive runs will produce the same shuffle. I am looking for a way to set this same random seed once on each worker. Is there any simple way to do it??

pyspark error with zip

2015-03-31 Thread Charles Hayden
? The following program fails in the zip step. x = sc.parallelize([1, 2, 3, 1, 2, 3]) y = sc.parallelize([1, 2, 3]) z = x.distinct() print x.zip(y).collect() The error that is produced depends on whether multiple partitions have been specified or not. I understand that the two RDDs [must]

Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Charles Hayden
?You could also consider using a count-min data structure such as in https://github.com/laserson/dsq? to get approximate quantiles, then use whatever values you want to filter the original sequence. From: Debasish Das debasish.da...@gmail.com Sent: Thursday,