):
random.seed(my_seed)
yield my_seed
rdd.mapPartitions(f)
From: ayan guha guha.a...@gmail.com
Sent: Thursday, May 14, 2015 2:29 AM
To: Charles Hayden
Cc: user
Subject: Re: how to set random seed
Sorry for late reply.
Here is what I was thinking
import random
In pySpark, I am writing a map with a lambda that calls random.shuffle.
For testing, I want to be able to give it a seed, so that successive runs will
produce the same shuffle.
I am looking for a way to set this same random seed once on each worker. Is
there any simple way to do it??
?
The following program fails in the zip step.
x = sc.parallelize([1, 2, 3, 1, 2, 3])
y = sc.parallelize([1, 2, 3])
z = x.distinct()
print x.zip(y).collect()
The error that is produced depends on whether multiple partitions have been
specified or not.
I understand that
the two RDDs [must]
?You could also consider using a count-min data structure such as in
https://github.com/laserson/dsq?
to get approximate quantiles, then use whatever values you want to filter the
original sequence.
From: Debasish Das debasish.da...@gmail.com
Sent: Thursday,