Re: how to set random seed

2015-05-14 Thread Charles Hayden
Thanks for the reply.


I have not tried it out (I will today and report on my results) but I think 
what I need to do is to call mapPartitions and pass it a function that sets the 
seed.  I was planning to pass the seed value in the closure.


Something like:

my_seed = 42
def f(iterator):
random.seed(my_seed)
yield my_seed
rdd.mapPartitions(f)



From: ayan guha guha.a...@gmail.com
Sent: Thursday, May 14, 2015 2:29 AM
To: Charles Hayden
Cc: user
Subject: Re: how to set random seed

Sorry for late reply.

Here is what I was thinking

import random as r
def main():
get SparkContext
#Just for fun, lets assume seed is an id
filename=bin.dat
seed = id(filename)
#broadcast it
br = sc.broadcast(seed)

#set up dummy list
lst = []
for i in range(4):
x=[]
for j in range(4):
x.append(j)
lst.append(x)
print lst
base = sc.parallelize(lst)
print base.map(randomize).collect()

Randomize looks like
def randomize(lst):
local_seed = br.value
r.seed(local_seed)
r.shuffle(lst)
return lst


Let me know if this helps...




base = sc.parallelize(lst)
print base.map(randomize).collect()

On Wed, May 13, 2015 at 11:41 PM, Charles Hayden 
charles.hay...@atigeo.commailto:charles.hay...@atigeo.com wrote:

?Can you elaborate? Broadcast will distribute the seed, which is only one 
number.  But what construct do I use to plant the seed (call random.seed()) 
once on each worker?


From: ayan guha guha.a...@gmail.commailto:guha.a...@gmail.com
Sent: Tuesday, May 12, 2015 11:17 PM
To: Charles Hayden
Cc: user
Subject: Re: how to set random seed


Easiest way is to broadcast it.

On 13 May 2015 10:40, Charles Hayden 
charles.hay...@atigeo.commailto:charles.hay...@atigeo.com wrote:
In pySpark, I am writing a map with a lambda that calls random.shuffle.
For testing, I want to be able to give it a seed, so that successive runs will 
produce the same shuffle.
I am looking for a way to set this same random seed once on each worker.  Is 
there any simple way to do it??




--
Best Regards,
Ayan Guha


how to set random seed

2015-05-12 Thread Charles Hayden
In pySpark, I am writing a map with a lambda that calls random.shuffle.
For testing, I want to be able to give it a seed, so that successive runs will 
produce the same shuffle.
I am looking for a way to set this same random seed once on each worker.  Is 
there any simple way to do it??



pyspark error with zip

2015-03-31 Thread Charles Hayden
?

The following program fails in the zip step.

x = sc.parallelize([1, 2, 3, 1, 2, 3])
y = sc.parallelize([1, 2, 3])
z = x.distinct()
print x.zip(y).collect()


The error that is produced depends on whether multiple partitions have been 
specified or not.

I understand that

the two RDDs [must] have the same number of partitions and the same number of 
elements in each partition.

What is the best way to work around this restriction?

I have been performing the operation with the following code, but I am hoping 
to find something more efficient.

def safe_zip(left, right):
ix_left = left.zipWithIndex().map(lambda row: (row[1], row[0]))
ix_right = right.zipWithIndex().map(lambda row: (row[1], row[0]))
return ix_left.join(ix_right).sortByKey().values()




Re: How to get a top X percent of a distribution represented as RDD

2015-03-26 Thread Charles Hayden
?You could also consider using a count-min data structure such as in 
https://github.com/laserson/dsq?

to get approximate quantiles, then use whatever values you want to filter the 
original sequence.


From: Debasish Das debasish.da...@gmail.com
Sent: Thursday, March 26, 2015 9:45 PM
To: Aung Htet
Cc: user
Subject: Re: How to get a top X percent of a distribution represented as RDD

Idea is to use a heap and get topK elements from every partition...then use 
aggregateBy and for combOp do a merge routine from mergeSort...basically get 
100 items from partition 1, 100 items from partition 2, merge them so that you 
get sorted 200 items and take 100...for merge you can use heap as well...Matei 
had a BPQ inside Spark which we use all the time...Passing arrays over wire is 
better than passing full heap objects and merge routine on array should run 
faster but needs experiment...

On Thu, Mar 26, 2015 at 9:26 PM, Aung Htet 
aung@gmail.commailto:aung@gmail.com wrote:
Hi Debasish,

Thanks for your suggestions. In-memory version is quite useful. I do not quite 
understand how you can use aggregateBy to get 10% top K elements. Can you 
please give an example?

Thanks,
Aung

On Fri, Mar 27, 2015 at 2:40 PM, Debasish Das 
debasish.da...@gmail.commailto:debasish.da...@gmail.com wrote:
You can do it in-memory as wellget 10% topK elements from each partition 
and use merge from any sort algorithm like timsortbasically aggregateBy

Your version uses shuffle but this version is 0 shuffle..assuming your data set 
is cached you will be using in-memory allReduce through treeAggregate...

But this is only good for top 10% or bottom 10%...if you need to do it for top 
30% then may be the shuffle version will work better...

On Thu, Mar 26, 2015 at 8:31 PM, Aung Htet 
aung@gmail.commailto:aung@gmail.com wrote:
Hi all,

I have a distribution represented as an RDD of tuples, in rows of (segment, 
score)
For each segment, I want to discard tuples with top X percent scores. This 
seems hard to do in Spark RDD.

A naive algorithm would be -

1) Sort RDD by segment  score (descending)
2) Within each segment, number the rows from top to bottom.
3) For each  segment, calculate the cut off index. i.e. 90 for 10% cut off out 
of a segment with 100 rows.
4) For the entire RDD, filter rows with row num = cut off index

This does not look like a good algorithm. I would really appreciate if someone 
can suggest a better way to implement this in Spark.

Regards,
Aung