Fwd: sampling in spark

Chengi Liu Tue, 28 Oct 2014 23:38:40 -0700

---------- Forwarded message ----------
From: Chengi Liu <chengi.liu...@gmail.com>
Date: Tue, Oct 28, 2014 at 11:23 PM
Subject: Re: sampling in spark
To: Davies Liu <dav...@databricks.com>



Any suggestions.. Thanks

On Tue, Oct 28, 2014 at 12:53 AM, Chengi Liu <chengi.liu...@gmail.com>
wrote:

> Is there an equivalent way of doing the following:
>
> a = [1,2,3,4]
>
> reduce(lambda x, y: x+[x[-1]+y], a, [0])[1:]
>
> ??
>
>
> The issue with above suggestion is that population is a hefty data
> structure :-/
>
> On Tue, Oct 28, 2014 at 12:42 AM, Davies Liu <dav...@databricks.com>
> wrote:
>
>>         _cumm = [p[0]]
>>         for i in range(1, len(p)):
>>             _cumm.append(_cumm[-1] + p[i])
>>         index = set([bisect(_cumm, random.random()) for i in range(k)])
>>
>>         chosed_x = X.zipWithIndex().filter(lambda (v, i): i in
>> index).map(lambda (v, i): v)
>>         chosed_y = [v for i, v in enumerate(y) if i in index]
>>
>>
>> On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu <chengi.liu...@gmail.com>
>> wrote:
>> > Oops, the reference for the above code:
>> >
>> http://stackoverflow.com/questions/26583462/selecting-corresponding-k-rows-from-matrix-and-vector/26583945#26583945
>> >
>> > On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu <chengi.liu...@gmail.com>
>> > wrote:
>> >>
>> >> Hi,
>> >>   I have three rdds.. X,y and p
>> >> X is matrix rdd (mXn), y is (mX1) dimension vector
>> >> and p is (mX1) dimension probability vector.
>> >> Now, I am trying to sample k rows from X and corresponding entries in y
>> >> based on probability vector p.
>> >> Here is the python implementation
>> >>
>> >> import random
>> >> from bisect import bisect
>> >> from operator import itemgetter
>> >>
>> >> def sample(population, k, prob):
>> >>
>> >>     def cdf(population, k, prob):
>> >>         population = map(itemgetter(1), sorted(zip(prob, population)))
>> >>         cumm = [prob[0]]
>> >>         for i in range(1, len(prob)):
>> >>
>> >>             cumm.append(_cumm[-1] + prob[i])
>> >>         return [population[bisect(cumm, random.random())] for i in
>> >> range(k)]
>> >>
>> >>
>> >>      return cdf(population, k, prob)
>> >
>> >
>>
>
>

Fwd: sampling in spark

Reply via email to