Oops, the reference for the above code:
http://stackoverflow.com/questions/26583462/selecting-corresponding-k-rows-from-matrix-and-vector/26583945#26583945

On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu <chengi.liu...@gmail.com>
wrote:

> Hi,
>   I have three rdds.. X,y and p
> X is matrix rdd (mXn), y is (mX1) dimension vector
> and p is (mX1) dimension probability vector.
> Now, I am trying to sample k rows from X and corresponding entries in y
> based on probability vector p.
> Here is the python implementation
>
> import randomfrom bisect import bisectfrom operator import itemgetter
>
> def sample(population, k, prob):
>
>     def cdf(population, k, prob):
>         population = map(itemgetter(1), sorted(zip(prob, population)))
>         cumm = [prob[0]]
>         for i in range(1, len(prob)):
>
>             cumm.append(_cumm[-1] + prob[i])
>         return [population[bisect(cumm, random.random())] for i in range(k)]
>
>
>      return cdf(population, k, prob)
>
>

Reply via email to