Oops, the reference for the above code: http://stackoverflow.com/questions/26583462/selecting-corresponding-k-rows-from-matrix-and-vector/26583945#26583945
On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu <chengi.liu...@gmail.com> wrote: > Hi, > I have three rdds.. X,y and p > X is matrix rdd (mXn), y is (mX1) dimension vector > and p is (mX1) dimension probability vector. > Now, I am trying to sample k rows from X and corresponding entries in y > based on probability vector p. > Here is the python implementation > > import randomfrom bisect import bisectfrom operator import itemgetter > > def sample(population, k, prob): > > def cdf(population, k, prob): > population = map(itemgetter(1), sorted(zip(prob, population))) > cumm = [prob[0]] > for i in range(1, len(prob)): > > cumm.append(_cumm[-1] + prob[i]) > return [population[bisect(cumm, random.random())] for i in range(k)] > > > return cdf(population, k, prob) > >