---------- Forwarded message ---------- From: Chengi Liu <chengi.liu...@gmail.com> Date: Tue, Oct 28, 2014 at 11:23 PM Subject: Re: sampling in spark To: Davies Liu <dav...@databricks.com>
Any suggestions.. Thanks On Tue, Oct 28, 2014 at 12:53 AM, Chengi Liu <chengi.liu...@gmail.com> wrote: > Is there an equivalent way of doing the following: > > a = [1,2,3,4] > > reduce(lambda x, y: x+[x[-1]+y], a, [0])[1:] > > ?? > > > The issue with above suggestion is that population is a hefty data > structure :-/ > > On Tue, Oct 28, 2014 at 12:42 AM, Davies Liu <dav...@databricks.com> > wrote: > >> _cumm = [p[0]] >> for i in range(1, len(p)): >> _cumm.append(_cumm[-1] + p[i]) >> index = set([bisect(_cumm, random.random()) for i in range(k)]) >> >> chosed_x = X.zipWithIndex().filter(lambda (v, i): i in >> index).map(lambda (v, i): v) >> chosed_y = [v for i, v in enumerate(y) if i in index] >> >> >> On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu <chengi.liu...@gmail.com> >> wrote: >> > Oops, the reference for the above code: >> > >> http://stackoverflow.com/questions/26583462/selecting-corresponding-k-rows-from-matrix-and-vector/26583945#26583945 >> > >> > On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu <chengi.liu...@gmail.com> >> > wrote: >> >> >> >> Hi, >> >> I have three rdds.. X,y and p >> >> X is matrix rdd (mXn), y is (mX1) dimension vector >> >> and p is (mX1) dimension probability vector. >> >> Now, I am trying to sample k rows from X and corresponding entries in y >> >> based on probability vector p. >> >> Here is the python implementation >> >> >> >> import random >> >> from bisect import bisect >> >> from operator import itemgetter >> >> >> >> def sample(population, k, prob): >> >> >> >> def cdf(population, k, prob): >> >> population = map(itemgetter(1), sorted(zip(prob, population))) >> >> cumm = [prob[0]] >> >> for i in range(1, len(prob)): >> >> >> >> cumm.append(_cumm[-1] + prob[i]) >> >> return [population[bisect(cumm, random.random())] for i in >> >> range(k)] >> >> >> >> >> >> return cdf(population, k, prob) >> > >> > >> > >