On 4/14/12 7:39 AM, Joseph Rushton Wakeling wrote:
In short, I've been implementing some random sampling algorithms in D to
help teach myself how to use the language effectively:
https://github.com/WebDrake/SampleD

This is an adaptation of some code I wrote in C a couple of years ago to
help deal with simulations where I had to take a very large random
sample from an even larger total[1]. The question came up whether this
might be useful for the random sampling tools provided in Phobos.

Absolutely. We could and should integrate Vitter's algorithm; the only reason it's not there yet is because I didn't have the time.

Phobos' std.random currently contains a RandomSample struct and
randomSample wrapper function which output a random subsample of the
input data. I would identify 2 problems with the current implementation.

(1) The algorithm used is highly inefficient. As far as I can see the code
implements Algorithm S as described in vol. 2 of Knuth's "The Art of
Computer Programming" (pp. 142-143). This algorithm sequentially goes
over each of the possible data elements deciding whether to accept or
reject it. Consequently, it's of o(N) complexity (where N is data size)
and requires o(N) random variates to be generated.

That is correct. However the distinction is often academic because the cost of generating random numbers is lower than the cost of scanning the data, and skipping N steps in the data is comparable to the cost of skipping one step N times.

Algorithms developed by Jeffrey Scott Vitter in the 1980s allow this to be
reduced to o(n) where n is sample size. These are the algorithms I've
implemented.

(2) RandomSample requires the whole data (and whole sample) to be loaded
into
memory while you're doing the sampling. Great if you can do it, but it's
sometimes prohibitive[1]. Often you don't _need_ the data to be loaded --
just the index values of the selected data -- and you may not even need
to store those as a collection.

Actually that's not correct. RandomSample works fine with an input range and does not keep it in memory.

The former is just to rewrite RandomSample to use Jeffrey Vitter's
Algorithm D (nice coincidence:-). This should be fairly easy and I'm
happy to prepare patches for this.

This would be great, but you'd need to show with benchmarks that the proposed implementation does better than the extant one for most cases that matter.



Thanks,

Andrei

Reply via email to