On 2/2/11 6:03 AM, Magnus Lie Hetland wrote:
Reading the doc for std.random.randomSample, I saw that "The total
length of r must be known". There are rather straightforward algorithms
for drawing random samples *without* knowing this. This might be useful
if one wants to support input ranges, I guess?

Take, for example, the method described by Knuth (TAoP 2), for selecting
n elements uniformly at random from an input range:

- Select the first n elements as the current sample.
- Each subsequent element is rejected with a probability of 1 - n/t,
where t is the number seen so far.
- If a new item is selected, it replaces a random item in the current
sample.

A cool property of this is that at any time, the current sample is one
drawn randomly (i.e., uniformly, without replacement) from the items
you've seen so far, so you could really stop at any point. That is, stop
iterating over the input; you can't really give the output as a
small-footprint range here (as far as I can see). Seems you have to
allocate room for n pointers. (Or you *could* just keep track of which
objects were swapped in -- might be worth the overhead if n is large
compared to the input size.)

I posted a problem solved by the algorithm above (and others, more sophisticated ones) as a challenge to this group a couple of years ago.

randomSample is designed to subsample a large stream in constant space and without needing to scan the entire stream in order to output the first element. I used in in my dissertation where e.g. I had to select 100K samples from a stream of many millions.

Having a reservoir sample available would be nice. I'd be thrilled if you coded up a reservoirSample(r, n) function for addition to std.random.


Andrei

Reply via email to