Re: sampling from frequency distribution / histogram without replacement

2019-01-18 Thread duncan smith
On 14/01/2019 20:11, duncan smith wrote:
> Hello,
>   Just checking to see if anyone has attacked this problem before
> for cases where the population size is unfeasibly large. i.e. The number
> of categories is manageable, but the sum of the frequencies, N,
> precludes simple solutions such as creating a list, shuffling it and
> using the first n items to populate the sample (frequency distribution /
> histogram).
> 
> I note that numpy.random.hypergeometric will allow me to generate a
> sample when I only have two categories, and that I could probably
> implement some kind of iterative / partitioning approach calling this
> repeatedly. But before I do I thought I'd ask if anyone has tackled this
> before. Can't find much on the web. Cheers.
> 
> Duncan
> 

After much tinkering I came up with the following:


import numpy as np

def hypgeom_variate(freqs, n):
# recursive partitioning approach
sample = [0] * len(freqs)
cumsum = np.cumsum(list(chain([0], freqs)))
if n > cumsum[-1]:
raise ValueError('n cannot be greater than population size')
hypergeometric = np.random.hypergeometric
argslist = [(0, len(freqs), 0, cumsum[-1], n)]
for i, k, ci, ck, m in argslist:
if k == i + 1:
sample[i] = m
else:
j = (i + k) // 2
cj = cumsum[j]
x = hypergeometric(cj - ci, ck - cj, m, 1)[0]
y = m-x
if x:
argslist.append((i, j, ci, cj, x))
if y:
argslist.append((j, k, cj, ck, y))
return sample


Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: sampling from frequency distribution / histogram without replacement

2019-01-15 Thread duncan smith
On 15/01/2019 17:59, Ian Hobson wrote:
> Hi,
> 
> If I understand your problem you can do it in two passes through the
> population.
> 

The thing is that I start with the population histogram and I want to
generate a sample histogram. The population itself is too large to deal
with each population member individually.

> First, however, lets work through taking a sample of 2 from 7 to
> demonstrate the method.
> 
> Take the first element with a probability of 2/7. (Note 1).
> If you took it, you only want 1 more, so the probability drops to 1/6.
> If you didn't take it you want 2 from 6, so probability goes to 2/6.
> Take the next in the population with probability 1/6 or 2/6 as appropriate.
> Continue in similar manner until the probability
> drops to 0 (when you have your whole sample). When the
> denominator drops to zero the population is expired.
> 

Yes, based on the chain rule.

> Your first pass has to categorise the population and create your
> histogram, (index N) of frequencies Y(N).
> 
> Then divide up the sample size you wish to take into the histogram,
> giving array X(N) of sample sizes. X(N) need not be integer.
> 
> Then pass through the population again, for each entry:
>    Compute the N it falls in the histogram.
>    Take this entry as a sample with a probability of X(N)/Y(N).  Note 2.
>    If the element was taken, decrement X(N).
>    Decrement Y(N).
>    step to next element.
> 

Ah, I'm not quota sampling. I want a simple random sample without
replacement. I just happen to have the data in the form of categories
and frequencies, and that's the form of output that I want.

> Note 1 - In most languages you can generate a pseudo-random number
> with a uniform distribution from 0 to Y(N)-1. Take the element if it is
> in range 0 to floor(X(N))-1.
> 
> Note 2 - X(N) need not be integer, but you can't actually take a sample
> of 6.5 out of 1000. You will either run out of population having taken
> 6, or, if you take 7, the probability will go negative, and no more
> should be taken (treat as zero). The number taken in slot N will be
> floor(X(N)) or ceiling(X(N)). The average over many tries will however
> be X(N).

> Sorry I did not come back to you sooner. It took a while to drag the
> method out of my memory from some 35 years ago when I was working on an
> audit package. 

Well I'd already forgotten that I'd coded up something for srs without
replacement only a few years ago. In fact I coded up a few algorithms
(that I can't take credit for) that allowed weighted sampling with
replacement, and at least one that didn't require a priori knowledge of
the population size (a single pass algorithm). The problem is that they
also (mostly) require scanning the whole population.

That was where I learned two things you may be interested
> in.

> 1) Auditors significantly under sample. Our Auditors actually took
> samples that were between 10% and 25% of what was necessary to support
> their claims.
> 

It's not just auditors :-(. The journals are full of claims based on
positive results from low powered tests or from "null fields". i.e. A
very high proportion are likely to be false positives (like 99% when it
comes to foodstuffs and the risks of various diseases). A while ago a
mate of mine (Prof. of statistics in Oz) told me about a student who
engineered a statistically significant result by copying and pasting her
data to double her sample size. That's no worse than some of the stuff
I've come across in the (usually medical) journals.

> 2) Very very few standard pseudo-random number generators are actually
> any good.
> 
> Regards
> 
> Ian

[snip]

BTW, the approach I'm currently using is also based on the chain rule.
Generate the number of sample units for the first category by sampling
from a (bivariate) hypergeometric. The number of sample units for the
second category (conditional on the number sampled for the first) is
another hypergeometric. Iterate until the full sample is obtained. It
helps to order the categories from largest to smallest. But I think I'll
get better performance by recursive partitioning (when I have the time
to try it). Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: sampling from frequency distribution / histogram without replacement

2019-01-15 Thread Ian Hobson

Hi,

If I understand your problem you can do it in two passes through the 
population.


First, however, lets work through taking a sample of 2 from 7 to 
demonstrate the method.


Take the first element with a probability of 2/7. (Note 1).
If you took it, you only want 1 more, so the probability drops to 1/6.
If you didn't take it you want 2 from 6, so probability goes to 2/6.
Take the next in the population with probability 1/6 or 2/6 as appropriate.
Continue in similar manner until the probability
drops to 0 (when you have your whole sample). When the
denominator drops to zero the population is expired.

Your first pass has to categorise the population and create your 
histogram, (index N) of frequencies Y(N).


Then divide up the sample size you wish to take into the histogram,
giving array X(N) of sample sizes. X(N) need not be integer.

Then pass through the population again, for each entry:
   Compute the N it falls in the histogram.
   Take this entry as a sample with a probability of X(N)/Y(N).  Note 2.
   If the element was taken, decrement X(N).
   Decrement Y(N).
   step to next element.

Note 1 - In most languages you can generate a pseudo-random number
with a uniform distribution from 0 to Y(N)-1. Take the element if it is 
in range 0 to floor(X(N))-1.


Note 2 - X(N) need not be integer, but you can't actually take a sample 
of 6.5 out of 1000. You will either run out of population having taken 
6, or, if you take 7, the probability will go negative, and no more 
should be taken (treat as zero). The number taken in slot N will be 
floor(X(N)) or ceiling(X(N)). The average over many tries will however 
be X(N).


Sorry I did not come back to you sooner. It took a while to drag the 
method out of my memory from some 35 years ago when I was working on an 
audit package. That was where I learned two things you may be interested in.


1) Auditors significantly under sample. Our Auditors actually took 
samples that were between 10% and 25% of what was necessary to support 
their claims.


2) Very very few standard pseudo-random number generators are actually 
any good.


Regards

Ian

On 14/01/2019 20:11, duncan smith wrote:

Hello,
   Just checking to see if anyone has attacked this problem before
for cases where the population size is unfeasibly large. i.e. The number
of categories is manageable, but the sum of the frequencies, N,
precludes simple solutions such as creating a list, shuffling it and
using the first n items to populate the sample (frequency distribution /
histogram).

I note that numpy.random.hypergeometric will allow me to generate a
sample when I only have two categories, and that I could probably
implement some kind of iterative / partitioning approach calling this
repeatedly. But before I do I thought I'd ask if anyone has tackled this
before. Can't find much on the web. Cheers.

Duncan



--
Ian Hobson
Tel (+351) 910 418 473
--
https://mail.python.org/mailman/listinfo/python-list


Re: sampling from frequency distribution / histogram without replacement

2019-01-15 Thread duncan smith
On 15/01/2019 02:41, Spencer Graves wrote:
> 
> 
> On 2019-01-14 18:40, duncan smith wrote:
>> On 14/01/2019 22:59, Gregory Ewing wrote:
>>> duncan smith wrote:
 Hello,
    Just checking to see if anyone has attacked this problem before
 for cases where the population size is unfeasibly large.
>>> The fastest way I know of is to create a list of cumulative
>>> frequencies, then generate uniformly distributed numbers and
>>> use a binary search to find where they fall in the list.
>>> That's O(log n) per sample in the size of the list once it's
>>> been set up.
>>>
>> That's the sort of thing I've been thinking about. But once I'd found
>> the relevant category I'd need to reduce its frequency by 1 and
>> correspondingly update the cumulative frequencies. Alternatively, I
>> could add an extra step where I selected a unit from the relevant
>> category with probability equal to the proportion of non-sampled units
>> from the category. I could maybe set up an alias table and do something
>> similar.
>>
>> The other thing I was thinking about was iterating through the
>> categories (ideally from largest frequency to smallest frequency),
>> generating the numbers to be sampled from the current category and the
>> remaining categories (using numpy.random.hypergeometric). With a few
>> large frequencies and lots of small frequencies that could be quite
>> quick (on average). Alternatively I could partition the categories into
>> two sets, generate the number to be sampled from each partition, then
>> partition the partitions etc. binary search style.
>>
>> I suppose I'll try the both the alias table + rejection step and the
>> recursive partitioning approach and see how they turn out. Cheers.
> 
> 
>   R has functions "sample" and "sample.int";  see
> "https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/sample;.
> You can call R from Python,
> "https://sites.google.com/site/aslugsguidetopython/data-analysis/pandas/calling-r-from-python;.
> 
> 
> 
>   These are in the "base" package.  I believe they have been an
> important part of the base R language almost since its inception and
> have been used extensively.  You'd have to work really hard to do
> better, in my judgment.
> 
> 
>       Spencer Graves
> 
> 
> DISCLAIMER:  I'm primarily an R guy and only use Python when I can't
> find a sensible way to do what I want in R.
>>
>> Duncan
> 

Despite being a statistician I'm primarily a Python guy and only use R
when I can't find a sensible way to do what I want in Python :-). The
problem with the R solution is that it doesn't seem to get round the
issue of having an unfeasibly large population size, but a reasonable
number of categories. It turns out I've already coded up a few reservoir
based algorithms for sampling without replacement that work with data
streams. So I can get round the space issues, but it still means
processing each data point rather than generating the sample frequencies
directly.

After much searching all I've been able to find is the approach I
suggested above, iterating through the frequencies. My implementation:


import numpy

def hypgeom_variate(freqs, n):
sample = [0] * len(freqs)
nbad = sum(freqs)
hypergeometric = numpy.random.hypergeometric
for i, ngood in enumerate(freqs):
nbad -= ngood
x = hypergeometric(ngood, nbad, n, 1)[0]
if x:
sample[i] = x
n -= x
if not n:
break
return sample


Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: sampling from frequency distribution / histogram without replacement

2019-01-14 Thread Spencer Graves



On 2019-01-14 18:40, duncan smith wrote:

On 14/01/2019 22:59, Gregory Ewing wrote:

duncan smith wrote:

Hello,
   Just checking to see if anyone has attacked this problem before
for cases where the population size is unfeasibly large.

The fastest way I know of is to create a list of cumulative
frequencies, then generate uniformly distributed numbers and
use a binary search to find where they fall in the list.
That's O(log n) per sample in the size of the list once it's
been set up.


That's the sort of thing I've been thinking about. But once I'd found
the relevant category I'd need to reduce its frequency by 1 and
correspondingly update the cumulative frequencies. Alternatively, I
could add an extra step where I selected a unit from the relevant
category with probability equal to the proportion of non-sampled units
from the category. I could maybe set up an alias table and do something
similar.

The other thing I was thinking about was iterating through the
categories (ideally from largest frequency to smallest frequency),
generating the numbers to be sampled from the current category and the
remaining categories (using numpy.random.hypergeometric). With a few
large frequencies and lots of small frequencies that could be quite
quick (on average). Alternatively I could partition the categories into
two sets, generate the number to be sampled from each partition, then
partition the partitions etc. binary search style.

I suppose I'll try the both the alias table + rejection step and the
recursive partitioning approach and see how they turn out. Cheers.



  R has functions "sample" and "sample.int";  see 
"https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/sample;. 
You can call R from Python, 
"https://sites.google.com/site/aslugsguidetopython/data-analysis/pandas/calling-r-from-python;. 




  These are in the "base" package.  I believe they have been an 
important part of the base R language almost since its inception and 
have been used extensively.  You'd have to work really hard to do 
better, in my judgment.



      Spencer Graves


DISCLAIMER:  I'm primarily an R guy and only use Python when I can't 
find a sensible way to do what I want in R.


Duncan


--
https://mail.python.org/mailman/listinfo/python-list


Re: sampling from frequency distribution / histogram without replacement

2019-01-14 Thread duncan smith
On 14/01/2019 22:59, Gregory Ewing wrote:
> duncan smith wrote:
>> Hello,
>>   Just checking to see if anyone has attacked this problem before
>> for cases where the population size is unfeasibly large.
> 
> The fastest way I know of is to create a list of cumulative
> frequencies, then generate uniformly distributed numbers and
> use a binary search to find where they fall in the list.
> That's O(log n) per sample in the size of the list once it's
> been set up.
> 

That's the sort of thing I've been thinking about. But once I'd found
the relevant category I'd need to reduce its frequency by 1 and
correspondingly update the cumulative frequencies. Alternatively, I
could add an extra step where I selected a unit from the relevant
category with probability equal to the proportion of non-sampled units
from the category. I could maybe set up an alias table and do something
similar.

The other thing I was thinking about was iterating through the
categories (ideally from largest frequency to smallest frequency),
generating the numbers to be sampled from the current category and the
remaining categories (using numpy.random.hypergeometric). With a few
large frequencies and lots of small frequencies that could be quite
quick (on average). Alternatively I could partition the categories into
two sets, generate the number to be sampled from each partition, then
partition the partitions etc. binary search style.

I suppose I'll try the both the alias table + rejection step and the
recursive partitioning approach and see how they turn out. Cheers.

Duncan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: sampling from frequency distribution / histogram without replacement

2019-01-14 Thread Gregory Ewing

duncan smith wrote:

Hello,
  Just checking to see if anyone has attacked this problem before
for cases where the population size is unfeasibly large.


The fastest way I know of is to create a list of cumulative
frequencies, then generate uniformly distributed numbers and
use a binary search to find where they fall in the list.
That's O(log n) per sample in the size of the list once it's
been set up.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list