----- Forwarded Message ----- From: Lee Rhodes <[email protected]>To: 
Eric Bax <[email protected]>Sent: Friday, October 18, 2019, 03:38:37 PM 
PDTSubject: Re: Sampling for sketches
 Eric, this is great.  But in the spirit of openness, would you mind moving 
this discussion to either [email protected] or to our Apache slack 
workspace:
the-asf.slack.com / channel: datasketches-dev.
Thanks,
Lee.




On Fri, Oct 18, 2019 at 3:23 PM Eric Bax <[email protected]> wrote:

 Thanks for the links. I see that it really is like selecting a sample of a 
specified size. 
In that case, I think hypergeometric bounds apply. 
Letn = population size.m = number of "successes" in population (e.g. number of 
people who visited the site 15 days last month.)s = sample size (e.g. number of 
users in our little sketch sample)k = number of "successes" in sample.
Then the probability of seeing k / s in our sample, given m / n at large, is 
hypergeometric:
p(n, m, s, k) = C(m, k) C(n - m, s - k) / C(n, s).
So an upper bound for m, with probability of bound failure delta, is:
max {m | sum(i = 0 to k) p(n, m, s, i) >= delta},
the maximum m-value that has k /s or less sample successes with probability at 
least delta. 
A lower bound is:
min {m | sum(i = k to s) p(n, m, s, i) >= delta}.
I'll attach some python code that computes these bounds.



    On Thursday, October 17, 2019, 05:28:27 PM PDT, Lee Rhodes 
<[email protected]> wrote:  
 
 Eric,
It is more like the second case.  
Example 1:  Reservoir sampling sketch.  Choose k, then pick samples from the 
stream of n samples such that each sample in the stream has the same 
probability, k/n, of ending up in the reservoir.   However, if there are 
duplicates in the stream, you can have duplicates in the reservoir.
Example 2: "Sampling" distinct values using a Theta (or KMV) Sketch  (which 
retains sample proxies).     
   - Remove all duplicates
   - Use Reservoir sampling to select your samples.
This is not HOW the sketches work, but from a probability point-of-view, it 
should be the same.
There is a very rudimentary explanation of how Theta (and KMV) sketches work 
on-line at https://datasketches.github.io/docs/Theta/InverseEstimate.html

The paper for Theta sketches is 
https://github.com/DataSketches/DataSketches.github.io/blob/master/docs/pdf/ThetaSketchFramework.pdf

HLL sketches are another animal altogether and don't retain samples.
Lee.





On Thu, Oct 17, 2019 at 4:36 PM Eric Bax <[email protected]> wrote:

Is the sampling like: 
For each user in the population, I randomly determine whether to include them 
in the sample with a fixed probability?
Rather than:
I decide how many samples to take. Then I select that many users at random from 
the population? 
I think the former. Just want to make sure for thinking about whether samples 
of categories are independent. In other words union bounds would not be needed 
for multiple categories. 


Sent from Yahoo Mail for iPhone

  
  

Reply via email to