Re: testing distributions with few events per bin

Rich Ulrich Fri, 21 Jul 2000 19:26:13 -0700
I have not seen any response to this one yet.

On Sun, 16 Jul 2000 18:36:00 +0200, "D. Wright"
<[EMAIL PROTECTED]> wrote:

> 
> I'm a newbie to the group but not to statistics.  Here is my problem: I
> have a probability distribtion over a discrete (or countably infinite,
> in any case "naturally binned") space and I want to test whether a
> sample agrees with the distribution when some bins have few events.
> 
> Here is an example.  The ranks of random 8 x 8 binary matrices (all
> entries are 0 or 1) are distributed as follows:
> 
>   rank  probab   sample (eg N=100)
>   8     0.2899   28
>   7     0.5776   54
>   6     0.1273   15
>   5     0.00512  2
>   4     4.4e-05
>   3     8.5e-08  1
>   2     1.7e-11
>   1     3.5e-15
>   0     5.3e-20
> 
> I want to test whether a sample of matrices is random at some given
> confidence level.  Most matrices are nearly full-rank, so there will be
> very few events in the low-rank bins.  Here is the progress of my
> thought:
> 
> 1) The traditional trick is to combine enough low-event bins together
> that the expected number of events is 10 or so, and then do a chi2
> test.  But this patently throws away information.  A rank-1 matrix event
> is telling me much more than a single rank-3 matrix event, but this
> technique gives them equal weight.  So I want a test that doesn't
> require me to combine bins.

Okay.  How many thousands or millions or billions of matrices are you
looking at, anyway?  It does sound as if, perhaps, you are trying to
compare to sets of PROPORTIONS  and that just does not work; you
always have to work the N into it if there is going to be a test.

And, WHY do you want to know?  Is it enough to say, "this is rare, and
take note... "  if you ever see rank= 0, 1, or 2?  Or 3?  or maybe, 4?
 
> 2) How about a KS test?  I tried this, but it came back with garbage
> (told
> me that data which I know to be good, and which did fine in a chi2 test,
> was "too good to be true").  I believe one assumption in the KS test is
> that the distribution being tested is continuous, ie not discrete.  So
> this doesn't work.

 - this is what makes me suspect you were not  using N....
 
Where do you get your model?  Are these numbers, or likelihood,
supposed to help you estimate some parameters that have to do with
rank?

> 3) So what do I do?  I would really like a test statistic that becomes
 < snip, rest >

How much do you want to rely on those exact p's ?  Once you know how
you want to "weight" some tests, you can look at each rank-size, and
consider the exact probability of having non-zero contents.  Do you
want to combine p-levels across tests?  A simpler hypothesis might be 
the one of "any difference at all" where you assign (say) 4/5 of the
5% test to the larger values, rank 8 to rank 5 - by requiring that the
chi squared on those proportions beat the 4% level, instead of the 5%
level.  Then you figure the rest at 1%, total.

Or do the others matter more, and you count *is*  in the billions, so
your p-level is not .0001% by default for the low rank?

-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html


=================================================================
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
                  http://jse.stat.ncsu.edu/
=================================================================
Re: testing distributions with few events per bin

Reply via email to