On Wed, 24 May 2000, Mike Stiso wrote:

> I'm hoping some of you might be able to help me with the following
> statistical problem. 
                        Not sure I can, but here's a start.

>  I have a sample of about 2500 subjects, each of which has been grouped 
> into 1 of a possible 70 categories (based on their behavior and other 
> criteria).
                Well-defined, pre-existing categories?  Or categories 
derived from data supplied by the Ss, e.g. by textual analysis? 
Are all possible categories represented in the data?  If not, is it 
reasonable for some not to be present?

> Out of that data, I've created a table listing those categories in 
> descending order according to the frequency with which each occurred in 
> the sample. For example: 

> 1)   category 6     17%
> 2)   category 35    13%
> 3)   category 65     7%
> 4)   category 2      3%
> .
> 70) category 40      0.1%
> -------------------------
> total              100%
> 
        <  snip  >
> ... it's important to my project to ensure the accuracy of the ordering 
> within the table for the top fifth of the categories -- in other words, 
> I need to be reasonably certain (p = .01?) that the top, say, 15 
> categories in my table actually are the 15 most common categories ...

Why "top fifth", or "15"?  (One could get here either by deciding to 
account for, say, 90% of the responses [in which case, why 90%?], or to 
take categories representing at least k% of the responses for some 
(evidently small -- if your example reflects your reality, you're already 
down to about k = 1 at the 5th or 6th category) value of k [again, why k, 
in particular?], or by deciding to take categories down to the smallest % 
that can be distinguished from the next smallest % by a statistical test. 
If all of this is just blind guessing on your part (or on the part of 
your employers/sponsors/whoever), it might pay you to select a manner of 
deciding that appears to make some sense.)

> ... occurring within the population, and also (less importantly) that
> their relative frequencies correspond to the ranking indicated by the 
> table.  Is such a determination possible? Or, looking at the question 
> another way, is it possible to determine the optimal sample size for 
> achieving a stable frequency table? And if so, can you point me in the 
> right direction? 

In the table above, 4 categories already account for 40% of the 
respondents.  If your 70th category really does have as many as 0.1% in 
it (which is only 2 respondents, after all), there must be lots of 
categories whose relative frequencies are indistinguishable from each 
other.  For 2500 respondents, the margin of error for a 99% confidence 
interval around a proportion of 1% is about +-/- 0.005, that is, about 
half a percentage point.  Around a proportion of 3% it's about one and a 
half percentage points.  This implies that at the p = .01 level you 
suggest, you can tell the 3% for the fourth category above from the 7% 
for the third category, but possibly not from the (say) 2% for the fifth 
category, with 2500 respondents.
        Do you still think you want a decent stability of order among as 
many as 15 categories?

 ------------------------------------------------------------------------
 Donald F. Burrill                                 [EMAIL PROTECTED]
 348 Hyde Hall, Plymouth State College,          [EMAIL PROTECTED]
 MSC #29, Plymouth, NH 03264                                 603-535-2597
 184 Nashua Road, Bedford, NH 03110                          603-471-7128  



===========================================================================
This list is open to everyone.  Occasionally, less thoughtful
people send inappropriate messages.  Please DO NOT COMPLAIN TO
THE POSTMASTER about these messages because the postmaster has no
way of controlling them, and excessive complaints will result in
termination of the list.

For information about this list, including information about the
problem of inappropriate messages and information about how to
unsubscribe, please see the web page at
http://jse.stat.ncsu.edu/
===========================================================================

Reply via email to