i have to ask ... if you have a known database ... then, would you not know how many are in it? for example ... if you are sampling n from it ... then you get n from SOME fixed database ...
if you have access to the database .... seems like you could ask for THE mean ... THE freq distribution ... etc. ... on specified variables ... or, am i missing something? At 10:21 AM 6/25/02 -0400, William R. Pearson wrote: >I have a sampling problem that I suspect has already been solved by >staticians, but I do not know the solution. > >I wish to estimate the extreme value location and scale parameters >from a list of similarity scores for a biological sequence comparison >program - FASTA. > >Right now, when FASTA samples scores, it takes the first 60,000 scores >and estimates the mean and variance from them. However, some >databases have more than 1,000,000 entries. It is important that this >sample have a random distribution of sequence lengths and >compositions. > >Sometimes (well, at least once) people use databases that are sorted by >length - this completely screws up my sampling strategy. > >It would be easy to sample through the entire database, if I knew how >big the database was. But I don't for most of my database types. > >So my question is: > > Give me a simple algorithm that will relatively uniformly sample >60,000 scores from a potentially very large number of scores, where the >number is not known. > >One possibility I have considered: > >(0) Save the first 60,000 > >(1) For the next 60,000, take every other score and use it to replace > every other of the first 60K > >(2) For the next 120,000, take every 4th score and use it to replace > every other of the previous > >(3) For the next 240,000, take every 8th score and use it to replace > every other of the previous > >The problem with this strategy is that the same scores keep getting >replaced. > >A possible solution is to do the replacements in blocks. At step (1), >replace every other single score; at step (2) replace every 4th block >of 2, etc. > >My questions for you folks - > >(1) does the block strategy work? >(2) is there a better way (short or simply replacing random positions, >which would be easy, but would require a random number generator). > >I realize that a combination of sampling every n-th score (with n >increasing 2-fold at every so often), combined with random replacement >of the 60,000 scores, is a simple solution. I'm looking for a more >elegant one. > >Bill Pearson > >. >. >================================================================= >Instructions for joining and leaving this list, remarks about the >problem of INAPPROPRIATE MESSAGES, and archives are available at: >. http://jse.stat.ncsu.edu/ . >================================================================= Dennis Roberts, 208 Cedar Bldg., University Park PA 16802 <Emailto: [EMAIL PROTECTED]> WWW: http://roberts.ed.psu.edu/users/droberts/drober~1.htm AC 8148632401 . . ================================================================= Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at: . http://jse.stat.ncsu.edu/ . =================================================================