Tom Lane <t...@sss.pgh.pa.us> wrote: > Robert Haas <robertmh...@gmail.com> writes: >> Well, the standard syntax apparently aims to reduce the number of >> returned rows, which ORDER BY does not. Maybe you could do it >> with ORDER BY .. LIMIT, but the idea here I think is that we'd >> like to sample the table without reading all of it first, so that >> seems to miss the point. > > I think actually the traditional locution is more like > WHERE random() < constant > where the constant is the fraction of the table you want. And > yeah, the presumption is that you'd like it to not actually read > every row. (Though unless the sampling density is quite a bit > less than 1 row per page, it's not clear how much you're really > going to win.) It's all going to depend on the use cases, which I don't think I've heard described very well yet. I've had to pick random rows from, for example, a table of disbursements to support a financial audit. In those cases it has been the sample size that mattered, and order didn't. One interesting twist there is that for some of these financial audits they wanted the probability of a row being selected to be proportional to the dollar amount of the disbursement. I don't think you can do this without a first pass across the whole data set. I've also been involved in developing software to pick random people for jury selection processes. In some of these cases, you don't want a certain percentage, you want a particular number of people, and you want the list to be ordered so that an initial set of potential jurors can be seated from the top of the list and then as the voir dire[1] process progresses you can replace excused jurors from progressive positions on the randomized list. In both cases you need to be able to explain the technique used in lay terms and show why it is very hard for anyone to predict or control which rows are chosen, but also use statistical analysis to prove that there is no significant correlation between the rows chosen and identifiable characteristics of the rows. For example, selecting all the rows from random blocks would be right out for juror selection because a list from the state DOT of people with driver's licenses and state ID cards would probably be in license number order when loaded, and since the start of the driver's license number is a soundex of the last name, there is a strong correlation between blocks of the table and ethnicity. One technique which might be suitably random without reading the whole table would be to figure out a maximum block number and tuple ID for the table, and generate a series of random ctid values to read. If the tuple doesn't exist or is not visible to the snapshot, you ignore it and continue, until you have read the requisite number of rows. You could try to generate them in advance and sort them by block number, but then you need to solve the problems of what to do if that set of ctids yields too many rows or too few rows, both of which have sticky issues. -Kevin [1] http://en.wikipedia.org/wiki/Voir_dire#Use_in_the_United_States
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers