Re: [GENERAL] Obtaining random rows from a result set

Alban Hertroys Sat, 01 Sep 2007 04:12:21 -0700

On Aug 31, 2007, at 15:54, Martijn van Oosterhout wrote:

On Fri, Aug 31, 2007 at 02:42:18PM +0200, Alban Hertroys wrote:

Examples:
* random(maxrows) would return random rows from the resultset.
* median() would return the rows in the middle of the result set(this
would require ordering to be meaningful).

It would be possible to write an aggregate that returns a singlerandom

value from a set. The algorithm is something like:

n = 1
v = null
for each row
  if random() < 1/n:
     v = value of row
  n = n + 1

return v

Doesn't this always return the first record, since random() is alwaysless than 1/1?I don't think this method has a linear distribution, but then again Idon't understand what 'value of row' refers to...

It does require a seqscan though.

I doubt that a seqscan can be entirely avoided to fetch random rowsfrom a set, at least not until the last random result has beenreturned, _unless_ the number of matching records would be knownbefore starting taking random samples.

If you're asking for 5 random rows
you probably mean 5 random but distinct rows, which is different to
just running the above set 5 times in parallel.

Indeed, that is one of the distinctions that need some thought for myoriginal preposition. I left it out, as it's an implementation detail(an important one, admittedly).

I don't know if there's a similar method for median...

I'm not entirely sure, but I think your method is the only onesuggested that doesn't involve calculating random() a million times(for a million records) to return 5 (random) records.

My suggestion involved a way to calculate random() only whenretrieving records from the result set (only 5 times for a millionrecords in this case).For a linearly distributed random set it does require knowing thenumber of records in the set though, an estimate would make it non-linear (although only a little bit if accurate enough).

OTOH, I'm starting to think that the last sort step of an order bycan be postponed to the result set fetching cycle under theconditions that:

- the ordering expression is unrelated to the records involved, and
- only a fraction of the total number of records will be returned.

(Which is somewhat similar to the condition for an index being moreefficient than a seqscan, btw)

Comparing records with each other for something not related seems awaste of effort, while the result set has already been determined(just not ordered in any particular way), am I right?

With that change (postponing sorting) my original ORDER BY random()LIMIT 5 would perform quite adequately, I think - it'd only involvecalculating random() at least 5 times, not as often as the number ofrecords in the result set.

Or is order by random() acting as some kind of shuffling method? Isthat a requirement to get a linearly distributed set to randomly drawfrom?I can see how it wouldn't be linear if you'd start randomly comparingrecords from the beginning of the result set... (Which would be thelogical method if you don't know the size of the set before hand)

I thought of another solution (with only a few calculations of random()) that can be deployed in existing versions of PG, using a set-returning function with a scrolling cursor that accepts the querystring as input like this (in pseudoish-code):


----
create function random(text _query, integer _limit)
returns set
volatile
as $$
DECLARE
        _cur cursor;
        _cnt bigint;
        _idx integer;
        _rowpos bigint;

        _rec record;
BEGIN
        open _cur for execute query;
        fetch forward all into _rec;
        -- select total nr of records into _cnt

        for _idx in 1.._limit loop
                _rowpos := random() * _cnt;

                fetch absolute _rowpos into _rec;
                return next _rec;
        end loop;

        return;
END;
$$
language 'plpgsql';
----

This method could return the same record twice though, I'll need tobuild in some accounting for used up rowpos'es.

Would it be more efficient than the usual methods?

Sorry for the brain dump, I tried to get everything into this singlemessage. I hope it is at least comprehensible and useful, orinteresting or at least mildly amusing if not.


Regards,

Alban Hertroys
magproductions b.v.

!DSPAM:737,46d93d9b289906550616460!



---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match

Re: [GENERAL] Obtaining random rows from a result set

Reply via email to