On Aug 31, 2007, at 15:54, Martijn van Oosterhout wrote:

On Fri, Aug 31, 2007 at 02:42:18PM +0200, Alban Hertroys wrote:
Examples:
* random(maxrows) would return random rows from the resultset.
* median() would return the rows in the middle of the result set (this
would require ordering to be meaningful).

It would be possible to write an aggregate that returns a single random
value from a set. The algorithm is something like:

n = 1
v = null
for each row
  if random() < 1/n:
     v = value of row
  n = n + 1

return v

Doesn't this always return the first record, since random() is always less than 1/1? I don't think this method has a linear distribution, but then again I don't understand what 'value of row' refers to...

It does require a seqscan though.

I doubt that a seqscan can be entirely avoided to fetch random rows from a set, at least not until the last random result has been returned, _unless_ the number of matching records would be known before starting taking random samples.

If you're asking for 5 random rows
you probably mean 5 random but distinct rows, which is different to
just running the above set 5 times in parallel.

Indeed, that is one of the distinctions that need some thought for my original preposition. I left it out, as it's an implementation detail (an important one, admittedly).

I don't know if there's a similar method for median...

I'm not entirely sure, but I think your method is the only one suggested that doesn't involve calculating random() a million times (for a million records) to return 5 (random) records.


My suggestion involved a way to calculate random() only when retrieving records from the result set (only 5 times for a million records in this case). For a linearly distributed random set it does require knowing the number of records in the set though, an estimate would make it non- linear (although only a little bit if accurate enough).


OTOH, I'm starting to think that the last sort step of an order by can be postponed to the result set fetching cycle under the conditions that:
- the ordering expression is unrelated to the records involved, and
- only a fraction of the total number of records will be returned.
(Which is somewhat similar to the condition for an index being more efficient than a seqscan, btw)

Comparing records with each other for something not related seems a waste of effort, while the result set has already been determined (just not ordered in any particular way), am I right?

With that change (postponing sorting) my original ORDER BY random() LIMIT 5 would perform quite adequately, I think - it'd only involve calculating random() at least 5 times, not as often as the number of records in the result set.


Or is order by random() acting as some kind of shuffling method? Is that a requirement to get a linearly distributed set to randomly draw from? I can see how it wouldn't be linear if you'd start randomly comparing records from the beginning of the result set... (Which would be the logical method if you don't know the size of the set before hand)


I thought of another solution (with only a few calculations of random ()) that can be deployed in existing versions of PG, using a set- returning function with a scrolling cursor that accepts the query string as input like this (in pseudoish-code):

----
create function random(text _query, integer _limit)
returns set
volatile
as $$
DECLARE
        _cur cursor;
        _cnt bigint;
        _idx integer;
        _rowpos bigint;

        _rec record;
BEGIN
        open _cur for execute query;
        fetch forward all into _rec;
        -- select total nr of records into _cnt

        for _idx in 1.._limit loop
                _rowpos := random() * _cnt;

                fetch absolute _rowpos into _rec;
                return next _rec;
        end loop;

        return;
END;
$$
language 'plpgsql';
----

This method could return the same record twice though, I'll need to build in some accounting for used up rowpos'es.
Would it be more efficient than the usual methods?

Sorry for the brain dump, I tried to get everything into this single message. I hope it is at least comprehensible and useful, or interesting or at least mildly amusing if not.

Regards,

Alban Hertroys
magproductions b.v.

!DSPAM:737,46d93d9b289906550616460!



---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match

Reply via email to