Re: [PERFORM] Selecting random rows efficiently

2003-09-03 Thread scott.marlowe
Can you just create an extra serial column and make sure that one is 
always in order and no holes in it?  (i.e. a nightly process, etc...)???

If so, then something like this truly flies:

select * from accounts where aid = (select cast(floor(random()*10)+1 as int));

My times on it on a 100,000 row table are < 1 millisecond.

Note that you have to have a hole free sequence AND know how many rows 
there are, but if you can meet those needs, this is screamingly fast.

On Sat, 30 Aug 2003, Russell Garrett wrote:

> Considering that we'd have to index the random field too, it'd be neater in
> the long term to re-number the primary key. Although, being a primary key,
> that's foreign-keyed from absolutely everywhere, so that'd probably take an
> amusingly long time.
> 
> ...and no we're not from Micronesia, we're from ever so slightly less exotic
> London. Though Micronesia might be nice...
> 
> Russ (also from last.fm but without the fancy address)
> 
> [EMAIL PROTECTED] wrote:
> > On Sat, 2003-08-30 at 09:01, Rod Taylor wrote:
> >>> i was hoping there was some trickery with sequences that would
> >>> allow me to easily pick a random valid sequence number..?
> >>
> >> I would suggest renumbering the data.
> >>
> >> ALTER SEQUENCE ... RESTART WITH 1;
> >> UPDATE table SET pkey = DEFAULT;
> >>
> >> Of course, PostgreSQL may have trouble with that update due to
> >> evaluation of the unique constraint immediately -- so drop the
> >> primary key first, and add it back after.
> >
> > And if there are child tables, they'd all have to be updated, too.
> 
> 
> 
> ---(end of broadcast)---
> TIP 8: explain analyze is your friend
> 


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Selecting random rows efficiently

2003-08-30 Thread Russell Garrett
Considering that we'd have to index the random field too, it'd be neater in
the long term to re-number the primary key. Although, being a primary key,
that's foreign-keyed from absolutely everywhere, so that'd probably take an
amusingly long time.

...and no we're not from Micronesia, we're from ever so slightly less exotic
London. Though Micronesia might be nice...

Russ (also from last.fm but without the fancy address)

[EMAIL PROTECTED] wrote:
> On Sat, 2003-08-30 at 09:01, Rod Taylor wrote:
>>> i was hoping there was some trickery with sequences that would
>>> allow me to easily pick a random valid sequence number..?
>>
>> I would suggest renumbering the data.
>>
>> ALTER SEQUENCE ... RESTART WITH 1;
>> UPDATE table SET pkey = DEFAULT;
>>
>> Of course, PostgreSQL may have trouble with that update due to
>> evaluation of the unique constraint immediately -- so drop the
>> primary key first, and add it back after.
>
> And if there are child tables, they'd all have to be updated, too.



---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Selecting random rows efficiently

2003-08-30 Thread Ron Johnson
On Sat, 2003-08-30 at 09:01, Rod Taylor wrote:
> > i was hoping there was some trickery with sequences that would allow me to 
> > easily pick a random valid sequence number..?
> 
> I would suggest renumbering the data.
> 
> ALTER SEQUENCE ... RESTART WITH 1;
> UPDATE table SET pkey = DEFAULT;
> 
> Of course, PostgreSQL may have trouble with that update due to
> evaluation of the unique constraint immediately -- so drop the primary
> key first, and add it back after.

And if there are child tables, they'd all have to be updated, too.

-- 
-
Ron Johnson, Jr. [EMAIL PROTECTED]
Jefferson, LA USA

"Whatever may be the moral ambiguities of the so-called 
demoratic nations and however serious may be their failure to 
conform perfectly to their democratic ideals, it is sheer moral 
perversity to equate the inconsistencies of a democratic 
civilization with the brutalities which modern tyrannical states
practice."
Reinhold Nieburhr, ca. 1940


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Selecting random rows efficiently

2003-08-30 Thread Tom Lane
I said:
> 3. Your query now looks like
>   SELECT * FROM table WHERE random_id >= random()
>   ORDER BY random_id LIMIT 1;

Correction: the above won't give quite the right query because random()
is marked as a volatile function.  You can hide the random() call inside
a user-defined function that you (misleadingly) mark stable, or you can
just stick it into a sub-select:

regression=# explain select * from foo WHERE random_id >= (select random())
regression-# ORDER BY random_id LIMIT 1;
   QUERY PLAN
-
 Limit  (cost=0.01..0.15 rows=1 width=8)
   InitPlan
 ->  Result  (cost=0.00..0.01 rows=1 width=0)
   ->  Index Scan using fooi on foo  (cost=0.00..45.50 rows=334 width=8)
 Index Cond: (random_id >= $0)
(5 rows)

This technique is probably safer against future planner changes,
however:

regression=# create function oneshot_random() returns float8 as
regression-# 'select random()' language sql stable;
CREATE FUNCTION
regression=# explain select * from foo WHERE random_id >= oneshot_random()
regression-# ORDER BY random_id LIMIT 1;
   QUERY PLAN
-
 Limit  (cost=0.00..0.14 rows=1 width=8)
   ->  Index Scan using fooi on foo  (cost=0.00..46.33 rows=334 width=8)
 Index Cond: (random_id >= oneshot_random())
(3 rows)

The point here is that an indexscan boundary condition has to use stable
or immutable functions.  By marking oneshot_random() stable, you
essentially say that it's okay to evaluate it only once per query,
rather than once at each row.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Selecting random rows efficiently

2003-08-30 Thread Tom Lane
Richard Jones <[EMAIL PROTECTED]> writes:
>>> i have a table of around 3 million rows from which i regularly (twice a
>>> second at the moment) need to select a random row from

> i was hoping there was some trickery with sequences that would allow me to 
> easily pick a random valid sequence number..?

There is no magic bullet here, but if you expect that requirement to
persist then it is worth your trouble to expend effort on a real
solution.  A real solution in my mind would look like

1. Add a column "random_id float8 default random()".  The idea here
   is that you assign a random ID to each row as it is created.

2. Add an index on the above column.

3. Your query now looks like

SELECT * FROM table WHERE random_id >= random()
ORDER BY random_id LIMIT 1;

This gives you a plan on the order of

 Limit  (cost=0.00..0.17 rows=1 width=8)
   ->  Index Scan using fooi on foo  (cost=0.00..57.00 rows=334 width=8)
 Filter: (random_id >= random())

which is fast and gives a genuinely random result row.  At least up
until you have enough rows that there start being duplicate random_ids,
which AFAIK would be 2 billion rows with a decent random()
implementation.  If you're concerned about that, you could periodically
re-randomize with
UPDATE table SET random_id = random();
so that any rows that were "hidden" because they had a duplicate
random_id have another shot at being choosable.  But with only a few mil
rows I don't think you need to worry.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Selecting random rows efficiently

2003-08-30 Thread Ron Johnson
On Sat, 2003-08-30 at 08:09, Richard Jones wrote:
> Hi,
> i have a table of around 3 million rows from which i regularly (twice a second 
> at the moment) need to select a random row from
> 
> currently i'm doing "order by rand() limit 1" - but i suspect this is 
> responsible for the large load on my db server - i guess that PG is doing far 
> too much work just to pick one row.

What datatype is the selected by key?

Also, where is rand() defined?  Is that a UDF?

Could it be that there is a type mismatch?

> one way i can think of is to read in all the primary keys from my table, and 
> select one of the keys at random then directly fetch that row. 
> 
> are there any other ways to do this? i need to keep the load down :)
> 
> Thanks,
> Richard 

Are you really in Micronesia?

-- 
-
Ron Johnson, Jr. [EMAIL PROTECTED]
Jefferson, LA USA

"The greatest dangers to liberty lurk in insidious encroachment 
by men of zeal, well-meaning, but without understanding."
Justice Louis Brandeis, dissenting, Olmstead v US (1928)


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Selecting random rows efficiently

2003-08-30 Thread Rod Taylor
> i was hoping there was some trickery with sequences that would allow me to 
> easily pick a random valid sequence number..?

I would suggest renumbering the data.

ALTER SEQUENCE ... RESTART WITH 1;
UPDATE table SET pkey = DEFAULT;

Of course, PostgreSQL may have trouble with that update due to
evaluation of the unique constraint immediately -- so drop the primary
key first, and add it back after.


signature.asc
Description: This is a digitally signed message part


Re: [PERFORM] Selecting random rows efficiently

2003-08-30 Thread Richard Jones
On Saturday 30 August 2003 1:08 pm, you wrote:
> On Sat, 30 Aug 2003, Richard Jones wrote:
> > Hi,
> > i have a table of around 3 million rows from which i regularly (twice a
> > second at the moment) need to select a random row from
> >
> > currently i'm doing "order by rand() limit 1" - but i suspect this is
> > responsible for the large load on my db server - i guess that PG is doing
> > far too much work just to pick one row.
>
> If you have an int id (aka serial) column then it is simple - just pick a
> random number between 1 and currval('id_seq')...
>
> or offset rand() limit 1 perhaps?
>
> since you want random ther eis no need to bother with an order and that'll
> save a sort.


Yes, the pkey is a SERIAL but the problem is that the sequence is rather 
sparse

for example, it goes something like 1 -> 5000  then 10->10 and then 
200->upwards

this is due to chunks being deleted etc.. 

if i pick a random number for the key it will not be a random enough 
distribution, because the sequence is sparse.. sometimes it will pick a key 
that doesnt exist.

i'm currently reading all the keys into an array and selecting randoms from 
there - but this is no good long-term as i need to refresh the array of keys 
to take into account newly added rows to the table (daily)

i was hoping there was some trickery with sequences that would allow me to 
easily pick a random valid sequence number..?

Thanks,
Rich.








>
> --
> Jeff Trout <[EMAIL PROTECTED]>
> http://www.jefftrout.com/
> http://www.stuarthamm.net/
>
>
>
> ---(end of broadcast)---
> TIP 4: Don't 'kill -9' the postmaster


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Selecting random rows efficiently

2003-08-30 Thread Jeff
On Sat, 30 Aug 2003, Richard Jones wrote:

> Hi,
> i have a table of around 3 million rows from which i regularly (twice a second
> at the moment) need to select a random row from
>
> currently i'm doing "order by rand() limit 1" - but i suspect this is
> responsible for the large load on my db server - i guess that PG is doing far
> too much work just to pick one row.
>

If you have an int id (aka serial) column then it is simple - just pick a
random number between 1 and currval('id_seq')...

or offset rand() limit 1 perhaps?

since you want random ther eis no need to bother with an order and that'll
save a sort.

--
Jeff Trout <[EMAIL PROTECTED]>
http://www.jefftrout.com/
http://www.stuarthamm.net/



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster