Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-17 Thread Alban Hertroys

On Jan 15, 2008, at 3:03 PM, Ivan Sergio Borgonovo wrote:


On Tue, 15 Jan 2008 14:43:35 +0100
Alban Hertroys [EMAIL PROTECTED] wrote:


You need to scroll to the last row to find the size of the result
set, but after that it's pretty easy to return random rows by
scrolling to them (and marking them 'read' in some way to prevent
accidentally returning the same row again).


Could you post a snippet of code or something giving a more detailed
idea of it?

BTW since cursors support offset if you're not interested if the
order of the retrieved rows is random too you don't even have to
remember which one you read I think.


I posted it on this list a while ago when I came up with this  
solution. I had some trouble finding my old post in the pgsql-general  
archives though - I could find the thread, just not my final posting,  
and searching didn't even turn up the thread.


I did find it here: http://www.mail-archive.com/pgsql- 
[EMAIL PROTECTED]/msg103670.html
The thread contains several other approaches to the problem, it  
really depends on your problem domain which one fits your bill.


I think the function in my original posting could do with clearer  
comments though, so here's the function again:


/*
 * Return $limit random rows from the result set of SQL query $query
 */
function randomSet(
$query, // The query to execute
$limit  // The (max) number of random rows required
) {
// SQL to declare the cursor
query(DECLARE _cur SCROLL CURSOR WITHOUT HOLD FOR $query);

/* Get the range for random(1, n)
 *
 * Determined by scrolling the cursor to the last row.
 * Equivalent to select count(*), but without a separate query.
 */
query(MOVE FORWARD ALL IN _cur);
$count = pg_affected_rows();

$uniques = array(); // A list of used cursor offsets
$resultSet = array();

// Fetch random rows until we have enough or there are no more
while ($limit  0  count($uniques)  $count) {
// Determine random scroll offset
$idx = random(1, $count);

// Skip records with an index we already used
if (in_array($idx, $uniques))
continue;

//Fetch the random row
$record = query(FETCH ABSOLUTE $idx FROM _cur);

// Add the row offset to the list of used offsets
$uniques[] = $idx;

$resultSet[] = $record;
$limit--;
}

// query
query(CLOSE _cur);

return $resultSet;
}



Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,478f32e59497683469944!



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-15 Thread Alban Hertroys

On Jan 9, 2008, at 8:07 PM, Scott Marlowe wrote:

I could see a use for an approximate count(*) with where clause, just
like I could see a use for the ability to retrieve random rows from a
table without using order by random() on it.  And those are both
things that would require some form of hacking in the db that I'm
certainly not capable of pulling off...


About returning random rows... I've successfully applied a scrolling  
cursor for that.


You need to scroll to the last row to find the size of the result  
set, but after that it's pretty easy to return random rows by  
scrolling to them (and marking them 'read' in some way to prevent  
accidentally returning the same row again).


It does require some specific application code though - doing it  
server side would mean to pass the query as a function argument  
(which still requires unnatural SQL statements in your application  
code) or write a function for each query (*cough*).


Performance was quite adequate (a few 100 ms) for a query returning  
random 5 rows from 3 joined tables or more, some of which had a few  
100k rows. Calculating random() for each record in the result set (to  
sort on) was taking much longer. That was on a dual 64-bit opteron  
with 4GB RAM, iirc.


Of course a built-in statement would be preferable, I just felt like  
pointing out that order by random() isn't necessarily the best  
alternative ;)


Regards,
Alban Hertroys.

!DSPAM:737,478cb43e9496078213597!



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-15 Thread Ivan Sergio Borgonovo
On Tue, 15 Jan 2008 14:43:35 +0100
Alban Hertroys [EMAIL PROTECTED] wrote:

 You need to scroll to the last row to find the size of the result  
 set, but after that it's pretty easy to return random rows by  
 scrolling to them (and marking them 'read' in some way to prevent  
 accidentally returning the same row again).

Could you post a snippet of code or something giving a more detailed
idea of it?

BTW since cursors support offset if you're not interested if the
order of the retrieved rows is random too you don't even have to
remember which one you read I think.

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-10 Thread Martijn van Oosterhout
On Wed, Jan 09, 2008 at 03:28:04PM +0100, Ivan Sergio Borgonovo wrote:
 Let me consider an everyday use where count() looks as the most
 obvious solution: paging.
 
 I search trough a table and I need to know which is the last page.

There's an often overlooked solution to this. Let's say your count
returns 100,000 records, are you going to give them link to 1000
different pages? Not really. Probably about 10, so really your count
only is interested in an exact result less than 100, or that's it's
more than 100.

By placing the where clause in a subselect with a limit of 101 and a
count() around it you have an upper bound on the cost of the count, the
result 101 simply means more than 100. Depending on the data you
might just put the limit on the query that fetches the data and using
everything after the 10th record to determine your count on the client
side and skip the extra round trip.

 Is there a way to count based on indexes without taking into account
 deleted rows so to count faster?

I've also often pondered whether the SQL standard support for table
sampling would be good here. Sure, you still need to check visibility,
but if you specify that the DB only needs to check 10% of the tuples
and to extrapolate the results from that, you could get a fast yet
reasonably accurate result. IIRC patches for this have been floated on
the lists.

Have a nice day,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Those who make peaceful revolution impossible will make violent revolution 
 inevitable.
  -- John F Kennedy


signature.asc
Description: Digital signature


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-10 Thread Harald Fuchs
In article [EMAIL PROTECTED],
Chris Browne [EMAIL PROTECTED] writes:

 There may be a further optimization to be had by doing a
 per-statement trigger that counts the number of INSERTs/DELETEs done,
 so that inserting 30 tuples (in the table being tracked) leads to
 adding a single tuple with count of 30 in the summary table.

This would be nice, but at least the 8.2.4 docs say

  Statement-level triggers do not currently have any way to examine
  the individual row(s) modified by the statement.

Is this restriction removed in a later version?


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-10 Thread Alvaro Herrera
Harald Fuchs wrote:
 In article [EMAIL PROTECTED],
 Chris Browne [EMAIL PROTECTED] writes:
 
  There may be a further optimization to be had by doing a
  per-statement trigger that counts the number of INSERTs/DELETEs done,
  so that inserting 30 tuples (in the table being tracked) leads to
  adding a single tuple with count of 30 in the summary table.
 
 This would be nice, but at least the 8.2.4 docs say
 
   Statement-level triggers do not currently have any way to examine
   the individual row(s) modified by the statement.
 
 Is this restriction removed in a later version?

Nope.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-10 Thread Ron Mayer
Chris Browne wrote:
 [EMAIL PROTECTED] (Zoltan Boszormenyi) writes:
 SELECT COUNT(*)
 [Waving hands for a moment]

Would what Chris describes below be a good candidate for
a pgfoundry project that has functions that'll create the
triggers for you?  (yeah, I might be volunteering, but would
undoubtedly need help)

Then when people ask it again and again the response could
be install http://pgfoundry.org/fast_count_star; rather
than go read the mailing list archives and roll
your own - but remember to worry about deadlock and
contention on the table containing the counts.


 What I would do *instead* would be for each INSERT to add a tuple with
 a count of 1, and for each DELETE to add a tuple with count of -1, and
 then to periodically have a single process walk through to summarize
 the table.  There may be a further optimization to be had by doing a
 per-statement trigger that counts the number of INSERTs/DELETEs done,
 so that inserting 30 tuples (in the table being tracked) leads to
 adding a single tuple with count of 30 in the summary table.
 
 That changes the tradeoffs, again...
 
  - Since each INSERT/DELETE is simply doing an INSERT into the summary
table, the ongoing activity is *never* blocking anything
 
  - You get the count by requesting 
 SELECT SUM(rec_cnt) as record_count from rec_count where tablename = 
 'foo';
 
  - Note that the query is MVCC-consistent with the table!
 
  - Once in a while, you'll want to run a single procedure that, for
each table, deletes all the existing records, and replaces them
with a single one consisting of the sum of the individual values.
 
  - You can re-sync a table by running the query:
  begin;
 delete from record_count where tablename = 'foo';
 insert into record_count(tablename, rec_cnt) select 'foo', (select 
 count(*) from foo);
  commit;

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-09 Thread Ivan Sergio Borgonovo
On Wed, 09 Jan 2008 00:06:45 -0800
Joshua D. Drake [EMAIL PROTECTED] wrote:

 Granted there are scenarios where others are FASTER (SELECT
 COUNT(*)) but I find that if you are doing those items, you
 normally have a weird design anyway.

 Sincerely,

Sincerely, would you make an example of such a bad design?

Or did you just mean that count(*) is bad design in postgresql since
there are usually better alternatives in postgresql?

I'm not joking. I'd like to learn.

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-09 Thread Dann Corbit
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:pgsql-general-
 [EMAIL PROTECTED] On Behalf Of Ivan Sergio Borgonovo
 Sent: Wednesday, January 09, 2008 1:30 AM
 To: pgsql-general@postgresql.org
 Subject: [GENERAL] count(*) and bad design was: Experiences with
 extensibility
 
 On Wed, 09 Jan 2008 00:06:45 -0800
 Joshua D. Drake [EMAIL PROTECTED] wrote:
 
  Granted there are scenarios where others are FASTER (SELECT
  COUNT(*)) but I find that if you are doing those items, you
  normally have a weird design anyway.
 
  Sincerely,
 
 Sincerely, would you make an example of such a bad design?

A program that estimates cardinality by doing SELECT COUNT(*) is a bad
design.  Assuming you have the wherewithal to vacuum your tables (or
have autovacuum enabled) a query against the system tables will be a
much better estimate of cardinality.

Now (some may argue) what if we want an _EXACT_ value for COUNT(*)?  We
had better ask ourselves (in that circumstance) Am I willing to lock
the entire table and scan it? because that is what will be necessary to
get a truly exact value.  Otherwise, you can get totals that are wildly
off-base if someone is doing a bulk import or deleting a large number of
records.
 
So:
SELECT reltuples FROM pg_class WHERE relname = table_name;

Is more often what is really wanted.

 Or did you just mean that count(*) is bad design in postgresql since
 there are usually better alternatives in postgresql?

If you are using COUNT(*) as an existence test, then substitute:

WHERE EXISTS(criteria)


Use the indexes (if possible) by WHERE clause restriction:

SELECT count(1) FROM table_name WHERE condition_list

Will use indexes if appropriate.

 
 I'm not joking. I'd like to learn.

I think this should be a FAQ because it is a (F)requently (A)sked
(Q)uestion.

IMO-YMMV.


 
 --
 Ivan Sergio Borgonovo
 http://www.webthatworks.it
 
 
 ---(end of
broadcast)---
 TIP 6: explain analyze is your friend

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-09 Thread Ivan Sergio Borgonovo
On Wed, 9 Jan 2008 01:39:34 -0800
Dann Corbit [EMAIL PROTECTED] wrote:

  On Wed, 09 Jan 2008 00:06:45 -0800
  Joshua D. Drake [EMAIL PROTECTED] wrote:

   Granted there are scenarios where others are FASTER (SELECT
   COUNT(*)) but I find that if you are doing those items, you
   normally have a weird design anyway.

   Sincerely,

  Sincerely, would you make an example of such a bad design?

 A program that estimates cardinality by doing SELECT COUNT(*) is a
 bad design.  Assuming you have the wherewithal to vacuum your
 tables (or have autovacuum enabled) a query against the system
 tables will be a much better estimate of cardinality.
 
 Now (some may argue) what if we want an _EXACT_ value for
 COUNT(*)?  We had better ask ourselves (in that circumstance) Am I
 willing to lock the entire table and scan it? because that is what
 will be necessary to get a truly exact value.  Otherwise, you can
 get totals that are wildly off-base if someone is doing a bulk
 import or deleting a large number of records.

Please forgive my naiveness in this field but what does it mean an
exact count and what other DB means with an exact count and how
other DB deal with it?

How count is defined in the SQL standard?

Is there a real situation then where you really need the exact
count?

Am I right saying that:

select count(*) from ...
-- here count may already be different

and that:
select for update count(*)

could be very expensive? Or what would it mean to do a
select for update count(*) ...?


 I think this should be a FAQ because it is a (F)requently (A)sked
 (Q)uestion.

After reading your email I think the real problem is not how to do
otherwise but understand what count(*) really mean and when and if it
is really useful and when it can be avoided.

I'd write in the FAQ something in the line of:

- What count(*) really does?
- When it can be avoided?
- When it can't be avoided?

In my everyday use of count(*), after your email I can hardly spot a
place where I need an exact count.
But to better understand and being convinced that using count(*) is
bad design I think last question could help a lot.

How does count(*) with a where clause perform generally on postgresql
compared to other DB?
I'd expect it perform as good or better than other DB since now the
bottleneck should be how efficiently it can filter records... but
still a count(*) with a where clause will incur in the same problem
of what exact means.

thanks

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-09 Thread Harald Armin Massa
Ivan,

 Please forgive my naiveness in this field but what does it mean an
 exact count and what other DB means with an exact count and how
 other DB deal with it?

PostgreSQL will give you an exact count of the contents of the
database as it is in the moment you begin your count. (i.e. the
transaction starts)

BUT as the table is not locked, in parallel somebody can bulkload MANY
items into the database, so at the moment (start of your transaction)
+ 1msec your count may be invalid allready.

 I'd expect it perform as good or better than other DB since now the
 bottleneck should be how efficiently it can filter records... but
 still a count(*) with a where clause will incur in the same problem
 of what exact means.

I know of 3 concepts to answer count() faster then PostreSQL:

1) just lie, present an estimate
2) do not have MVCC
3) store record deleted info in index, so you can answer count()
with only scanning the index

Harald


-- 
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Spielberger Straße 49
70435 Stuttgart
0173/9409607
fx 01212-5-13695179
-
EuroPython 2008 will take place in Vilnius, Lithuania - Stay tuned!

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-09 Thread Ivan Sergio Borgonovo
On Wed, 9 Jan 2008 13:04:39 +0100
Harald Armin Massa [EMAIL PROTECTED] wrote:

 Ivan,

  Please forgive my naiveness in this field but what does it mean an
  exact count and what other DB means with an exact count and
  how other DB deal with it?

 PostgreSQL will give you an exact count of the contents of the
 database as it is in the moment you begin your count. (i.e. the
 transaction starts)

 BUT as the table is not locked, in parallel somebody can bulkload
 MANY items into the database, so at the moment (start of your
 transaction)
 + 1msec your count may be invalid allready.

That's reasonable. What other DB do and what is the SQL standard
requirement for count(*)?

  I'd expect it perform as good or better than other DB since now
  the bottleneck should be how efficiently it can filter records...
  but still a count(*) with a where clause will incur in the same
  problem of what exact means.
 
 I know of 3 concepts to answer count() faster then PostreSQL:
 
 1) just lie, present an estimate
 2) do not have MVCC
 3) store record deleted info in index, so you can answer count()
 with only scanning the index

Sorry if I insist but I think this is a beaten selling point against
postgresql.
One of the most recurring argument about why someone else db is
better than pg is count is slow.

Who lies?
If it is possible to do otherwise to have count run faster what are
the trade off that make it unreasonable to implement it in pg?

This is not very useful question but 3) imply that select scan the
index return the rows and just later check if they are still there.
Is it?
And since indexes aren't updated on the fly you may get back a
larger number than what is actually the real value.

Let me consider an everyday use where count() looks as the most
obvious solution: paging.

I search trough a table and I need to know which is the last page.
Concurrently someone is deleting a ton of records.
No matter if count() is fast or not when I output the pager it will
be wrong. But still I'll need an estimate of the number of pages,
it is not fun if that estimate is wrong *and* slow.
And once you add the where clauses there is no way to cache the count.
Is there a way to count based on indexes without taking into account
deleted rows so to count faster?

I can make the search faster using indexes as Dann Corbit suggested,
but as you imply that won't make count as fast as the
concurrence[1] that lie or don't use MVCC or store deleted info in
indexes.

SELECT reltuples FROM pg_class WHERE relname = table_name;

doesn't apply since you can't add conditions.

Please be patient. I hear this over and over and over. Postgresql is
faulty because it can't count fast.
And the only reply I've seen are: it is bad design to use count,
you're a dumb ass. I admit I may be a dumb ass, but it is hard to
sell Postgres if I continue to be a dumb ass ;)

- What count(*) really does?
Now I understood that count(*) return the # of rows as it sees them
at the moment it was invoked. That should be true for other DB as
well. That means that unless other DB lock the table they can't take
into account records that are going to be deleted once the count has
been issued.

- When count can be avoided?
Well since even other DB may report the wrong number, this makes
count() look less a Sacre Graal. But still if you need an estimate,
wouldn't it be better to have it quickly?
How does postgresql compare to other DB when you run:
select count(*) from table where conditions
once you use indexes?
If such kind of query will have anyway to scan the results to see if
they are still there since info about deleted records aren't stored
in indexes, is there a way to ignore this and just have a faster
estimate?
I still can't see why it is bad design to use count().

- When count() can't be avoided?
All the situation where you may really need count() I think you also
need to lock the table but well I'd be curious to see an example
where you need count()

Still can somebody make an example of bad design and one where
count() couldn't be avoided if any?

Consider that while it makes few sense to rely on wrong numbers in
a business environment where data integrity/coherence makes *a lot*
of sense it is not so clear in a CMS world where most of those
critics come from.

I know that the arguments to promote postgres in the business world
where DB2, Oracle and MS SQL play (?) may be different and count()
may lose its importance in that context and you could say that other
advantages plenty pay off the slowness of an operation that in such
a context is rare.


thanks


[1] or does postgres perform as the concurrence once you add where
clauses?


-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-09 Thread Sim Zacks
Using count(*) is not bad design, though generally it makes sense to use it with 
a where.


Saying using count(*) is bad design means that the only design that you can 
visualize is the specific one that you are using.


There are tons of real world examples where you need count. That is why so many 
people use it as a benchmark.


Obviously if you have an application where millions of rows are added and 
subtracted every minute, then the value of count is kind of vague.


However, if you are querying a data warehouse that gets populated once a month, 
then count has a huge value. You could use statistics in that case, but it is 
much harder to join against other tables when you are using statistics. It is 
also less intuitive then using the SQL standard for finding the number of rows.


Sim

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-09 Thread Ivan Sergio Borgonovo
On Wed, 09 Jan 2008 16:33:54 +0200
Sim Zacks [EMAIL PROTECTED] wrote:

 Using count(*) is not bad design, though generally it makes sense
 to use it with a where.

I got the impression from others comments that postgresql
under perform other DB even when a where clause on indexed column is
involved.

I may have misinterpreted this but still could someone clarify?

Harald Armin Massa wrote:

 1) just lie, present an estimate
 2) do not have MVCC
 3) store record deleted info in index, so you can answer count()
 with only scanning the index

I got the impression that even counting with clauses on on indexed
columns means you'll have to check if columns are still there. That
seems to imply that the extra cost make pg under perform compared to
other DB even in that scenario.

I beg pardon to Harald if I misinterpreted his words.

 Saying using count(*) is bad design means that the only design that
 you can visualize is the specific one that you are using.

I'd be interested in some example by Joshua otherwise I can't
understand what he meant.

If you're interested in all the record in a table, there is no way to
have an engraved in stone answer and since there are no where
clauses you can cache that info and update it once in a while.
If you have a where clause I'm not expecting an engraved in stone
answer but I'd expect to have a quick way to get an estimate and I
still haven't understood if:
a) I could improve my design to avoid count
b) postgres perform as good as other db on count where there is a
where clause
c) is there a way to have a quick estimate avoiding count when there
is a where clause

 There are tons of real world examples where you need count. That is
 why so many people use it as a benchmark.

 Obviously if you have an application where millions of rows are
 added and subtracted every minute, then the value of count is kind
 of vague.

 However, if you are querying a data warehouse that gets populated
 once a month, then count has a huge value. You could use statistics
 in that case, but it is much harder to join against other tables
 when you are using statistics. It is also less intuitive then using
 the SQL standard for finding the number of rows.

Again: paging records. You can't do statistics.
Surely you could optimise and lie... but that comes to a cost
compared to the simplicity of count.

Still everybody knows that a frequent complain about postgresql is it
has a slow count.

I can understand grey answer provided they are coherent.
a) the above claim is false
b) this claim is true just on cases where you could opt for a better
design
c) this claim is false for count without where clause
d) this claim is true
e) ... 

details on b) would be much appreciated. Other cases require just a
yes/no answer.

thanks

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-09 Thread Scott Marlowe
On Jan 9, 2008 8:33 AM, Sim Zacks [EMAIL PROTECTED] wrote:
 Using count(*) is not bad design, though generally it makes sense to use it 
 with
 a where.

 Saying using count(*) is bad design means that the only design that you can
 visualize is the specific one that you are using.

 There are tons of real world examples where you need count. That is why so 
 many
 people use it as a benchmark.

 Obviously if you have an application where millions of rows are added and
 subtracted every minute, then the value of count is kind of vague.

 However, if you are querying a data warehouse that gets populated once a 
 month,
 then count has a huge value. You could use statistics in that case, but it is
 much harder to join against other tables when you are using statistics. It is
 also less intuitive then using the SQL standard for finding the number of 
 rows.

I think part of the problem is people think of count() as something
other than an aggregate function.

If I ran select avg(i) from table on a 20M row table, I'd expect it
to take a few seconds, after all, I'm running a function across 20
Million rows.

Some databases have the ability to short-circuit count(*) without a
where clause, some with a where clause.  But the basic model of
count(*) is that it's an aggregate function, and what you're asking
the db to do is to count every single row in the db that matches your
where clause.

Now, everything's a tradeoff.  If PostgreSQL had visibility
information in the indexes, it would have to lock both the table and
index for every write, thus slowing down all the other queries that
are trying to access the table.  It would be a tradeoff that
sacrificed write speed for read speed. In a db that was used mostly
for writing, it would likely be a fair trade.  In a db that did a lot
of writing, it might slow the whole thing to a crawl.

So, the slow count(*) performance of postgresql, especially count(*)
without a where clause, is a result of the TANSTAAFL principle (there
ain't no such thing as a free lunch).

Now, if there's a where clause that's selective enough, then a
count(*) query may use the index and be a bit faster, but remember, in
pgsql, it's still got to actually hit the table to see if each tuple
really is visible to this transaction, so the index needs to be fairly
selective to be a win.

A possible workaround is to have something like a separate table with
nothing but the IDs and whatever would be in your where clause for the
tables you're accessing with a foreign key to it, and use THAT for a
count(*).  Since the rows are skinnier, the count(*) will be faster.
Another alternative is to have a trigger fire that keeps a track of
the size of the table in a summary table when rows are added and
deleted.

Each of these methods costs you something, in time and effort or
performance, and that's why they're not automatic.  For instance, I
have an 80M row stats db that grows by about 1M rows a week.  I do NOT
need to count the whole thing, hardly ever, and can live with the fact
that I don't know exactly how many rows it has at any given time.
When I do a select count(*) with a where clause it's usually
restricted to  1 weeks data and can use an index and come back pretty
quickly.

select count(*) from bigtable where inserttime  now() - interval '1 week';
  count
-
 1254269
(1 row)

Time: 21422.368 ms

Second run:
select count(*) from bigtable where inserttime  now() - interval '1 week';
  count
-
 1254320
(1 row)

Time: 2001.204 ms

With the data loaded into shared_buffers / linux kernel cache, that's
not too bad.

Now, I run the same query against our production oracle machine, which
is a MUCH more powerful server...

SQL select count(*) from bigtable where inserttime  SYSDATE-7;

  COUNT(*)
--
   1255972

Elapsed: 00:00:18.62

second run:
SQL select count(*) from bigtable where inserttime  SYSDATE-7;

  COUNT(*)
--
   1255973
Elapsed: 00:00:00.98

Now, obviously, Oracle's got some optimizations for what it's got in
the buffer there, but the first run isn't really any faster. In fact,
for a much more powerful machine, the performance was, relatively
speaking, pretty bad compared to my little 1 CPU 1 sw RAID-10
reporting server.

So, while PostgreSQL's count(*) performance isn't blindingly fast,
it's not the dog some people make it out to be either.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-09 Thread Scott Marlowe
On Jan 9, 2008 10:21 AM, Ivan Sergio Borgonovo [EMAIL PROTECTED] wrote:
 On Wed, 09 Jan 2008 16:33:54 +0200
 Sim Zacks [EMAIL PROTECTED] wrote:

  Using count(*) is not bad design, though generally it makes sense
  to use it with a where.

 I got the impression from others comments that postgresql
 under perform other DB even when a where clause on indexed column is
 involved.

Correct, see my other post.  Again though, it's a tradeoff.  This one
lower performing action allows other actions to be faster.

 Again: paging records. You can't do statistics.

Sure you can.  When you're near the front, no one cares how exact it
is.  Ever search for a common term on google?  You don't get an exact
count, you get an approximation, and you get it for a reason.

Other sites often lie, and give what looks like an exact count, but if
it's in the 100,000 who really cares?  Seriously, are you going to
10,000th page on google for a search term?  If someone does start
going that far out, the cost of limit/offset are going to kick in, and
not just in postgresql, and queries are going to take longer and
longer.  At that point, you can switch to an exact count(*) if you
need to have it.  It won't be the most expensive thing you're doing.

 Still everybody knows that a frequent complain about postgresql is it
 has a slow count.

I've never heard it before (just kidding).  I do think it's frequent.
I also think it's overstated.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-09 Thread Andrew Sullivan
On Wed, Jan 09, 2008 at 05:21:24PM +0100, Ivan Sergio Borgonovo wrote:
 
 I got the impression that even counting with clauses on on indexed
 columns means you'll have to check if columns are still there. That
 seems to imply that the extra cost make pg under perform compared to
 other DB even in that scenario.

You have to do this for any row you need to see, for _any_ database
operation in Postgres.  But that's no different from any other database
system: they all have to locate all the rows that satisfy the condition, and
then tell you how many there are.

Many other systems, however, know how many rows there are in the table. 
In some sense, they have optimised for that case at the expense of other
cases (like, for instance, more aggressive locks than Postgres takes, or
failures due to rollback segment exhaustion, or whatever your favourite
limitation of your system of choice is).  When you build a system, you're
going to trade some features for others more than likely, and the real
question is what things you trade away.  The speed of counting all the rows
in the table seems to me to be a good thing to trade away, because it's very
rare that you actually need to know that.

 If you're interested in all the record in a table, there is no way to
 have an engraved in stone answer 

Sure there is: SELECT count(*) from table.  That tells you how many rows
there were in the table when your transaction started (modulo read
visibility rules), just like in any other database system.

A

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-09 Thread Scott Marlowe
On Jan 9, 2008 10:46 AM, Andrew Sullivan [EMAIL PROTECTED] wrote:
 On Wed, Jan 09, 2008 at 05:21:24PM +0100, Ivan Sergio Borgonovo wrote:
 
  I got the impression that even counting with clauses on on indexed
  columns means you'll have to check if columns are still there. That
  seems to imply that the extra cost make pg under perform compared to
  other DB even in that scenario.

 You have to do this for any row you need to see, for _any_ database
 operation in Postgres.  But that's no different from any other database
 system: they all have to locate all the rows that satisfy the condition, and
 then tell you how many there are.

 Many other systems, however, know how many rows there are in the table.
 In some sense, they have optimised for that case at the expense of other
 cases (like, for instance, more aggressive locks than Postgres takes, or
 failures due to rollback segment exhaustion, or whatever your favourite
 limitation of your system of choice is).  When you build a system, you're
 going to trade some features for others more than likely, and the real
 question is what things you trade away.  The speed of counting all the rows
 in the table seems to me to be a good thing to trade away, because it's very
 rare that you actually need to know that.

  If you're interested in all the record in a table, there is no way to
  have an engraved in stone answer

 Sure there is: SELECT count(*) from table.  That tells you how many rows
 there were in the table when your transaction started (modulo read
 visibility rules), just like in any other database system.

And if, for some god forsaken reason, you need to operate on that
number, there's always lock table...

I feel dirty. :)

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-09 Thread Andrew Sullivan
On Wed, Jan 09, 2008 at 11:03:59AM -0600, Scott Marlowe wrote:
 
 And if, for some god forsaken reason, you need to operate on that
 number, there's always lock table...

Yes.  You could also store the data in ISAM :-P

 I feel dirty. :)

You should.  Go wash your brain out with soap.  LOCK TABLE indeed.

A


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-09 Thread Joshua D. Drake
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Wed, 9 Jan 2008 20:01:05 +0100
Ivan Sergio Borgonovo [EMAIL PROTECTED] wrote:

 On Wed, 9 Jan 2008 10:30:45 -0600
 Scott Marlowe [EMAIL PROTECTED] wrote:
 
  Now, everything's a tradeoff.  If PostgreSQL had visibility
  information in the indexes, it would have to lock both the table and
  index for every write, thus slowing down all the other queries that
  are trying to access the table.  It would be a tradeoff that
  sacrificed write speed for read speed. In a db that was used mostly
  for writing, it would likely be a fair trade.  In a db that did a
  lot of writing, it might slow the whole thing to a crawl.
 
 OK... we are getting near to the point. I understand the trade-off
 problem in storing into indexes id the row is still there.
 Is there a way to get the count of the rows that *may be* there, 

If you analyze regularly you can use pg_class. It isn't exact but is
usually close enough (especially if you are just using it for something
like pagination).

Joshua D. Drake



- -- 
The PostgreSQL Company: Since 1997, http://www.commandprompt.com/ 
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
SELECT 'Training', 'Consulting' FROM vendor WHERE name = 'CMD'


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHhRlVATb/zqfZUUQRAoQmAJ4nkBHmZEsC8UusCT7+qul1Qa9/0QCeJFru
gnBj3ROCMz+vqbF/1Z78roY=
=iVOP
-END PGP SIGNATURE-

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-09 Thread Ivan Sergio Borgonovo
On Wed, 9 Jan 2008 10:30:45 -0600
Scott Marlowe [EMAIL PROTECTED] wrote:

 Now, everything's a tradeoff.  If PostgreSQL had visibility
 information in the indexes, it would have to lock both the table and
 index for every write, thus slowing down all the other queries that
 are trying to access the table.  It would be a tradeoff that
 sacrificed write speed for read speed. In a db that was used mostly
 for writing, it would likely be a fair trade.  In a db that did a
 lot of writing, it might slow the whole thing to a crawl.

OK... we are getting near to the point. I understand the trade-off
problem in storing into indexes id the row is still there.
Is there a way to get the count of the rows that *may be* there, I
mean the one that satisfy the where clauses no matter if in the
meanwhile they were deleted or not and reindex on a regular basis.
That would be simple and I think fast and an acceptable solution for
most CMS.

 Now, if there's a where clause that's selective enough, then a
 count(*) query may use the index and be a bit faster, but remember,
 in pgsql, it's still got to actually hit the table to see if each
 tuple really is visible to this transaction, so the index needs to
 be fairly selective to be a win.

But well if the query is not selective enough I think the problem is
shared with other DB as well.

 A possible workaround is to have something like a separate table
 with nothing but the IDs and whatever would be in your where clause
 for the tables you're accessing with a foreign key to it, and use
 THAT for a count(*).  Since the rows are skinnier, the count(*)
 will be faster. Another alternative is to have a trigger fire that
 keeps a track of the size of the table in a summary table when rows
 are added and deleted.

If the where clause is dynamic, how can it help?

 select count(*) from bigtable where inserttime  now() - interval
 '1 week'; count
 -
  1254269
 (1 row)
 
 Time: 21422.368 ms
 
 Second run:
 select count(*) from bigtable where inserttime  now() - interval
 '1 week'; count
 -
  1254320
 (1 row)
 
 Time: 2001.204 ms
 
 With the data loaded into shared_buffers / linux kernel cache,
 that's not too bad.
 
 Now, I run the same query against our production oracle machine,
 which is a MUCH more powerful server...
 
 SQL select count(*) from bigtable where inserttime  SYSDATE-7;
 
   COUNT(*)
 --
1255972
 
 Elapsed: 00:00:18.62
 
 second run:
 SQL select count(*) from bigtable where inserttime  SYSDATE-7;
 
   COUNT(*)
 --
1255973
 Elapsed: 00:00:00.98
 
 Now, obviously, Oracle's got some optimizations for what it's got in
 the buffer there, but the first run isn't really any faster. In
 fact, for a much more powerful machine, the performance was,
 relatively speaking, pretty bad compared to my little 1 CPU 1 sw
 RAID-10 reporting server.

Interesting... oh let's put some emotions in it: cool ;)

 So, while PostgreSQL's count(*) performance isn't blindingly fast,
 it's not the dog some people make it out to be either.

Continuing here from your previous post... while you guys are
concerned of competing with big boxes running Oracle or thinking
about Google's webfarms I'm thinking about stuff that is in between a
mere CMS for mortals and stuff that deals with money where you don't
have budgets and hits enough to justify a complex logic to do stats
analysis or collect enough stats to make any forecast reasonable.

In this context a simpler faster even if less accurate count may be
very handy.

thanks again

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-09 Thread Ivan Sergio Borgonovo
On Wed, 9 Jan 2008 10:58:29 -0800
Joshua D. Drake [EMAIL PROTECTED] wrote:

  OK... we are getting near to the point. I understand the trade-off
  problem in storing into indexes id the row is still there.
  Is there a way to get the count of the rows that *may be* there, 

 If you analyze regularly you can use pg_class. It isn't exact but is
 usually close enough (especially if you are just using it for
 something like pagination).

But what if I've a

select count(*) from table where condition;

where condition involves just indexed columns and I want to trust the
indexes and I'm not concerned of the deleted rows?
Just to get an estimate between reindexing cycles, that would be
perfect for paging.

pg_class does look as returning all the rows.

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-09 Thread Zoltan Boszormenyi

Ivan Sergio Borgonovo írta:

On Wed, 9 Jan 2008 13:04:39 +0100
Harald Armin Massa [EMAIL PROTECTED] wrote:

  

Ivan,



  

Please forgive my naiveness in this field but what does it mean an
exact count and what other DB means with an exact count and
how other DB deal with it?
  


  

PostgreSQL will give you an exact count of the contents of the
database as it is in the moment you begin your count. (i.e. the
transaction starts)



  

BUT as the table is not locked, in parallel somebody can bulkload
MANY items into the database, so at the moment (start of your
transaction)
+ 1msec your count may be invalid allready.



That's reasonable. What other DB do and what is the SQL standard
requirement for count(*)?

  

I'd expect it perform as good or better than other DB since now
the bottleneck should be how efficiently it can filter records...
but still a count(*) with a where clause will incur in the same
problem of what exact means.
  

I know of 3 concepts to answer count() faster then PostreSQL:

1) just lie, present an estimate
2) do not have MVCC
3) store record deleted info in index, so you can answer count()
with only scanning the index



Sorry if I insist but I think this is a beaten selling point against
postgresql.
One of the most recurring argument about why someone else db is
better than pg is count is slow.

Who lies?
If it is possible to do otherwise to have count run faster what are
the trade off that make it unreasonable to implement it in pg?
  


The decision to use MVCC in PostgreSQL makes the point moot.
What your transaction cannot see (e.g. another transaction inserted
a new record but didn't COMMITted yet) cannot be counted as visible.
It's not only a theoretical but also practical. Compare that to the
MaxDB way. At the very lowest level MaxDB keeps the records
in a balanced b-tree based on its primary key. Look at this URL:
http://blog.ulf-wendel.de/?p=76
MaxDB keeps how many records are in the tree at all times which is
handy if  your database is close to idle. Consequences:

- very fast COUNT(*) when idle
- records unCOMMITted by other transactions can be seen by COUNT(*)
 if your transaction is in read committed isolation level but not the
 actual records
- COUNT(*) waits for all other transactions that modified the table
 in question to finish if your transaction   is in repeatable read or
 serializable isolation levels. Consider that transactions can take
 a lng time to finish if they process many things. This way your
 SELECT COUNT(*) doesn't respond instantly but doesn't slow down
 your server either. But the end user perception is the same:
 COUNT(*) is slow!

In PostgreSQL, COUNT(*) responds closely at the same speed regardless
of other transactions. Which way do you prefer?


This is not very useful question but 3) imply that select scan the
index return the rows and just later check if they are still there.
Is it?
And since indexes aren't updated on the fly you may get back a
larger number than what is actually the real value.

Let me consider an everyday use where count() looks as the most
obvious solution: paging.

I search trough a table and I need to know which is the last page.
Concurrently someone is deleting a ton of records.
No matter if count() is fast or not when I output the pager it will
be wrong. But still I'll need an estimate of the number of pages,
it is not fun if that estimate is wrong *and* slow.
And once you add the where clauses there is no way to cache the count.
Is there a way to count based on indexes without taking into account
deleted rows so to count faster?

I can make the search faster using indexes as Dann Corbit suggested,
but as you imply that won't make count as fast as the
concurrence[1] that lie or don't use MVCC or store deleted info in
indexes.

SELECT reltuples FROM pg_class WHERE relname = table_name;

doesn't apply since you can't add conditions.

Please be patient. I hear this over and over and over. Postgresql is
faulty because it can't count fast.
And the only reply I've seen are: it is bad design to use count,
you're a dumb ass. I admit I may be a dumb ass, but it is hard to
sell Postgres if I continue to be a dumb ass ;)

- What count(*) really does?
Now I understood that count(*) return the # of rows as it sees them
at the moment it was invoked. That should be true for other DB as
well. That means that unless other DB lock the table they can't take
into account records that are going to be deleted once the count has
been issued.

- When count can be avoided?
Well since even other DB may report the wrong number, this makes
count() look less a Sacre Graal. But still if you need an estimate,
wouldn't it be better to have it quickly?
How does postgresql compare to other DB when you run:
select count(*) from table where conditions
once you use indexes?
If such kind of query will have anyway to scan the results to see if
they are still there since info about deleted records aren't stored
in indexes, 

Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-09 Thread Scott Marlowe
On Jan 9, 2008 12:58 PM, Joshua D. Drake [EMAIL PROTECTED] wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On Wed, 9 Jan 2008 20:01:05 +0100
 Ivan Sergio Borgonovo [EMAIL PROTECTED] wrote:

  On Wed, 9 Jan 2008 10:30:45 -0600
  Scott Marlowe [EMAIL PROTECTED] wrote:
 
   Now, everything's a tradeoff.  If PostgreSQL had visibility
   information in the indexes, it would have to lock both the table and
   index for every write, thus slowing down all the other queries that
   are trying to access the table.  It would be a tradeoff that
   sacrificed write speed for read speed. In a db that was used mostly
   for writing, it would likely be a fair trade.  In a db that did a
   lot of writing, it might slow the whole thing to a crawl.
 
  OK... we are getting near to the point. I understand the trade-off
  problem in storing into indexes id the row is still there.
  Is there a way to get the count of the rows that *may be* there,

 If you analyze regularly you can use pg_class. It isn't exact but is
 usually close enough (especially if you are just using it for something
 like pagination).

Yeah, but the OP's point was that it doesn't work if you have a where clause.

I could see a use for an approximate count(*) with where clause, just
like I could see a use for the ability to retrieve random rows from a
table without using order by random() on it.  And those are both
things that would require some form of hacking in the db that I'm
certainly not capable of pulling off...

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-09 Thread Ivan Sergio Borgonovo
On Wed, 09 Jan 2008 20:29:39 +0100
Zoltan Boszormenyi [EMAIL PROTECTED] wrote:

 The decision to use MVCC in PostgreSQL makes the point moot.

...

thanks.

 In PostgreSQL, COUNT(*) responds closely at the same speed
 regardless of other transactions. Which way do you prefer?

Considering the relative value of count my interest was for something
that is even less precise than the usual count but performs better.
I'm not proposing to turn Postgres into MySQL.

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-09 Thread Zoltan Boszormenyi

Ivan Sergio Borgonovo írta:

On Wed, 09 Jan 2008 20:29:39 +0100
Zoltan Boszormenyi [EMAIL PROTECTED] wrote:

  

The decision to use MVCC in PostgreSQL makes the point moot.



...

thanks.

  

In PostgreSQL, COUNT(*) responds closely at the same speed
regardless of other transactions. Which way do you prefer?



Considering the relative value of count my interest was for something
that is even less precise than the usual count but performs better.
I'm not proposing to turn Postgres into MySQL.
  


This below might be a good compromise.
Although every INSERT/DELETE will be a bit slower
because of the additional UPDATE on the administrative table.

create table rec_count (tablename text unique, rec_cnt bigint) with 
(fillfactor 50);


Add any tables you want to it with their current record count and
for any tables you want to watch:

create or replace function inc_tablename_rec_cnt()
returns trigger as $$
begin
   update rec_count set rec_cnt = rec_cnt + 1 where tablename = 
'tablename';

   return new;
end; $$ language plpgsql;

create or replace function dec_tablename_rec_cnt()
returns trigger as $$
begin
   update rec_count set rec_cnt = rec_cnt - 1 where tablename = 
'tablename';

   return new;
end; $$ language plpgsql;

create trigger ins_tablename_rec_cnt after insert on tablename for each 
row execute procedure inc_tablename_rec_cnt();
create trigger del_tablename_rec_cnt after insert on tablename for each 
row execute procedure dec_tablename_rec_cnt();


The administrative table will be a highly updated one if you want
to watch a high-traffic table, hence the FILLFACTOR setting.
You may need to adjust max_fsm_pages. Later, you can do a

SELECT rec_cnt FROM rec_count WHERE tablename = 'tablename';

which will be fast and depending on the initial value of COUNT(*)
it will be very close to the exact figure. You can extend the example
with more columns if you know your SELECT COUNT(*) ... WHERE
conditions in advance but this way you have to keep several administrative
tables for different monitored tables. Again, this trades some disk space
and INSERT/DELETE operation speed on the monitored tables for
quicker count.

--
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
http://www.postgresql.at/



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-09 Thread Chris Browne
[EMAIL PROTECTED] (Zoltan Boszormenyi) writes:
 which will be fast and depending on the initial value of COUNT(*)
 it will be very close to the exact figure. You can extend the example
 with more columns if you know your SELECT COUNT(*) ... WHERE
 conditions in advance but this way you have to keep several administrative
 tables for different monitored tables. Again, this trades some disk space
 and INSERT/DELETE operation speed on the monitored tables for
 quicker count.

Actually, this approach will be Really Terrible for any cases where
multiple connections are adding/deleting tuples concurrently, as it
will force ALL updates to serialize behind the update to the central
table.

Occasionally, you'll have something even worse, namely a deadlock,
where two or more of the updates fighting over the single summary
tuple fall into a bad state, and one of them is forced to give up,
potentially rolling back its whole transaction.

[Waving hands for a moment]

What I would do *instead* would be for each INSERT to add a tuple with
a count of 1, and for each DELETE to add a tuple with count of -1, and
then to periodically have a single process walk through to summarize
the table.  There may be a further optimization to be had by doing a
per-statement trigger that counts the number of INSERTs/DELETEs done,
so that inserting 30 tuples (in the table being tracked) leads to
adding a single tuple with count of 30 in the summary table.

That changes the tradeoffs, again...

 - Since each INSERT/DELETE is simply doing an INSERT into the summary
   table, the ongoing activity is *never* blocking anything

 - You get the count by requesting 
SELECT SUM(rec_cnt) as record_count from rec_count where tablename = 'foo';

 - Note that the query is MVCC-consistent with the table!

 - Once in a while, you'll want to run a single procedure that, for
   each table, deletes all the existing records, and replaces them
   with a single one consisting of the sum of the individual values.

 - You can re-sync a table by running the query:
 begin;
delete from record_count where tablename = 'foo';
insert into record_count(tablename, rec_cnt) select 'foo', (select 
count(*) from foo);
 commit;
-- 
output = reverse(ofni.sesabatadxunil @ enworbbc)
http://www3.sympatico.ca/cbbrowne/languages.html
Rules of  the Evil Overlord #174.  If I am dangling  over a precipice
and the hero reaches  his hand down to me, I will  not attempt to pull
him down with  me. I will allow him to rescue  me, thank him properly,
then return to the safety of my fortress and order his execution.
http://www.eviloverlord.com/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/


Re: [GENERAL] count(*) and bad design was: Experiences with extensibility

2008-01-09 Thread Sim Zacks
It would be an administrative nightmare unless you had very few where clauses 
that you were tracking.


Instead of using a trigger, you could use Listen/Notify to call a daemon on the 
server to run the procedure and then you have no insert/delete overhead.

Or you could call the function on a cron job every 10 minutes...

Chris Browne wrote:

[EMAIL PROTECTED] (Zoltan Boszormenyi) writes:

which will be fast and depending on the initial value of COUNT(*)
it will be very close to the exact figure. You can extend the example
with more columns if you know your SELECT COUNT(*) ... WHERE
conditions in advance but this way you have to keep several administrative
tables for different monitored tables. Again, this trades some disk space
and INSERT/DELETE operation speed on the monitored tables for
quicker count.


Actually, this approach will be Really Terrible for any cases where
multiple connections are adding/deleting tuples concurrently, as it
will force ALL updates to serialize behind the update to the central
table.

Occasionally, you'll have something even worse, namely a deadlock,
where two or more of the updates fighting over the single summary
tuple fall into a bad state, and one of them is forced to give up,
potentially rolling back its whole transaction.

[Waving hands for a moment]

What I would do *instead* would be for each INSERT to add a tuple with
a count of 1, and for each DELETE to add a tuple with count of -1, and
then to periodically have a single process walk through to summarize
the table.  There may be a further optimization to be had by doing a
per-statement trigger that counts the number of INSERTs/DELETEs done,
so that inserting 30 tuples (in the table being tracked) leads to
adding a single tuple with count of 30 in the summary table.

That changes the tradeoffs, again...

 - Since each INSERT/DELETE is simply doing an INSERT into the summary
   table, the ongoing activity is *never* blocking anything

 - You get the count by requesting 
SELECT SUM(rec_cnt) as record_count from rec_count where tablename = 'foo';


 - Note that the query is MVCC-consistent with the table!

 - Once in a while, you'll want to run a single procedure that, for
   each table, deletes all the existing records, and replaces them
   with a single one consisting of the sum of the individual values.

 - You can re-sync a table by running the query:
 begin;
delete from record_count where tablename = 'foo';
insert into record_count(tablename, rec_cnt) select 'foo', (select 
count(*) from foo);
 commit;


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/