Re: [PERFORM] Select max(foo) and select count(*) optimization

2004-01-06 Thread D'Arcy J.M. Cain
On January 6, 2004 01:42 am, Shridhar Daithankar wrote:
 On Tuesday 06 January 2004 01:22, Rod Taylor wrote:
  Anyway, with Rules you can force this:
 
  ON INSERT UPDATE counter SET tablecount = tablecount + 1;
 
  ON DELETE UPDATE counter SET tablecount = tablecount - 1;

 That would generate lot of dead tuples in counter table. How about

 select relpages,reltuples from pg_class where relname=tablename;

 Assuming the stats are recent enough, it would be much faster and
 accurate..

Well, I did this:

cert=# select relpages,reltuples from pg_class where relname= 'certificate';
 relpages |  reltuples
--+-
   399070 | 2.48587e+07
(1 row)

Casting seemed to help:

cert=# select relpages,reltuples::bigint from pg_class where relname= 
'certificate';
 relpages | reltuples
--+---
   399070 |  24858736
(1 row)

But:

cert=# select count(*) from certificate;
[*Crunch* *Crunch* *Crunch*]
  count
--
 19684668
(1 row)

Am I missing something?  Max certificate_id is 20569544 btw.

-- 
D'Arcy J.M. Cain [EMAIL PROTECTED]|vex}.net   |  Democracy is three wolves
http://www.druid.net/darcy/|  and a sheep voting on
+1 416 425 1212 (DoD#0082)(eNTP)   |  what's for dinner.

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Select max(foo) and select count(*) optimization

2004-01-06 Thread Shridhar Daithankar
On Tuesday 06 January 2004 17:48, D'Arcy J.M. Cain wrote:
 On January 6, 2004 01:42 am, Shridhar Daithankar wrote:
 cert=# select relpages,reltuples::bigint from pg_class where relname=
 'certificate';
  relpages | reltuples
 --+---
399070 |  24858736
 (1 row)

 But:

 cert=# select count(*) from certificate;
 [*Crunch* *Crunch* *Crunch*]
   count
 --
  19684668
 (1 row)

 Am I missing something?  Max certificate_id is 20569544 btw.

Do 'vacuum analyze certificate' and try..:-)

The numbers from pg_class are estimates updated by vacuum /analyze. Of course 
you need to run vacuum frequent enough for that statistics to be updated all 
the time or run autovacuum daemon..

Ran into same problem on my machine till I remembered about vacuum..:-)

 Shridhar


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Select max(foo) and select count(*) optimization

2004-01-06 Thread Robert Treat
On Tue, 2004-01-06 at 07:20, Shridhar Daithankar wrote:
 On Tuesday 06 January 2004 17:48, D'Arcy J.M. Cain wrote:
  On January 6, 2004 01:42 am, Shridhar Daithankar wrote:
  cert=# select relpages,reltuples::bigint from pg_class where relname=
  'certificate';
   relpages | reltuples
  --+---
 399070 |  24858736
  (1 row)
 
  But:
 
  cert=# select count(*) from certificate;
  [*Crunch* *Crunch* *Crunch*]
count
  --
   19684668
  (1 row)
 
  Am I missing something?  Max certificate_id is 20569544 btw.
 
 Do 'vacuum analyze certificate' and try..:-)
 
 The numbers from pg_class are estimates updated by vacuum /analyze. Of course 
 you need to run vacuum frequent enough for that statistics to be updated all 
 the time or run autovacuum daemon..
 
 Ran into same problem on my machine till I remembered about vacuum..:-)
 

Actually you only need to run analyze to update the statistics.

Robert Treat
-- 
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Select max(foo) and select count(*) optimization

2004-01-06 Thread Shridhar Daithankar
Robert Treat wrote:
On Tue, 2004-01-06 at 07:20, Shridhar Daithankar wrote:

The numbers from pg_class are estimates updated by vacuum /analyze. Of course 
you need to run vacuum frequent enough for that statistics to be updated all 
the time or run autovacuum daemon..
Ran into same problem on my machine till I remembered about vacuum..:-)
Actually you only need to run analyze to update the statistics.
Old habits die hard..:-)

 shridhar

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Select max(foo) and select count(*) optimization

2004-01-06 Thread Mark Kirkwood
if this situation persists after 'analyze certificate', then you need to:

increase the statistics target 'alter table certificate alter column 
certificate_id set statistics 100' 

or

'vacuum full certificate'

i.e : there are lots of (dead) updated or deleted tuples in the 
relation, distributed in such a way as to throw off analyze's estimate.

regards

Mark

D'Arcy J.M. Cain wrote:

Well, I did this:

cert=# select relpages,reltuples from pg_class where relname= 'certificate';
relpages |  reltuples
--+-
  399070 | 2.48587e+07
(1 row)
Casting seemed to help:

cert=# select relpages,reltuples::bigint from pg_class where relname= 
'certificate';
relpages | reltuples
--+---
  399070 |  24858736
(1 row)

But:

cert=# select count(*) from certificate;
[*Crunch* *Crunch* *Crunch*]
 count
--
19684668
(1 row)
Am I missing something?  Max certificate_id is 20569544 btw.

 



---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Select max(foo) and select count(*) optimization

2004-01-06 Thread D'Arcy J.M. Cain
On January 6, 2004 07:20 am, Shridhar Daithankar wrote:
 On Tuesday 06 January 2004 17:48, D'Arcy J.M. Cain wrote:
  On January 6, 2004 01:42 am, Shridhar Daithankar wrote:
  cert=# select relpages,reltuples::bigint from pg_class where relname=
  'certificate';
   relpages | reltuples
  --+---
 399070 |  24858736
  (1 row)
 
  But:
 
  cert=# select count(*) from certificate;
  [*Crunch* *Crunch* *Crunch*]
count
  --
   19684668
  (1 row)
 
  Am I missing something?  Max certificate_id is 20569544 btw.

 Do 'vacuum analyze certificate' and try..:-)

Kind of invalidates the part about being accurate then, don't it?  Besides, I 
vacuum that table every day (*) and we have reorganized the schema so that we 
never update it except in exceptional cases.  I would be less surprised if 
the result was less than the real count since we only insert into that table.

In any case, if I have to vacuum a 20,000,000 row table to get an accurate 
count then I may as well run count(*) on it.

(*): Actually I only analyze but I understand that that should be sufficient.

-- 
D'Arcy J.M. Cain [EMAIL PROTECTED]|vex}.net   |  Democracy is three wolves
http://www.druid.net/darcy/|  and a sheep voting on
+1 416 425 1212 (DoD#0082)(eNTP)   |  what's for dinner.

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Select max(foo) and select count(*) optimization

2004-01-06 Thread Tom Lane
D'Arcy J.M. Cain [EMAIL PROTECTED] writes:
 In any case, if I have to vacuum a 20,000,000 row table to get an accurate 
 count then I may as well run count(*) on it.
 (*): Actually I only analyze but I understand that that should be sufficient.

ANALYZE without VACUUM will deliver a not-very-accurate estimate, since it
only looks at a sample of the table's pages and doesn't grovel through
every one.  Any of the VACUUM variants, on the other hand, will set
pg_class.reltuples reasonably accurately (as the number of rows actually
seen and left undeleted by the VACUUM pass).

There are pathological cases where ANALYZE's estimate of the overall row
count can be horribly bad --- mainly, when the early pages of the table
are empty or nearly so, but there are well-filled pages out near the
end.  I have a TODO item to try to make ANALYZE less prone to getting
fooled that way...

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[PERFORM] Select max(foo) and select count(*) optimization

2004-01-05 Thread John Siracusa
Speaking of special cases (well, I was on the admin list) there are two
kinds that would really benefit from some attention.

1. The query select max(foo) from bar where the column foo has an index.
Aren't indexes ordered?  If not, an ordered index would be useful in this
situation so that this query, rather than doing a sequential scan of the
whole table, would just ask the index for the max value and return nearly
instantly.

2. The query select count(*) from bar  Surely the total number of rows in
a table is kept somewhere convenient.  If not, it would be nice if it could
be :)  Again, rather than doing a sequential scan of the entire table, this
type of query could return instantly.

I believe MySQL does both of these optimizations (which are probably a lot
easier in that product, given its data storage system).  These were the
first areas where I noticed a big performance difference between MySQL and
Postgres.

Especially with very large tables, hearing the disks grind as Postgres scans
every single row in order to determine the number of rows in a table or the
max value of a column (even a primary key created from a sequence) is pretty
painful.  If the implementation is not too horrendous, this is an area where
an orders-of-magnitude performance increase can  be had.

-John


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Select max(foo) and select count(*) optimization

2004-01-05 Thread John Siracusa
On 1/5/04 2:52 PM, Rod Taylor wrote:
 max(foo) optimizations requires an extension to the aggregates system.
 It will likely happen within a few releases.

Looking forward to it.

 A work around can be accomplished today through the use of LIMIT and ORDER BY.

Wowzers, I never imagined that that'd be so much faster.  Thanks! :)

-John


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Select max(foo) and select count(*) optimization

2004-01-05 Thread Christopher Browne
Oops! [EMAIL PROTECTED] (John Siracusa) was seen spray-painting on a wall:
 Speaking of special cases (well, I was on the admin list) there are two
 kinds that would really benefit from some attention.

 1. The query select max(foo) from bar where the column foo has an
 index.  Aren't indexes ordered?  If not, an ordered index would be
 useful in this situation so that this query, rather than doing a
 sequential scan of the whole table, would just ask the index for
 the max value and return nearly instantly.

 2. The query select count(*) from bar Surely the total number of
 rows in a table is kept somewhere convenient.  If not, it would be
 nice if it could be :) Again, rather than doing a sequential scan of
 the entire table, this type of query could return instantly.

 I believe MySQL does both of these optimizations (which are probably
 a lot easier in that product, given its data storage system).  These
 were the first areas where I noticed a big performance difference
 between MySQL and Postgres.

 Especially with very large tables, hearing the disks grind as
 Postgres scans every single row in order to determine the number of
 rows in a table or the max value of a column (even a primary key
 created from a sequence) is pretty painful.  If the implementation
 is not too horrendous, this is an area where an orders-of-magnitude
 performance increase can be had.

These are both VERY frequently asked questions.

In the case of question #1, the optimization you suggest could be
accomplished via some Small Matter Of Programming.  None of the people
that have wanted the optimization have, however, offered to actually
DO the programming.

In the case of #2, the answer is surely NOT.  In MVCC databases,
that information CANNOT be stored anywhere convenient because queries
requested by transactions started at different points in time must get
different answers.

I think we need to add these questions and their answers to the FAQ so
that the answer can be See FAQ Item #17 rather than people having to
gratuitously explain it over and over and over again.
-- 
(reverse (concatenate 'string moc.enworbbc @ enworbbc))
http://www.ntlug.org/~cbbrowne/finances.html
Rules of  the Evil Overlord #127.  Prison guards will  have their own
cantina featuring  a wide  variety of tasty  treats that  will deliver
snacks to the  guards while on duty. The guards  will also be informed
that  accepting food or  drink from  any other  source will  result in
execution. http://www.eviloverlord.com/

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Select max(foo) and select count(*) optimization

2004-01-05 Thread Paul Tuckfield
Not that I'm offering to do the porgramming mind you, :) but . . 


In the case of select count(*), one optimization is to do  a scan of the
primary key, not the table itself, if the table has a primary key. In a
certain commercial, lesser database, this is called an index fast full
scan.  It would be important to scan the index in physical order
(sequential physical IO) and not in key order (random physical IO)

I'm guessing the payoff as well as real-world-utility of a max(xxx)
optimization are much higher than a count(*) optimization tho


On Mon, 2004-01-05 at 12:26, Christopher Browne wrote:
 Oops! [EMAIL PROTECTED] (John Siracusa) was seen spray-painting on a wall:
  Speaking of special cases (well, I was on the admin list) there are two
  kinds that would really benefit from some attention.
 
  1. The query select max(foo) from bar where the column foo has an
  index.  Aren't indexes ordered?  If not, an ordered index would be
  useful in this situation so that this query, rather than doing a
  sequential scan of the whole table, would just ask the index for
  the max value and return nearly instantly.
 
  2. The query select count(*) from bar Surely the total number of
  rows in a table is kept somewhere convenient.  If not, it would be
  nice if it could be :) Again, rather than doing a sequential scan of
  the entire table, this type of query could return instantly.
 
  I believe MySQL does both of these optimizations (which are probably
  a lot easier in that product, given its data storage system).  These
  were the first areas where I noticed a big performance difference
  between MySQL and Postgres.
 
  Especially with very large tables, hearing the disks grind as
  Postgres scans every single row in order to determine the number of
  rows in a table or the max value of a column (even a primary key
  created from a sequence) is pretty painful.  If the implementation
  is not too horrendous, this is an area where an orders-of-magnitude
  performance increase can be had.
 
 These are both VERY frequently asked questions.
 
 In the case of question #1, the optimization you suggest could be
 accomplished via some Small Matter Of Programming.  None of the people
 that have wanted the optimization have, however, offered to actually
 DO the programming.
 
 In the case of #2, the answer is surely NOT.  In MVCC databases,
 that information CANNOT be stored anywhere convenient because queries
 requested by transactions started at different points in time must get
 different answers.
 
 I think we need to add these questions and their answers to the FAQ so
 that the answer can be See FAQ Item #17 rather than people having to
 gratuitously explain it over and over and over again.


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Select max(foo) and select count(*) optimization

2004-01-05 Thread Doug McNaught
Paul Tuckfield [EMAIL PROTECTED] writes:

 In the case of select count(*), one optimization is to do  a scan of the
 primary key, not the table itself, if the table has a primary key. In a
 certain commercial, lesser database, this is called an index fast full
 scan.  It would be important to scan the index in physical order
 (sequential physical IO) and not in key order (random physical IO)

That won't work because you still have to hit the actual tuple to
determine visibility.

-Doug


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Select max(foo) and select count(*) optimization

2004-01-05 Thread Christopher Browne
Martha Stewart called it a Good Thing when [EMAIL PROTECTED] (Paul Tuckfield) wrote:
 Not that I'm offering to do the porgramming mind you, :) but . . 

 In the case of select count(*), one optimization is to do  a scan of the
 primary key, not the table itself, if the table has a primary key. In a
 certain commercial, lesser database, this is called an index fast full
 scan.  It would be important to scan the index in physical order
 (sequential physical IO) and not in key order (random physical IO)

The problem is that this optimization does not actually work.  The
index does not contain transaction visibility information, so you have
to go to the pages of tuples in order to determine if any given tuple
is visible.

 I'm guessing the payoff as well as real-world-utility of a max(xxx)
 optimization are much higher than a count(*) optimization tho

That's probably so.

In many cases, approximations, such as page counts, may be good
enough, and pray consider, that (an approximation) is probably all
you were getting from the database systems that had an optimization
to store the count in a counter.
-- 
let name=cbbrowne and tld=ntlug.org in name ^ @ ^ tld;;
http://www3.sympatico.ca/cbbrowne/linuxxian.html
No, you  misunderstand. Microsoft asked  some hackers how  they could
make their system secure - the hackers replied Turn it off.. So they
did. -- Anthony Ord

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Select max(foo) and select count(*) optimization

2004-01-05 Thread Christopher Browne
[EMAIL PROTECTED] (Rod Taylor) wrote:
 Especially with very large tables, hearing the disks grind as Postgres scans
 every single row in order to determine the number of rows in a table or the
 max value of a column (even a primary key created from a sequence) is pretty
 painful.  If the implementation is not too horrendous, this is an area where
 an orders-of-magnitude performance increase can  be had.

 Actually, it's very painful. For MySQL, they've accepted the concurrancy
 hit in order to accomplish it -- PostgreSQL would require a more subtle
 approach.

 Anyway, with Rules you can force this:

 ON INSERT UPDATE counter SET tablecount = tablecount + 1;

 ON DELETE UPDATE counter SET tablecount = tablecount - 1;

 You need to create a table counter with a single row that will keep
 track of the number of rows in the table. Just remember, you've now
 serialized all writes to the table, but in your situation it may be
 worth while.

There's a still more subtle approach that relieves the serialization
constraint, at some cost...

- You add rules that _insert_ a row each time there is an
  insert/delete
   ON INSERT insert into counts(table, value) values ('our_table', 1);
   ON DELETE insert into counts(table, value) values ('our_table', -1);

- The select count(*) from our_table is replaced by select
  sum(value) from counts where table = 'our_table'

- Periodically, a compression process goes through and either:

a) Deletes the rows for 'our_table' and replaces them with one
   row with a conventionally-scanned 'count(*)' value, or

b) Computes select table, sum(value) as value from counts group
   by table, deletes all the existing rows in counts, and replaces
   them by the preceding selection, or

c) Perhaps does something incremental that's like b), but which
   only processes parts of the count table at once.  Process
   500 rows, then COMMIT, or something of the sort...

Note that this counts table can potentially grow _extremely_ large.
The win comes when it gets compressed, so that instead of scanning
through 500K items, it index-scans through 27, the 1 that has the
497000 that was the state of the table at the last compression, and
then 26 singletons.

A win comes in if an INSERT that adds in 50 rows can lead to
inserting ('our_table', 50) into COUNTS, or a delete that eliminates
5000 rows puts in ('our_table', -5000).

It's vital to run the compression reasonably often (much like VACUUM
:-)) in order that the COUNTS summary table stays relatively small.
-- 
let name=cbbrowne and tld=cbbrowne.com in String.concat @ [name;tld];;
http://www3.sympatico.ca/cbbrowne/wp.html
Debugging is twice  as hard as writing   the code in the first  place.
Therefore, if you write the code as cleverly as  possible, you are, by
definition, not smart enough to debug it.  -- Brian W. Kernighan

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Select max(foo) and select count(*) optimization

2004-01-05 Thread Shridhar Daithankar
On Tuesday 06 January 2004 07:16, Christopher Browne wrote:
 Martha Stewart called it a Good Thing when [EMAIL PROTECTED] (Paul 
Tuckfield) wrote:
  Not that I'm offering to do the porgramming mind you, :) but . .
 
  In the case of select count(*), one optimization is to do  a scan of the
  primary key, not the table itself, if the table has a primary key. In a
  certain commercial, lesser database, this is called an index fast full
  scan.  It would be important to scan the index in physical order
  (sequential physical IO) and not in key order (random physical IO)

 The problem is that this optimization does not actually work.  The
 index does not contain transaction visibility information, so you have
 to go to the pages of tuples in order to determine if any given tuple
 is visible.

It was rejected as an idea to add transaction visibility information to 
indexes. The time I proposed, my idea was to vacuum tuples on page level 
while postgresql pushes buffers out of shared cache. If indexes had 
visibility information, they could be cleaned out of order than heap tuples.

This wouldn't have eliminated vacuum entirely but at least frequently hit data 
would be clean.

But it was rejected because of associated overhead. 

Just thought worh a mention..

 Shridhar



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Select max(foo) and select count(*) optimization

2004-01-05 Thread Shridhar Daithankar
On Tuesday 06 January 2004 01:22, Rod Taylor wrote:
 Anyway, with Rules you can force this:

 ON INSERT UPDATE counter SET tablecount = tablecount + 1;

 ON DELETE UPDATE counter SET tablecount = tablecount - 1;

That would generate lot of dead tuples in counter table. How about

select relpages,reltuples from pg_class where relname=tablename;

Assuming the stats are recent enough, it would be much faster and accurate..

 Shridhar


---(end of broadcast)---
TIP 8: explain analyze is your friend