Re: [PERFORM] Select max(foo) and select count(*) optimization
On January 6, 2004 01:42 am, Shridhar Daithankar wrote: On Tuesday 06 January 2004 01:22, Rod Taylor wrote: Anyway, with Rules you can force this: ON INSERT UPDATE counter SET tablecount = tablecount + 1; ON DELETE UPDATE counter SET tablecount = tablecount - 1; That would generate lot of dead tuples in counter table. How about select relpages,reltuples from pg_class where relname=tablename; Assuming the stats are recent enough, it would be much faster and accurate.. Well, I did this: cert=# select relpages,reltuples from pg_class where relname= 'certificate'; relpages | reltuples --+- 399070 | 2.48587e+07 (1 row) Casting seemed to help: cert=# select relpages,reltuples::bigint from pg_class where relname= 'certificate'; relpages | reltuples --+--- 399070 | 24858736 (1 row) But: cert=# select count(*) from certificate; [*Crunch* *Crunch* *Crunch*] count -- 19684668 (1 row) Am I missing something? Max certificate_id is 20569544 btw. -- D'Arcy J.M. Cain [EMAIL PROTECTED]|vex}.net | Democracy is three wolves http://www.druid.net/darcy/| and a sheep voting on +1 416 425 1212 (DoD#0082)(eNTP) | what's for dinner. ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Select max(foo) and select count(*) optimization
On Tuesday 06 January 2004 17:48, D'Arcy J.M. Cain wrote: On January 6, 2004 01:42 am, Shridhar Daithankar wrote: cert=# select relpages,reltuples::bigint from pg_class where relname= 'certificate'; relpages | reltuples --+--- 399070 | 24858736 (1 row) But: cert=# select count(*) from certificate; [*Crunch* *Crunch* *Crunch*] count -- 19684668 (1 row) Am I missing something? Max certificate_id is 20569544 btw. Do 'vacuum analyze certificate' and try..:-) The numbers from pg_class are estimates updated by vacuum /analyze. Of course you need to run vacuum frequent enough for that statistics to be updated all the time or run autovacuum daemon.. Ran into same problem on my machine till I remembered about vacuum..:-) Shridhar ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Select max(foo) and select count(*) optimization
On Tue, 2004-01-06 at 07:20, Shridhar Daithankar wrote: On Tuesday 06 January 2004 17:48, D'Arcy J.M. Cain wrote: On January 6, 2004 01:42 am, Shridhar Daithankar wrote: cert=# select relpages,reltuples::bigint from pg_class where relname= 'certificate'; relpages | reltuples --+--- 399070 | 24858736 (1 row) But: cert=# select count(*) from certificate; [*Crunch* *Crunch* *Crunch*] count -- 19684668 (1 row) Am I missing something? Max certificate_id is 20569544 btw. Do 'vacuum analyze certificate' and try..:-) The numbers from pg_class are estimates updated by vacuum /analyze. Of course you need to run vacuum frequent enough for that statistics to be updated all the time or run autovacuum daemon.. Ran into same problem on my machine till I remembered about vacuum..:-) Actually you only need to run analyze to update the statistics. Robert Treat -- Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Select max(foo) and select count(*) optimization
Robert Treat wrote: On Tue, 2004-01-06 at 07:20, Shridhar Daithankar wrote: The numbers from pg_class are estimates updated by vacuum /analyze. Of course you need to run vacuum frequent enough for that statistics to be updated all the time or run autovacuum daemon.. Ran into same problem on my machine till I remembered about vacuum..:-) Actually you only need to run analyze to update the statistics. Old habits die hard..:-) shridhar ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Select max(foo) and select count(*) optimization
if this situation persists after 'analyze certificate', then you need to: increase the statistics target 'alter table certificate alter column certificate_id set statistics 100' or 'vacuum full certificate' i.e : there are lots of (dead) updated or deleted tuples in the relation, distributed in such a way as to throw off analyze's estimate. regards Mark D'Arcy J.M. Cain wrote: Well, I did this: cert=# select relpages,reltuples from pg_class where relname= 'certificate'; relpages | reltuples --+- 399070 | 2.48587e+07 (1 row) Casting seemed to help: cert=# select relpages,reltuples::bigint from pg_class where relname= 'certificate'; relpages | reltuples --+--- 399070 | 24858736 (1 row) But: cert=# select count(*) from certificate; [*Crunch* *Crunch* *Crunch*] count -- 19684668 (1 row) Am I missing something? Max certificate_id is 20569544 btw. ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Select max(foo) and select count(*) optimization
On January 6, 2004 07:20 am, Shridhar Daithankar wrote: On Tuesday 06 January 2004 17:48, D'Arcy J.M. Cain wrote: On January 6, 2004 01:42 am, Shridhar Daithankar wrote: cert=# select relpages,reltuples::bigint from pg_class where relname= 'certificate'; relpages | reltuples --+--- 399070 | 24858736 (1 row) But: cert=# select count(*) from certificate; [*Crunch* *Crunch* *Crunch*] count -- 19684668 (1 row) Am I missing something? Max certificate_id is 20569544 btw. Do 'vacuum analyze certificate' and try..:-) Kind of invalidates the part about being accurate then, don't it? Besides, I vacuum that table every day (*) and we have reorganized the schema so that we never update it except in exceptional cases. I would be less surprised if the result was less than the real count since we only insert into that table. In any case, if I have to vacuum a 20,000,000 row table to get an accurate count then I may as well run count(*) on it. (*): Actually I only analyze but I understand that that should be sufficient. -- D'Arcy J.M. Cain [EMAIL PROTECTED]|vex}.net | Democracy is three wolves http://www.druid.net/darcy/| and a sheep voting on +1 416 425 1212 (DoD#0082)(eNTP) | what's for dinner. ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Select max(foo) and select count(*) optimization
D'Arcy J.M. Cain [EMAIL PROTECTED] writes: In any case, if I have to vacuum a 20,000,000 row table to get an accurate count then I may as well run count(*) on it. (*): Actually I only analyze but I understand that that should be sufficient. ANALYZE without VACUUM will deliver a not-very-accurate estimate, since it only looks at a sample of the table's pages and doesn't grovel through every one. Any of the VACUUM variants, on the other hand, will set pg_class.reltuples reasonably accurately (as the number of rows actually seen and left undeleted by the VACUUM pass). There are pathological cases where ANALYZE's estimate of the overall row count can be horribly bad --- mainly, when the early pages of the table are empty or nearly so, but there are well-filled pages out near the end. I have a TODO item to try to make ANALYZE less prone to getting fooled that way... regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PERFORM] Select max(foo) and select count(*) optimization
Speaking of special cases (well, I was on the admin list) there are two kinds that would really benefit from some attention. 1. The query select max(foo) from bar where the column foo has an index. Aren't indexes ordered? If not, an ordered index would be useful in this situation so that this query, rather than doing a sequential scan of the whole table, would just ask the index for the max value and return nearly instantly. 2. The query select count(*) from bar Surely the total number of rows in a table is kept somewhere convenient. If not, it would be nice if it could be :) Again, rather than doing a sequential scan of the entire table, this type of query could return instantly. I believe MySQL does both of these optimizations (which are probably a lot easier in that product, given its data storage system). These were the first areas where I noticed a big performance difference between MySQL and Postgres. Especially with very large tables, hearing the disks grind as Postgres scans every single row in order to determine the number of rows in a table or the max value of a column (even a primary key created from a sequence) is pretty painful. If the implementation is not too horrendous, this is an area where an orders-of-magnitude performance increase can be had. -John ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Select max(foo) and select count(*) optimization
On 1/5/04 2:52 PM, Rod Taylor wrote: max(foo) optimizations requires an extension to the aggregates system. It will likely happen within a few releases. Looking forward to it. A work around can be accomplished today through the use of LIMIT and ORDER BY. Wowzers, I never imagined that that'd be so much faster. Thanks! :) -John ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Select max(foo) and select count(*) optimization
Oops! [EMAIL PROTECTED] (John Siracusa) was seen spray-painting on a wall: Speaking of special cases (well, I was on the admin list) there are two kinds that would really benefit from some attention. 1. The query select max(foo) from bar where the column foo has an index. Aren't indexes ordered? If not, an ordered index would be useful in this situation so that this query, rather than doing a sequential scan of the whole table, would just ask the index for the max value and return nearly instantly. 2. The query select count(*) from bar Surely the total number of rows in a table is kept somewhere convenient. If not, it would be nice if it could be :) Again, rather than doing a sequential scan of the entire table, this type of query could return instantly. I believe MySQL does both of these optimizations (which are probably a lot easier in that product, given its data storage system). These were the first areas where I noticed a big performance difference between MySQL and Postgres. Especially with very large tables, hearing the disks grind as Postgres scans every single row in order to determine the number of rows in a table or the max value of a column (even a primary key created from a sequence) is pretty painful. If the implementation is not too horrendous, this is an area where an orders-of-magnitude performance increase can be had. These are both VERY frequently asked questions. In the case of question #1, the optimization you suggest could be accomplished via some Small Matter Of Programming. None of the people that have wanted the optimization have, however, offered to actually DO the programming. In the case of #2, the answer is surely NOT. In MVCC databases, that information CANNOT be stored anywhere convenient because queries requested by transactions started at different points in time must get different answers. I think we need to add these questions and their answers to the FAQ so that the answer can be See FAQ Item #17 rather than people having to gratuitously explain it over and over and over again. -- (reverse (concatenate 'string moc.enworbbc @ enworbbc)) http://www.ntlug.org/~cbbrowne/finances.html Rules of the Evil Overlord #127. Prison guards will have their own cantina featuring a wide variety of tasty treats that will deliver snacks to the guards while on duty. The guards will also be informed that accepting food or drink from any other source will result in execution. http://www.eviloverlord.com/ ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Select max(foo) and select count(*) optimization
Not that I'm offering to do the porgramming mind you, :) but . . In the case of select count(*), one optimization is to do a scan of the primary key, not the table itself, if the table has a primary key. In a certain commercial, lesser database, this is called an index fast full scan. It would be important to scan the index in physical order (sequential physical IO) and not in key order (random physical IO) I'm guessing the payoff as well as real-world-utility of a max(xxx) optimization are much higher than a count(*) optimization tho On Mon, 2004-01-05 at 12:26, Christopher Browne wrote: Oops! [EMAIL PROTECTED] (John Siracusa) was seen spray-painting on a wall: Speaking of special cases (well, I was on the admin list) there are two kinds that would really benefit from some attention. 1. The query select max(foo) from bar where the column foo has an index. Aren't indexes ordered? If not, an ordered index would be useful in this situation so that this query, rather than doing a sequential scan of the whole table, would just ask the index for the max value and return nearly instantly. 2. The query select count(*) from bar Surely the total number of rows in a table is kept somewhere convenient. If not, it would be nice if it could be :) Again, rather than doing a sequential scan of the entire table, this type of query could return instantly. I believe MySQL does both of these optimizations (which are probably a lot easier in that product, given its data storage system). These were the first areas where I noticed a big performance difference between MySQL and Postgres. Especially with very large tables, hearing the disks grind as Postgres scans every single row in order to determine the number of rows in a table or the max value of a column (even a primary key created from a sequence) is pretty painful. If the implementation is not too horrendous, this is an area where an orders-of-magnitude performance increase can be had. These are both VERY frequently asked questions. In the case of question #1, the optimization you suggest could be accomplished via some Small Matter Of Programming. None of the people that have wanted the optimization have, however, offered to actually DO the programming. In the case of #2, the answer is surely NOT. In MVCC databases, that information CANNOT be stored anywhere convenient because queries requested by transactions started at different points in time must get different answers. I think we need to add these questions and their answers to the FAQ so that the answer can be See FAQ Item #17 rather than people having to gratuitously explain it over and over and over again. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Select max(foo) and select count(*) optimization
Paul Tuckfield [EMAIL PROTECTED] writes: In the case of select count(*), one optimization is to do a scan of the primary key, not the table itself, if the table has a primary key. In a certain commercial, lesser database, this is called an index fast full scan. It would be important to scan the index in physical order (sequential physical IO) and not in key order (random physical IO) That won't work because you still have to hit the actual tuple to determine visibility. -Doug ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Select max(foo) and select count(*) optimization
Martha Stewart called it a Good Thing when [EMAIL PROTECTED] (Paul Tuckfield) wrote: Not that I'm offering to do the porgramming mind you, :) but . . In the case of select count(*), one optimization is to do a scan of the primary key, not the table itself, if the table has a primary key. In a certain commercial, lesser database, this is called an index fast full scan. It would be important to scan the index in physical order (sequential physical IO) and not in key order (random physical IO) The problem is that this optimization does not actually work. The index does not contain transaction visibility information, so you have to go to the pages of tuples in order to determine if any given tuple is visible. I'm guessing the payoff as well as real-world-utility of a max(xxx) optimization are much higher than a count(*) optimization tho That's probably so. In many cases, approximations, such as page counts, may be good enough, and pray consider, that (an approximation) is probably all you were getting from the database systems that had an optimization to store the count in a counter. -- let name=cbbrowne and tld=ntlug.org in name ^ @ ^ tld;; http://www3.sympatico.ca/cbbrowne/linuxxian.html No, you misunderstand. Microsoft asked some hackers how they could make their system secure - the hackers replied Turn it off.. So they did. -- Anthony Ord ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Select max(foo) and select count(*) optimization
[EMAIL PROTECTED] (Rod Taylor) wrote: Especially with very large tables, hearing the disks grind as Postgres scans every single row in order to determine the number of rows in a table or the max value of a column (even a primary key created from a sequence) is pretty painful. If the implementation is not too horrendous, this is an area where an orders-of-magnitude performance increase can be had. Actually, it's very painful. For MySQL, they've accepted the concurrancy hit in order to accomplish it -- PostgreSQL would require a more subtle approach. Anyway, with Rules you can force this: ON INSERT UPDATE counter SET tablecount = tablecount + 1; ON DELETE UPDATE counter SET tablecount = tablecount - 1; You need to create a table counter with a single row that will keep track of the number of rows in the table. Just remember, you've now serialized all writes to the table, but in your situation it may be worth while. There's a still more subtle approach that relieves the serialization constraint, at some cost... - You add rules that _insert_ a row each time there is an insert/delete ON INSERT insert into counts(table, value) values ('our_table', 1); ON DELETE insert into counts(table, value) values ('our_table', -1); - The select count(*) from our_table is replaced by select sum(value) from counts where table = 'our_table' - Periodically, a compression process goes through and either: a) Deletes the rows for 'our_table' and replaces them with one row with a conventionally-scanned 'count(*)' value, or b) Computes select table, sum(value) as value from counts group by table, deletes all the existing rows in counts, and replaces them by the preceding selection, or c) Perhaps does something incremental that's like b), but which only processes parts of the count table at once. Process 500 rows, then COMMIT, or something of the sort... Note that this counts table can potentially grow _extremely_ large. The win comes when it gets compressed, so that instead of scanning through 500K items, it index-scans through 27, the 1 that has the 497000 that was the state of the table at the last compression, and then 26 singletons. A win comes in if an INSERT that adds in 50 rows can lead to inserting ('our_table', 50) into COUNTS, or a delete that eliminates 5000 rows puts in ('our_table', -5000). It's vital to run the compression reasonably often (much like VACUUM :-)) in order that the COUNTS summary table stays relatively small. -- let name=cbbrowne and tld=cbbrowne.com in String.concat @ [name;tld];; http://www3.sympatico.ca/cbbrowne/wp.html Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Select max(foo) and select count(*) optimization
On Tuesday 06 January 2004 07:16, Christopher Browne wrote: Martha Stewart called it a Good Thing when [EMAIL PROTECTED] (Paul Tuckfield) wrote: Not that I'm offering to do the porgramming mind you, :) but . . In the case of select count(*), one optimization is to do a scan of the primary key, not the table itself, if the table has a primary key. In a certain commercial, lesser database, this is called an index fast full scan. It would be important to scan the index in physical order (sequential physical IO) and not in key order (random physical IO) The problem is that this optimization does not actually work. The index does not contain transaction visibility information, so you have to go to the pages of tuples in order to determine if any given tuple is visible. It was rejected as an idea to add transaction visibility information to indexes. The time I proposed, my idea was to vacuum tuples on page level while postgresql pushes buffers out of shared cache. If indexes had visibility information, they could be cleaned out of order than heap tuples. This wouldn't have eliminated vacuum entirely but at least frequently hit data would be clean. But it was rejected because of associated overhead. Just thought worh a mention.. Shridhar ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Select max(foo) and select count(*) optimization
On Tuesday 06 January 2004 01:22, Rod Taylor wrote: Anyway, with Rules you can force this: ON INSERT UPDATE counter SET tablecount = tablecount + 1; ON DELETE UPDATE counter SET tablecount = tablecount - 1; That would generate lot of dead tuples in counter table. How about select relpages,reltuples from pg_class where relname=tablename; Assuming the stats are recent enough, it would be much faster and accurate.. Shridhar ---(end of broadcast)--- TIP 8: explain analyze is your friend