Re: [PERFORM] Performance of count(*)

2007-03-23 Thread Tino Wildenhain

Michael Stone schrieb:

On Fri, Mar 23, 2007 at 01:01:02PM +0100, Tino Wildenhain wrote:

This discussion is a bit theoretical until we see the actual problem
and the proposed solution here.


It's good to see you back off a bit from your previous stance of 
assuming that someone doesn't know what they're doing and that their 
solution is absolutely wrong without actually knowing anything about 
what they are trying to do.


Well I'm sure its very likely wrong :-) At least the core part of
it with the statement of "keeping index data outside postgres".

What I meant with my comment about the theoreticalness: we cannot
make educated suggestions about alternative solutions to the problem
until we know the problem and maybe the current solution in detail.

Regards
Tino

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Performance of count(*)

2007-03-23 Thread Michael Stone

On Fri, Mar 23, 2007 at 01:01:02PM +0100, Tino Wildenhain wrote:

This discussion is a bit theoretical until we see the actual problem
and the proposed solution here.


It's good to see you back off a bit from your previous stance of 
assuming that someone doesn't know what they're doing and that their 
solution is absolutely wrong without actually knowing anything about 
what they are trying to do.


Mike Stone

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Performance of count(*)

2007-03-23 Thread Tino Wildenhain

Michael Stone schrieb:

On Thu, Mar 22, 2007 at 06:27:32PM +0100, Tino Wildenhain wrote:

Craig A. James schrieb:
You guys can correct me if I'm wrong, but the key feature that's 
missing from Postgres's flexible indexing is the ability to maintain 
state across queries.  Something like this:


 select a, b, my_index_state() from foo where ...
   offset 100 limit 10 using my_index(prev_my_index_state);



Yes, you are wrong :-) The technique is called "CURSOR"
if you maintain persistent connection per session
(e.g. stand allone application or clever pooling webapplication)


Did you read the email before correcting it? From the part you trimmed out:

The problem is that relational databases were invented before the web 
and its stateless applications.  In the "good old days", you could 
connect to a database and work for hours, and in that environment 
cursors and such work well -- the RDBMS maintains the internal state 
of the indexing system.  But in a web environment, state information 
is very difficult to maintain.  There are all sorts of systems that 
try (Enterprise Java Beans, for example), but they're very complex.


Yes, but actually this is not true. They are not so complex in this
regard. All you have to do is to look in the pg_cursor view if
your cursor is there and if not, create it in your session.
All you need to maintain is the cursor name which maps to your
session + the special query you run. This should be easy
in any web application.

It sounds like they wrote their own middleware to handle the problem, 
which is basically what you suggested (a "clever pooling web 
application") after saying "wrong".


I read about "building index data outside postgres" which still is
the wrong approach imho.

This discussion is a bit theoretical until we see the actual problem
and the proposed solution here.

Regards
Tino

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Tino Wildenhain

Craig A. James schrieb:

Tino Wildenhain wrote:
You guys can correct me if I'm wrong, but the key feature that's 
missing from Postgres's flexible indexing is the ability to maintain 
state across queries.  Something like this:


 select a, b, my_index_state() from foo where ...
   offset 100 limit 10 using my_index(prev_my_index_state);



Yes, you are wrong :-) The technique is called "CURSOR"
if you maintain persistent connection per session
(e.g. stand allone application or clever pooling webapplication)


That's my whole point: If relational databases had a simple mechanism 
for storing their internal state in an external application, the need 
for cursors, connection pools, and all those other tricks would be 
eliminated.


Well the cursor is exactly the simple handle to the internal
state of the relational db you are looking for.
Do you really think transferring the whole query-tree, open index
and data files to the client over the network would really improve
the situation?

As I said earlier, relational technology was invented in an earlier era, 
and hasn't caught up with the reality of modern web apps.


There is nothing modern with todays web apps.


If its a naive web application you just store your session
in tables where you can easily maintain the scroll state
as well.


One thing I've learned in 25 years of software development is that 
people who use my software have problems I never imagined.  I've been 
the one who was naive when I said similar things about my customers, and 
was later embarrassed to learn that their problems were more complex 
than I ever imagined.


Sure it really depends on the application how the best solution
would look like but I'm quite certain, counterfaiting internal
stuff of the underlying relational database in the application
makes more problems then it solves. If you can't handle SQL,
dont use SQL, you can build web applications w/o any relational
database if you want it.

Regards
Tino Wildenhain

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Tino Wildenhain

Michael Stone schrieb:

On Thu, Mar 22, 2007 at 02:24:39PM -0400, Merlin Moncure wrote:

Tino was saying that rather that build a complete indexing storage
management solution that lives outside the database, it is better to
do intelligent session management so that you get the simplicity if a
two tier client server system but the scalability of a web app.


No, what he was saying was "there's this thing called a cursor". I 
thought there was enough information in the original message to indicate 
that the author knew about cursors. There are certainly pros and cons 
and nuances to different approaches, but Tino's message didn't touch on 
anything that specific.


Sure, the message thread sometimes loose history so I wasnt 100% sure
what the framework really is - although I assumed it could be a web
solution. With stand alone applications you usually have a limited
number of users connecting and they are connected during the session
so you can easily use cursors there.

And even if you do use some kind of "intelligent session management", 
how many simultaneous cursors can postgres sanely keep track of? 
Thousands? Millions? Tens of Millions? I suspect there's a scalability 
limit in there somewhere. Luckily I don't spend much time in the web 
application space, so I don't need to know.  :)


Depending on the application, you can even simulate above situation
with a web framework if you manage session in the web framework
with persistent connections for a limited amount of users to work
the same time (certainly not feasable for a public web shop but for
data management systems for inhouse use). In this case, cursors
would be perfect too.

In any other case I fail to see the advantage in storing "index
data" outside the database with all the roundtripping involved.

If the query is complex and rerunning it for every batch is
expensive, fetching the whole result to the application in
case of users really traversing the complete batch
(How often is that really done? I mean, who browses to an
end of a huge result set?) is costy as well w/o really
benefit.

It would be much more easy and clean imho, in this case
to really fetch the data to session and batch linked
scratch table.

If its fast or you can prepare a batch helper table
with index, you can just select the batch equival
portion of the result.

You dont need extensive session management in the
web application to scroll thru result sets in this
way. This can all be encoded in forms or links.

Regards
Tino



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Tom Lane
"Craig A. James" <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> You mean
>> http://archives.postgresql.org/pgsql-performance/2006-10/msg00283.php
>> ?  I don't see anything there that bears on Steve's suggestion.

> Mea culpa, it's October 8, not October 18:
>http://archives.postgresql.org/pgsql-performance/2006-10/msg00143.php

I still don't see the relevance to Steve's suggestion.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Craig A. James

Tom Lane wrote:

"Craig A. James" <[EMAIL PROTECTED]> writes:

Steve Atkins wrote:

As long as you're ordering by some row in the table then you can do that in
straight SQL.

select a, b, ts from foo where (stuff) and foo > X order by foo limit 10

Then, record the last value of foo you read, and plug it in as X the next
time around.



We've been over this before in this forum: It doesn't work as advertised.
Look for postings by me regarding the fact that there is no way to tell
the optimizer the cost of executing a function.  There's one, for example,
on Oct 18, 2006.


You mean
http://archives.postgresql.org/pgsql-performance/2006-10/msg00283.php
?  I don't see anything there that bears on Steve's suggestion.
(The complaint is obsolete as of CVS HEAD anyway.)


Mea culpa, it's October 8, not October 18:

  http://archives.postgresql.org/pgsql-performance/2006-10/msg00143.php

The relevant part is this:

"My example, discussed previously in this forum, is a classic.  I have a VERY 
expensive function (it's in the class of NP-complete problems, so there is no faster 
way to do it).  There is no circumstance when my function should be used as a 
filter, and no circumstance when it should be done before a join.  But PG has no way 
of knowing the cost of a function, and so the optimizer assigns the same cost to 
every function.  Big disaster.

"The result?  I can't use my function in any WHERE clause that involves any 
other conditions or joins.  Only by itself.  PG will occasionally decide to use my 
function as a filter instead of doing the join or the other WHERE conditions first, 
and I'm dead.

"The interesting thing is that PG works pretty well for me on big tables -- it does 
the join first, then applies my expensive functions.  But with a SMALL (like 50K rows) 
table, it applies my function first, then does the join.  A search that completes in 1 
second on a 5,000,000 row database can take a minute or more on a 50,000 row 
database."

Craig

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Michael Stone

On Thu, Mar 22, 2007 at 02:24:39PM -0400, Merlin Moncure wrote:

Tino was saying that rather that build a complete indexing storage
management solution that lives outside the database, it is better to
do intelligent session management so that you get the simplicity if a
two tier client server system but the scalability of a web app.


No, what he was saying was "there's this thing called a cursor". I 
thought there was enough information in the original message to indicate 
that the author knew about cursors. There are certainly pros and cons 
and nuances to different approaches, but Tino's message didn't touch on 
anything that specific. 

And even if you do use some kind of "intelligent session management", 
how many simultaneous cursors can postgres sanely keep track of? 
Thousands? Millions? Tens of Millions? I suspect there's a scalability 
limit in there somewhere. Luckily I don't spend much time in the web 
application space, so I don't need to know.  :)


Mike Stone

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Tom Lane
"Craig A. James" <[EMAIL PROTECTED]> writes:
> Steve Atkins wrote:
>> As long as you're ordering by some row in the table then you can do that in
>> straight SQL.
>> 
>> select a, b, ts from foo where (stuff) and foo > X order by foo limit 10
>> 
>> Then, record the last value of foo you read, and plug it in as X the next
>> time around.

> We've been over this before in this forum: It doesn't work as advertised.  
> Look for postings by me regarding the fact that there is no way to tell the 
> optimizer the cost of executing a function.  There's one, for example, on Oct 
> 18, 2006.

You mean
http://archives.postgresql.org/pgsql-performance/2006-10/msg00283.php
?  I don't see anything there that bears on Steve's suggestion.
(The complaint is obsolete as of CVS HEAD anyway.)

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Steve Atkins


On Mar 22, 2007, at 11:26 AM, Guido Neitzer wrote:


On 22.03.2007, at 11:53, Steve Atkins wrote:

As long as you're ordering by some row in the table then you can  
do that in

straight SQL.

select a, b, ts from foo where (stuff) and foo > X order by foo  
limit 10


Then, record the last value of foo you read, and plug it in as X  
the next

time around.


This does only work if you have unique values in foo. You might  
have "batch breaks" inside   a list of rows with equal values for foo.


If I don't have unique values in foo, I certainly have unique values  
in (foo, pk).




But: a web application that needs state and doesn't maintain it by  
itself (or inside the dev toolkit) is imho broken by design. How  
should the database store a "state" for a web app? It's only  
possible on the web app part, because the app is either stateless  
and so are the queries to the database - they have to be re- 
evaluated for every request as the request might come from totally  
different sources (users, ...) or it is stateful and has to  
maintain the state because only the app developer knows, what  
information is needed for the "current state".


This is why all web application toolkits have a "session" concept.


Yes. HTTP is not very stateful. Web applications are stateful. There  
are some really obvious approaches to maintaining state cleanly that  
work well with databases and let you do some quite complex stuff  
(tying a persistent database connection to a single user, for  
instance). But they don't scale at all well.


What Craig was suggesting is, basically, to assign a persistent  
database connection to each user. But rather than maintain that  
connection as a running process, to serialise all the state out of  
the database connection and store that in the webapp, then when the  
next action from that user comes in take a database connection and  
stuff all that state into it again.


It's a lovely idea, but strikes me as completely infeasible in the  
general case. There's just too much state there. Doing it in the  
specific case is certainly possible, but rapidly devolves to the  
standard approach of "On the first page of results, run the query and  
record the first 5000 results. Store those in a scratch table,  
indexed by session-id, or in external storage. On displaying later  
pages of results to the same user, pull directly from the already  
calculated results."





I think the problem is more that most web developers aren't very good
at using the database, and tend to fall back on simplistic, wrong,  
approaches

to displaying the data. There's a lot of monkey-see, monkey-do in web
UI design too, which doesn't help.


Sure. That is the other problem ... ;-) But, and I think this is  
much more important: most toolkits today free you from using the  
database directly and writing lots and lots of lines of sql code  
which instantly breaks when you switch the storage backend. It's  
just the thing from where you look at something.


The real problem is the user-interface is designed around what is  
easy to implement in elderly cgi scripts, rather than what's  
appropriate to the data being displayed or useful to the user.  
Displaying tables of results, ten at a time, is just one of the more  
egregious examples of that.


Cheers,
  Steve


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Guido Neitzer

On 22.03.2007, at 11:53, Steve Atkins wrote:

As long as you're ordering by some row in the table then you can do  
that in

straight SQL.

select a, b, ts from foo where (stuff) and foo > X order by foo  
limit 10


Then, record the last value of foo you read, and plug it in as X  
the next

time around.


This does only work if you have unique values in foo. You might have  
"batch breaks" inside   a list of rows with equal values for foo.


But: a web application that needs state and doesn't maintain it by  
itself (or inside the dev toolkit) is imho broken by design. How  
should the database store a "state" for a web app? It's only possible  
on the web app part, because the app is either stateless and so are  
the queries to the database - they have to be re-evaluated for every  
request as the request might come from totally different sources  
(users, ...) or it is stateful and has to maintain the state because  
only the app developer knows, what information is needed for the  
"current state".


This is why all web application toolkits have a "session" concept.


I think the problem is more that most web developers aren't very good
at using the database, and tend to fall back on simplistic, wrong,  
approaches

to displaying the data. There's a lot of monkey-see, monkey-do in web
UI design too, which doesn't help.


Sure. That is the other problem ... ;-) But, and I think this is much  
more important: most toolkits today free you from using the database  
directly and writing lots and lots of lines of sql code which  
instantly breaks when you switch the storage backend. It's just the  
thing from where you look at something.


cug

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Merlin Moncure

On 3/22/07, Michael Stone <[EMAIL PROTECTED]> wrote:

On Thu, Mar 22, 2007 at 06:27:32PM +0100, Tino Wildenhain wrote:
>Craig A. James schrieb:
>>You guys can correct me if I'm wrong, but the key feature that's missing
>>from Postgres's flexible indexing is the ability to maintain state
>>across queries.  Something like this:
>>
>>  select a, b, my_index_state() from foo where ...
>>offset 100 limit 10 using my_index(prev_my_index_state);
>>
>
>Yes, you are wrong :-) The technique is called "CURSOR"
>if you maintain persistent connection per session
>(e.g. stand allone application or clever pooling webapplication)

Did you read the email before correcting it? From the part you trimmed
out:

>The problem is that relational databases were invented before the web
>and its stateless applications.  In the "good old days", you could
>connect to a database and work for hours, and in that environment
>cursors and such work well -- the RDBMS maintains the internal state of
>the indexing system.  But in a web environment, state information is
>very difficult to maintain.  There are all sorts of systems that try
>(Enterprise Java Beans, for example), but they're very complex.

It sounds like they wrote their own middleware to handle the problem,
which is basically what you suggested (a "clever pooling web
application") after saying "wrong".


Tino was saying that rather that build a complete indexing storage
management solution that lives outside the database, it is better to
do intelligent session management so that you get the simplicity if a
two tier client server system but the scalability of a web app.

Web apps are not necessarily stateless, you just have to be a little
clever about how database connections are opened and closed.  Then you
get all the database stuff that comes along with a persistent
connection (advisory locks, cursors, prepared statements, etc) without
building all kinds of data management into the middleware.

merlin

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Craig A. James

Steve Atkins wrote:

As long as you're ordering by some row in the table then you can do that in
straight SQL.

select a, b, ts from foo where (stuff) and foo > X order by foo limit 10

Then, record the last value of foo you read, and plug it in as X the next
time around.


We've been over this before in this forum: It doesn't work as advertised.  Look 
for postings by me regarding the fact that there is no way to tell the 
optimizer the cost of executing a function.  There's one, for example, on Oct 
18, 2006.


I think the problem is more that most web developers aren't very good
at using the database, and tend to fall back on simplistic, wrong, 
approaches

to displaying the data. There's a lot of monkey-see, monkey-do in web
UI design too, which doesn't help.


Thanks, I'm sure your thoughtful comments will help me solve my problem.  
Somehow. ;-)

Craig

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Steve Atkins


On Mar 22, 2007, at 10:21 AM, Craig A. James wrote:


Tino Wildenhain wrote:

Craig A. James schrieb:
...
In our case (for a variety of reasons, but this one is critical),  
we actually can't use Postgres indexing at all -- we wrote an  
entirely separate indexing system for our data...


...There is no need to store or
maintain this information along with postgres when you can store
and maintain it directly in postgres as well.


Whether we store our data inside or outside Postgres misses the  
point (in fact, most of our data is stored IN Postgres).  It's the  
code that actually performs the index operation that has to be  
external to Postgres.



On top of that, postgres has a very flexible and extensible index
system.


You guys can correct me if I'm wrong, but the key feature that's  
missing from Postgres's flexible indexing is the ability to  
maintain state across queries.  Something like this:


 select a, b, my_index_state() from foo where ...
   offset 100 limit 10 using my_index(prev_my_index_state);

The my_index_state() function would issue something like a  
"cookie", an opaque text or binary object that would record  
information about how it got from row 1 through row 99.  When you  
issue the query above, it could start looking for row 100 WITHOUT  
reexamining rows 1-99.


This could be tricky in a OLTP environment, where the "cookie"  
could be invalidated by changes to the database.  But in warehouse  
read-mostly or read-only environments, it could yield vastly  
improved performance for database web applications.


If I'm not mistaken, Postgres (nor Oracle, MySQL or other RDBMS)  
can't do this.  I would love to be corrected.


As long as you're ordering by some row in the table then you can do  
that in

straight SQL.

select a, b, ts from foo where (stuff) and foo > X order by foo limit 10

Then, record the last value of foo you read, and plug it in as X the  
next

time around.

This has the advantage over a simple offset approach of actually
displaying all the data as a user pages through it too. (Consider
the case where the user is viewing offsets 91-100, and you delete
the record at offset 15. The user goes to the next page and will
miss the record that used to be at offset 101 and is now at offset
100).

The problem is that relational databases were invented before the  
web and its stateless applications.  In the "good old days", you  
could connect to a database and work for hours, and in that  
environment cursors and such work well -- the RDBMS maintains the  
internal state of the indexing system.  But in a web environment,  
state information is very difficult to maintain.  There are all  
sorts of systems that try (Enterprise Java Beans, for example), but  
they're very complex.


I think the problem is more that most web developers aren't very good
at using the database, and tend to fall back on simplistic, wrong,  
approaches

to displaying the data. There's a lot of monkey-see, monkey-do in web
UI design too, which doesn't help.

Cheers,
  Steve


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Craig A. James

Tino Wildenhain wrote:
You guys can correct me if I'm wrong, but the key feature that's 
missing from Postgres's flexible indexing is the ability to maintain 
state across queries.  Something like this:


 select a, b, my_index_state() from foo where ...
   offset 100 limit 10 using my_index(prev_my_index_state);



Yes, you are wrong :-) The technique is called "CURSOR"
if you maintain persistent connection per session
(e.g. stand allone application or clever pooling webapplication)


That's my whole point: If relational databases had a simple mechanism for 
storing their internal state in an external application, the need for cursors, 
connection pools, and all those other tricks would be eliminated.

As I said earlier, relational technology was invented in an earlier era, and 
hasn't caught up with the reality of modern web apps.


If its a naive web application you just store your session
in tables where you can easily maintain the scroll state
as well.


One thing I've learned in 25 years of software development is that people who 
use my software have problems I never imagined.  I've been the one who was 
naive when I said similar things about my customers, and was later embarrassed 
to learn that their problems were more complex than I ever imagined.

Craig

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Michael Stone

On Thu, Mar 22, 2007 at 06:27:32PM +0100, Tino Wildenhain wrote:

Craig A. James schrieb:
You guys can correct me if I'm wrong, but the key feature that's missing 
from Postgres's flexible indexing is the ability to maintain state 
across queries.  Something like this:


 select a, b, my_index_state() from foo where ...
   offset 100 limit 10 using my_index(prev_my_index_state);



Yes, you are wrong :-) The technique is called "CURSOR"
if you maintain persistent connection per session
(e.g. stand allone application or clever pooling webapplication)


Did you read the email before correcting it? From the part you trimmed 
out:


The problem is that relational databases were invented before the web 
and its stateless applications.  In the "good old days", you could 
connect to a database and work for hours, and in that environment 
cursors and such work well -- the RDBMS maintains the internal state of 
the indexing system.  But in a web environment, state information is 
very difficult to maintain.  There are all sorts of systems that try 
(Enterprise Java Beans, for example), but they're very complex.


It sounds like they wrote their own middleware to handle the problem, 
which is basically what you suggested (a "clever pooling web 
application") after saying "wrong".


Mike Stone

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Tino Wildenhain

Craig A. James schrieb:

Tino Wildenhain wrote:

Craig A. James schrieb:
...
In our case (for a variety of reasons, but this one is critical), we 
actually can't use Postgres indexing at all -- we wrote an entirely 
separate indexing system for our data...


...There is no need to store or
maintain this information along with postgres when you can store
and maintain it directly in postgres as well.


Whether we store our data inside or outside Postgres misses the point 
(in fact, most of our data is stored IN Postgres).  It's the code that 
actually performs the index operation that has to be external to Postgres.



On top of that, postgres has a very flexible and extensible index
system.


You guys can correct me if I'm wrong, but the key feature that's missing 
from Postgres's flexible indexing is the ability to maintain state 
across queries.  Something like this:


 select a, b, my_index_state() from foo where ...
   offset 100 limit 10 using my_index(prev_my_index_state);



Yes, you are wrong :-) The technique is called "CURSOR"
if you maintain persistent connection per session
(e.g. stand allone application or clever pooling webapplication)

If its a naive web application you just store your session
in tables where you can easily maintain the scroll state
as well.

Regards
Tino

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Craig A. James

Tino Wildenhain wrote:

Craig A. James schrieb:
...
In our case (for a variety of reasons, but this one is critical), we 
actually can't use Postgres indexing at all -- we wrote an entirely 
separate indexing system for our data...


...There is no need to store or
maintain this information along with postgres when you can store
and maintain it directly in postgres as well.


Whether we store our data inside or outside Postgres misses the point (in fact, 
most of our data is stored IN Postgres).  It's the code that actually performs 
the index operation that has to be external to Postgres.


On top of that, postgres has a very flexible and extensible index
system.


You guys can correct me if I'm wrong, but the key feature that's missing from 
Postgres's flexible indexing is the ability to maintain state across queries.  
Something like this:

 select a, b, my_index_state() from foo where ...
   offset 100 limit 10 using my_index(prev_my_index_state);

The my_index_state() function would issue something like a "cookie", an opaque 
text or binary object that would record information about how it got from row 1 through 
row 99.  When you issue the query above, it could start looking for row 100 WITHOUT 
reexamining rows 1-99.

This could be tricky in a OLTP environment, where the "cookie" could be 
invalidated by changes to the database.  But in warehouse read-mostly or read-only 
environments, it could yield vastly improved performance for database web applications.

If I'm not mistaken, Postgres (nor Oracle, MySQL or other RDBMS) can't do this. 
 I would love to be corrected.

The problem is that relational databases were invented before the web and its stateless 
applications.  In the "good old days", you could connect to a database and work 
for hours, and in that environment cursors and such work well -- the RDBMS maintains the 
internal state of the indexing system.  But in a web environment, state information is 
very difficult to maintain.  There are all sorts of systems that try (Enterprise Java 
Beans, for example), but they're very complex.

Craig


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Craig A. James

Brian Hurt wrote:
One of our biggest single problems is this very thing.  It's not a 
Postgres problem specifically, but more embedded in the idea of a 
relational database: There are no "job status" or "rough estimate of 
results" or "give me part of the answer" features that are critical to 
many real applications.


For the "give me part of the answer", I'm wondering if cursors wouldn't 
work (and if not, why not)?


There is no mechanism in Postgres (or any RDB that I know of) to say, "Give me rows 
1000 through 1010", that doesn't also execute the query on rows 1-1000.  In other 
words, the RDBMS does the work for 1010 rows, when only 10 are needed -- 100 times more 
work than is necessary.

Limit/Offset will return the correct 10 rows, but at the cost of doing the 
previous 1000 rows and discarding them.

Web applications are stateless.  To use a cursor, you'd have to keep it around for hours 
or days, and create complex "server affinity" code to direct a user back to the 
same server of your server farm (where that cursor is being held), on the chance that the 
user will come back and ask for rows 1000 through 1010, then a cursor isn't up to the 
task.

Craig

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Mario Weilguni
Am Donnerstag, 22. März 2007 16:17 schrieb Andreas Kostyrka:
> * Mario Weilguni <[EMAIL PROTECTED]> [070322 15:59]:
> > Am Donnerstag, 22. März 2007 15:33 schrieb Jonah H. Harris:
> > > On 3/22/07, Merlin Moncure <[EMAIL PROTECTED]> wrote:
> > > > As others suggest select count(*) from table is very special case
> > > > which non-mvcc databases can optimize for.
> > >
> > > Well, other MVCC database still do it faster than we do.  However, I
> > > think we'll be able to use the dead space map for speeding this up a
> > > bit wouldn't we?
> >
> > Which MVCC DB do you mean? Just curious...
>
> Well, mysql claims InnoDB to be mvcc ;)

Ok, but last time I tried count(*) with InnoDB tables did take roughly(*) the 
same time last time I tried - because InnoDB has the same problem as postgres 
and has to do a seqscan too (I think it's mentioned somewhere in their docs).

(*) in fact, postgres was faster, but the values were comparable, 40 seconds 
vs. 48 seconds 

Maybe the InnoDB have made some progress here, I tested it with MySQL 5.0.18.


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Brian Hurt

Craig A. James wrote:



One of our biggest single problems is this very thing.  It's not a 
Postgres problem specifically, but more embedded in the idea of a 
relational database: There are no "job status" or "rough estimate of 
results" or "give me part of the answer" features that are critical to 
many real applications.


For the "give me part of the answer", I'm wondering if cursors wouldn't 
work (and if not, why not)?


Brian


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Carlos Moreno



count(*).  I was just discussing general performance issues on the
phone line and when my colleague asked me about the size of the
database he just wonderd why this takes so long for a job his
MS-SQL server is much faster.  [...].
   



Simple. MSSQL is optimized for this case, and uses "older"
datastructures. PG uses a MVCC storage, 



Which version of MSSQL?  Wikipedia says that SQL Server 2005 uses the
MVCC model.

Carlos
--


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Tino Wildenhain

Craig A. James schrieb:
...
In our case (for a variety of reasons, but this one is critical), we 
actually can't use Postgres indexing at all -- we wrote an entirely 
separate indexing system for our data, one that has the following 
properties:


 1. It can give out "pages" of information (i.e. "rows 50-60") without
rescanning the skipped pages the way "limit/offset" would.
 2. It can give accurate estimates of the total rows that will be returned.
 3. It can accurately estimate the time it will take.



Thats certainly not entirely correct. There is no need to store or
maintain this information along with postgres when you can store
and maintain it directly in postgres as well. When you have some
outside application I think I can savely assume you are doing
less updates compared to many reads to have it actually pay out.

So why not store this information in separate "index" and "statistic"
tables? You would have just to join with your real data for retrival.

On top of that, postgres has a very flexible and extensible index
system. This would mean you save on database roundtrips and
double information storage (and the sync problems you certainly
get from it)

Regards
Tino


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Craig A. James

Michael Stone wrote:

On Thu, Mar 22, 2007 at 01:30:35PM +0200, [EMAIL PROTECTED] wrote:

approximated count?

why? who would need it? where you can use it?


Do a google query. Look at the top of the page, where it says "results N 
to M of about O". For user interfaces (which is where a lot of this 
count(*) stuff comes from) you quite likely don't care about the exact 
count...


Right on, Michael.

One of our biggest single problems is this very thing.  It's not a Postgres problem specifically, but more 
embedded in the idea of a relational database: There are no "job status" or "rough estimate of 
results" or "give me part of the answer" features that are critical to many real applications.

In our case (for a variety of reasons, but this one is critical), we actually 
can't use Postgres indexing at all -- we wrote an entirely separate indexing 
system for our data, one that has the following properties:

 1. It can give out "pages" of information (i.e. "rows 50-60") without
rescanning the skipped pages the way "limit/offset" would.
 2. It can give accurate estimates of the total rows that will be returned.
 3. It can accurately estimate the time it will take.

For our primary business-critical data, Postgres is merely a storage system, not a search 
system, because we have to do the "heavy lifting" in our own code.  (To be 
fair, there is no relational database that can handle our data.)

Many or most web-based search engines face these exact problems.

Craig

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Andreas Kostyrka
* Mario Weilguni <[EMAIL PROTECTED]> [070322 15:59]:
> Am Donnerstag, 22. März 2007 15:33 schrieb Jonah H. Harris:
> > On 3/22/07, Merlin Moncure <[EMAIL PROTECTED]> wrote:
> > > As others suggest select count(*) from table is very special case
> > > which non-mvcc databases can optimize for.
> >
> > Well, other MVCC database still do it faster than we do.  However, I
> > think we'll be able to use the dead space map for speeding this up a
> > bit wouldn't we?
> 
> Which MVCC DB do you mean? Just curious...
Well, mysql claims InnoDB to be mvcc ;)

Andreas

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Luke Lonergan
Andreas,

On 3/22/07 4:48 AM, "Andreas Tille" <[EMAIL PROTECTED]> wrote:

> Well, to be honest I'm not really interested in the performance of
> count(*).  I was just discussing general performance issues on the
> phone line and when my colleague asked me about the size of the
> database he just wonderd why this takes so long for a job his
> MS-SQL server is much faster.  So in principle I was just asking
> a first question that is easy to ask.  Perhaps I come up with
> more difficult optimisation questions.

This may be the clue you needed - in Postgres SELECT COUNT(*) is an
approximate way to measure the speed of your disk setup (up to about
1,200MB/s).  Given that you are having performance problems, it may be that
your disk layout is either:
- slow by design
- malfunctioning

If this is the case, then any of your queries that require a full table scan
will be affected.

You should check your sequential disk performance using the following:

time bash -c "dd if=/dev/zero of=/your_file_system/bigfile bs=8k
count=(your_memory_size_in_KB*2/8) && sync"
time dd if=/your_file_system/bigfile of=/dev/null bs=8k

Report those times here and we can help you with it.

- Luke



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread mark
On Thu, Mar 22, 2007 at 10:18:10AM -0400, Michael Stone wrote:
> IIRC, that's basically what you get with the mysql count anyway, since 
> there are corner cases for results in a transaction. Avoiding those 
> cases is why the postgres count takes so long; sometimes that's what's 
> desired and sometimes it is not.

Adding to this point:

In any production system, the count presented to the user is usually
wrong very shortly after it is displayed anyways. Transactions in the
background or from other users are adding or removing items, perhaps
even before the count reaches the user's display.

The idea of transaction-safety for counts doesn't apply in this case.
Both the transaction and the number are complete before the value is
displayed.

In my own systems, I rarely use count(*) for anything except user
visible results. For the PostgreSQL system I use, I keep a table of
counts, and lock the row for update when adding or removing items.
This turns out to be best in this system anyways, as I need my new
rows to be ordered, and locking the 'count' row lets me assign a
new sequence number for the row. (Don't want to use SEQUENCE objects,
as there could as the rows are [key, sequence, data], with thousands
or more keys)

Cheers,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] 
__
.  .  _  ._  . .   .__.  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/|_ |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
   and in the darkness bind them...

   http://mark.mielke.cc/


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Michael Stone

On Thu, Mar 22, 2007 at 09:39:18AM -0400, Merlin Moncure wrote:

You can get the approximate count by selecting reltuples from
pg_class.  It is valid as of last analyze.


Of course, that only works if you're not using any WHERE clause. 
Here's a (somewhat ugly) example of getting an approximate count based 
off the statistics info, which will work for more complicated queries:

http://archives.postgresql.org/pgsql-sql/2005-08/msg00046.php
The ugliness is that you have to provide the whole query as a 
parameter to the function, instead of using it as a drop-in replacement 
for count. I assume that the TODO item is to provide the latter, but for 
now this method can be useful for UI type stuff where you just want to 
know whether there's "a little" or "a lot".


Mike Stone

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Michael Fuhr
On Thu, Mar 22, 2007 at 01:29:46PM +0100, Andreas Kostyrka wrote:
> * Andreas Tille <[EMAIL PROTECTED]> [070322 13:24]:
> > Well, to be honest I'm not really interested in the performance of
> > count(*).  I was just discussing general performance issues on the
> > phone line and when my colleague asked me about the size of the
> > database he just wonderd why this takes so long for a job his
> > MS-SQL server is much faster.  So in principle I was just asking
> > a first question that is easy to ask.  Perhaps I come up with
> > more difficult optimisation questions.
> 
> Simple. MSSQL is optimized for this case, and uses "older"
> datastructures. PG uses a MVCC storage, which is not optimized for
> this usecase. It's quite fast for different kinds of queries.

Ask about performing concurrent selects, inserts, updates, and
deletes in SQL Server and about the implications on ACID of locking
hints such as NOLOCK.  Then consider how MVCC handles concurrency
without blocking or the need for dirty reads.

-- 
Michael Fuhr

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Mario Weilguni
Am Donnerstag, 22. März 2007 15:33 schrieb Jonah H. Harris:
> On 3/22/07, Merlin Moncure <[EMAIL PROTECTED]> wrote:
> > As others suggest select count(*) from table is very special case
> > which non-mvcc databases can optimize for.
>
> Well, other MVCC database still do it faster than we do.  However, I
> think we'll be able to use the dead space map for speeding this up a
> bit wouldn't we?

Which MVCC DB do you mean? Just curious...

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Jonah H. Harris

On 3/22/07, Merlin Moncure <[EMAIL PROTECTED]> wrote:

As others suggest select count(*) from table is very special case
which non-mvcc databases can optimize for.


Well, other MVCC database still do it faster than we do.  However, I
think we'll be able to use the dead space map for speeding this up a
bit wouldn't we?

--
Jonah H. Harris, Software Architect | phone: 732.331.1324
EnterpriseDB Corporation| fax: 732.331.1301
33 Wood Ave S, 3rd Floor| [EMAIL PROTECTED]
Iselin, New Jersey 08830| http://www.enterprisedb.com/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Michael Stone

On Thu, Mar 22, 2007 at 01:30:35PM +0200, [EMAIL PROTECTED] wrote:

approximated count?

why? who would need it? where you can use it?


Do a google query. Look at the top of the page, where it says 
"results N to M of about O". For user interfaces (which is where a lot 
of this count(*) stuff comes from) you quite likely don't care about the 
exact count, because the user doesn't really care about the exact count. 

IIRC, that's basically what you get with the mysql count anyway, since 
there are corner cases for results in a transaction. Avoiding those 
cases is why the postgres count takes so long; sometimes that's what's 
desired and sometimes it is not.


Mike Stone

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Merlin Moncure

On 3/22/07, Andreas Tille <[EMAIL PROTECTED]> wrote:

I just try to find out why a simple count(*) might last that long.
At first I tried explain, which rather quickly knows how many rows
to check, but the final count is two orders of magnitude slower.


You can get the approximate count by selecting reltuples from
pg_class.  It is valid as of last analyze.

As others suggest select count(*) from table is very special case
which non-mvcc databases can optimize for.  There are many reasons why
this is the case and why it explains nothing about the relative
performance of the two databases.   This is probably #1 most
frequenctly asked question to -performance...there is a wealth of
information in the archives.

merlin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Mario Weilguni
Am Donnerstag, 22. März 2007 12:30 schrieb [EMAIL PROTECTED]:
> approximated count?
>
> why? who would need it? where you can use it?
>
> calculating costs and desiding how to execute query needs
> approximated count, but it's totally worthless information for any user
> IMO.

No, it is not useless. Try: 
http://www.google.com/search?hl=de&q=test&btnG=Google-Suche&meta=

Do you really think google counted each of those individual 895 million 
results? It doesn't. In fact, the estimate of google can be off by an order 
of magnitude, and still nobody complains...


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Bill Moran
In response to [EMAIL PROTECTED]:
> 
> approximated count?
> 
> why? who would need it? where you can use it?
> 
> calculating costs and desiding how to execute query needs 
> approximated count, but it's totally worthless information for any user 
> IMO.

I don't think so.

We have some AJAX stuff where users enter search criteria on a web form,
and the # of results updates in "real time" as they change their criteria.

Right now, this works fine with small tables using count(*) -- it's fast
enough not to be an issue, but we're aware that we can't use it on large
tables.

An estimate_count(*) or similar that would allow us to put an estimate of
how many results will be returned (not guaranteed accurate) would be very
nice to have in these cases.

We're dealing with complex sets of criteria.  It's very useful for the users
to know in "real time" how much their search criteria is effecting the
result pool.  Once they feel they've limited as much as they can without
reducing the pool too much, they can hit submit and get the actual result.

As I said, we do this with small data sets, but it's not terribly useful
there.  Where it will be useful is searches of large data sets, where
constantly submitting and then retrying is overly time-consuming.

Of course, this is count(*)ing the results of a complex query, possibly
with a bunch of joins and many limitations in the WHERE clause, so I'm
not sure what could be done overall to improve the response time.

> On Thu, 22 Mar 2007, Albert Cervera Areny wrote:
> 
> > As you can see, PostgreSQL needs to do a sequencial scan to count because 
> > its 
> > MVCC nature and indices don't have transaction information. It's a known 
> > drawback inherent to the way PostgreSQL works and which gives very good 
> > results in other areas. It's been talked about adding some kind of 
> > approximated count which wouldn't need a full table scan but I don't think 
> > there's anything there right now.
> > 
> > A Dijous 22 Març 2007 11:53, Andreas Tille va escriure:
> > > Hi,
> > >
> > > I just try to find out why a simple count(*) might last that long.
> > > At first I tried explain, which rather quickly knows how many rows
> > > to check, but the final count is two orders of magnitude slower.
> > >
> > > My MS_SQL server using colleague can't believe that.
> > >
> > > $ psql InfluenzaWeb -c 'explain SELECT count(*) from agiraw ;'
> > >QUERY PLAN
> > > ---
> > >   Aggregate  (cost=196969.77..196969.77 rows=1 width=0)
> > > ->  Seq Scan on agiraw  (cost=0.00..185197.41 rows=4708941 width=0)
> > > (2 rows)
> > >
> > > real0m0.066s
> > > user0m0.024s
> > > sys 0m0.008s
> > >
> > > $ psql InfluenzaWeb -c 'SELECT count(*) from agiraw ;'
> > >count
> > > -
> > >   4708941
> > > (1 row)
> > >
> > > real0m4.474s
> > > user0m0.036s
> > > sys 0m0.004s
> > >
> > >
> > > Any explanation?
> > >
> > > Kind regards
> > >
> > >   Andreas.
> > 
> > -- 
> > Albert Cervera Areny
> > Dept. Informàtica Sedifa, S.L.
> > 
> > Av. Can Bordoll, 149
> > 08202 - Sabadell (Barcelona)
> > Tel. 93 715 51 11
> > Fax. 93 715 51 12
> > 
> > 
> >   AVISO LEGAL  
> > La   presente  comunicación  y sus anexos tiene como destinatario la
> > persona a  la  que  va  dirigida, por  lo  que  si  usted lo  recibe
> > por error  debe  notificarlo  al  remitente  y   eliminarlo   de  su
> > sistema,  no  pudiendo  utilizarlo,  total  o   parcialmente,   para
> > ningún  fin.  Su  contenido  puede  tener información confidencial o
> > protegida legalmente   y   únicamente   expresa  la  opinión del
> > remitente.  El   uso   del   correo   electrónico   vía Internet  no
> > permite   asegurarni  la   confidencialidad   de   los  mensajes
> > nisucorrecta recepción.   Enel  caso   de   que   el
> > destinatario no consintiera la utilización  del correo  electrónico,
> > deberá ponerlo en nuestro conocimiento inmediatamente.
> > 
> > ... DISCLAIMER .
> > This message and its  attachments are  intended  exclusively for the
> > named addressee. If you  receive  this  message  in   error,  please
> > immediately delete it from  your  system  and notify the sender. You
> > may  not  use  this message  or  any  part  of it  for any  purpose.
> > The   message   may  contain  information  that  is  confidential or
> > protected  by  law,  and  any  opinions  expressed  are those of the
> > individualsender.  Internet  e-mail   guarantees   neither   the
> > confidentiality   nor  the  proper  receipt  of  the  message  sent.
> > If  the  addressee  of  this  message  does  not  consent to the use
> > of   internete-

Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Andreas Kostyrka
* Andreas Tille <[EMAIL PROTECTED]> [070322 13:24]:
> On Thu, 22 Mar 2007, Andreas Kostyrka wrote:
> 
> >Which version of PG?
> 
> Ahh, sorry, forgot that.  The issue occurs in Debian (Etch) packaged
> version 7.4.16.  I plan to switch soon to 8.1.8.
I'd recommend 8.2 if at all possible :)
> 
> >That's the reason why PG (check the newest releases, I seem to
> >remember that there has been some aggregate optimizations there),
> 
> I'll verify this once I moved to the new version.
8.1 won't help you I'd guess. ;)

> 
> >does
> >a SeqScan for select count(*) from table. btw, depending upon your
> >data, doing a select count(*) from table where user=X will use an
> >Index, but will still need to fetch the rows proper to validate them.
> 
> I have an index on three (out of 7 columns) of this table.  Is there
> any chance to optimize indexing regarding this.
Well, that depends upon you query pattern. It's an art and a science
at the same time ;)
> 
> Well, to be honest I'm not really interested in the performance of
> count(*).  I was just discussing general performance issues on the
> phone line and when my colleague asked me about the size of the
> database he just wonderd why this takes so long for a job his
> MS-SQL server is much faster.  So in principle I was just asking
> a first question that is easy to ask.  Perhaps I come up with
> more difficult optimisation questions.

Simple. MSSQL is optimized for this case, and uses "older"
datastructures. PG uses a MVCC storage, which is not optimized for
this usecase. It's quite fast for different kinds of queries.

The basic trouble here is that mvcc makes it a little harder to decide
what is valid for your transaction, plus the indexes seems to be
designed for lookup, not for data fetching. (Basically, PG can use
indexes only to locate potential data, but cannot return data directly
out of an index)

Andreas

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Andreas Tille

On Thu, 22 Mar 2007, Andreas Kostyrka wrote:


Which version of PG?


Ahh, sorry, forgot that.  The issue occurs in Debian (Etch) packaged
version 7.4.16.  I plan to switch soon to 8.1.8.


That's the reason why PG (check the newest releases, I seem to
remember that there has been some aggregate optimizations there),


I'll verify this once I moved to the new version.


does
a SeqScan for select count(*) from table. btw, depending upon your
data, doing a select count(*) from table where user=X will use an
Index, but will still need to fetch the rows proper to validate them.


I have an index on three (out of 7 columns) of this table.  Is there
any chance to optimize indexing regarding this.

Well, to be honest I'm not really interested in the performance of
count(*).  I was just discussing general performance issues on the
phone line and when my colleague asked me about the size of the
database he just wonderd why this takes so long for a job his
MS-SQL server is much faster.  So in principle I was just asking
a first question that is easy to ask.  Perhaps I come up with
more difficult optimisation questions.

Kind regards

Andreas.

--
http://fam-tille.de

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread ismo . tuononen

approximated count?

why? who would need it? where you can use it?

calculating costs and desiding how to execute query needs 
approximated count, but it's totally worthless information for any user 
IMO.

Ismo

On Thu, 22 Mar 2007, Albert Cervera Areny wrote:

> As you can see, PostgreSQL needs to do a sequencial scan to count because its 
> MVCC nature and indices don't have transaction information. It's a known 
> drawback inherent to the way PostgreSQL works and which gives very good 
> results in other areas. It's been talked about adding some kind of 
> approximated count which wouldn't need a full table scan but I don't think 
> there's anything there right now.
> 
> A Dijous 22 Març 2007 11:53, Andreas Tille va escriure:
> > Hi,
> >
> > I just try to find out why a simple count(*) might last that long.
> > At first I tried explain, which rather quickly knows how many rows
> > to check, but the final count is two orders of magnitude slower.
> >
> > My MS_SQL server using colleague can't believe that.
> >
> > $ psql InfluenzaWeb -c 'explain SELECT count(*) from agiraw ;'
> >QUERY PLAN
> > ---
> >   Aggregate  (cost=196969.77..196969.77 rows=1 width=0)
> > ->  Seq Scan on agiraw  (cost=0.00..185197.41 rows=4708941 width=0)
> > (2 rows)
> >
> > real0m0.066s
> > user0m0.024s
> > sys 0m0.008s
> >
> > $ psql InfluenzaWeb -c 'SELECT count(*) from agiraw ;'
> >count
> > -
> >   4708941
> > (1 row)
> >
> > real0m4.474s
> > user0m0.036s
> > sys 0m0.004s
> >
> >
> > Any explanation?
> >
> > Kind regards
> >
> >   Andreas.
> 
> -- 
> Albert Cervera Areny
> Dept. Informàtica Sedifa, S.L.
> 
> Av. Can Bordoll, 149
> 08202 - Sabadell (Barcelona)
> Tel. 93 715 51 11
> Fax. 93 715 51 12
> 
> 
>   AVISO LEGAL  
> La   presente  comunicación  y sus anexos tiene como destinatario la
> persona a  la  que  va  dirigida, por  lo  que  si  usted lo  recibe
> por error  debe  notificarlo  al  remitente  y   eliminarlo   de  su
> sistema,  no  pudiendo  utilizarlo,  total  o   parcialmente,   para
> ningún  fin.  Su  contenido  puede  tener información confidencial o
> protegida legalmente   y   únicamente   expresa  la  opinión del
> remitente.  El   uso   del   correo   electrónico   vía Internet  no
> permite   asegurarni  la   confidencialidad   de   los  mensajes
> nisucorrecta recepción.   Enel  caso   de   que   el
> destinatario no consintiera la utilización  del correo  electrónico,
> deberá ponerlo en nuestro conocimiento inmediatamente.
> 
> ... DISCLAIMER .
> This message and its  attachments are  intended  exclusively for the
> named addressee. If you  receive  this  message  in   error,  please
> immediately delete it from  your  system  and notify the sender. You
> may  not  use  this message  or  any  part  of it  for any  purpose.
> The   message   may  contain  information  that  is  confidential or
> protected  by  law,  and  any  opinions  expressed  are those of the
> individualsender.  Internet  e-mail   guarantees   neither   the
> confidentiality   nor  the  proper  receipt  of  the  message  sent.
> If  the  addressee  of  this  message  does  not  consent to the use
> of   internete-mail,pleaseinform usinmmediately.
> 
> 
> 
>  
> 
> ---(end of broadcast)---
> TIP 7: You can help support the PostgreSQL project by donating at
> 
> http://www.postgresql.org/about/donate
> 
---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread ismo . tuononen

explain is just "quessing" how many rows are in table. sometimes quess is 
right, sometimes just an estimate.

sailabdb=# explain SELECT count(*) from sl_tuote;
  QUERY PLAN  
--
 Aggregate  (cost=10187.10..10187.11 rows=1 width=0)
   ->  Seq Scan on sl_tuote  (cost=0.00..9806.08 rows=152408 width=0)
(2 rows)

sailabdb=# SELECT count(*) from sl_tuote;
 count 
---
 62073
(1 row)


so in that case explain estimates that sl_tuote table have 152408 rows, but 
there are only 62073 rows.

after analyze estimates are better:

sailabdb=# vacuum analyze sl_tuote;
VACUUM
sailabdb=# explain SELECT count(*) from sl_tuote;
 QUERY PLAN  
-
 Aggregate  (cost=9057.91..9057.92 rows=1 width=0)
   ->  Seq Scan on sl_tuote  (cost=0.00..8902.73 rows=62073 width=0)
(2 rows)

you can't never trust that estimate, you must always count it!

Ismo

On Thu, 22 Mar 2007, Andreas Tille wrote:

> Hi,
> 
> I just try to find out why a simple count(*) might last that long.
> At first I tried explain, which rather quickly knows how many rows
> to check, but the final count is two orders of magnitude slower.
> 
> My MS_SQL server using colleague can't believe that.
> 
> $ psql InfluenzaWeb -c 'explain SELECT count(*) from agiraw ;'
>   QUERY PLAN
> ---
>  Aggregate  (cost=196969.77..196969.77 rows=1 width=0)
>->  Seq Scan on agiraw  (cost=0.00..185197.41 rows=4708941 width=0)
> (2 rows)
> 
> real0m0.066s
> user0m0.024s
> sys 0m0.008s
> 
> $ psql InfluenzaWeb -c 'SELECT count(*) from agiraw ;'
>   count -
>  4708941
> (1 row)
> 
> real0m4.474s
> user0m0.036s
> sys 0m0.004s
> 
> 
> Any explanation?
> 
> Kind regards
> 
>  Andreas.
> 
> -- 
> http://fam-tille.de
> 
> ---(end of broadcast)---
> TIP 7: You can help support the PostgreSQL project by donating at
> 
>http://www.postgresql.org/about/donate
> 

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Albert Cervera Areny
As you can see, PostgreSQL needs to do a sequencial scan to count because its 
MVCC nature and indices don't have transaction information. It's a known 
drawback inherent to the way PostgreSQL works and which gives very good 
results in other areas. It's been talked about adding some kind of 
approximated count which wouldn't need a full table scan but I don't think 
there's anything there right now.

A Dijous 22 Març 2007 11:53, Andreas Tille va escriure:
> Hi,
>
> I just try to find out why a simple count(*) might last that long.
> At first I tried explain, which rather quickly knows how many rows
> to check, but the final count is two orders of magnitude slower.
>
> My MS_SQL server using colleague can't believe that.
>
> $ psql InfluenzaWeb -c 'explain SELECT count(*) from agiraw ;'
>QUERY PLAN
> ---
>   Aggregate  (cost=196969.77..196969.77 rows=1 width=0)
> ->  Seq Scan on agiraw  (cost=0.00..185197.41 rows=4708941 width=0)
> (2 rows)
>
> real0m0.066s
> user0m0.024s
> sys 0m0.008s
>
> $ psql InfluenzaWeb -c 'SELECT count(*) from agiraw ;'
>count
> -
>   4708941
> (1 row)
>
> real0m4.474s
> user0m0.036s
> sys 0m0.004s
>
>
> Any explanation?
>
> Kind regards
>
>   Andreas.

-- 
Albert Cervera Areny
Dept. Informàtica Sedifa, S.L.

Av. Can Bordoll, 149
08202 - Sabadell (Barcelona)
Tel. 93 715 51 11
Fax. 93 715 51 12


  AVISO LEGAL  
La   presente  comunicación  y sus anexos tiene como destinatario la
persona a  la  que  va  dirigida, por  lo  que  si  usted lo  recibe
por error  debe  notificarlo  al  remitente  y   eliminarlo   de  su
sistema,  no  pudiendo  utilizarlo,  total  o   parcialmente,   para
ningún  fin.  Su  contenido  puede  tener información confidencial o
protegida legalmente   y   únicamente   expresa  la  opinión del
remitente.  El   uso   del   correo   electrónico   vía Internet  no
permite   asegurarni  la   confidencialidad   de   los  mensajes
nisucorrecta recepción.   Enel  caso   de   que   el
destinatario no consintiera la utilización  del correo  electrónico,
deberá ponerlo en nuestro conocimiento inmediatamente.

... DISCLAIMER .
This message and its  attachments are  intended  exclusively for the
named addressee. If you  receive  this  message  in   error,  please
immediately delete it from  your  system  and notify the sender. You
may  not  use  this message  or  any  part  of it  for any  purpose.
The   message   may  contain  information  that  is  confidential or
protected  by  law,  and  any  opinions  expressed  are those of the
individualsender.  Internet  e-mail   guarantees   neither   the
confidentiality   nor  the  proper  receipt  of  the  message  sent.
If  the  addressee  of  this  message  does  not  consent to the use
of   internete-mail,pleaseinform usinmmediately.



 

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Andreas Kostyrka
* Andreas Tille <[EMAIL PROTECTED]> [070322 12:07]:
> Hi,
> 
> I just try to find out why a simple count(*) might last that long.
> At first I tried explain, which rather quickly knows how many rows
> to check, but the final count is two orders of magnitude slower.

Which version of PG?

The basic problem is, that explain knows quickly, because it has it's
statistics.

The select proper, OTOH, has to go through the whole table to make
sure which rows are valid for your transaction.

That's the reason why PG (check the newest releases, I seem to
remember that there has been some aggregate optimizations there), does
a SeqScan for select count(*) from table. btw, depending upon your
data, doing a select count(*) from table where user=X will use an
Index, but will still need to fetch the rows proper to validate them.

Andreas

> 
> My MS_SQL server using colleague can't believe that.
> 
> $ psql InfluenzaWeb -c 'explain SELECT count(*) from agiraw ;'
>   QUERY PLAN 
> ---
>  Aggregate  (cost=196969.77..196969.77 rows=1 width=0)
>->  Seq Scan on agiraw  (cost=0.00..185197.41 rows=4708941 width=0)
> (2 rows)
> 
> real0m0.066s
> user0m0.024s
> sys 0m0.008s
> 
> $ psql InfluenzaWeb -c 'SELECT count(*) from agiraw ;'
>   count -
>  4708941
> (1 row)
> 
> real0m4.474s
> user0m0.036s
> sys 0m0.004s
> 
> 
> Any explanation?
> 
> Kind regards
> 
>  Andreas.
> 
> -- 
> http://fam-tille.de
> 
> ---(end of broadcast)---
> TIP 7: You can help support the PostgreSQL project by donating at
> 
>http://www.postgresql.org/about/donate

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Performance of count(*) on large tables vs SQL Server

2005-02-01 Thread PFC

clause will be a cheap query  - and use it to test if
a table is empty, for instance. (because for
Oracle/Sybase/SQL Server, count(*) is cheap).
	To test if a table is empty, use a SELECT EXISTS or whatever SELECT with  
a LIMIT 1...

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Performance of count(*) on large tables vs SQL Server

2005-02-01 Thread Stef

Hello Andrew,
Everything that Shridhar says makes perfect
sense, and, speaking from experience in dealing with
this type of 'problem', everything you say does as 
well. Such is life really :)

I would not be at -all- surprised if Sybase
and Oracle did query re-writing behind the scene's
to send un-defined count's to a temporary table which
holds the row count. For an example of such done in
postgreSQL (using triggers and a custom procedure)
look into the 'General Bits' newsletter. Specifically
http://www.varlena.com/varlena/GeneralBits/49.php

I know, giving a URL as an answer 'sucks', but,
well, it simply repeats my experience. Triggers and
Procedures.

Regards
Steph

On Tue, Feb 01, 2005 at 06:32:56PM +0530, Shridhar Daithankar wrote:
> On Tuesday 01 Feb 2005 6:11 pm, Andrew Mayo wrote:
> > PG, on the other hand, appears to do a full table scan
> > to answer this question, taking nearly 4 seconds to
> > process the query.
> >
> > Doing an ANALYZE on the table and also VACUUM did not
> > seem to affect this.
> >
> > Can PG find a table's row count more efficiently?.
> > This is not an unusual practice in commercial
> > applications which assume that count(*) with no WHERE
> > clause will be a cheap query  - and use it to test if
> > a table is empty, for instance. (because for
> > Oracle/Sybase/SQL Server, count(*) is cheap).
> 
> First of all, such an assumption is no good. It should hit concurrency under 
> heavy load but I know people do use it.
> 
> For the specific question, after a vacuum analyze, you can use 
> 
> select reltuples from pg_class where relname='Foo';
> 
> Remember, you will get different results between 'analyze' and 'vacuum 
> analyze', since later actually visit every page in the table and hence is 
> expected to be more accurate.
> 
> > (sure, I appreciate there are other ways of doing
> > this, but I am curious about the way PG works here).
> 
> Answer is MVCC and PG's inability use index alone. This has been a FAQ for a 
> loong time.. Furthermore PG has custom aggregates to complicate the matter..
> 
> Most of the pg developers/users think that unqualified select count(*) is of 
> no use. You can search the archives for more details..
> 
>  HTH
> 
>  Shridhar
> 
> ---(end of broadcast)---
> TIP 6: Have you searched our list archives?
> 
>http://archives.postgresql.org
> 


pgpoc1hcqAQ8G.pgp
Description: PGP signature


Re: [PERFORM] Performance of count(*) on large tables vs SQL Server

2005-02-01 Thread Shridhar Daithankar
On Tuesday 01 Feb 2005 6:11 pm, Andrew Mayo wrote:
> PG, on the other hand, appears to do a full table scan
> to answer this question, taking nearly 4 seconds to
> process the query.
>
> Doing an ANALYZE on the table and also VACUUM did not
> seem to affect this.
>
> Can PG find a table's row count more efficiently?.
> This is not an unusual practice in commercial
> applications which assume that count(*) with no WHERE
> clause will be a cheap query  - and use it to test if
> a table is empty, for instance. (because for
> Oracle/Sybase/SQL Server, count(*) is cheap).

First of all, such an assumption is no good. It should hit concurrency under 
heavy load but I know people do use it.

For the specific question, after a vacuum analyze, you can use 

select reltuples from pg_class where relname='Foo';

Remember, you will get different results between 'analyze' and 'vacuum 
analyze', since later actually visit every page in the table and hence is 
expected to be more accurate.

> (sure, I appreciate there are other ways of doing
> this, but I am curious about the way PG works here).

Answer is MVCC and PG's inability use index alone. This has been a FAQ for a 
loong time.. Furthermore PG has custom aggregates to complicate the matter..

Most of the pg developers/users think that unqualified select count(*) is of 
no use. You can search the archives for more details..

 HTH

 Shridhar

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org