subject:"Re\: \[PERFORM\] Performance of count\(\*\)"


Craig A. James schrieb:

Tino Wildenhain wrote:
You guys can correct me if I'm wrong, but the key feature that's 
missing from Postgres's flexible indexing is the ability to maintain 
state across queries.  Something like this:


 select a, b, my_index_state() from foo where ...
   offset 100 limit 10 using my_index(prev_my_index_state);



Yes, you are wrong :-) The technique is called "CURSOR"
if you maintain persistent connection per session
(e.g. stand allone application or clever pooling webapplication)


That's my whole point: If relational databases had a simple mechanism 
for storing their internal state in an external application, the need 
for cursors, connection pools, and all those other tricks would be 
eliminated.


Well the cursor is exactly the simple handle to the internal
state of the relational db you are looking for.
Do you really think transferring the whole query-tree, open index
and data files to the client over the network would really improve
the situation?

As I said earlier, relational technology was invented in an earlier era, 
and hasn't caught up with the reality of modern web apps.


There is nothing modern with todays web apps.


If its a naive web application you just store your session
in tables where you can easily maintain the scroll state
as well.


One thing I've learned in 25 years of software development is that 
people who use my software have problems I never imagined.  I've been 
the one who was naive when I said similar things about my customers, and 
was later embarrassed to learn that their problems were more complex 
than I ever imagined.


Sure it really depends on the application how the best solution
would look like but I'm quite certain, counterfaiting internal
stuff of the underlying relational database in the application
makes more problems then it solves. If you can't handle SQL,
dont use SQL, you can build web applications w/o any relational
database if you want it.

Regards
Tino Wildenhain

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] Performance of count(*)


Michael Stone schrieb:

On Thu, Mar 22, 2007 at 02:24:39PM -0400, Merlin Moncure wrote:

Tino was saying that rather that build a complete indexing storage
management solution that lives outside the database, it is better to
do intelligent session management so that you get the simplicity if a
two tier client server system but the scalability of a web app.


No, what he was saying was "there's this thing called a cursor". I 
thought there was enough information in the original message to indicate 
that the author knew about cursors. There are certainly pros and cons 
and nuances to different approaches, but Tino's message didn't touch on 
anything that specific.


Sure, the message thread sometimes loose history so I wasnt 100% sure
what the framework really is - although I assumed it could be a web
solution. With stand alone applications you usually have a limited
number of users connecting and they are connected during the session
so you can easily use cursors there.

And even if you do use some kind of "intelligent session management", 
how many simultaneous cursors can postgres sanely keep track of? 
Thousands? Millions? Tens of Millions? I suspect there's a scalability 
limit in there somewhere. Luckily I don't spend much time in the web 
application space, so I don't need to know.  :)


Depending on the application, you can even simulate above situation
with a web framework if you manage session in the web framework
with persistent connections for a limited amount of users to work
the same time (certainly not feasable for a public web shop but for
data management systems for inhouse use). In this case, cursors
would be perfect too.

In any other case I fail to see the advantage in storing "index
data" outside the database with all the roundtripping involved.

If the query is complex and rerunning it for every batch is
expensive, fetching the whole result to the application in
case of users really traversing the complete batch
(How often is that really done? I mean, who browses to an
end of a huge result set?) is costy as well w/o really
benefit.

It would be much more easy and clean imho, in this case
to really fetch the data to session and batch linked
scratch table.

If its fast or you can prepare a batch helper table
with index, you can just select the batch equival
portion of the result.

You dont need extensive session management in the
web application to scroll thru result sets in this
way. This can all be encoded in forms or links.

Regards
Tino



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Tom Lane

"Craig A. James" <[EMAIL PROTECTED]> writes:
> Tom Lane wrote:
>> You mean
>> http://archives.postgresql.org/pgsql-performance/2006-10/msg00283.php
>> ?  I don't see anything there that bears on Steve's suggestion.

> Mea culpa, it's October 8, not October 18:
>http://archives.postgresql.org/pgsql-performance/2006-10/msg00143.php

I still don't see the relevance to Steve's suggestion.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Performance of count(*)


Tom Lane wrote:

"Craig A. James" <[EMAIL PROTECTED]> writes:

Steve Atkins wrote:

As long as you're ordering by some row in the table then you can do that in
straight SQL.

select a, b, ts from foo where (stuff) and foo > X order by foo limit 10

Then, record the last value of foo you read, and plug it in as X the next
time around.



We've been over this before in this forum: It doesn't work as advertised.
Look for postings by me regarding the fact that there is no way to tell
the optimizer the cost of executing a function.  There's one, for example,
on Oct 18, 2006.


You mean
http://archives.postgresql.org/pgsql-performance/2006-10/msg00283.php
?  I don't see anything there that bears on Steve's suggestion.
(The complaint is obsolete as of CVS HEAD anyway.)


Mea culpa, it's October 8, not October 18:

  http://archives.postgresql.org/pgsql-performance/2006-10/msg00143.php

The relevant part is this:

"My example, discussed previously in this forum, is a classic.  I have a VERY 
expensive function (it's in the class of NP-complete problems, so there is no faster 
way to do it).  There is no circumstance when my function should be used as a 
filter, and no circumstance when it should be done before a join.  But PG has no way 
of knowing the cost of a function, and so the optimizer assigns the same cost to 
every function.  Big disaster.

"The result?  I can't use my function in any WHERE clause that involves any 
other conditions or joins.  Only by itself.  PG will occasionally decide to use my 
function as a filter instead of doing the join or the other WHERE conditions first, 
and I'm dead.

"The interesting thing is that PG works pretty well for me on big tables -- it does 
the join first, then applies my expensive functions.  But with a SMALL (like 50K rows) 
table, it applies my function first, then does the join.  A search that completes in 1 
second on a 5,000,000 row database can take a minute or more on a 50,000 row 
database."

Craig

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [PERFORM] Performance of count(*)


On Thu, Mar 22, 2007 at 02:24:39PM -0400, Merlin Moncure wrote:

Tino was saying that rather that build a complete indexing storage
management solution that lives outside the database, it is better to
do intelligent session management so that you get the simplicity if a
two tier client server system but the scalability of a web app.


No, what he was saying was "there's this thing called a cursor". I 
thought there was enough information in the original message to indicate 
that the author knew about cursors. There are certainly pros and cons 
and nuances to different approaches, but Tino's message didn't touch on 
anything that specific. 

And even if you do use some kind of "intelligent session management", 
how many simultaneous cursors can postgres sanely keep track of? 
Thousands? Millions? Tens of Millions? I suspect there's a scalability 
limit in there somewhere. Luckily I don't spend much time in the web 
application space, so I don't need to know.  :)


Mike Stone

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Tom Lane

"Craig A. James" <[EMAIL PROTECTED]> writes:
> Steve Atkins wrote:
>> As long as you're ordering by some row in the table then you can do that in
>> straight SQL.
>> 
>> select a, b, ts from foo where (stuff) and foo > X order by foo limit 10
>> 
>> Then, record the last value of foo you read, and plug it in as X the next
>> time around.

> We've been over this before in this forum: It doesn't work as advertised.  
> Look for postings by me regarding the fact that there is no way to tell the 
> optimizer the cost of executing a function.  There's one, for example, on Oct 
> 18, 2006.

You mean
http://archives.postgresql.org/pgsql-performance/2006-10/msg00283.php
?  I don't see anything there that bears on Steve's suggestion.
(The complaint is obsolete as of CVS HEAD anyway.)

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Steve Atkins



On Mar 22, 2007, at 11:26 AM, Guido Neitzer wrote:


On 22.03.2007, at 11:53, Steve Atkins wrote:

As long as you're ordering by some row in the table then you can  
do that in

straight SQL.

select a, b, ts from foo where (stuff) and foo > X order by foo  
limit 10


Then, record the last value of foo you read, and plug it in as X  
the next

time around.


This does only work if you have unique values in foo. You might  
have "batch breaks" inside   a list of rows with equal values for foo.


If I don't have unique values in foo, I certainly have unique values  
in (foo, pk).




But: a web application that needs state and doesn't maintain it by  
itself (or inside the dev toolkit) is imho broken by design. How  
should the database store a "state" for a web app? It's only  
possible on the web app part, because the app is either stateless  
and so are the queries to the database - they have to be re- 
evaluated for every request as the request might come from totally  
different sources (users, ...) or it is stateful and has to  
maintain the state because only the app developer knows, what  
information is needed for the "current state".


This is why all web application toolkits have a "session" concept.


Yes. HTTP is not very stateful. Web applications are stateful. There  
are some really obvious approaches to maintaining state cleanly that  
work well with databases and let you do some quite complex stuff  
(tying a persistent database connection to a single user, for  
instance). But they don't scale at all well.


What Craig was suggesting is, basically, to assign a persistent  
database connection to each user. But rather than maintain that  
connection as a running process, to serialise all the state out of  
the database connection and store that in the webapp, then when the  
next action from that user comes in take a database connection and  
stuff all that state into it again.


It's a lovely idea, but strikes me as completely infeasible in the  
general case. There's just too much state there. Doing it in the  
specific case is certainly possible, but rapidly devolves to the  
standard approach of "On the first page of results, run the query and  
record the first 5000 results. Store those in a scratch table,  
indexed by session-id, or in external storage. On displaying later  
pages of results to the same user, pull directly from the already  
calculated results."





I think the problem is more that most web developers aren't very good
at using the database, and tend to fall back on simplistic, wrong,  
approaches

to displaying the data. There's a lot of monkey-see, monkey-do in web
UI design too, which doesn't help.


Sure. That is the other problem ... ;-) But, and I think this is  
much more important: most toolkits today free you from using the  
database directly and writing lots and lots of lines of sql code  
which instantly breaks when you switch the storage backend. It's  
just the thing from where you look at something.


The real problem is the user-interface is designed around what is  
easy to implement in elderly cgi scripts, rather than what's  
appropriate to the data being displayed or useful to the user.  
Displaying tables of results, ten at a time, is just one of the more  
egregious examples of that.


Cheers,
  Steve


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Guido Neitzer


On 22.03.2007, at 11:53, Steve Atkins wrote:

As long as you're ordering by some row in the table then you can do  
that in

straight SQL.

select a, b, ts from foo where (stuff) and foo > X order by foo  
limit 10


Then, record the last value of foo you read, and plug it in as X  
the next

time around.


This does only work if you have unique values in foo. You might have  
"batch breaks" inside   a list of rows with equal values for foo.


But: a web application that needs state and doesn't maintain it by  
itself (or inside the dev toolkit) is imho broken by design. How  
should the database store a "state" for a web app? It's only possible  
on the web app part, because the app is either stateless and so are  
the queries to the database - they have to be re-evaluated for every  
request as the request might come from totally different sources  
(users, ...) or it is stateful and has to maintain the state because  
only the app developer knows, what information is needed for the  
"current state".


This is why all web application toolkits have a "session" concept.


I think the problem is more that most web developers aren't very good
at using the database, and tend to fall back on simplistic, wrong,  
approaches

to displaying the data. There's a lot of monkey-see, monkey-do in web
UI design too, which doesn't help.


Sure. That is the other problem ... ;-) But, and I think this is much  
more important: most toolkits today free you from using the database  
directly and writing lots and lots of lines of sql code which  
instantly breaks when you switch the storage backend. It's just the  
thing from where you look at something.


cug

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Merlin Moncure

On 3/22/07, Michael Stone <[EMAIL PROTECTED]> wrote:

On Thu, Mar 22, 2007 at 06:27:32PM +0100, Tino Wildenhain wrote:
>Craig A. James schrieb:
>>You guys can correct me if I'm wrong, but the key feature that's missing
>>from Postgres's flexible indexing is the ability to maintain state
>>across queries.  Something like this:
>>
>>  select a, b, my_index_state() from foo where ...
>>offset 100 limit 10 using my_index(prev_my_index_state);
>>
>
>Yes, you are wrong :-) The technique is called "CURSOR"
>if you maintain persistent connection per session
>(e.g. stand allone application or clever pooling webapplication)

Did you read the email before correcting it? From the part you trimmed
out:

>The problem is that relational databases were invented before the web
>and its stateless applications.  In the "good old days", you could
>connect to a database and work for hours, and in that environment
>cursors and such work well -- the RDBMS maintains the internal state of
>the indexing system.  But in a web environment, state information is
>very difficult to maintain.  There are all sorts of systems that try
>(Enterprise Java Beans, for example), but they're very complex.

It sounds like they wrote their own middleware to handle the problem,
which is basically what you suggested (a "clever pooling web
application") after saying "wrong".

Tino was saying that rather that build a complete indexing storage
management solution that lives outside the database, it is better to
do intelligent session management so that you get the simplicity if a
two tier client server system but the scalability of a web app.

Web apps are not necessarily stateless, you just have to be a little
clever about how database connections are opened and closed.  Then you
get all the database stuff that comes along with a persistent
connection (advisory locks, cursors, prepared statements, etc) without
building all kinds of data management into the middleware.

merlin

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [PERFORM] Performance of count(*)


Steve Atkins wrote:

As long as you're ordering by some row in the table then you can do that in
straight SQL.

select a, b, ts from foo where (stuff) and foo > X order by foo limit 10

Then, record the last value of foo you read, and plug it in as X the next
time around.


We've been over this before in this forum: It doesn't work as advertised.  Look 
for postings by me regarding the fact that there is no way to tell the 
optimizer the cost of executing a function.  There's one, for example, on Oct 
18, 2006.


I think the problem is more that most web developers aren't very good
at using the database, and tend to fall back on simplistic, wrong, 
approaches

to displaying the data. There's a lot of monkey-see, monkey-do in web
UI design too, which doesn't help.


Thanks, I'm sure your thoughtful comments will help me solve my problem.  
Somehow. ;-)

Craig

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Steve Atkins



On Mar 22, 2007, at 10:21 AM, Craig A. James wrote:


Tino Wildenhain wrote:

Craig A. James schrieb:
...
In our case (for a variety of reasons, but this one is critical),  
we actually can't use Postgres indexing at all -- we wrote an  
entirely separate indexing system for our data...


...There is no need to store or
maintain this information along with postgres when you can store
and maintain it directly in postgres as well.


Whether we store our data inside or outside Postgres misses the  
point (in fact, most of our data is stored IN Postgres).  It's the  
code that actually performs the index operation that has to be  
external to Postgres.



On top of that, postgres has a very flexible and extensible index
system.


You guys can correct me if I'm wrong, but the key feature that's  
missing from Postgres's flexible indexing is the ability to  
maintain state across queries.  Something like this:


 select a, b, my_index_state() from foo where ...
   offset 100 limit 10 using my_index(prev_my_index_state);

The my_index_state() function would issue something like a  
"cookie", an opaque text or binary object that would record  
information about how it got from row 1 through row 99.  When you  
issue the query above, it could start looking for row 100 WITHOUT  
reexamining rows 1-99.


This could be tricky in a OLTP environment, where the "cookie"  
could be invalidated by changes to the database.  But in warehouse  
read-mostly or read-only environments, it could yield vastly  
improved performance for database web applications.


If I'm not mistaken, Postgres (nor Oracle, MySQL or other RDBMS)  
can't do this.  I would love to be corrected.


As long as you're ordering by some row in the table then you can do  
that in

straight SQL.

select a, b, ts from foo where (stuff) and foo > X order by foo limit 10

Then, record the last value of foo you read, and plug it in as X the  
next

time around.

This has the advantage over a simple offset approach of actually
displaying all the data as a user pages through it too. (Consider
the case where the user is viewing offsets 91-100, and you delete
the record at offset 15. The user goes to the next page and will
miss the record that used to be at offset 101 and is now at offset
100).

The problem is that relational databases were invented before the  
web and its stateless applications.  In the "good old days", you  
could connect to a database and work for hours, and in that  
environment cursors and such work well -- the RDBMS maintains the  
internal state of the indexing system.  But in a web environment,  
state information is very difficult to maintain.  There are all  
sorts of systems that try (Enterprise Java Beans, for example), but  
they're very complex.


I think the problem is more that most web developers aren't very good
at using the database, and tend to fall back on simplistic, wrong,  
approaches

to displaying the data. There's a lot of monkey-see, monkey-do in web
UI design too, which doesn't help.

Cheers,
  Steve


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Performance of count(*)


Tino Wildenhain wrote:
You guys can correct me if I'm wrong, but the key feature that's 
missing from Postgres's flexible indexing is the ability to maintain 
state across queries.  Something like this:


 select a, b, my_index_state() from foo where ...
   offset 100 limit 10 using my_index(prev_my_index_state);



Yes, you are wrong :-) The technique is called "CURSOR"
if you maintain persistent connection per session
(e.g. stand allone application or clever pooling webapplication)


That's my whole point: If relational databases had a simple mechanism for 
storing their internal state in an external application, the need for cursors, 
connection pools, and all those other tricks would be eliminated.

As I said earlier, relational technology was invented in an earlier era, and 
hasn't caught up with the reality of modern web apps.


If its a naive web application you just store your session
in tables where you can easily maintain the scroll state
as well.


One thing I've learned in 25 years of software development is that people who 
use my software have problems I never imagined.  I've been the one who was 
naive when I said similar things about my customers, and was later embarrassed 
to learn that their problems were more complex than I ever imagined.

Craig

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Performance of count(*)


On Thu, Mar 22, 2007 at 06:27:32PM +0100, Tino Wildenhain wrote:

Craig A. James schrieb:
You guys can correct me if I'm wrong, but the key feature that's missing 
from Postgres's flexible indexing is the ability to maintain state 
across queries.  Something like this:


 select a, b, my_index_state() from foo where ...
   offset 100 limit 10 using my_index(prev_my_index_state);



Yes, you are wrong :-) The technique is called "CURSOR"
if you maintain persistent connection per session
(e.g. stand allone application or clever pooling webapplication)


Did you read the email before correcting it? From the part you trimmed 
out:


The problem is that relational databases were invented before the web 
and its stateless applications.  In the "good old days", you could 
connect to a database and work for hours, and in that environment 
cursors and such work well -- the RDBMS maintains the internal state of 
the indexing system.  But in a web environment, state information is 
very difficult to maintain.  There are all sorts of systems that try 
(Enterprise Java Beans, for example), but they're very complex.


It sounds like they wrote their own middleware to handle the problem, 
which is basically what you suggested (a "clever pooling web 
application") after saying "wrong".


Mike Stone

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [PERFORM] Performance of count(*)


Craig A. James schrieb:

Tino Wildenhain wrote:

Craig A. James schrieb:
...
In our case (for a variety of reasons, but this one is critical), we 
actually can't use Postgres indexing at all -- we wrote an entirely 
separate indexing system for our data...


...There is no need to store or
maintain this information along with postgres when you can store
and maintain it directly in postgres as well.


Whether we store our data inside or outside Postgres misses the point 
(in fact, most of our data is stored IN Postgres).  It's the code that 
actually performs the index operation that has to be external to Postgres.



On top of that, postgres has a very flexible and extensible index
system.


You guys can correct me if I'm wrong, but the key feature that's missing 
from Postgres's flexible indexing is the ability to maintain state 
across queries.  Something like this:


 select a, b, my_index_state() from foo where ...
   offset 100 limit 10 using my_index(prev_my_index_state);



Yes, you are wrong :-) The technique is called "CURSOR"
if you maintain persistent connection per session
(e.g. stand allone application or clever pooling webapplication)

If its a naive web application you just store your session
in tables where you can easily maintain the scroll state
as well.

Regards
Tino

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] Performance of count(*)


Tino Wildenhain wrote:

Craig A. James schrieb:
...
In our case (for a variety of reasons, but this one is critical), we 
actually can't use Postgres indexing at all -- we wrote an entirely 
separate indexing system for our data...


...There is no need to store or
maintain this information along with postgres when you can store
and maintain it directly in postgres as well.


Whether we store our data inside or outside Postgres misses the point (in fact, 
most of our data is stored IN Postgres).  It's the code that actually performs 
the index operation that has to be external to Postgres.


On top of that, postgres has a very flexible and extensible index
system.


You guys can correct me if I'm wrong, but the key feature that's missing from 
Postgres's flexible indexing is the ability to maintain state across queries.  
Something like this:

 select a, b, my_index_state() from foo where ...
   offset 100 limit 10 using my_index(prev_my_index_state);

The my_index_state() function would issue something like a "cookie", an opaque 
text or binary object that would record information about how it got from row 1 through 
row 99.  When you issue the query above, it could start looking for row 100 WITHOUT 
reexamining rows 1-99.

This could be tricky in a OLTP environment, where the "cookie" could be 
invalidated by changes to the database.  But in warehouse read-mostly or read-only 
environments, it could yield vastly improved performance for database web applications.

If I'm not mistaken, Postgres (nor Oracle, MySQL or other RDBMS) can't do this. 
 I would love to be corrected.

The problem is that relational databases were invented before the web and its stateless 
applications.  In the "good old days", you could connect to a database and work 
for hours, and in that environment cursors and such work well -- the RDBMS maintains the 
internal state of the indexing system.  But in a web environment, state information is 
very difficult to maintain.  There are all sorts of systems that try (Enterprise Java 
Beans, for example), but they're very complex.

Craig


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] Performance of count(*)


Brian Hurt wrote:
One of our biggest single problems is this very thing.  It's not a 
Postgres problem specifically, but more embedded in the idea of a 
relational database: There are no "job status" or "rough estimate of 
results" or "give me part of the answer" features that are critical to 
many real applications.


For the "give me part of the answer", I'm wondering if cursors wouldn't 
work (and if not, why not)?


There is no mechanism in Postgres (or any RDB that I know of) to say, "Give me rows 
1000 through 1010", that doesn't also execute the query on rows 1-1000.  In other 
words, the RDBMS does the work for 1010 rows, when only 10 are needed -- 100 times more 
work than is necessary.

Limit/Offset will return the correct 10 rows, but at the cost of doing the 
previous 1000 rows and discarding them.

Web applications are stateless.  To use a cursor, you'd have to keep it around for hours 
or days, and create complex "server affinity" code to direct a user back to the 
same server of your server farm (where that cursor is being held), on the chance that the 
user will come back and ask for rows 1000 through 1010, then a cursor isn't up to the 
task.

Craig

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq

Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Mario Weilguni

Am Donnerstag, 22. März 2007 16:17 schrieb Andreas Kostyrka:
> * Mario Weilguni <[EMAIL PROTECTED]> [070322 15:59]:
> > Am Donnerstag, 22. März 2007 15:33 schrieb Jonah H. Harris:
> > > On 3/22/07, Merlin Moncure <[EMAIL PROTECTED]> wrote:
> > > > As others suggest select count(*) from table is very special case
> > > > which non-mvcc databases can optimize for.
> > >
> > > Well, other MVCC database still do it faster than we do.  However, I
> > > think we'll be able to use the dead space map for speeding this up a
> > > bit wouldn't we?
> >
> > Which MVCC DB do you mean? Just curious...
>
> Well, mysql claims InnoDB to be mvcc ;)

Ok, but last time I tried count(*) with InnoDB tables did take roughly(*) the 
same time last time I tried - because InnoDB has the same problem as postgres 
and has to do a seqscan too (I think it's mentioned somewhere in their docs).

(*) in fact, postgres was faster, but the values were comparable, 40 seconds 
vs. 48 seconds 

Maybe the InnoDB have made some progress here, I tested it with MySQL 5.0.18.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Brian Hurt


Craig A. James wrote:



One of our biggest single problems is this very thing.  It's not a 
Postgres problem specifically, but more embedded in the idea of a 
relational database: There are no "job status" or "rough estimate of 
results" or "give me part of the answer" features that are critical to 
many real applications.


For the "give me part of the answer", I'm wondering if cursors wouldn't 
work (and if not, why not)?


Brian


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Carlos Moreno




count(*).  I was just discussing general performance issues on the
phone line and when my colleague asked me about the size of the
database he just wonderd why this takes so long for a job his
MS-SQL server is much faster.  [...].
   



Simple. MSSQL is optimized for this case, and uses "older"
datastructures. PG uses a MVCC storage, 



Which version of MSSQL?  Wikipedia says that SQL Server 2005 uses the
MVCC model.

Carlos
--


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [PERFORM] Performance of count(*)


Craig A. James schrieb:
...
In our case (for a variety of reasons, but this one is critical), we 
actually can't use Postgres indexing at all -- we wrote an entirely 
separate indexing system for our data, one that has the following 
properties:


 1. It can give out "pages" of information (i.e. "rows 50-60") without
rescanning the skipped pages the way "limit/offset" would.
 2. It can give accurate estimates of the total rows that will be returned.
 3. It can accurately estimate the time it will take.



Thats certainly not entirely correct. There is no need to store or
maintain this information along with postgres when you can store
and maintain it directly in postgres as well. When you have some
outside application I think I can savely assume you are doing
less updates compared to many reads to have it actually pay out.

So why not store this information in separate "index" and "statistic"
tables? You would have just to join with your real data for retrival.

On top of that, postgres has a very flexible and extensible index
system. This would mean you save on database roundtrips and
double information storage (and the sync problems you certainly
get from it)

Regards
Tino


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Performance of count(*)

Michael Stone wrote:

On Thu, Mar 22, 2007 at 01:30:35PM +0200, [EMAIL PROTECTED] wrote:

approximated count?

why? who would need it? where you can use it?

Do a google query. Look at the top of the page, where it says "results N
to M of about O". For user interfaces (which is where a lot of this
count(*) stuff comes from) you quite likely don't care about the exact
count...

Right on, Michael.

One of our biggest single problems is this very thing. It's not a Postgres problem specifically, but more
embedded in the idea of a relational database: There are no "job status" or "rough estimate of
results" or "give me part of the answer" features that are critical to many real applications.

In our case (for a variety of reasons, but this one is critical), we actually
can't use Postgres indexing at all -- we wrote an entirely separate indexing
system for our data, one that has the following properties:

1. It can give out "pages" of information (i.e. "rows 50-60") without
rescanning the skipped pages the way "limit/offset" would.
2. It can give accurate estimates of the total rows that will be returned.
3. It can accurately estimate the time it will take.

For our primary business-critical data, Postgres is merely a storage system, not a search
system, because we have to do the "heavy lifting" in our own code. (To be
fair, there is no relational database that can handle our data.)

Many or most web-based search engines face these exact problems.

Craig

---(end of broadcast)---
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Andreas Kostyrka

* Mario Weilguni <[EMAIL PROTECTED]> [070322 15:59]:
> Am Donnerstag, 22. März 2007 15:33 schrieb Jonah H. Harris:
> > On 3/22/07, Merlin Moncure <[EMAIL PROTECTED]> wrote:
> > > As others suggest select count(*) from table is very special case
> > > which non-mvcc databases can optimize for.
> >
> > Well, other MVCC database still do it faster than we do.  However, I
> > think we'll be able to use the dead space map for speeding this up a
> > bit wouldn't we?
> 
> Which MVCC DB do you mean? Just curious...
Well, mysql claims InnoDB to be mvcc ;)

Andreas

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Luke Lonergan

Andreas,

On 3/22/07 4:48 AM, "Andreas Tille" <[EMAIL PROTECTED]> wrote:

> Well, to be honest I'm not really interested in the performance of
> count(*).  I was just discussing general performance issues on the
> phone line and when my colleague asked me about the size of the
> database he just wonderd why this takes so long for a job his
> MS-SQL server is much faster.  So in principle I was just asking
> a first question that is easy to ask.  Perhaps I come up with
> more difficult optimisation questions.

This may be the clue you needed - in Postgres SELECT COUNT(*) is an
approximate way to measure the speed of your disk setup (up to about
1,200MB/s).  Given that you are having performance problems, it may be that
your disk layout is either:
- slow by design
- malfunctioning

If this is the case, then any of your queries that require a full table scan
will be affected.

You should check your sequential disk performance using the following:

time bash -c "dd if=/dev/zero of=/your_file_system/bigfile bs=8k
count=(your_memory_size_in_KB*2/8) && sync"
time dd if=/your_file_system/bigfile of=/dev/null bs=8k

Report those times here and we can help you with it.

- Luke



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [PERFORM] Performance of count(*)

2007-03-22 Thread mark

On Thu, Mar 22, 2007 at 10:18:10AM -0400, Michael Stone wrote:
> IIRC, that's basically what you get with the mysql count anyway, since 
> there are corner cases for results in a transaction. Avoiding those 
> cases is why the postgres count takes so long; sometimes that's what's 
> desired and sometimes it is not.

Adding to this point:

In any production system, the count presented to the user is usually
wrong very shortly after it is displayed anyways. Transactions in the
background or from other users are adding or removing items, perhaps
even before the count reaches the user's display.

The idea of transaction-safety for counts doesn't apply in this case.
Both the transaction and the number are complete before the value is
displayed.

In my own systems, I rarely use count(*) for anything except user
visible results. For the PostgreSQL system I use, I keep a table of
counts, and lock the row for update when adding or removing items.
This turns out to be best in this system anyways, as I need my new
rows to be ordered, and locking the 'count' row lets me assign a
new sequence number for the row. (Don't want to use SEQUENCE objects,
as there could as the rows are [key, sequence, data], with thousands
or more keys)

Cheers,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] 
__
.  .  _  ._  . .   .__.  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/|_ |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
   and in the darkness bind them...

   http://mark.mielke.cc/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [PERFORM] Performance of count(*)


On Thu, Mar 22, 2007 at 09:39:18AM -0400, Merlin Moncure wrote:

You can get the approximate count by selecting reltuples from
pg_class.  It is valid as of last analyze.


Of course, that only works if you're not using any WHERE clause. 
Here's a (somewhat ugly) example of getting an approximate count based 
off the statistics info, which will work for more complicated queries:

http://archives.postgresql.org/pgsql-sql/2005-08/msg00046.php
The ugliness is that you have to provide the whole query as a 
parameter to the function, instead of using it as a drop-in replacement 
for count. I assume that the TODO item is to provide the latter, but for 
now this method can be useful for UI type stuff where you just want to 
know whether there's "a little" or "a lot".


Mike Stone

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Michael Fuhr

On Thu, Mar 22, 2007 at 01:29:46PM +0100, Andreas Kostyrka wrote:
> * Andreas Tille <[EMAIL PROTECTED]> [070322 13:24]:
> > Well, to be honest I'm not really interested in the performance of
> > count(*).  I was just discussing general performance issues on the
> > phone line and when my colleague asked me about the size of the
> > database he just wonderd why this takes so long for a job his
> > MS-SQL server is much faster.  So in principle I was just asking
> > a first question that is easy to ask.  Perhaps I come up with
> > more difficult optimisation questions.
> 
> Simple. MSSQL is optimized for this case, and uses "older"
> datastructures. PG uses a MVCC storage, which is not optimized for
> this usecase. It's quite fast for different kinds of queries.

Ask about performing concurrent selects, inserts, updates, and
deletes in SQL Server and about the implications on ACID of locking
hints such as NOLOCK.  Then consider how MVCC handles concurrency
without blocking or the need for dirty reads.

-- 
Michael Fuhr

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Mario Weilguni

Am Donnerstag, 22. März 2007 15:33 schrieb Jonah H. Harris:
> On 3/22/07, Merlin Moncure <[EMAIL PROTECTED]> wrote:
> > As others suggest select count(*) from table is very special case
> > which non-mvcc databases can optimize for.
>
> Well, other MVCC database still do it faster than we do.  However, I
> think we'll be able to use the dead space map for speeding this up a
> bit wouldn't we?

Which MVCC DB do you mean? Just curious...

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [PERFORM] Performance of count(*)

2007-03-22 Thread Jonah H. Harris


On 3/22/07, Merlin Moncure <[EMAIL PROTECTED]> wrote:

As others suggest select count(*) from table is very special case
which non-mvcc databases can optimize for.


Well, other MVCC database still do it faster than we do.  However, I
think we'll be able to use the dead space map for speeding this up a
bit wouldn't we?

--
Jonah H. Harris, Software Architect | phone: 732.331.1324
EnterpriseDB Corporation| fax: 732.331.1301
33 Wood Ave S, 3rd Floor| [EMAIL PROTECTED]
Iselin, New Jersey 08830| http://www.enterprisedb.com/

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [PERFORM] Performance of count(*)