Re: [PERFORM] Comparative performance

2005-09-28 Thread Magnus Hagander
> > It appears that PostgreSQL is two to three times slower 
> than MySQL.  
> > For example, some pages that have some 30,000 characters 
> (when saved 
> > as HTML) take 1 to 1 1/2 seconds with MySQL but 3 to 4 seconds with 
> > PostgreSQL.  I had read that the former was generally 
> faster than the 
> > latter, particularly for simple web applications but I was 
> hoping that 
> > Postgres' performance would not be that noticeably slower.
> 
> Are you comparing PostgreSQL on XP to MySQL on XP or 
> PostgreSQL on Linux to MySQL on Linux? Our performance on XP 
> is not great. Also, which version of PostgreSQL are you using?

That actually depends a lot on *how* you use it. I've seen pg-on-windows
deployments that come within a few percent of the linux performance.
I've also seen those that are absolutely horrible compared.

One sure way to kill the performance is to do a lot of small
connections. Using persistent connection is even more important on
Windows than it is on Unix. It could easily explain a difference like
this.

//Magnus

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-28 Thread Ron Peacetree
>From: "Jeffrey W. Baker" <[EMAIL PROTECTED]>
>Sent: Sep 29, 2005 12:27 AM
>To: Ron Peacetree <[EMAIL PROTECTED]>
>Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org
>Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
>
>You are engaging in a length and verbose exercise in mental
>masturbation, because you have not yet given a concrete example of a
>query where this stuff would come in handy.  A common, general-purpose
>case would be the best.
> 
??? I posted =3= specific classes of common, general-purpose query
operations where OES and the OES Btrees look like they should be
superior to current methods:
1= when splitting sorting or other operations across multiple CPUs
2= when doing joins of different tables by doing the join on these Btrees
rather than the original tables.
3= when the opportunity arises to reuse OES Btree results of previous
sorts for different keys in the same table.  Now we can combine the
existing Btrees to obtain the new order based on the composite key
without ever manipulating the original, much larger, table.  

In what way are these examples not "concrete"?


>We can all see that the method you describe might be a good way to sort
>a very large dataset with some known properties, which would be fine if
>you are trying to break the terasort benchmark.  But that's not what
>we're doing here.  We are designing and operating relational databases.
>So please explain the application.
>
This is a GENERAL method.  It's based on CPU cache efficient Btrees that
use variable length prefix keys and RIDs.
It assumes NOTHING about the data or the system in order to work.
I gave some concrete examples for the sake of easing explanation, NOT
as an indication of assumptions or limitations of the method.  I've even
gone out of my way to prove that no such assumptions or limitations exist.
Where in the world are you getting such impressions?
  

>Your main example seems to focus on a large table where a key column has
>constrained values.  This case is interesting in proportion to the
>number of possible values.  If I have billions of rows, each having one
>of only two values, I can think of a trivial and very fast method of
>returning the table "sorted" by that key: make two sequential passes,
>returning the first value on the first pass and the second value on the
>second pass.  This will be faster than the method you propose.
>
1= No that was not my main example.  It was the simplest example used to
frame the later more complicated examples.  Please don't get hung up on it.

2= You are incorrect.  Since IO is the most expensive operation we can do,
any method that makes two passes through the data at top scanning speed
will take at least 2x as long as any method that only takes one such pass.


>I think an important aspect you have failed to address is how much of
>the heap you must visit after the sort is complete.  If you are
>returning every tuple in the heap then the optimal plan will be very
>different from the case when you needn't.  
>
Hmmm.  Not sure which "heap" you are referring to, but the OES Btree
index is provably the lowest (in terms of tree height) and smallest
possible CPU cache efficient data structure that one can make and still
have all of the traditional benefits associated with a Btree representation
of a data set.

Nonetheless, returning a RID, or all RIDs with(out) the same Key, or all
RIDs (not) within a range of Keys, or simply all RIDs in sorted order is
efficient.  Just as should be for a Btree (actually it's a B+ tree variant to
use Knuth's nomenclature).  I'm sure someone posting from acm.org
recognizes how each of these Btree operations maps to various SQL
features...  

I haven't been talking about query plans because they are orthogonal to
the issue under discussion?  If we use a layered model for PostgreSQL's
architecture, this functionality is more primal than that of a query
planner.  ALL query plans that currently involve sorts will benefit from a
more efficient way to do, or avoid, sorts.


>PS: Whatever mailer you use doesn't understand or respect threading nor
>attribution.  Out of respect for the list's readers, please try a mailer
>that supports these 30-year-old fundamentals of electronic mail.
>
That is an issue of infrastructure on the recieving side, not on the sending
(my) side since even my web mailer seems appropriately RFC conformant.
Everything seems to be going in the correct places and being properly 
organized on archival.postgres.org ...

Ron

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Comparative performance

2005-09-28 Thread Dennis Bjorklund
On Wed, 28 Sep 2005, Joe wrote:

> Before I post the EXPLAIN and the table schema I'd appreciate
> confirmation that this list is the appropriate forum. 

It is and and useful things to show are

 * the slow query
 * EXPLAIN ANALYZE of the query
 * the output of \d for each table involved in the query
 * the output of SHOW ALL;
 * The amount of memory the machine have

The settings that are the most important to tune in postgresql.conf for
performance is in my opinion; shared_buffers, effective_cache_size and
(to a lesser extent) work_mem.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Sequential I/O Cost (was Re: [PERFORM] A Better External Sort?)

2005-09-28 Thread Jeffrey W. Baker
On Wed, 2005-09-28 at 12:03 -0400, Ron Peacetree wrote:
> >From: "Jeffrey W. Baker" <[EMAIL PROTECTED]>
> >Perhaps I believe this because you can now buy as much sequential I/O
> >as you want.  Random I/O is the only real savings.
> >
> 1= No, you can not "buy as much sequential IO as you want".  Even if
> with an infinite budget, there are physical and engineering limits.  Long
> before you reach those limits, you will pay exponentially increasing costs
> for linearly increasing performance gains.  So even if you _can_ buy a
> certain level of sequential IO, it may not be the most efficient way to
> spend money.

This is just false.  You can buy sequential I/O for linear money up to
and beyond your platform's main memory bandwidth.  Even 1GB/sec will
severely tax memory bandwidth of mainstream platforms, and you can
achieve this rate for a modest cost.  

I have one array that can supply this rate and it has only 15 disks.  It
would fit on my desk.  I think your dire talk about the limits of
science and engineering may be a tad overblown.


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-28 Thread Jeffrey W. Baker
On Wed, 2005-09-28 at 12:03 -0400, Ron Peacetree wrote:
> >From: "Jeffrey W. Baker" <[EMAIL PROTECTED]>
> >Sent: Sep 27, 2005 1:26 PM
> >To: Ron Peacetree <[EMAIL PROTECTED]>
> >Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
> >
> >On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote:
> >
> >>That Btree can be used to generate a physical reordering of the data
> >>in one pass, but that's the weakest use for it.  The more powerful
> >>uses involve allowing the Btree to persist and using it for more
> >>efficient re-searches or combining it with other such Btrees (either as
> >>a step in task distribution across multiple CPUs or as a more efficient
> >>way to do things like joins by manipulating these Btrees rather than
> >>the actual records.)
> >
> >Maybe you could describe some concrete use cases.  I can see what
> >you are getting at, and I can imagine some advantageous uses, but
> >I'd like to know what you are thinking.
> >
> >Specifically I'd like to see some cases where this would beat sequential
> >scan.  I'm thinking that in your example of a terabyte table with a
> >column having only two values, all the queries I can think of would be
> >better served with a sequential scan.
> >
> In my original example, a sequential scan of the 1TB of 2KB or 4KB
> records, => 250M or 500M records of data, being sorted on a binary
> value key will take ~1000x more time than reading in the ~1GB Btree
> I described that used a Key+RID (plus node pointers) representation
> of the data.

You are engaging in a length and verbose exercise in mental
masturbation, because you have not yet given a concrete example of a
query where this stuff would come in handy.  A common, general-purpose
case would be the best.

We can all see that the method you describe might be a good way to sort
a very large dataset with some known properties, which would be fine if
you are trying to break the terasort benchmark.  But that's not what
we're doing here.  We are designing and operating relational databases.
So please explain the application.

Your main example seems to focus on a large table where a key column has
constrained values.  This case is interesting in proportion to the
number of possible values.  If I have billions of rows, each having one
of only two values, I can think of a trivial and very fast method of
returning the table "sorted" by that key: make two sequential passes,
returning the first value on the first pass and the second value on the
second pass.  This will be faster than the method you propose.

I think an important aspect you have failed to address is how much of
the heap you must visit after the sort is complete.  If you are
returning every tuple in the heap then the optimal plan will be very
different from the case when you needn't.  

-jwb

PS: Whatever mailer you use doesn't understand or respect threading nor
attribution.  Out of respect for the list's readers, please try a mailer
that supports these 30-year-old fundamentals of electronic mail.


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Comparative performance

2005-09-28 Thread Gavin Sherry
On Wed, 28 Sep 2005, Joe wrote:

> I'm converting a relatively small database (2 MB) from MySQL to PostgreSQL.  
> It
> is used to generate web pages using PHP.  Although the actual website runs 
> under
> Linux, the development is done under XP.  I've completed most of the data
> conversion and rewrite of the PHP scripts, so now I'm comparing relative
> performance.
>
> It appears that PostgreSQL is two to three times slower than MySQL.  For
> example, some pages that have some 30,000 characters (when saved as HTML) 
> take 1
> to 1 1/2 seconds with MySQL but 3 to 4 seconds with PostgreSQL.  I had read 
> that
> the former was generally faster than the latter, particularly for simple web
> applications but I was hoping that Postgres' performance would not be that
> noticeably slower.

Are you comparing PostgreSQL on XP to MySQL on XP or PostgreSQL on Linux
to MySQL on Linux? Our performance on XP is not great. Also, which version
of PostgreSQL are you using?

>
> I'm trying to determine if the difference can be attributed to anything that
> I've done or missed.  I've run VACUUM ANALYZE on the two main tables and I'm
> looking at the results of EXPLAIN on the query that drives the retrieval of
> probably 80% of the data for the pages in question.

Good.

>
> Before I post the EXPLAIN and the table schema I'd appreciate confirmation 
> that
> this list is the appropriate forum.  I'm a relative newcomer to PostgreSQL 
> (but
> not to relational databases), so I'm not sure if this belongs in the novice or
> general lists.

You can post the results of EXPLAIN ANALYZE here. Please including schema
definitions and the query string(s) themselves.

Thanks,

Gavin

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[PERFORM] Comparative performance

2005-09-28 Thread Joe
I'm converting a relatively small database (2 MB) from MySQL to PostgreSQL.  It 
is used to generate web pages using PHP.  Although the actual website runs under 
Linux, the development is done under XP.  I've completed most of the data 
conversion and rewrite of the PHP scripts, so now I'm comparing relative 
performance.


It appears that PostgreSQL is two to three times slower than MySQL.  For 
example, some pages that have some 30,000 characters (when saved as HTML) take 1 
to 1 1/2 seconds with MySQL but 3 to 4 seconds with PostgreSQL.  I had read that 
the former was generally faster than the latter, particularly for simple web 
applications but I was hoping that Postgres' performance would not be that 
noticeably slower.


I'm trying to determine if the difference can be attributed to anything that 
I've done or missed.  I've run VACUUM ANALYZE on the two main tables and I'm 
looking at the results of EXPLAIN on the query that drives the retrieval of 
probably 80% of the data for the pages in question.


Before I post the EXPLAIN and the table schema I'd appreciate confirmation that 
this list is the appropriate forum.  I'm a relative newcomer to PostgreSQL (but 
not to relational databases), so I'm not sure if this belongs in the novice or 
general lists.


Joe


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Slow concurrent update of same row in a given table

2005-09-28 Thread Gavin Sherry
On Thu, 29 Sep 2005, Rajesh Kumar Mallah wrote:

> On 9/29/05, Gavin Sherry <[EMAIL PROTECTED]> wrote:
> > On Wed, 28 Sep 2005, Rajesh Kumar Mallah wrote:
> >
> > > > > Number of Copies | Update perl Sec
> > > > >
> > > > > 1  --> 119
> > > > > 2  ---> 59
> > > > > 3  --->  38
> > > > > 4  ---> 28
> > > > > 5 --> 22
> > > > > 6 --> 19
> > > > > 7 --> 16
> > > > > 8 --> 14
> > > > > 9 --> 11
> > > > > 10 --> 11
> > > > > 11 --> 10
> > > >
> > > > So, 11 instances result in 10 updated rows per second, database wide or
> > > > per instance? If it is per instance, then 11 * 10 is close to the
> > > > performance for one connection.
> > >
> > >
> > > Sorry do not understand the difference between "database wide"
> > > and "per instance"
> >
> > Per instance.
> >
> > >
> > > >
> > > > That being said, when you've got 10 connections fighting over one row, I
> > > > wouldn't be surprised if you had bad performance.
> > > >
> > > > Also, at 119 updates a second, you're more than doubling the table's
> > > > initial size (dead tuples) each second. How often are you vacuuming and
> > > > are you using vacuum or vacuum full?
> > >
> > >
> > > Yes I realize the obvious phenomenon now, (and the uselessness of the 
> > > script)
> > >  , we should not consider it a performance degradation.
> > >
> > > I am having performance issue in my live database thats why i tried to
> > > simulate the situation(may the the script was overstresser).
> > >
> > > My original problem is  that i send 100 000s of emails carrying a
> > > beacon for tracking readership every tuesday and on wednesday i see
> > > lot of the said query in pg_stat_activity each of these query update
> > > the SAME row that corresponds to the dispatch of last day and it is
> > > then i face the performance problem.
> > >
> > > I think i can only post further details next wednesday , please lemme
> > > know how should i be dealing with the situation if each the updates takes
> > > 100times more time that normal update duration.
> >
> > I see. These problems regularly come up in database design. The best thing
> > you can do is modify your database design/application such that instead of
> > incrementing a count in a single row, you insert a row into a table,
> > recording the 'dispatch_id'. Counting the number of rows for a given
> > dispatch id will give you your count.
> >
>
> sorry i will be accumulating huge amount of rows in seperate table
> with no extra info when i really want just the count. Do you have
> a better database design in mind?
>
> Also i encounter same problem in implementing read count of
> articles in sites and in counting banner impressions where same
> row get updated by multiple processes frequently.

As I said in private email, accumulating large numbers of rows is not a
problem. In your current application, you are write bound, not read bound.
I've designed many similar systems which have hundred of millions of rows.
It takes a while to generate the count, but you just do it periodically in
non-busy periods.

With 8.1, constraint exclusion will give you significantly better
performance with this system, as well.

Thanks,

Gavin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Slow concurrent update of same row in a given table

2005-09-28 Thread Rajesh Kumar Mallah
On 9/29/05, Gavin Sherry <[EMAIL PROTECTED]> wrote:
> On Wed, 28 Sep 2005, Rajesh Kumar Mallah wrote:
>
> > > > Number of Copies | Update perl Sec
> > > >
> > > > 1  --> 119
> > > > 2  ---> 59
> > > > 3  --->  38
> > > > 4  ---> 28
> > > > 5 --> 22
> > > > 6 --> 19
> > > > 7 --> 16
> > > > 8 --> 14
> > > > 9 --> 11
> > > > 10 --> 11
> > > > 11 --> 10
> > >
> > > So, 11 instances result in 10 updated rows per second, database wide or
> > > per instance? If it is per instance, then 11 * 10 is close to the
> > > performance for one connection.
> >
> >
> > Sorry do not understand the difference between "database wide"
> > and "per instance"
>
> Per instance.
>
> >
> > >
> > > That being said, when you've got 10 connections fighting over one row, I
> > > wouldn't be surprised if you had bad performance.
> > >
> > > Also, at 119 updates a second, you're more than doubling the table's
> > > initial size (dead tuples) each second. How often are you vacuuming and
> > > are you using vacuum or vacuum full?
> >
> >
> > Yes I realize the obvious phenomenon now, (and the uselessness of the 
> > script)
> >  , we should not consider it a performance degradation.
> >
> > I am having performance issue in my live database thats why i tried to
> > simulate the situation(may the the script was overstresser).
> >
> > My original problem is  that i send 100 000s of emails carrying a
> > beacon for tracking readership every tuesday and on wednesday i see
> > lot of the said query in pg_stat_activity each of these query update
> > the SAME row that corresponds to the dispatch of last day and it is
> > then i face the performance problem.
> >
> > I think i can only post further details next wednesday , please lemme
> > know how should i be dealing with the situation if each the updates takes
> > 100times more time that normal update duration.
>
> I see. These problems regularly come up in database design. The best thing
> you can do is modify your database design/application such that instead of
> incrementing a count in a single row, you insert a row into a table,
> recording the 'dispatch_id'. Counting the number of rows for a given
> dispatch id will give you your count.
>

sorry i will be accumulating huge amount of rows in seperate table
with no extra info when i really want just the count. Do you have
a better database design in mind?

Also i encounter same problem in implementing read count of
articles in sites and in counting banner impressions where same
row get updated by multiple processes frequently.


Thanks & Regds
mallah.







> Thanks,
>
> Gavin
>

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-28 Thread Ron Peacetree
If I've done this correctly, there should not be anywhere near
the number of context switches we currently see while sorting.

Each unscheduled context switch represents something unexpected
occuring or things not being where they are needed when they are
needed.  Reducing such circumstances to the absolute minimum 
was one of the design goals.

Reducing the total amount of IO to the absolute minimum should
help as well. 

Ron


-Original Message-
From: Kevin Grittner <[EMAIL PROTECTED]>
Sent: Sep 27, 2005 11:21 AM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

I can't help wondering how a couple thousand context switches per
second would affect the attempt to load disk info into the L1 and
L2 caches.  That's pretty much the low end of what I see when the
server is under any significant load.




---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-28 Thread Ron Peacetree
In the interest of efficiency and "not reinventing the wheel", does anyone know
where I can find C or C++ source code for a Btree variant with the following
properties:

A= Data elements (RIDs) are only stored in the leaves, Keys (actually
KeyPrefixes; see "D" below) and Node pointers are only stored in the internal
nodes of the Btree.

B= Element redistribution is done as an alternative to node splitting in 
overflow
conditions during Inserts whenever possible.

C= Variable length Keys are supported.

D= Node buffering with a reasonable replacement policy is supported.

E= Since we will know beforehand exactly how many RID's will be stored, we
will know apriori how much space will be needed for leaves, and will know the
worst case for how much space will be required for the Btree internal nodes
as well.  This implies that we may be able to use an array, rather than linked
list, implementation of the Btree.  Less pointer chasing at the expense of more
CPU calculations, but that's a trade-off in the correct direction. 

Such source would be a big help in getting a prototype together.

Thanks in advance for any pointers or source,
Ron

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Logarithmic change (decrease) in performance

2005-09-28 Thread Ron Peacetree
>From: Matthew Nuzum <[EMAIL PROTECTED]>
>Sent: Sep 28, 2005 4:02 PM
>Subject: [PERFORM] Logarithmic change (decrease) in performance
>
Small nit-pick:  A "logarithmic decrease" in performance would be
a relatively good thing, being better than either a linear or
exponential decrease in performance.  What you are describing is
the worst kind: an _exponential_ decrease in performance.

>Something interesting is going on. I wish I could show you the graphs,
>but I'm sure this will not be a surprise to the seasoned veterans.
>
>A particular application server I have has been running for over a
>year now. I've been logging cpu load since mid-april.
>
>It took 8 months or more to fall from excellent performance to
>"acceptable." Then, over the course of about 5 weeks it fell from
>"acceptable" to "so-so." Then, in the last four weeks it's gone from
>"so-so" to alarming.
>
>I've been working on this performance drop since Friday but it wasn't
>until I replied to Arnau's post earlier today that I remembered I'd
>been logging the server load. I grabbed the data and charted it in
>Excel and to my surprise, the graph of the server's load average looks
>kind of like the graph of y=x^2.
>
>I've got to make a recomendation for a solution to the PHB and my
>analysis is showing that as the dataset becomes larger, the amount of
>time the disk spends seeking is increasing. This causes processes to
>take longer to finish, which causes more processes to pile up, which
>causes processes to take longer to finish, which causes more processes
>to pile up etc. It is this growing dataset that seems to be the source
>of the sharp decrease in performance.
>
>I knew this day would come, but I'm actually quite surprised that when
>it came, there was little time between the warning and the grande
>finale. I guess this message is being sent to the list to serve as a
>warning to other data warehouse admins that when you reach your
>capacity, the downward spiral happens rather quickly.
>
Yep, definitely been where you are.  Bottom line: you have to reduce
the sequential seeking behavior of the system to within an acceptable
window and then keep it there.

1= keep more of the data set in RAM
2= increase the size of your HD IO buffers
3= make your RAID sets wider (more parallel vs sequential IO)
4= reduce the atomic latency of your RAID sets
(time for Fibre Channel 15Krpm HD's vs 7.2Krpm SATA ones?)
5= make sure your data is as unfragmented as possible
6= change you DB schema to minimize the problem
a= overall good schema design
b= partitioning the data so that the system only has to manipulate a
reasonable chunk of it at a time.

In many cases, there's a number of ways to accomplish the above.
Unfortunately, most of them require CapEx.

Also, ITRW world such systems tend to have this as a chronic
problem.  This is not a "fix it once and it goes away forever".  This
is a part of the regular maintenance and upgrade plan(s). 

Good Luck,
Ron 

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Slow concurrent update of same row in a given table

2005-09-28 Thread Gavin Sherry
On Wed, 28 Sep 2005, Rajesh Kumar Mallah wrote:

> > > Number of Copies | Update perl Sec
> > >
> > > 1  --> 119
> > > 2  ---> 59
> > > 3  --->  38
> > > 4  ---> 28
> > > 5 --> 22
> > > 6 --> 19
> > > 7 --> 16
> > > 8 --> 14
> > > 9 --> 11
> > > 10 --> 11
> > > 11 --> 10
> >
> > So, 11 instances result in 10 updated rows per second, database wide or
> > per instance? If it is per instance, then 11 * 10 is close to the
> > performance for one connection.
>
>
> Sorry do not understand the difference between "database wide"
> and "per instance"

Per instance.

>
> >
> > That being said, when you've got 10 connections fighting over one row, I
> > wouldn't be surprised if you had bad performance.
> >
> > Also, at 119 updates a second, you're more than doubling the table's
> > initial size (dead tuples) each second. How often are you vacuuming and
> > are you using vacuum or vacuum full?
>
>
> Yes I realize the obvious phenomenon now, (and the uselessness of the script)
>  , we should not consider it a performance degradation.
>
> I am having performance issue in my live database thats why i tried to
> simulate the situation(may the the script was overstresser).
>
> My original problem is  that i send 100 000s of emails carrying a
> beacon for tracking readership every tuesday and on wednesday i see
> lot of the said query in pg_stat_activity each of these query update
> the SAME row that corresponds to the dispatch of last day and it is
> then i face the performance problem.
>
> I think i can only post further details next wednesday , please lemme
> know how should i be dealing with the situation if each the updates takes
> 100times more time that normal update duration.

I see. These problems regularly come up in database design. The best thing
you can do is modify your database design/application such that instead of
incrementing a count in a single row, you insert a row into a table,
recording the 'dispatch_id'. Counting the number of rows for a given
dispatch id will give you your count.

Thanks,

Gavin

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Monitoring Postgresql performance

2005-09-28 Thread Alex Stapleton


On 28 Sep 2005, at 15:32, Arnau wrote:



Hi all,

  I have been "googling" a bit searching info about a way to  
monitor postgresql (CPU & Memory, num processes, ... ) and I  
haven't found anything relevant. I'm using munin to monitor others  
parameters of my servers and I'd like to include postgresql or have  
a similar tool. Any of you is using anything like that? all kind of  
hints are welcome :-)


Cheers!



Have you looked at SNMP? It's a bit complex but there's lots of tools  
for monitoring system data / sending alerts based on SNMP already.



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[PERFORM] Logarithmic change (decrease) in performance

2005-09-28 Thread Matthew Nuzum
Something interesting is going on. I wish I could show you the graphs,
but I'm sure this will not be a surprise to the seasoned veterans.

A particular application server I have has been running for over a
year now. I've been logging cpu load since mid-april.

It took 8 months or more to fall from excellent performance to
"acceptable." Then, over the course of about 5 weeks it fell from
"acceptable" to "so-so." Then, in the last four weeks it's gone from
"so-so" to alarming.

I've been working on this performance drop since Friday but it wasn't
until I replied to Arnau's post earlier today that I remembered I'd
been logging the server load. I grabbed the data and charted it in
Excel and to my surprise, the graph of the server's load average looks
kind of like the graph of y=x^2.

I've got to make a recomendation for a solution to the PHB and my
analysis is showing that as the dataset becomes larger, the amount of
time the disk spends seeking is increasing. This causes processes to
take longer to finish, which causes more processes to pile up, which
cuases processes to take longer to finish, which causes more processes
to pile up etc. It is this growing dataset that seems to be the source
of the sharp decrease in performance.

I knew this day would come, but I'm actually quite surprised that when
it came, there was little time between the warning and the grande
finale. I guess this message is being sent to the list to serve as a
warning to other data warehouse admins that when you reach your
capacity, the downward spiral happens rather quickly.

Crud... Outlook just froze while composing the PHB memo. I've been
working on that for an hour. What a bad day.
--
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Monitoring Postgresql performance

2005-09-28 Thread Matthew Nuzum
On 9/28/05, Arnau <[EMAIL PROTECTED]> wrote:
> Hi all,
>
>I have been "googling" a bit searching info about a way to monitor
> postgresql (CPU & Memory, num processes, ... ) and I haven't found
> anything relevant. I'm using munin to monitor others parameters of my
> servers and I'd like to include postgresql or have a similar tool. Any
> of you is using anything like that? all kind of hints are welcome :-)
>
> Cheers!
> --
> Arnau

I have a cronjob that runs every 5 minutes and checks the number of
processes. When things get unruly I get a text message sent to my cell
phone. It also creates a detailed log entry. I'll paste in an example
of one of my scripts that does this below. This is on a dual purpose
server and monitors both cpu load average and postgres. You can have
the text message sent to multiple email addresses, just put a space
separated list of e-mail addresses between quotes in the CONTACTS=
line. It's simple, but it works and its always nice to know when
there's a problem *before the boss discovers it* ;-)

# Create some messages
HOSTNAME=`hostname`
WARNING_DB="Database connections on $HOSTNAME is rather high"
WARNING_CPU="CPU load on $HOSTNAME is rather high"
CONTACTS="[EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED]"
WARN=0

#calculate the db load
DB_LOAD=`ps -ax | grep postgres | wc -l`
if (($DB_LOAD > 150))
then
  WARN=1
  echo "$WARNING_DB ($DB_LOAD) " | mail -s "db_load is high
($DB_LOAD)" $CONTACTS
fi

#calculate the processor load
CPU_LOAD=`cat /proc/loadavg | cut --delimiter=" " -f 2 | cut
--delimiter="." -f 1`
if (($CPU_LOAD > 8))
then
  WARN=1
  echo "$WARNING_CPU ($CPU_LOAD) " | mail -s "CPU_load is high
($CPU_LOAD)" $CONTACTS
fi

if (($WARN > 0))
then
  echo -=-=-=-=-=-=-=-=- W A R N I N G -=-=-=-=-=-=-=-=- >> /tmp/warn.txt
  NOW=`date`
  echo -=-=-=-=-=-$NOW-=-=-=-=-=- >> /tmp/warn.txt
  echo CPU LOAD: $CPU_LOAD DB LOAD: $DB_LOAD >> /tmp/warn.txt
  echo >> /tmp/warn.txt
  top -bn 1 >> /tmp/warn.txt
  echo >> /tmp/warn.txt
fi

NOW=`date`
CPU_LOAD=`cat /proc/loadavg | cut --delimiter=" " -f 1,2,3
--output-delimiter=\|`
echo -e $NOW\|$CPU_LOAD\|$DB_LOAD >> ~/LOAD_MONITOR.LOG

--
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Slow concurrent update of same row in a given table

2005-09-28 Thread Rajesh Kumar Mallah
On 9/28/05, Gavin Sherry <[EMAIL PROTECTED]> wrote:
> On Wed, 28 Sep 2005, Rajesh Kumar Mallah wrote:
>
> > Hi
> >
> > While doing some stress testing for updates in a small sized table
> > we found the following results. We are not too happy about the speed
> > of the updates particularly at high concurrency (10 clients).
> >
> > Initially we get 119 updates / sec but it drops to 10 updates/sec
> > as concurrency is increased.
> >
> > PostgreSQL: 8.0.3
> > ---
> > TABLE STRUCTURE: general.stress
> > ---
> > | dispatch_id  | integer  | not null  |
> > | query_id | integer  |   |
> > | generated| timestamp with time zone |   |
> > | unsubscribes | integer  |   |
> > | read_count   | integer  |   |
> > | status   | character varying(10)|   |
> > | bounce_tracking  | boolean  |   |
> > | dispatch_hour| integer  |   |
> > | dispatch_date_id | integer  |   |
> > +--+--+---+
> > Indexes:
> > "stress_pkey" PRIMARY KEY, btree (dispatch_id)
> >
> > UPDATE STATEMENT:
> > update general.stress set read_count=read_count+1 where dispatch_id=114
>
> This means you are updating only one row, correct?

Correct.


>
> > Number of Copies | Update perl Sec
> >
> > 1  --> 119
> > 2  ---> 59
> > 3  --->  38
> > 4  ---> 28
> > 5 --> 22
> > 6 --> 19
> > 7 --> 16
> > 8 --> 14
> > 9 --> 11
> > 10 --> 11
> > 11 --> 10
>
> So, 11 instances result in 10 updated rows per second, database wide or
> per instance? If it is per instance, then 11 * 10 is close to the
> performance for one connection.


Sorry do not understand the difference between "database wide"
and "per instance"

>
> That being said, when you've got 10 connections fighting over one row, I
> wouldn't be surprised if you had bad performance.
>
> Also, at 119 updates a second, you're more than doubling the table's
> initial size (dead tuples) each second. How often are you vacuuming and
> are you using vacuum or vacuum full?


Yes I realize the obvious phenomenon now, (and the uselessness of the script)
 , we should not consider it a performance degradation.

I am having performance issue in my live database thats why i tried to
simulate the situation(may the the script was overstresser).

My original problem is  that i send 100 000s of emails carrying a
beacon for tracking readership every tuesday and on wednesday i see
lot of the said query in pg_stat_activity each of these query update
the SAME row that corresponds to the dispatch of last day and it is
then i face the performance problem.

I think i can only post further details next wednesday , please lemme
know how should i be dealing with the situation if each the updates takes
100times more time that normal update duration.

Best Regards
Mallah.


>
> Gavin
>

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Monitoring Postgresql performance

2005-09-28 Thread Dan Harris


On Sep 28, 2005, at 8:32 AM, Arnau wrote:


Hi all,

  I have been "googling" a bit searching info about a way to  
monitor postgresql (CPU & Memory, num processes, ... )


You didn't mention your platform, but I have an xterm open pretty  
much continuously for my DB server that runs plain old top.  I have  
customized my settings enough that I can pretty much see anything I  
need to from there.


-Dan


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] [PERFORM] A Better External Sort?

2005-09-28 Thread Ron Peacetree
>From: "Jeffrey W. Baker" <[EMAIL PROTECTED]>
>Sent: Sep 27, 2005 1:26 PM
>To: Ron Peacetree <[EMAIL PROTECTED]>
>Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
>
>On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote:
>
>>That Btree can be used to generate a physical reordering of the data
>>in one pass, but that's the weakest use for it.  The more powerful
>>uses involve allowing the Btree to persist and using it for more
>>efficient re-searches or combining it with other such Btrees (either as
>>a step in task distribution across multiple CPUs or as a more efficient
>>way to do things like joins by manipulating these Btrees rather than
>>the actual records.)
>
>Maybe you could describe some concrete use cases.  I can see what
>you are getting at, and I can imagine some advantageous uses, but
>I'd like to know what you are thinking.
>
1= In a 4P box, we split the data in RAM into 4 regions and create
a CPU cache friendly Btree using the method I described for each CPU.
The 4 Btrees can be merged in a more time and space efficient manner
than the original records to form a Btree that represents the sorted order
of the entire data set.  Any of these Btrees can be allowed to persist to
lower the cost of doing similar operations in the future (Updating the
Btrees during inserts and deletes is cheaper than updating the original
data files and then redoing the same sort from scratch in the future.)
Both the original sort and future such sorts are made more efficient
than current methods.

2= We use my method to sort two different tables.  We now have these
very efficient representations of a specific ordering on these tables.  A
join operation can now be done using these Btrees rather than the
original data tables that involves less overhead than many current
methods.

3=  We have multiple such Btrees for the same data set representing
sorts done using different fields (and therefore different Keys).
Calculating a sorted order for the data based on a composition of
those Keys is now cheaper than doing the sort based on the composite
Key from scratch.  When some of the Btrees exist and some of them
do not, there is a tradeoff calculation to be made.  Sometimes it will be
cheaper to do the sort from scratch using the composite Key.   


>Specifically I'd like to see some cases where this would beat sequential
>scan.  I'm thinking that in your example of a terabyte table with a
>column having only two values, all the queries I can think of would be
>better served with a sequential scan.
>
In my original example, a sequential scan of the 1TB of 2KB or 4KB
records, => 250M or 500M records of data, being sorted on a binary
value key will take ~1000x more time than reading in the ~1GB Btree
I described that used a Key+RID (plus node pointers) representation
of the data.
 
Just to clarify the point further,
1TB of 1B records => 2^40 records of at most 256 distinct values.
1TB of 2B records => 2^39 records of at most 2^16 distinct values.
1TB of 4B records => 2^38 records of at most 2^32 distinct values.
1TB of 5B records => 200B records of at most 200B distinct values.
>From here on, the number of possible distinct values is limited by the
number of records.
100B records are used in the "Indy" version of Jim Gray's sorting 
contests, so 1TB => 10B records.
2KB-4KB is the most common record size I've seen in enterprise
class DBMS (so I used this value to make my initial example more
realistic).

Therefore the vast majority of the time representing a data set by Key
will use less space that the original record.  Less space used means
less IO to scan the data set, which means faster scan times.

This is why index files work in the first place, right?


>Perhaps I believe this because you can now buy as much sequential I/O
>as you want.  Random I/O is the only real savings.
>
1= No, you can not "buy as much sequential IO as you want".  Even if
with an infinite budget, there are physical and engineering limits.  Long
before you reach those limits, you will pay exponentially increasing costs
for linearly increasing performance gains.  So even if you _can_ buy a
certain level of sequential IO, it may not be the most efficient way to
spend money.

2= Most RW IT professionals have far from an infinite budget.  Just traffic
on these lists shows how severe the typical cost constraints usually are.
OTOH, if you have an inifinite IT budget, care to help a few less fortunate
than yourself?  After all, a even a large constant substracted from infinity
is still infinity... ;-)

3= No matter how fast you can do IO, IO remains the most expensive
part of the performance equation.  The fastest and cheapest IO you can
do is _no_ IO.  As long as we trade cheaper RAM and even cheaoer CPU
operations for IO correctly, more space efficient data representations will
always be a Win because of this.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desi

[PERFORM] Monitoring Postgresql performance

2005-09-28 Thread Arnau

Hi all,

  I have been "googling" a bit searching info about a way to monitor 
postgresql (CPU & Memory, num processes, ... ) and I haven't found 
anything relevant. I'm using munin to monitor others parameters of my 
servers and I'd like to include postgresql or have a similar tool. Any 
of you is using anything like that? all kind of hints are welcome :-)


Cheers!
--
Arnau


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Slow concurrent update of same row in a given table

2005-09-28 Thread Gavin Sherry
On Wed, 28 Sep 2005, Rajesh Kumar Mallah wrote:

> Hi
>
> While doing some stress testing for updates in a small sized table
> we found the following results. We are not too happy about the speed
> of the updates particularly at high concurrency (10 clients).
>
> Initially we get 119 updates / sec but it drops to 10 updates/sec
> as concurrency is increased.
>
> PostgreSQL: 8.0.3
> ---
> TABLE STRUCTURE: general.stress
> ---
> | dispatch_id  | integer  | not null  |
> | query_id | integer  |   |
> | generated| timestamp with time zone |   |
> | unsubscribes | integer  |   |
> | read_count   | integer  |   |
> | status   | character varying(10)|   |
> | bounce_tracking  | boolean  |   |
> | dispatch_hour| integer  |   |
> | dispatch_date_id | integer  |   |
> +--+--+---+
> Indexes:
> "stress_pkey" PRIMARY KEY, btree (dispatch_id)
>
> UPDATE STATEMENT:
> update general.stress set read_count=read_count+1 where dispatch_id=114

This means you are updating only one row, correct?

> Number of Copies | Update perl Sec
>
> 1  --> 119
> 2  ---> 59
> 3  --->  38
> 4  ---> 28
> 5 --> 22
> 6 --> 19
> 7 --> 16
> 8 --> 14
> 9 --> 11
> 10 --> 11
> 11 --> 10

So, 11 instances result in 10 updated rows per second, database wide or
per instance? If it is per instance, then 11 * 10 is close to the
performance for one connection.

That being said, when you've got 10 connections fighting over one row, I
wouldn't be surprised if you had bad performance.

Also, at 119 updates a second, you're more than doubling the table's
initial size (dead tuples) each second. How often are you vacuuming and
are you using vacuum or vacuum full?

Gavin

---(end of broadcast)---
TIP 6: explain analyze is your friend


[PERFORM] Slow concurrent update of same row in a given table

2005-09-28 Thread Rajesh Kumar Mallah
Hi

While doing some stress testing for updates in a small sized table
we found the following results. We are not too happy about the speed
of the updates particularly at high concurrency (10 clients).

Initially we get 119 updates / sec but it drops to 10 updates/sec
as concurrency is increased.

PostgreSQL: 8.0.3
---
TABLE STRUCTURE: general.stress
---
| dispatch_id  | integer  | not null  |
| query_id | integer  |   |
| generated| timestamp with time zone |   |
| unsubscribes | integer  |   |
| read_count   | integer  |   |
| status   | character varying(10)|   |
| bounce_tracking  | boolean  |   |
| dispatch_hour| integer  |   |
| dispatch_date_id | integer  |   |
+--+--+---+
Indexes:
"stress_pkey" PRIMARY KEY, btree (dispatch_id)

UPDATE STATEMENT:
update general.stress set read_count=read_count+1 where dispatch_id=114
TOOL USED: Perl/DBI , with prepared statement handlers
CONCURRENCY METHOD: executing multiple copies of same program
from different shells (linux enviornment)
CLIENT SERVER LINK : 10/100 Mbits , LAN

CLIENT CODE: stress.pl
-
#!/opt/perl/bin/perl -I/usr/local/masonapache/lib/perl

#overview: update the table as fast as possible (while(1){})
#on every 100th commit , print the average update frequency
#of last 100 updates
##
use strict;
use Time::HiRes qw(gettimeofday tv_interval);
use Utils;
my $dbh = &Utils::db_connect();
my $sth = $dbh -> prepare("update general.stress set
read_count=read_count+1 where dispatch_id=114");
my $cnt=0;
my $t0 = [ gettimeofday ];
while(1) {
$sth -> execute();
$dbh->commit();
$cnt++;
if ($cnt % 100 == 0)
{
my $t1 = [ gettimeofday ];
my $elapsed = tv_interval ( $t0 , $t1 );
$t0 = $t1;
printf "Rate: %d updates / sec\n" , 100.0/$elapsed ;
}
}
$sth->finish();
$dbh->disconnect();
--

--
RESULTS:
--

Number of Copies | Update perl Sec

1  --> 119
2  ---> 59
3  --->  38
4  ---> 28
5 --> 22
6 --> 19
7 --> 16
8 --> 14
9 --> 11
10 --> 11
11 --> 10

-
Note that the table was vacuum analyzed during the tests
total number of records in table: 93
-

Regds
Rajesh Kumar Mallah.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[PERFORM] database bloat, but vacuums are done, and fsm seems to be setup ok

2005-09-28 Thread hubert depesz lubaczewski
hi
setup:
postgresql 8.0.3 put on debian on dual xeon, 8GB ram, hardware raid.

database just after recreation from dump takes 15gigabytes.
after some time (up to 3 weeks) it gets really slow and has to be dump'ed and restored.

as for fsm:
end of vacuum info:
INFO:  free space map: 248 relations, 1359140 pages stored; 1361856 total pages needed
DETAIL:  Allocated FSM size: 1000 relations + 1000 pages = 58659 kB shared memory.

so it looks i have plenty of space in fsm.

vacuums run constantly.
4 different tasks, 3 of them doing:
while true
vacuum table
sleep 15m
done
with different tables (i have chooses the most updated tables in system).

and the fourth vacuum task does the same, but without specifying table - so it vacuumes whole database.

after last dump/restore cycle i noticed that doing reindex on all
indices in database made it drop in side from 40G to about 20G - so it
might be that i will be using reindex instead of drop/restore.
anyway - i'm not using any special indices - just some (117 to be
exact) indices of btree type. we use simple, multi-column, partial and
multi-column partial indices. we do not have functional indices.

database has quite huge load of updates, but i thought that vacum will
guard me from database bloat, but from what i observed it means that
vacuuming of b-tree indices is somewhat faulty.

any suggestions? what else can i supply you with to help you help me?

best regards

depesz