Re: [PERFORM] Comparative performance
> > It appears that PostgreSQL is two to three times slower > than MySQL. > > For example, some pages that have some 30,000 characters > (when saved > > as HTML) take 1 to 1 1/2 seconds with MySQL but 3 to 4 seconds with > > PostgreSQL. I had read that the former was generally > faster than the > > latter, particularly for simple web applications but I was > hoping that > > Postgres' performance would not be that noticeably slower. > > Are you comparing PostgreSQL on XP to MySQL on XP or > PostgreSQL on Linux to MySQL on Linux? Our performance on XP > is not great. Also, which version of PostgreSQL are you using? That actually depends a lot on *how* you use it. I've seen pg-on-windows deployments that come within a few percent of the linux performance. I've also seen those that are absolutely horrible compared. One sure way to kill the performance is to do a lot of small connections. Using persistent connection is even more important on Windows than it is on Unix. It could easily explain a difference like this. //Magnus ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PERFORM] A Better External Sort?
>From: "Jeffrey W. Baker" <[EMAIL PROTECTED]> >Sent: Sep 29, 2005 12:27 AM >To: Ron Peacetree <[EMAIL PROTECTED]> >Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org >Subject: Re: [HACKERS] [PERFORM] A Better External Sort? > >You are engaging in a length and verbose exercise in mental >masturbation, because you have not yet given a concrete example of a >query where this stuff would come in handy. A common, general-purpose >case would be the best. > ??? I posted =3= specific classes of common, general-purpose query operations where OES and the OES Btrees look like they should be superior to current methods: 1= when splitting sorting or other operations across multiple CPUs 2= when doing joins of different tables by doing the join on these Btrees rather than the original tables. 3= when the opportunity arises to reuse OES Btree results of previous sorts for different keys in the same table. Now we can combine the existing Btrees to obtain the new order based on the composite key without ever manipulating the original, much larger, table. In what way are these examples not "concrete"? >We can all see that the method you describe might be a good way to sort >a very large dataset with some known properties, which would be fine if >you are trying to break the terasort benchmark. But that's not what >we're doing here. We are designing and operating relational databases. >So please explain the application. > This is a GENERAL method. It's based on CPU cache efficient Btrees that use variable length prefix keys and RIDs. It assumes NOTHING about the data or the system in order to work. I gave some concrete examples for the sake of easing explanation, NOT as an indication of assumptions or limitations of the method. I've even gone out of my way to prove that no such assumptions or limitations exist. Where in the world are you getting such impressions? >Your main example seems to focus on a large table where a key column has >constrained values. This case is interesting in proportion to the >number of possible values. If I have billions of rows, each having one >of only two values, I can think of a trivial and very fast method of >returning the table "sorted" by that key: make two sequential passes, >returning the first value on the first pass and the second value on the >second pass. This will be faster than the method you propose. > 1= No that was not my main example. It was the simplest example used to frame the later more complicated examples. Please don't get hung up on it. 2= You are incorrect. Since IO is the most expensive operation we can do, any method that makes two passes through the data at top scanning speed will take at least 2x as long as any method that only takes one such pass. >I think an important aspect you have failed to address is how much of >the heap you must visit after the sort is complete. If you are >returning every tuple in the heap then the optimal plan will be very >different from the case when you needn't. > Hmmm. Not sure which "heap" you are referring to, but the OES Btree index is provably the lowest (in terms of tree height) and smallest possible CPU cache efficient data structure that one can make and still have all of the traditional benefits associated with a Btree representation of a data set. Nonetheless, returning a RID, or all RIDs with(out) the same Key, or all RIDs (not) within a range of Keys, or simply all RIDs in sorted order is efficient. Just as should be for a Btree (actually it's a B+ tree variant to use Knuth's nomenclature). I'm sure someone posting from acm.org recognizes how each of these Btree operations maps to various SQL features... I haven't been talking about query plans because they are orthogonal to the issue under discussion? If we use a layered model for PostgreSQL's architecture, this functionality is more primal than that of a query planner. ALL query plans that currently involve sorts will benefit from a more efficient way to do, or avoid, sorts. >PS: Whatever mailer you use doesn't understand or respect threading nor >attribution. Out of respect for the list's readers, please try a mailer >that supports these 30-year-old fundamentals of electronic mail. > That is an issue of infrastructure on the recieving side, not on the sending (my) side since even my web mailer seems appropriately RFC conformant. Everything seems to be going in the correct places and being properly organized on archival.postgres.org ... Ron ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Comparative performance
On Wed, 28 Sep 2005, Joe wrote: > Before I post the EXPLAIN and the table schema I'd appreciate > confirmation that this list is the appropriate forum. It is and and useful things to show are * the slow query * EXPLAIN ANALYZE of the query * the output of \d for each table involved in the query * the output of SHOW ALL; * The amount of memory the machine have The settings that are the most important to tune in postgresql.conf for performance is in my opinion; shared_buffers, effective_cache_size and (to a lesser extent) work_mem. -- /Dennis Björklund ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Sequential I/O Cost (was Re: [PERFORM] A Better External Sort?)
On Wed, 2005-09-28 at 12:03 -0400, Ron Peacetree wrote: > >From: "Jeffrey W. Baker" <[EMAIL PROTECTED]> > >Perhaps I believe this because you can now buy as much sequential I/O > >as you want. Random I/O is the only real savings. > > > 1= No, you can not "buy as much sequential IO as you want". Even if > with an infinite budget, there are physical and engineering limits. Long > before you reach those limits, you will pay exponentially increasing costs > for linearly increasing performance gains. So even if you _can_ buy a > certain level of sequential IO, it may not be the most efficient way to > spend money. This is just false. You can buy sequential I/O for linear money up to and beyond your platform's main memory bandwidth. Even 1GB/sec will severely tax memory bandwidth of mainstream platforms, and you can achieve this rate for a modest cost. I have one array that can supply this rate and it has only 15 disks. It would fit on my desk. I think your dire talk about the limits of science and engineering may be a tad overblown. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PERFORM] A Better External Sort?
On Wed, 2005-09-28 at 12:03 -0400, Ron Peacetree wrote: > >From: "Jeffrey W. Baker" <[EMAIL PROTECTED]> > >Sent: Sep 27, 2005 1:26 PM > >To: Ron Peacetree <[EMAIL PROTECTED]> > >Subject: Re: [HACKERS] [PERFORM] A Better External Sort? > > > >On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote: > > > >>That Btree can be used to generate a physical reordering of the data > >>in one pass, but that's the weakest use for it. The more powerful > >>uses involve allowing the Btree to persist and using it for more > >>efficient re-searches or combining it with other such Btrees (either as > >>a step in task distribution across multiple CPUs or as a more efficient > >>way to do things like joins by manipulating these Btrees rather than > >>the actual records.) > > > >Maybe you could describe some concrete use cases. I can see what > >you are getting at, and I can imagine some advantageous uses, but > >I'd like to know what you are thinking. > > > >Specifically I'd like to see some cases where this would beat sequential > >scan. I'm thinking that in your example of a terabyte table with a > >column having only two values, all the queries I can think of would be > >better served with a sequential scan. > > > In my original example, a sequential scan of the 1TB of 2KB or 4KB > records, => 250M or 500M records of data, being sorted on a binary > value key will take ~1000x more time than reading in the ~1GB Btree > I described that used a Key+RID (plus node pointers) representation > of the data. You are engaging in a length and verbose exercise in mental masturbation, because you have not yet given a concrete example of a query where this stuff would come in handy. A common, general-purpose case would be the best. We can all see that the method you describe might be a good way to sort a very large dataset with some known properties, which would be fine if you are trying to break the terasort benchmark. But that's not what we're doing here. We are designing and operating relational databases. So please explain the application. Your main example seems to focus on a large table where a key column has constrained values. This case is interesting in proportion to the number of possible values. If I have billions of rows, each having one of only two values, I can think of a trivial and very fast method of returning the table "sorted" by that key: make two sequential passes, returning the first value on the first pass and the second value on the second pass. This will be faster than the method you propose. I think an important aspect you have failed to address is how much of the heap you must visit after the sort is complete. If you are returning every tuple in the heap then the optimal plan will be very different from the case when you needn't. -jwb PS: Whatever mailer you use doesn't understand or respect threading nor attribution. Out of respect for the list's readers, please try a mailer that supports these 30-year-old fundamentals of electronic mail. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Comparative performance
On Wed, 28 Sep 2005, Joe wrote: > I'm converting a relatively small database (2 MB) from MySQL to PostgreSQL. > It > is used to generate web pages using PHP. Although the actual website runs > under > Linux, the development is done under XP. I've completed most of the data > conversion and rewrite of the PHP scripts, so now I'm comparing relative > performance. > > It appears that PostgreSQL is two to three times slower than MySQL. For > example, some pages that have some 30,000 characters (when saved as HTML) > take 1 > to 1 1/2 seconds with MySQL but 3 to 4 seconds with PostgreSQL. I had read > that > the former was generally faster than the latter, particularly for simple web > applications but I was hoping that Postgres' performance would not be that > noticeably slower. Are you comparing PostgreSQL on XP to MySQL on XP or PostgreSQL on Linux to MySQL on Linux? Our performance on XP is not great. Also, which version of PostgreSQL are you using? > > I'm trying to determine if the difference can be attributed to anything that > I've done or missed. I've run VACUUM ANALYZE on the two main tables and I'm > looking at the results of EXPLAIN on the query that drives the retrieval of > probably 80% of the data for the pages in question. Good. > > Before I post the EXPLAIN and the table schema I'd appreciate confirmation > that > this list is the appropriate forum. I'm a relative newcomer to PostgreSQL > (but > not to relational databases), so I'm not sure if this belongs in the novice or > general lists. You can post the results of EXPLAIN ANALYZE here. Please including schema definitions and the query string(s) themselves. Thanks, Gavin ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PERFORM] Comparative performance
I'm converting a relatively small database (2 MB) from MySQL to PostgreSQL. It is used to generate web pages using PHP. Although the actual website runs under Linux, the development is done under XP. I've completed most of the data conversion and rewrite of the PHP scripts, so now I'm comparing relative performance. It appears that PostgreSQL is two to three times slower than MySQL. For example, some pages that have some 30,000 characters (when saved as HTML) take 1 to 1 1/2 seconds with MySQL but 3 to 4 seconds with PostgreSQL. I had read that the former was generally faster than the latter, particularly for simple web applications but I was hoping that Postgres' performance would not be that noticeably slower. I'm trying to determine if the difference can be attributed to anything that I've done or missed. I've run VACUUM ANALYZE on the two main tables and I'm looking at the results of EXPLAIN on the query that drives the retrieval of probably 80% of the data for the pages in question. Before I post the EXPLAIN and the table schema I'd appreciate confirmation that this list is the appropriate forum. I'm a relative newcomer to PostgreSQL (but not to relational databases), so I'm not sure if this belongs in the novice or general lists. Joe ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Slow concurrent update of same row in a given table
On Thu, 29 Sep 2005, Rajesh Kumar Mallah wrote: > On 9/29/05, Gavin Sherry <[EMAIL PROTECTED]> wrote: > > On Wed, 28 Sep 2005, Rajesh Kumar Mallah wrote: > > > > > > > Number of Copies | Update perl Sec > > > > > > > > > > 1 --> 119 > > > > > 2 ---> 59 > > > > > 3 ---> 38 > > > > > 4 ---> 28 > > > > > 5 --> 22 > > > > > 6 --> 19 > > > > > 7 --> 16 > > > > > 8 --> 14 > > > > > 9 --> 11 > > > > > 10 --> 11 > > > > > 11 --> 10 > > > > > > > > So, 11 instances result in 10 updated rows per second, database wide or > > > > per instance? If it is per instance, then 11 * 10 is close to the > > > > performance for one connection. > > > > > > > > > Sorry do not understand the difference between "database wide" > > > and "per instance" > > > > Per instance. > > > > > > > > > > > > > That being said, when you've got 10 connections fighting over one row, I > > > > wouldn't be surprised if you had bad performance. > > > > > > > > Also, at 119 updates a second, you're more than doubling the table's > > > > initial size (dead tuples) each second. How often are you vacuuming and > > > > are you using vacuum or vacuum full? > > > > > > > > > Yes I realize the obvious phenomenon now, (and the uselessness of the > > > script) > > > , we should not consider it a performance degradation. > > > > > > I am having performance issue in my live database thats why i tried to > > > simulate the situation(may the the script was overstresser). > > > > > > My original problem is that i send 100 000s of emails carrying a > > > beacon for tracking readership every tuesday and on wednesday i see > > > lot of the said query in pg_stat_activity each of these query update > > > the SAME row that corresponds to the dispatch of last day and it is > > > then i face the performance problem. > > > > > > I think i can only post further details next wednesday , please lemme > > > know how should i be dealing with the situation if each the updates takes > > > 100times more time that normal update duration. > > > > I see. These problems regularly come up in database design. The best thing > > you can do is modify your database design/application such that instead of > > incrementing a count in a single row, you insert a row into a table, > > recording the 'dispatch_id'. Counting the number of rows for a given > > dispatch id will give you your count. > > > > sorry i will be accumulating huge amount of rows in seperate table > with no extra info when i really want just the count. Do you have > a better database design in mind? > > Also i encounter same problem in implementing read count of > articles in sites and in counting banner impressions where same > row get updated by multiple processes frequently. As I said in private email, accumulating large numbers of rows is not a problem. In your current application, you are write bound, not read bound. I've designed many similar systems which have hundred of millions of rows. It takes a while to generate the count, but you just do it periodically in non-busy periods. With 8.1, constraint exclusion will give you significantly better performance with this system, as well. Thanks, Gavin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Slow concurrent update of same row in a given table
On 9/29/05, Gavin Sherry <[EMAIL PROTECTED]> wrote: > On Wed, 28 Sep 2005, Rajesh Kumar Mallah wrote: > > > > > Number of Copies | Update perl Sec > > > > > > > > 1 --> 119 > > > > 2 ---> 59 > > > > 3 ---> 38 > > > > 4 ---> 28 > > > > 5 --> 22 > > > > 6 --> 19 > > > > 7 --> 16 > > > > 8 --> 14 > > > > 9 --> 11 > > > > 10 --> 11 > > > > 11 --> 10 > > > > > > So, 11 instances result in 10 updated rows per second, database wide or > > > per instance? If it is per instance, then 11 * 10 is close to the > > > performance for one connection. > > > > > > Sorry do not understand the difference between "database wide" > > and "per instance" > > Per instance. > > > > > > > > > That being said, when you've got 10 connections fighting over one row, I > > > wouldn't be surprised if you had bad performance. > > > > > > Also, at 119 updates a second, you're more than doubling the table's > > > initial size (dead tuples) each second. How often are you vacuuming and > > > are you using vacuum or vacuum full? > > > > > > Yes I realize the obvious phenomenon now, (and the uselessness of the > > script) > > , we should not consider it a performance degradation. > > > > I am having performance issue in my live database thats why i tried to > > simulate the situation(may the the script was overstresser). > > > > My original problem is that i send 100 000s of emails carrying a > > beacon for tracking readership every tuesday and on wednesday i see > > lot of the said query in pg_stat_activity each of these query update > > the SAME row that corresponds to the dispatch of last day and it is > > then i face the performance problem. > > > > I think i can only post further details next wednesday , please lemme > > know how should i be dealing with the situation if each the updates takes > > 100times more time that normal update duration. > > I see. These problems regularly come up in database design. The best thing > you can do is modify your database design/application such that instead of > incrementing a count in a single row, you insert a row into a table, > recording the 'dispatch_id'. Counting the number of rows for a given > dispatch id will give you your count. > sorry i will be accumulating huge amount of rows in seperate table with no extra info when i really want just the count. Do you have a better database design in mind? Also i encounter same problem in implementing read count of articles in sites and in counting banner impressions where same row get updated by multiple processes frequently. Thanks & Regds mallah. > Thanks, > > Gavin > ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PERFORM] A Better External Sort?
If I've done this correctly, there should not be anywhere near the number of context switches we currently see while sorting. Each unscheduled context switch represents something unexpected occuring or things not being where they are needed when they are needed. Reducing such circumstances to the absolute minimum was one of the design goals. Reducing the total amount of IO to the absolute minimum should help as well. Ron -Original Message- From: Kevin Grittner <[EMAIL PROTECTED]> Sent: Sep 27, 2005 11:21 AM Subject: Re: [HACKERS] [PERFORM] A Better External Sort? I can't help wondering how a couple thousand context switches per second would affect the attempt to load disk info into the L1 and L2 caches. That's pretty much the low end of what I see when the server is under any significant load. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [PERFORM] A Better External Sort?
In the interest of efficiency and "not reinventing the wheel", does anyone know where I can find C or C++ source code for a Btree variant with the following properties: A= Data elements (RIDs) are only stored in the leaves, Keys (actually KeyPrefixes; see "D" below) and Node pointers are only stored in the internal nodes of the Btree. B= Element redistribution is done as an alternative to node splitting in overflow conditions during Inserts whenever possible. C= Variable length Keys are supported. D= Node buffering with a reasonable replacement policy is supported. E= Since we will know beforehand exactly how many RID's will be stored, we will know apriori how much space will be needed for leaves, and will know the worst case for how much space will be required for the Btree internal nodes as well. This implies that we may be able to use an array, rather than linked list, implementation of the Btree. Less pointer chasing at the expense of more CPU calculations, but that's a trade-off in the correct direction. Such source would be a big help in getting a prototype together. Thanks in advance for any pointers or source, Ron ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Logarithmic change (decrease) in performance
>From: Matthew Nuzum <[EMAIL PROTECTED]> >Sent: Sep 28, 2005 4:02 PM >Subject: [PERFORM] Logarithmic change (decrease) in performance > Small nit-pick: A "logarithmic decrease" in performance would be a relatively good thing, being better than either a linear or exponential decrease in performance. What you are describing is the worst kind: an _exponential_ decrease in performance. >Something interesting is going on. I wish I could show you the graphs, >but I'm sure this will not be a surprise to the seasoned veterans. > >A particular application server I have has been running for over a >year now. I've been logging cpu load since mid-april. > >It took 8 months or more to fall from excellent performance to >"acceptable." Then, over the course of about 5 weeks it fell from >"acceptable" to "so-so." Then, in the last four weeks it's gone from >"so-so" to alarming. > >I've been working on this performance drop since Friday but it wasn't >until I replied to Arnau's post earlier today that I remembered I'd >been logging the server load. I grabbed the data and charted it in >Excel and to my surprise, the graph of the server's load average looks >kind of like the graph of y=x^2. > >I've got to make a recomendation for a solution to the PHB and my >analysis is showing that as the dataset becomes larger, the amount of >time the disk spends seeking is increasing. This causes processes to >take longer to finish, which causes more processes to pile up, which >causes processes to take longer to finish, which causes more processes >to pile up etc. It is this growing dataset that seems to be the source >of the sharp decrease in performance. > >I knew this day would come, but I'm actually quite surprised that when >it came, there was little time between the warning and the grande >finale. I guess this message is being sent to the list to serve as a >warning to other data warehouse admins that when you reach your >capacity, the downward spiral happens rather quickly. > Yep, definitely been where you are. Bottom line: you have to reduce the sequential seeking behavior of the system to within an acceptable window and then keep it there. 1= keep more of the data set in RAM 2= increase the size of your HD IO buffers 3= make your RAID sets wider (more parallel vs sequential IO) 4= reduce the atomic latency of your RAID sets (time for Fibre Channel 15Krpm HD's vs 7.2Krpm SATA ones?) 5= make sure your data is as unfragmented as possible 6= change you DB schema to minimize the problem a= overall good schema design b= partitioning the data so that the system only has to manipulate a reasonable chunk of it at a time. In many cases, there's a number of ways to accomplish the above. Unfortunately, most of them require CapEx. Also, ITRW world such systems tend to have this as a chronic problem. This is not a "fix it once and it goes away forever". This is a part of the regular maintenance and upgrade plan(s). Good Luck, Ron ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Slow concurrent update of same row in a given table
On Wed, 28 Sep 2005, Rajesh Kumar Mallah wrote: > > > Number of Copies | Update perl Sec > > > > > > 1 --> 119 > > > 2 ---> 59 > > > 3 ---> 38 > > > 4 ---> 28 > > > 5 --> 22 > > > 6 --> 19 > > > 7 --> 16 > > > 8 --> 14 > > > 9 --> 11 > > > 10 --> 11 > > > 11 --> 10 > > > > So, 11 instances result in 10 updated rows per second, database wide or > > per instance? If it is per instance, then 11 * 10 is close to the > > performance for one connection. > > > Sorry do not understand the difference between "database wide" > and "per instance" Per instance. > > > > > That being said, when you've got 10 connections fighting over one row, I > > wouldn't be surprised if you had bad performance. > > > > Also, at 119 updates a second, you're more than doubling the table's > > initial size (dead tuples) each second. How often are you vacuuming and > > are you using vacuum or vacuum full? > > > Yes I realize the obvious phenomenon now, (and the uselessness of the script) > , we should not consider it a performance degradation. > > I am having performance issue in my live database thats why i tried to > simulate the situation(may the the script was overstresser). > > My original problem is that i send 100 000s of emails carrying a > beacon for tracking readership every tuesday and on wednesday i see > lot of the said query in pg_stat_activity each of these query update > the SAME row that corresponds to the dispatch of last day and it is > then i face the performance problem. > > I think i can only post further details next wednesday , please lemme > know how should i be dealing with the situation if each the updates takes > 100times more time that normal update duration. I see. These problems regularly come up in database design. The best thing you can do is modify your database design/application such that instead of incrementing a count in a single row, you insert a row into a table, recording the 'dispatch_id'. Counting the number of rows for a given dispatch id will give you your count. Thanks, Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Monitoring Postgresql performance
On 28 Sep 2005, at 15:32, Arnau wrote: Hi all, I have been "googling" a bit searching info about a way to monitor postgresql (CPU & Memory, num processes, ... ) and I haven't found anything relevant. I'm using munin to monitor others parameters of my servers and I'd like to include postgresql or have a similar tool. Any of you is using anything like that? all kind of hints are welcome :-) Cheers! Have you looked at SNMP? It's a bit complex but there's lots of tools for monitoring system data / sending alerts based on SNMP already. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[PERFORM] Logarithmic change (decrease) in performance
Something interesting is going on. I wish I could show you the graphs, but I'm sure this will not be a surprise to the seasoned veterans. A particular application server I have has been running for over a year now. I've been logging cpu load since mid-april. It took 8 months or more to fall from excellent performance to "acceptable." Then, over the course of about 5 weeks it fell from "acceptable" to "so-so." Then, in the last four weeks it's gone from "so-so" to alarming. I've been working on this performance drop since Friday but it wasn't until I replied to Arnau's post earlier today that I remembered I'd been logging the server load. I grabbed the data and charted it in Excel and to my surprise, the graph of the server's load average looks kind of like the graph of y=x^2. I've got to make a recomendation for a solution to the PHB and my analysis is showing that as the dataset becomes larger, the amount of time the disk spends seeking is increasing. This causes processes to take longer to finish, which causes more processes to pile up, which cuases processes to take longer to finish, which causes more processes to pile up etc. It is this growing dataset that seems to be the source of the sharp decrease in performance. I knew this day would come, but I'm actually quite surprised that when it came, there was little time between the warning and the grande finale. I guess this message is being sent to the list to serve as a warning to other data warehouse admins that when you reach your capacity, the downward spiral happens rather quickly. Crud... Outlook just froze while composing the PHB memo. I've been working on that for an hour. What a bad day. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Monitoring Postgresql performance
On 9/28/05, Arnau <[EMAIL PROTECTED]> wrote: > Hi all, > >I have been "googling" a bit searching info about a way to monitor > postgresql (CPU & Memory, num processes, ... ) and I haven't found > anything relevant. I'm using munin to monitor others parameters of my > servers and I'd like to include postgresql or have a similar tool. Any > of you is using anything like that? all kind of hints are welcome :-) > > Cheers! > -- > Arnau I have a cronjob that runs every 5 minutes and checks the number of processes. When things get unruly I get a text message sent to my cell phone. It also creates a detailed log entry. I'll paste in an example of one of my scripts that does this below. This is on a dual purpose server and monitors both cpu load average and postgres. You can have the text message sent to multiple email addresses, just put a space separated list of e-mail addresses between quotes in the CONTACTS= line. It's simple, but it works and its always nice to know when there's a problem *before the boss discovers it* ;-) # Create some messages HOSTNAME=`hostname` WARNING_DB="Database connections on $HOSTNAME is rather high" WARNING_CPU="CPU load on $HOSTNAME is rather high" CONTACTS="[EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED]" WARN=0 #calculate the db load DB_LOAD=`ps -ax | grep postgres | wc -l` if (($DB_LOAD > 150)) then WARN=1 echo "$WARNING_DB ($DB_LOAD) " | mail -s "db_load is high ($DB_LOAD)" $CONTACTS fi #calculate the processor load CPU_LOAD=`cat /proc/loadavg | cut --delimiter=" " -f 2 | cut --delimiter="." -f 1` if (($CPU_LOAD > 8)) then WARN=1 echo "$WARNING_CPU ($CPU_LOAD) " | mail -s "CPU_load is high ($CPU_LOAD)" $CONTACTS fi if (($WARN > 0)) then echo -=-=-=-=-=-=-=-=- W A R N I N G -=-=-=-=-=-=-=-=- >> /tmp/warn.txt NOW=`date` echo -=-=-=-=-=-$NOW-=-=-=-=-=- >> /tmp/warn.txt echo CPU LOAD: $CPU_LOAD DB LOAD: $DB_LOAD >> /tmp/warn.txt echo >> /tmp/warn.txt top -bn 1 >> /tmp/warn.txt echo >> /tmp/warn.txt fi NOW=`date` CPU_LOAD=`cat /proc/loadavg | cut --delimiter=" " -f 1,2,3 --output-delimiter=\|` echo -e $NOW\|$CPU_LOAD\|$DB_LOAD >> ~/LOAD_MONITOR.LOG -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Slow concurrent update of same row in a given table
On 9/28/05, Gavin Sherry <[EMAIL PROTECTED]> wrote: > On Wed, 28 Sep 2005, Rajesh Kumar Mallah wrote: > > > Hi > > > > While doing some stress testing for updates in a small sized table > > we found the following results. We are not too happy about the speed > > of the updates particularly at high concurrency (10 clients). > > > > Initially we get 119 updates / sec but it drops to 10 updates/sec > > as concurrency is increased. > > > > PostgreSQL: 8.0.3 > > --- > > TABLE STRUCTURE: general.stress > > --- > > | dispatch_id | integer | not null | > > | query_id | integer | | > > | generated| timestamp with time zone | | > > | unsubscribes | integer | | > > | read_count | integer | | > > | status | character varying(10)| | > > | bounce_tracking | boolean | | > > | dispatch_hour| integer | | > > | dispatch_date_id | integer | | > > +--+--+---+ > > Indexes: > > "stress_pkey" PRIMARY KEY, btree (dispatch_id) > > > > UPDATE STATEMENT: > > update general.stress set read_count=read_count+1 where dispatch_id=114 > > This means you are updating only one row, correct? Correct. > > > Number of Copies | Update perl Sec > > > > 1 --> 119 > > 2 ---> 59 > > 3 ---> 38 > > 4 ---> 28 > > 5 --> 22 > > 6 --> 19 > > 7 --> 16 > > 8 --> 14 > > 9 --> 11 > > 10 --> 11 > > 11 --> 10 > > So, 11 instances result in 10 updated rows per second, database wide or > per instance? If it is per instance, then 11 * 10 is close to the > performance for one connection. Sorry do not understand the difference between "database wide" and "per instance" > > That being said, when you've got 10 connections fighting over one row, I > wouldn't be surprised if you had bad performance. > > Also, at 119 updates a second, you're more than doubling the table's > initial size (dead tuples) each second. How often are you vacuuming and > are you using vacuum or vacuum full? Yes I realize the obvious phenomenon now, (and the uselessness of the script) , we should not consider it a performance degradation. I am having performance issue in my live database thats why i tried to simulate the situation(may the the script was overstresser). My original problem is that i send 100 000s of emails carrying a beacon for tracking readership every tuesday and on wednesday i see lot of the said query in pg_stat_activity each of these query update the SAME row that corresponds to the dispatch of last day and it is then i face the performance problem. I think i can only post further details next wednesday , please lemme know how should i be dealing with the situation if each the updates takes 100times more time that normal update duration. Best Regards Mallah. > > Gavin > ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Monitoring Postgresql performance
On Sep 28, 2005, at 8:32 AM, Arnau wrote: Hi all, I have been "googling" a bit searching info about a way to monitor postgresql (CPU & Memory, num processes, ... ) You didn't mention your platform, but I have an xterm open pretty much continuously for my DB server that runs plain old top. I have customized my settings enough that I can pretty much see anything I need to from there. -Dan ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PERFORM] A Better External Sort?
>From: "Jeffrey W. Baker" <[EMAIL PROTECTED]> >Sent: Sep 27, 2005 1:26 PM >To: Ron Peacetree <[EMAIL PROTECTED]> >Subject: Re: [HACKERS] [PERFORM] A Better External Sort? > >On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote: > >>That Btree can be used to generate a physical reordering of the data >>in one pass, but that's the weakest use for it. The more powerful >>uses involve allowing the Btree to persist and using it for more >>efficient re-searches or combining it with other such Btrees (either as >>a step in task distribution across multiple CPUs or as a more efficient >>way to do things like joins by manipulating these Btrees rather than >>the actual records.) > >Maybe you could describe some concrete use cases. I can see what >you are getting at, and I can imagine some advantageous uses, but >I'd like to know what you are thinking. > 1= In a 4P box, we split the data in RAM into 4 regions and create a CPU cache friendly Btree using the method I described for each CPU. The 4 Btrees can be merged in a more time and space efficient manner than the original records to form a Btree that represents the sorted order of the entire data set. Any of these Btrees can be allowed to persist to lower the cost of doing similar operations in the future (Updating the Btrees during inserts and deletes is cheaper than updating the original data files and then redoing the same sort from scratch in the future.) Both the original sort and future such sorts are made more efficient than current methods. 2= We use my method to sort two different tables. We now have these very efficient representations of a specific ordering on these tables. A join operation can now be done using these Btrees rather than the original data tables that involves less overhead than many current methods. 3= We have multiple such Btrees for the same data set representing sorts done using different fields (and therefore different Keys). Calculating a sorted order for the data based on a composition of those Keys is now cheaper than doing the sort based on the composite Key from scratch. When some of the Btrees exist and some of them do not, there is a tradeoff calculation to be made. Sometimes it will be cheaper to do the sort from scratch using the composite Key. >Specifically I'd like to see some cases where this would beat sequential >scan. I'm thinking that in your example of a terabyte table with a >column having only two values, all the queries I can think of would be >better served with a sequential scan. > In my original example, a sequential scan of the 1TB of 2KB or 4KB records, => 250M or 500M records of data, being sorted on a binary value key will take ~1000x more time than reading in the ~1GB Btree I described that used a Key+RID (plus node pointers) representation of the data. Just to clarify the point further, 1TB of 1B records => 2^40 records of at most 256 distinct values. 1TB of 2B records => 2^39 records of at most 2^16 distinct values. 1TB of 4B records => 2^38 records of at most 2^32 distinct values. 1TB of 5B records => 200B records of at most 200B distinct values. >From here on, the number of possible distinct values is limited by the number of records. 100B records are used in the "Indy" version of Jim Gray's sorting contests, so 1TB => 10B records. 2KB-4KB is the most common record size I've seen in enterprise class DBMS (so I used this value to make my initial example more realistic). Therefore the vast majority of the time representing a data set by Key will use less space that the original record. Less space used means less IO to scan the data set, which means faster scan times. This is why index files work in the first place, right? >Perhaps I believe this because you can now buy as much sequential I/O >as you want. Random I/O is the only real savings. > 1= No, you can not "buy as much sequential IO as you want". Even if with an infinite budget, there are physical and engineering limits. Long before you reach those limits, you will pay exponentially increasing costs for linearly increasing performance gains. So even if you _can_ buy a certain level of sequential IO, it may not be the most efficient way to spend money. 2= Most RW IT professionals have far from an infinite budget. Just traffic on these lists shows how severe the typical cost constraints usually are. OTOH, if you have an inifinite IT budget, care to help a few less fortunate than yourself? After all, a even a large constant substracted from infinity is still infinity... ;-) 3= No matter how fast you can do IO, IO remains the most expensive part of the performance equation. The fastest and cheapest IO you can do is _no_ IO. As long as we trade cheaper RAM and even cheaoer CPU operations for IO correctly, more space efficient data representations will always be a Win because of this. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desi
[PERFORM] Monitoring Postgresql performance
Hi all, I have been "googling" a bit searching info about a way to monitor postgresql (CPU & Memory, num processes, ... ) and I haven't found anything relevant. I'm using munin to monitor others parameters of my servers and I'd like to include postgresql or have a similar tool. Any of you is using anything like that? all kind of hints are welcome :-) Cheers! -- Arnau ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Slow concurrent update of same row in a given table
On Wed, 28 Sep 2005, Rajesh Kumar Mallah wrote: > Hi > > While doing some stress testing for updates in a small sized table > we found the following results. We are not too happy about the speed > of the updates particularly at high concurrency (10 clients). > > Initially we get 119 updates / sec but it drops to 10 updates/sec > as concurrency is increased. > > PostgreSQL: 8.0.3 > --- > TABLE STRUCTURE: general.stress > --- > | dispatch_id | integer | not null | > | query_id | integer | | > | generated| timestamp with time zone | | > | unsubscribes | integer | | > | read_count | integer | | > | status | character varying(10)| | > | bounce_tracking | boolean | | > | dispatch_hour| integer | | > | dispatch_date_id | integer | | > +--+--+---+ > Indexes: > "stress_pkey" PRIMARY KEY, btree (dispatch_id) > > UPDATE STATEMENT: > update general.stress set read_count=read_count+1 where dispatch_id=114 This means you are updating only one row, correct? > Number of Copies | Update perl Sec > > 1 --> 119 > 2 ---> 59 > 3 ---> 38 > 4 ---> 28 > 5 --> 22 > 6 --> 19 > 7 --> 16 > 8 --> 14 > 9 --> 11 > 10 --> 11 > 11 --> 10 So, 11 instances result in 10 updated rows per second, database wide or per instance? If it is per instance, then 11 * 10 is close to the performance for one connection. That being said, when you've got 10 connections fighting over one row, I wouldn't be surprised if you had bad performance. Also, at 119 updates a second, you're more than doubling the table's initial size (dead tuples) each second. How often are you vacuuming and are you using vacuum or vacuum full? Gavin ---(end of broadcast)--- TIP 6: explain analyze is your friend
[PERFORM] Slow concurrent update of same row in a given table
Hi While doing some stress testing for updates in a small sized table we found the following results. We are not too happy about the speed of the updates particularly at high concurrency (10 clients). Initially we get 119 updates / sec but it drops to 10 updates/sec as concurrency is increased. PostgreSQL: 8.0.3 --- TABLE STRUCTURE: general.stress --- | dispatch_id | integer | not null | | query_id | integer | | | generated| timestamp with time zone | | | unsubscribes | integer | | | read_count | integer | | | status | character varying(10)| | | bounce_tracking | boolean | | | dispatch_hour| integer | | | dispatch_date_id | integer | | +--+--+---+ Indexes: "stress_pkey" PRIMARY KEY, btree (dispatch_id) UPDATE STATEMENT: update general.stress set read_count=read_count+1 where dispatch_id=114 TOOL USED: Perl/DBI , with prepared statement handlers CONCURRENCY METHOD: executing multiple copies of same program from different shells (linux enviornment) CLIENT SERVER LINK : 10/100 Mbits , LAN CLIENT CODE: stress.pl - #!/opt/perl/bin/perl -I/usr/local/masonapache/lib/perl #overview: update the table as fast as possible (while(1){}) #on every 100th commit , print the average update frequency #of last 100 updates ## use strict; use Time::HiRes qw(gettimeofday tv_interval); use Utils; my $dbh = &Utils::db_connect(); my $sth = $dbh -> prepare("update general.stress set read_count=read_count+1 where dispatch_id=114"); my $cnt=0; my $t0 = [ gettimeofday ]; while(1) { $sth -> execute(); $dbh->commit(); $cnt++; if ($cnt % 100 == 0) { my $t1 = [ gettimeofday ]; my $elapsed = tv_interval ( $t0 , $t1 ); $t0 = $t1; printf "Rate: %d updates / sec\n" , 100.0/$elapsed ; } } $sth->finish(); $dbh->disconnect(); -- -- RESULTS: -- Number of Copies | Update perl Sec 1 --> 119 2 ---> 59 3 ---> 38 4 ---> 28 5 --> 22 6 --> 19 7 --> 16 8 --> 14 9 --> 11 10 --> 11 11 --> 10 - Note that the table was vacuum analyzed during the tests total number of records in table: 93 - Regds Rajesh Kumar Mallah. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[PERFORM] database bloat, but vacuums are done, and fsm seems to be setup ok
hi setup: postgresql 8.0.3 put on debian on dual xeon, 8GB ram, hardware raid. database just after recreation from dump takes 15gigabytes. after some time (up to 3 weeks) it gets really slow and has to be dump'ed and restored. as for fsm: end of vacuum info: INFO: free space map: 248 relations, 1359140 pages stored; 1361856 total pages needed DETAIL: Allocated FSM size: 1000 relations + 1000 pages = 58659 kB shared memory. so it looks i have plenty of space in fsm. vacuums run constantly. 4 different tasks, 3 of them doing: while true vacuum table sleep 15m done with different tables (i have chooses the most updated tables in system). and the fourth vacuum task does the same, but without specifying table - so it vacuumes whole database. after last dump/restore cycle i noticed that doing reindex on all indices in database made it drop in side from 40G to about 20G - so it might be that i will be using reindex instead of drop/restore. anyway - i'm not using any special indices - just some (117 to be exact) indices of btree type. we use simple, multi-column, partial and multi-column partial indices. we do not have functional indices. database has quite huge load of updates, but i thought that vacum will guard me from database bloat, but from what i observed it means that vacuuming of b-tree indices is somewhat faulty. any suggestions? what else can i supply you with to help you help me? best regards depesz