Re: [PERFORM] Running 9 in production? Sticking with 8.4.4 for a while?
On 9/28/2010 4:45 PM, Greg Smith wrote: Tory M Blue wrote: I'm doing an OS upgrade and have been sitting on 8.4.3 for sometime. I was wondering if it's better for the short term just to bring things to 8.4.4 and let 9.0 bake a bit longer, or are people with large data sets running 9.0 in production already? I'm aware of two people with large data sets who have been running 9.0 in production since it was in beta. Like most code, what you have to consider is how much the code path you expect to use each day has been modified during the previous release. If you're using 9.0 as a better 8.4, the odds of your running into a problem are on the low side of the risk curve. But those using the features that are both new and were worked on until the very end of the development cycle, like the new replication features, they are much more likely to run into a bug. A conservative approach is never to use version x.0 of *anything*. The PG developers are very talented (and also very helpful on these mailing lists - thanks for that), but they are human. For work I'm paid to do (as opposed to my own or charity work), I like to stay at least one point release behind the bleeding edge. -- Guy Rouillier -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query plan for NOT IN
Kevin Grittner wrote: Grzegorz JaĆkiewiczgryz...@gmail.com wrote: A failing of the SQL standard is that it uses the same mark (NULL) to show the absence of a value because it is unknown as for the case where it is known that no value exists (not applicable). Codd argued for a distinction there, but it hasn't come to pass, at least in the standard. If anyone could suggest a way to support standard syntax and semantics and add extensions to support this distinction, it might be another advance that would distinguish PostgreSQL from less evolved products. :-) Theoretically, the distinction already exists. If you don't know a person's middle initial, then set it to null; if you know the person doesn't have one, set it to the empty string. But from a practical point of view, that wouldn't go very far. Most *people* equate an empty string to mean the same as null. When I wrote my own data access layer years ago, I expressly checked for empty strings on input and changed them to null. I did this because empty strings had a nasty way of creeping into our databases; writing queries to produce predictable results got to be very messy. -- Guy Rouillier -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query plan for NOT IN
Craig James wrote: Kevin Grittner wrote: Which leaves the issue open -- a flexible way to flag the *reason* (or *reasons*) for the absence of a value could be a nice enhancement, if someone could invent a good implementation. Of course, one could always add a column to indicate the reason for a NULL; and perhaps that would be as good as any scheme to attach reason flags to NULL. You'd just have to make sure the reason column was null capable for those rows where there *was* a value, which would make the reason not applicable I'd argue that this is just a special case of a broader problem of metadata: Data about the data. For example, I could have a temperature, 40 degrees, and an error bounds, +/- 0.25 degrees. Nobody would think twice about making these separate columns. I don't see how this is any different from a person's middle initial of NULL, plus a separate column indicating not known versus doesn't have one if that distinction is important. There are many examples like this, where a simple value in one column isn't sufficient, so another column contains metadata that qualifies or clarifies the information. NULL is just one such case. But, this should probably be on an SQL discussion board, not PG performance... Most DBMSs I'm aware of use a null *byte* attached to a nullable column to indicate whether the column is null or not. yes/no takes one *bit*. That leaves 255 other possible values to describe the state of the column. That seems preferable to adding an additional column to every nullable column. But as you say, that would have to be taken up with the SQL standardization bodies, and not PostgreSQL. -- Guy Rouillier -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Query plan for NOT IN
Grzegorz JaĆkiewicz wrote: well, as a rule of thumb - unless you can't think of a default value of column - don't use nulls. So using nulls as default 'idunno' - is a bad practice, but everybody's opinion on that differ. I don't understand this point of view. The concept of null was introduced into the SQL vernacular by Codd and Date expressly to represent unknown values. -- Guy Rouillier -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] PostgreSQL vs Oracle
Victor Nawothnig wrote: Hi, I am looking for some recent and hopefully genuine comparisons between Oracle and PostgreSQL regarding their performance in large scale applications. Tests from real world applications would be preferable but not required. Also differentiations in different areas (i.e. different data types, query structures, clusters, hardware, etc.) might be helpful as well. Victor, Oracle expressly forbids, in their license agreement, anyone from publishing performance comparisons between Oracle and any other product. So you will rarely find anyone willing to publicly provide you any performance numbers. Difference in data structures, etc, are fairly easy to determine. Anyone can read the Oracle documentation. -- Guy Rouillier -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Planning a new server - help needed
PFC wrote: Why do you claim that 'More platters also means slower seeks and generally slower performance.'? More platters - more heads - heavier head assembly - slower seek time But.. More platters - higher density - less seek distance (in mm of head movement) - faster seek time More platters means more tracks under the read heads at a time, so generally *better* performance. All other things (like rotational speed) being equal, of course. -- Guy Rouillier -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Making the most of memory?
Scott Marlowe wrote: I assume you're talking about solid state drives? They have their uses, but for most use cases, having plenty of RAM in your server will be a better way to spend your money. For certain high throughput, relatively small databases (i.e. transactional work) the SSD can be quite useful. Unless somebody has changes some physics recently, I'm not understanding the recent discussions of SSD in the general press. Flash has a limited number of writes before it becomes unreliable. On good quality consumer grade, that's about 300,000 writes, while on industrial grade it's about 10 times that. That's fine for mp3 players and cameras; even professional photographers probably won't rewrite the same spot on a flash card that many times in a lifetime. But for database applications, 300,000 writes is trivial. 3 million will go a lot longer, but in non-archival applications, I imagine even that mark won't take but a year or two to surpass. -- Guy Rouillier ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Barcelona vs Tigerton
Mindaugas wrote: Hello, Now that both 4x4 out it's time for us to decide which one should be better for our PostgreSQL and Oracle. And especially for Oracle we really need such server to squeeze everything from Oracle licenses. Both of the databases handle OLTP type of the load. Since we plan to buy 4U HP DL580 or 585 and only very few of them so power ratings are not very critical in this our case. First benchmarks (http://www.anandtech.com/IT/showdoc.aspx?i=3091) show that Intel still has more raw CPU power but Barcelona scales much better and also has better memory bandwidth which I believe is quite critical with 16 cores and DB usage pattern. On the other hand Intel's X7350 (2.93GHz) has almost 50% advantage in CPU frequency against 2GHz Barcelona. Barcelona was just announced yesterday. I wouldn't want to be betting my business on it just yet. Plus, AMD usually is able to up the clock on their chips pretty well after they've got a few production runs under their belts. If you've absolutely got to have something today, I'd say Intel would be a safer bet. If you can afford to wait 3-4 months, then you'll benefit from some industry experience as well as production maturity with Barcelona, and can judge then which better fits your needs. -- Guy Rouillier ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] High update activity, PostgreSQL vs BigDBMS
I originally posted the question below back in Dec 2006, and many helpful suggestions resulted. Unfortunately, since this was a closet effort, my official duties pushed further exploration to the back burner, then I lost my original test environment. So while I can no longer compare to BigDBMS, I've just made some discoveries that I thought others might find helpful. The app (which I inherited) was implemented making exhaustive use of stored procedures. All inserts and updates are done using procs. When configuration changes produced no noticeable improvements in performance, I turned to the application architecture. In a new environment, I updated an insert/update intensive part of the app to use embedded insert and update statements instead of invoking stored procedures that did the same work. All the remaining code, database implementation, hardware, etc remains the same. The results were significant. Running a repeatable test set of data produced the following results: With stored procs: 2595 seconds With embedded inserts/updates: 991 seconds So at least in this one scenario, it looks like the extensive use of stored procs is contributing significantly to long run times. Guy Rouillier wrote: I don't want to violate any license agreement by discussing performance, so I'll refer to a large, commercial PostgreSQL-compatible DBMS only as BigDBMS here. I'm trying to convince my employer to replace BigDBMS with PostgreSQL for at least some of our Java applications. As a proof of concept, I started with a high-volume (but conceptually simple) network data collection application. This application collects files of 5-minute usage statistics from our network devices, and stores a raw form of these stats into one table and a normalized form into a second table. We are currently storing about 12 million rows a day in the normalized table, and each month we start new tables. For the normalized data, the app inserts rows initialized to zero for the entire current day first thing in the morning, then throughout the day as stats are received, executes updates against existing rows. So the app has very high update activity. In my test environment, I have a dual-x86 Linux platform running the application, and an old 4-CPU Sun Enterprise 4500 running BigDBMS and PostgreSQL 8.2.0 (only one at a time.) The Sun box has 4 disk arrays attached, each with 12 SCSI hard disks (a D1000 and 3 A1000, for those familiar with these devices.) The arrays are set up with RAID5. So I'm working with a consistent hardware platform for this comparison. I'm only processing a small subset of files (144.) BigDBMS processed this set of data in 2 seconds, with all foreign keys in place. With all foreign keys in place, PG took 54000 seconds to complete the same job. I've tried various approaches to autovacuum (none, 30-seconds) and it doesn't seem to make much difference. What does seem to make a difference is eliminating all the foreign keys; in that configuration, PG takes about 3 seconds. Better, but BigDBMS still has it beat significantly. I've got PG configured so that that the system database is on disk array 2, as are the transaction log files. The default table space for the test database is disk array 3. I've got all the reference tables (the tables to which the foreign keys in the stats tables refer) on this array. I also store the stats tables on this array. Finally, I put the indexes for the stats tables on disk array 4. I don't use disk array 1 because I believe it is a software array. I'm out of ideas how to improve this picture any further. I'd appreciate some suggestions. Thanks. -- Guy Rouillier ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] How much ram is too much
Dave Cramer wrote: Is it possible that providing 128G of ram is too much ? Will other systems in the server bottleneck ? What CPU and OS are you considering? -- Guy Rouillier ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PERFORM] How much ram is too much
Dave Cramer wrote: It's an IBM x3850 using linux redhat 4.0 I had to look that up, web site says it is a 4-processor, dual-core (so 8 cores) Intel Xeon system. It also says Up to 64GB DDR II ECC memory, so are you sure you can even get 128 GB RAM? If you could, I'd expect diminishing returns from the Xeon northbridge memory access. If you are willing to spend that kind of money on memory, you'd be better off with Opteron or Sparc. -- Guy Rouillier ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] High update activity, PostgreSQL vs BigDBMS
Dave Cramer wrote: On 6-Jan-07, at 11:32 PM, Guy Rouillier wrote: Dave Cramer wrote: The box has 3 GB of memory. I would think that BigDBMS would be hurt by this more than PG. Here are the settings I've modified in postgresql.conf: As I said you need to set shared_buffers to at least 750MB this is the starting point, it can actually go higher. Additionally effective cache should be set to 2.25 G turning fsync is not a real world situation. Additional tuning of file systems can provide some gain, however as Craig pointed out some queries may need to be tweaked. Dave, thanks for the hard numbers, I'll try them. I agree turning fsync off is not a production option. In another reply to my original posting, Alex mentioned that BigDBMS gets an advantage from its async IO. So simply as a test, I turned fsync off in an attempt to open wide all the pipes. Regarding shared_buffers=750MB, the last discussions I remember on this subject said that anything over 10,000 (8K buffers = 80 MB) had unproven benefits. So I'm surprised to see such a large value suggested. I'll certainly give it a try and see what happens. That is 25% of your available memory. This is just a starting point. There are reports that going as high as 50% can be advantageous, however you need to measure it yourself. Ok, I ran with the settings below, but with shared_buffers=768MB effective_cache_size=2048MB fsync=on This run took 29000 seconds. I'm beginning to think configuration changes are not going to buy significant additional improvement. Time to look at the app implementation. autovacuum=on stats_row_level = on max_connections = 10 listen_addresses = 'db01,localhost' shared_buffers = 128MB work_mem = 16MB maintenance_work_mem = 64MB temp_buffers = 32MB max_fsm_pages = 204800 checkpoint_segments = 30 redirect_stderr = on log_line_prefix = '%t %d' --Guy Rouillier ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org -- Guy Rouillier ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] High update activity, PostgreSQL vs BigDBMS
Ron wrote: C= What file system are you using? Unlike BigDBMS, pg does not have its own native one, so you have to choose the one that best suits your needs. For update heavy applications involving lots of small updates jfs and XFS should both be seriously considered. Ron, thanks for your ideas. Many of them I've addressed in response to suggestions from others. I wanted to address this one in particular. Unfortunately, I do not have the liberty to change file systems on this old Sun box. All file systems are formatted Sun UFS. BigDBMS is equally subject to whatever pluses or minuses can be attributed to this file system, so I'm thinking that this issue would be a wash between the two. I've come to the conclusion that configuration changes to PG alone will not equal the playing field. My next step is to try to determine where the biggest payback will be regarding changing the implementation. -- Guy Rouillier ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] High update activity, PostgreSQL vs BigDBMS
Craig A. James wrote: I don't know if you have access to the application's SQL, or the time to experiment a bit, but unless your schema is trival and your SQL is boneheaded simple, you're not going to get equal performance from Postgres until you do some analysis of your application under real-world conditions, and optimize the problem areas. Craig, thanks for taking the time to think about this. Yes, I have all the application source code, and all the time in the world, as I'm doing this experimentation on my own time. The test hardware is old stuff no one intends to use for production work ever again, so I can use it as long as I want. The application is fairly straightforward, but as you say, what is working okay with BigDBMS isn't working as well under PG. I'm going to try other configuration suggestions made by others before I attempt logic changes. The core logic is unchangeable; millions of rows of data in a single table will be updated throughout the day. If PG can't handle high volume updates well, this may be brick wall. -- Guy Rouillier ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] High update activity, PostgreSQL vs BigDBMS
Dave Cramer wrote: The box has 3 GB of memory. I would think that BigDBMS would be hurt by this more than PG. Here are the settings I've modified in postgresql.conf: As I said you need to set shared_buffers to at least 750MB this is the starting point, it can actually go higher. Additionally effective cache should be set to 2.25 G turning fsync is not a real world situation. Additional tuning of file systems can provide some gain, however as Craig pointed out some queries may need to be tweaked. Dave, thanks for the hard numbers, I'll try them. I agree turning fsync off is not a production option. In another reply to my original posting, Alex mentioned that BigDBMS gets an advantage from its async IO. So simply as a test, I turned fsync off in an attempt to open wide all the pipes. Regarding shared_buffers=750MB, the last discussions I remember on this subject said that anything over 10,000 (8K buffers = 80 MB) had unproven benefits. So I'm surprised to see such a large value suggested. I'll certainly give it a try and see what happens. autovacuum=on stats_row_level = on max_connections = 10 listen_addresses = 'db01,localhost' shared_buffers = 128MB work_mem = 16MB maintenance_work_mem = 64MB temp_buffers = 32MB max_fsm_pages = 204800 checkpoint_segments = 30 redirect_stderr = on log_line_prefix = '%t %d' -- Guy Rouillier ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] High update activity, PostgreSQL vs BigDBMS
I've got back access to my test system. I ran another test run with the same input data set. This time I put pg_xlog on a different RAID volume (the unused one that I suspect is a software RAID), and I turned fsync=off in postgresql.conf. I left the rest of the configuration alone (all foreign keys removed), etc. Unfortunately, this only dropped elapsed time down to about 28000 seconds (from 3), still significantly more than BigDBMS. Additional info inline below. Shoaib Mir wrote: Here are my few recommendations that might help you: - You will need to do table partitioning (http://www.postgresql.org/docs/current/static/ddl-partitioning.html http://www.postgresql.org/docs/current/static/ddl-partitioning.html) as you are storing quite a lot of data in one table per day. I'm focusing on the detailed perspective for now. The 144 files I'm processing represent not even two hours of data, so that surely wouldn't be split up. - You are using a RAID5 setup which is something that can also affect performance so switching to RAID1 might help you there, but again you have a RAID5 with 12 disks so hmm that shouldn't be that much of a problem. Agreed. - Have you done the tuning for postgresql.conf parameters? if not then you really need to do this for like checkpoint segments, random page cost, shared buffers, cache size, fsm pages, vacuum cost delay, work_mem, bgwriter etc etc. You can get good advice for tuning these parameters at -- http://www.powerpostgresql.com/PerfList/ The box has 3 GB of memory. I would think that BigDBMS would be hurt by this more than PG. Here are the settings I've modified in postgresql.conf: autovacuum=on stats_row_level = on max_connections = 10 listen_addresses = 'db01,localhost' shared_buffers = 128MB work_mem = 16MB maintenance_work_mem = 64MB temp_buffers = 32MB max_fsm_pages = 204800 checkpoint_segments = 30 redirect_stderr = on log_line_prefix = '%t %d' - For autovacuuming you need to properly tune the thresholds so that the vacuum and analyze is done at the right time not affecting the database server performance. (You can find help for this at http://www.postgresql.org/docs/current/static/routine-vacuuming.html under 22.1.4. The auto-vacuum daemon) The real-life load on this database would be fairly constant throughout the day. Stats from network devices are received every 15 minutes from each device, but they are staggered. As a result, the database is almost constantly being updated, so there is no dead time to do vacuums. - You will need to separate your transactional logs i.e. pg_xlog folder to a different drive other then your database server drive. This can be done by creating symlinks for pg_xlog folder. Done, see opening remarks. Unfortunately minor impact. - I hope you are doing proper connection pool management, because good use of database connections can be really effect the overall performance, connections can be expensive to create, and consume memory if they are not properly exited. I probably should have mentioned this originally but was afraid of information overload. The application runs on JBoss and uses JBoss connection pools. So connections are pooled, but I don't know how they would compare to native PG connection pools. Essentially, JBoss gets native JDBC connections, and the pools simply allow them to be re-used without opening and closing each time. So if the native PG connection pools provide any pooling optimizations beyond that, those advantages are not being realized. Hope that helps your tests... Thanks to everyone for providing suggestions, and I apologize for my delay in responding to each of them. Shoaib Mir EnterpriseDB (www.enterprisedb.com http://www.enterprisedb.com) On 12/28/06, *Guy Rouillier* [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: I don't want to violate any license agreement by discussing performance, so I'll refer to a large, commercial PostgreSQL-compatible DBMS only as BigDBMS here. I'm trying to convince my employer to replace BigDBMS with PostgreSQL for at least some of our Java applications. As a proof of concept, I started with a high-volume (but conceptually simple) network data collection application. This application collects files of 5-minute usage statistics from our network devices, and stores a raw form of these stats into one table and a normalized form into a second table. We are currently storing about 12 million rows a day in the normalized table, and each month we start new tables. For the normalized data, the app inserts rows initialized to zero for the entire current day first thing in the morning, then throughout the day as stats are received, executes updates against existing rows. So the app has very high update activity. In my test environment, I have a dual-x86 Linux platform running the application
Re: [PERFORM] High update activity, PostgreSQL vs BigDBMS
Tom Lane wrote: Shoaib Mir [EMAIL PROTECTED] writes: Here are my few recommendations that might help you: [ snip good advice ] Another thing to look at is whether you are doing inserts/updates as individual transactions, and if so see if you can batch them to reduce the per-transaction overhead. Thank you everyone who replied with suggestions. Unfortunately, this is a background activity for me, so I can only work on it when I can squeeze in time. Right now, I can't do anything; I swapped out a broken switch in our network and the DB server is currently inaccessible ;(. I will eventually work through all suggestions, but I'll start with the ones I can respond to without further investigation. I'm not doing updates as individual transactions. I cannot use the Java batch functionality because the code uses stored procedures to do the inserts and updates, and the PG JDBC driver cannot handle executing stored procedures in batch. Briefly, executing a stored procedure returns a result set, and Java batches don't expect result sets. So, in the code I turn autocommit off, and do a commit every 100 executions of the stored proc. The exact same code is running against BigDBMS, so any penalty from this approach should be evenly felt. -- Guy Rouillier ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[PERFORM] High update activity, PostgreSQL vs BigDBMS
I don't want to violate any license agreement by discussing performance, so I'll refer to a large, commercial PostgreSQL-compatible DBMS only as BigDBMS here. I'm trying to convince my employer to replace BigDBMS with PostgreSQL for at least some of our Java applications. As a proof of concept, I started with a high-volume (but conceptually simple) network data collection application. This application collects files of 5-minute usage statistics from our network devices, and stores a raw form of these stats into one table and a normalized form into a second table. We are currently storing about 12 million rows a day in the normalized table, and each month we start new tables. For the normalized data, the app inserts rows initialized to zero for the entire current day first thing in the morning, then throughout the day as stats are received, executes updates against existing rows. So the app has very high update activity. In my test environment, I have a dual-x86 Linux platform running the application, and an old 4-CPU Sun Enterprise 4500 running BigDBMS and PostgreSQL 8.2.0 (only one at a time.) The Sun box has 4 disk arrays attached, each with 12 SCSI hard disks (a D1000 and 3 A1000, for those familiar with these devices.) The arrays are set up with RAID5. So I'm working with a consistent hardware platform for this comparison. I'm only processing a small subset of files (144.) BigDBMS processed this set of data in 2 seconds, with all foreign keys in place. With all foreign keys in place, PG took 54000 seconds to complete the same job. I've tried various approaches to autovacuum (none, 30-seconds) and it doesn't seem to make much difference. What does seem to make a difference is eliminating all the foreign keys; in that configuration, PG takes about 3 seconds. Better, but BigDBMS still has it beat significantly. I've got PG configured so that that the system database is on disk array 2, as are the transaction log files. The default table space for the test database is disk array 3. I've got all the reference tables (the tables to which the foreign keys in the stats tables refer) on this array. I also store the stats tables on this array. Finally, I put the indexes for the stats tables on disk array 4. I don't use disk array 1 because I believe it is a software array. I'm out of ideas how to improve this picture any further. I'd appreciate some suggestions. Thanks. -- Guy Rouillier ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings