Re: [PERFORM] Performace Optimization for Dummies
Carlo Stonebanks wrote: > Lots of great info here, I will see what applies to my situation. However, I > don't see bulk inserts of the tables working, because all of the tables need > to be refreshed as values to deduplicate and match will change with every > row added. In order for this to work, i would have to write queries against > the hash tables. This is where something like MySQL's in-memory tables would > have come in handy... > > What is GDB? > > Carlo Sorry, meant GDBM (disk based hash/lookup table). With Postgres if your tables fit into RAM then they are in-memory as long as they're actively being used. Hashtables and GDBM, as far as I know, are only useful for key->value lookups. However, for this they are *fast*. If you can figure out a way to make them work I'll bet things speed up. -- Matthew Nuzum newz2000 on freenode ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Performace Optimization for Dummies
Carlo Stonebanks wrote: >> are you using the 'copy' interface? > > Straightforward inserts - the import data has to transformed, normalised and > de-duped by the import program. I imagine the copy interface is for more > straightforward data importing. These are - buy necessity - single row > inserts. > I know this is an answer to a question you didn't ask, but here it is. I was once doing stuff where I processed log files and had to do many lookups to normalize the data before insertion. I started out doing everything in SQL and using postgresql tables and it took a little over 24 hours to process 24 hours worth of data. Like you, it was single process, many lookups. I found a better way. I rewrote it (once in c#, again in python) and used hashtables/dictionaries instead of tables for the lookup data. For example, I'd start by loading the data into hash tables (yes, this took a *lot* of ram) then for each row I did something like: 1. is it in the hash table? 1. If not, insert it into the db 1. Insert it into the hashtable 2. Get the lookup field out of the hash table 3. Output normalized data This allow me to create text files containing the data in COPY format which can then be inserted into the database at dramatically increased speeds. My first version in C# (mono) cut the time down to 6 hours for 24 hours worth of data. I tweaked the algorithms and rewrote it in Python and got it down to 45 min. (Python can't take all the credit for the performance boost, I used an improved technique that could have been done in C# as well) This time included the time needed to do the copy and update the indexes. I created a version that also used gdb databases instead of hash tables. It increased the time from 45 min to a little over an hour but decreased the memory usage to something like 45MB (vs dozens or hundreds of MB per hashtable) -- Matthew Nuzum newz2000 on freenode ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Sun Fire T2000 and PostgreSQL 8.1.3
On 4/6/06, Juan Casero (FL FLC) <[EMAIL PROTECTED]> wrote: > Because I plan to develop a rather large (for us anyway) data warehouse > with PostgreSQL. I am looking for the right hardware that can handle > queries on a database that might grow to over a 100 gigabytes. You need to look for a server that has fast I/O. 100 GB of data will take a long time to scan through and won't fit in RAM. > Right > now our decision support system based on postgresql 8.1.3 stores retail > sales information for about 4 four years back *but* only as weekly > summaries. I want to build the system so it can handle daily sales > transactions also. You can imagine how many more records this will > involve so I am looking for hardware that can give me the performance I > need to make this project useable. Sounds like you need to be doing a few heavy queries when you do this, not tons of small queries. That likely means you need fewer CPUs that are very fast. > In other words parsing and loading > the daily transaction logs for our stores is likely to take huge amounts > of effort. I need a machine that can complete the task in a reasonable > amount of time. See my previous comment > As people start to query the database to find sales > related reports and information I need to make sure the queries will run > reasonably fast for them. Get more than one CPU core and make sure you have a lot of drive spindles. You will definately want to be able to ensure a long running query doesn't hog your i/o system. I have a server with a single disk and when we do a long query the server load will jump from about .2 to 10 until the long query finishes. More cpus won't help this because the bottle neck is the disk. > I have already hand optimized all of my > queries on the current system. But currently I only have weekly sales > summaries. Other divisions in our company have done a similar project > using MS SQL Server on SMP hardware far outclassing the database server > I currently use and they report heavy loads on the server with less than > ideal query run times. I am sure I can do my part to optimize the > queries once I start this project but there is only so much you can do. > At some point you just need more powerful hardware. This is where I am > at right now. You say "this is where I am at right __now__" but where will you be in 9 months? Sounds like you will be i/o bound by the time you get above 10GB. > Apart from that since I will only get this one chance to > buy a new server for data processing I need to make sure that I buy > something that can grow over time as our needs change. I don't want to > buy a server only to find out later that it cannot meet our needs with > future database projects. I have to balance a limited budget, room for > future performance growth, and current system requirements. Trust me it > isn't easy. Isn't it about time we had our annual "what kind of server can I get for $8k" thread? -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] 1 TB of memory
On 3/16/06, Jim Nasby <[EMAIL PROTECTED]> wrote: > PostgreSQL tuned to the max and still too slow? Database too big to > fit into memory? Here's the solution! http://www.superssd.com/ > products/tera-ramsan/ > > Anyone purchasing one will be expected to post benchmarks! :) Pricing is tight-lipped, but searching shows $1.85 /GB. That's close to $500,000 for 250GB. One report says a person paid $219,000 for 32GB and 1TB costs "well over $1,000,000." But they "guarantee the performance." Too rich for me. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] help needed asap....
On 12 Mar 2006 11:46:25 -, Phadnis <[EMAIL PROTECTED]> wrote: > Hi. > > I am new to postgres and i need help from u.i hope i get positive response.. > though my questions mite seem silly to u... > > iam working on postgres.. i have around 1 lakh records in almost 12 tables.. > 1 ) when i try to query for count or for any thg it takes a long time to > return the result. How to avoid this > > 2) also i want to know how to increase the performance..( i do vacuum once > in a day) > These two questions are applicable to this list... your other questions may get quicker responses on the users list. However, you haven't provided enough information for anyone here to help. Here's what you should do: Find queries that you think should be faster than they are. For example, if your query is "Select count(*) from foo" you can get important performance information about the query by running: EXPLAIN ANALYZE select count(*) from foo Send the details of the query, including the output from the explain analyze output (which looks pretty meaningless until you've learned what to look for) to the list with a detailed question. Also, for general performance hints, tell the list what your setup is, what items you've tweaked (and maybe why). Generally, be as generous with your details as you can. Also, have you googled around for hints? Here's a good website with information: http://www.varlena.com/varlena/GeneralBits/Tidbits/index.php Notice there's a section on performance tips. Also, this list works because volunteers who have knowledge and free time choose to help when they can. If you really need answers ASAP, there are a few organizations who provide paid support. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Process Time X200
On 3/10/06, NbForYou <[EMAIL PROTECTED]> wrote: > Hey Michael, you sure know your stuff! > > Versions: > > PostgreSQL 7.3.9-RH running on the webhost. > PostgreSQL 8.0.3 running on my homeserver. > > So the only solution is to ask my webhost to upgrade its postgresql? > The question is will he do that? After all a license fee is required for > commercial use. And running a webhosting service is a commercial use. > > thanks for replying and going through the effort of creating the database > and populating it. > > Nick > You can look at the explain analyze output of the query from pg 7.3, figure out why the plan is bad and tweak your query to get optimum performance. Yes, I agree with the other statements that say, "upgrade to 7.4 or 8.x if you can" but if you can't, then you can still work on it. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Postgres on VPS - how much is enough?
On 3/7/06, Dave Page wrote: > On 7/3/06 18:45, "Jim C. Nasby" <[EMAIL PROTECTED]> wrote: > > On Mon, Mar 06, 2006 at 01:14:45PM -0400, Marc G. Fournier wrote: > >> We host VPSs here (http://www.hub.org) and don't use the 'single file, > >> virtual file system' to put them into ... it must depend on where you > >> host? > > > > Yeah, but aren't you also using FreeBSD jails? AFAIK linux doesn't have > > an equivalent to jail; all their VPS stuff actually brings up a > > full-blown copy of linux, kernel and all. > > No, linux vserver is equivalent to a jail - and they work superbly imho. > developer.pgadmin.org is just one such VM that I run. > > http://www.linux-vserver.org/ > > Regards, Dave. I can confirm this. I've been using linux-vserver for years. It is a very up-to-date and active project that is extremely responsive and helpful to users of all experience levels. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Postgres on VPS - how much is enough?
On 3/6/06, Marc G. Fournier <[EMAIL PROTECTED]> wrote: > On Mon, 6 Mar 2006, Matthew Nuzum wrote: > > My problem with running PG inside of a VPS was that the VPS used a > > virtual filesystem... basically, a single file that had been formatted > > and loop mounted so that it looked like a regular hard drive. > > Unfortunately, it was very slow. The difference between my application > > and yours is that mine well more than filled the 1GB of RAM that I had > > allocated. If your data will fit comfortably into RAM then you may be > > fine. > > We host VPSs here (http://www.hub.org) and don't use the 'single file, > virtual file system' to put them into ... it must depend on where you > host? That's true... I hope I didn't imply that I am anti-vps, I run my own servers and one of them is dedicated to doing VPS for different applications. I think they're wonderful. On 3/6/06, Nagita Karunaratne <[EMAIL PROTECTED]> wrote: > From personal experience, would you run Postgres on a linux machine > (NOT a vps) with 512MB of ram? > > Assumining I can keep all my data in memory. Nagita, It all depends on performance... I have one postgres database that runs on a Pentium 350MHz with 128MB of RAM. It does 1 insert per minute 24 hours per day. Because the load is so low, I can get away with minimal hardware. If your application has a lot of inserts/updates then disk speed is important and can vary greatly from one VPS to another. If your application is not time-critical than this may be a moot point anyway. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Postgres on VPS - how much is enough?
On 3/6/06, Nagita Karunaratne <[EMAIL PROTECTED]> wrote: > How big a VPS would I need to run a Postgres DB. > > One application will add about 500 orders per day > Another will access this data to create and send about 500 emails per day > A third will access this data to create an after-sales survey for at > most 500 times per day. > > What type of VPS would I need to run a database with this type pf load? > Is 128 MB ram enough? > What percentage of a 2.8 GHz CPU would be required? My problem with running PG inside of a VPS was that the VPS used a virtual filesystem... basically, a single file that had been formatted and loop mounted so that it looked like a regular hard drive. Unfortunately, it was very slow. The difference between my application and yours is that mine well more than filled the 1GB of RAM that I had allocated. If your data will fit comfortably into RAM then you may be fine. If you really want to know how it will work, try running it yourself. Two projects that make this really easy and free is the colinux project[1] which allows you to run a linux VPS in Windows and the linux-vserver project[2] which is free software that works on pretty much any linux OS. Try it out, tinker with the values and that way you won't have to guess when making your purchase decission. [1] http://www.colinux.org/ Coperative Linux [2] http://linux-vserver.org/ Linux-vserver project -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Large Database Design Help
On 2/9/06, Orion Henry <[EMAIL PROTECTED]> wrote: > > Hello All, > > I've inherited a postgresql database that I would like to refactor. It > was origionally designed for Postgres 7.0 on a PIII 500Mhz and some > design decisions were made that don't make sense any more. Here's the > problem: > > 1) The database is very large, the largest table has 40 million tuples. > > 2) The database needs to import 10's of thousands of tuples each night > quickly. The current method is VERY slow. > > 3) I can't import new records with a COPY or drop my indexes b/c some of > them are new records (INSERTS) and some are altered records (UPDATES) > and the only way I can think of to identify these records is to perform > a select for each record. [snip] > > 3) The current code that bulk loads data into the database is a loop > that looks like this: > > $result = exe("INSERT INTO $table ($name_str) SELECT > $val_str WHERE NOT EXISTS (SELECT 1 FROM $table WHERE $keys)"); > if ($result == 0) > { > $result = exe("UPDATE $table SET $non_keys WHERE > $keys"); > } > > Is there a faster way to bulk load data when it's not known ahead of > time if it's a new record or an updated record? I experimented with something like this and I was able to successively decrease the amount of time needed with an import. The final solution that took my import down from aproximately 24 hours to about 30 min was to use a C#/Java hashtable or a python dictionary. For example, the unique data in one particular table was "User_Agent" so I made it the key in my hashtable. I actually added a method to the hashtable so that when I added a new record to the hashtable it would do the insert into the db. The downside to this is that it used *GOBS* of RAM. Using Python, I was able to dramatically decrease the ram usage by switching to a GDB based dictionary instead of the standard dictionary. It only increased the time by about 50% so the total processing time was about 45 min vs the previous 30 min. I only had about 35 million records and my technique was getting to the point where it was unweldy, so with your 40 million and counting records you would probably want to start with the GDB technique unless you have a ton of available ram. You might interpret this as being a knock against PostgreSQL since I pulled the data out of the db, but it's not; You'd be hard pressed to find anything as fast as the in-memory hashtable or the on disk GDB; however it's usefullness is very limited and for anything more complex than just key=>value lookups moving to PostgreSQL is likely a big win. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Hardware/OS recommendations for large databases (
On 11/16/05, Steinar H. Gunderson <[EMAIL PROTECTED]> wrote: > If you have a cool SAN, it alerts you and removes all data off a disk > _before_ it starts giving hard failures :-) > > /* Steinar */ > -- > Homepage: http://www.sesse.net/ Good point. I have avoided data loss *twice* this year by using SMART hard drive monitoring software. I can't tell you how good it feels to replace a drive that is about to die, as compared to restoring data because a drive died. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Hardware/OS recommendations for large databases (
On 11/16/05, David Boreham <[EMAIL PROTECTED]> wrote: > >Spend a fortune on dual core CPUs and then buy crappy disks... I bet > >for most applications this system will be IO bound, and you will see a > >nice lot of drive failures in the first year of operation with > >consumer grade drives. > > I guess I've never bought into the vendor story that there are > two reliability grades. Why would they bother making two > different kinds of bearing, motor etc ? Seems like it's more > likely an excuse to justify higher prices. In my experience the > expensive SCSI drives I own break frequently while the cheapo > desktop drives just keep chunking along (modulo certain products > that have a specific known reliability problem). > > I'd expect that a larger number of hotter drives will give a less reliable > system than a smaller number of cooler ones. Of all the SCSI and IDE drives I've used, and I've used a lot, there is a definite difference in quality. The SCSI drives primarily use higher quality components that are intended to last longer under 24/7 work loads. I've taken several SCSI and IDE drives apart and you can tell from the guts that the SCSI drives are built with sturdier components. I haven't gotten my hands on the Raptor line of ATA drives yet, but I've heard they share this in common with the SCSI drives - they are built with components made to be used day and night for years straight without ending. That doesn't mean they will last longer than IDE drives, that just means they've been designed to withstand higher amounts of heat and sustained activity. I've got some IDE drives that have lasted years++ and I've got some IDE drives that have lasted months. However, my SCSI drives I've had over the years all lasted longer than the server they were installed in. I will say that in the last 10 years, the MTBF of IDE/ATA drives has improved dramatically, so I regularly use them in servers, however I have also shifted my ideology so that a server should be replaced after 3 years, where before I aimed for 5. It seems to me that the least reliable components in servers these days are the fans. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Help tuning postgres
On 10/13/05, Csaba Nagy <[EMAIL PROTECTED]> wrote: > On Thu, 2005-10-13 at 14:40, Andrew Sullivan wrote: > > On Thu, Oct 13, 2005 at 10:15:03AM +0200, Csaba Nagy wrote: > > > And how would the analyze help in finding this out ? I thought it would > > > only show me additionally the actual timings, not more detail in what > > > was done... > > > > Yes, it shows the actual timings, and the actual number of rows. But > > if the estimates that the planner makes are wildly different than the > > actual results, then you know your statistics are wrong, and that the > > planner is going about things the wrong way. ANALYSE is a big help. > > There's also a verbose option to it, but it's usually less useful in > > production situations. This is the point I was trying to make. I've seen special instances where people have posted an explain annalyze for a select/update to the list and suggestions have arisen allowing major performance improvements. If this task is where your database is performing its worst then it is the best place to start with optimizing, short of the obvious stuff, which it sounds like you've covered. Sometimes, and I think this has often been true for databases that are either very large or very small, statistics can be tweaked to get better performance. One good example is when a sequential scan is being chosen when an index scan may be better; something like this would definately peg your disk i/o. Throwing more hardware at your problem will definately help, but I'm a performance freak and I like to optimize everything to the max. *Sometimes* you can get drastic improvements without adding any hardware. I have seen some truly miraculus turn-arounds by tweaking some non-obvious settings based on suggestions made on this list. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Help tuning postgres
On 10/12/05, Csaba Nagy <[EMAIL PROTECTED]> wrote: > We have adapted our application (originally written for oracle) to > postgres, and switched part of our business to a postgres data base. > The data base has in the main tables around 150 million rows, the whole > data set takes ~ 30G after the initial migration. After ~ a month of > usage that bloated to ~ 100G. We installed autovacuum after ~ 2 weeks. > > The main table is heavily updated during the active periods of usage, > which is coming in bursts. > > Now Oracle on the same hardware has no problems handling it (the load), > but postgres comes to a crawl. Examining the pg_stats_activity table I > see the updates on the main table as being the biggest problem, they are > very slow. The table has a few indexes on it, I wonder if they are > updated too on an update ? The index fields are not changing. In any > case, I can't explain why the updates are so much slower on postgres. I'm not the most experience person on this list, but I've got some big tables I work with. Doing an update on these big tables often involves a sequential scan which can be quite slow. I would suggest posting the explain analyze output for one of your slow updates. I'll bet it is much more revealing and takes out a lot of the guesswork. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[PERFORM] Logarithmic change (decrease) in performance
Something interesting is going on. I wish I could show you the graphs, but I'm sure this will not be a surprise to the seasoned veterans. A particular application server I have has been running for over a year now. I've been logging cpu load since mid-april. It took 8 months or more to fall from excellent performance to "acceptable." Then, over the course of about 5 weeks it fell from "acceptable" to "so-so." Then, in the last four weeks it's gone from "so-so" to alarming. I've been working on this performance drop since Friday but it wasn't until I replied to Arnau's post earlier today that I remembered I'd been logging the server load. I grabbed the data and charted it in Excel and to my surprise, the graph of the server's load average looks kind of like the graph of y=x^2. I've got to make a recomendation for a solution to the PHB and my analysis is showing that as the dataset becomes larger, the amount of time the disk spends seeking is increasing. This causes processes to take longer to finish, which causes more processes to pile up, which cuases processes to take longer to finish, which causes more processes to pile up etc. It is this growing dataset that seems to be the source of the sharp decrease in performance. I knew this day would come, but I'm actually quite surprised that when it came, there was little time between the warning and the grande finale. I guess this message is being sent to the list to serve as a warning to other data warehouse admins that when you reach your capacity, the downward spiral happens rather quickly. Crud... Outlook just froze while composing the PHB memo. I've been working on that for an hour. What a bad day. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Monitoring Postgresql performance
On 9/28/05, Arnau <[EMAIL PROTECTED]> wrote: > Hi all, > >I have been "googling" a bit searching info about a way to monitor > postgresql (CPU & Memory, num processes, ... ) and I haven't found > anything relevant. I'm using munin to monitor others parameters of my > servers and I'd like to include postgresql or have a similar tool. Any > of you is using anything like that? all kind of hints are welcome :-) > > Cheers! > -- > Arnau I have a cronjob that runs every 5 minutes and checks the number of processes. When things get unruly I get a text message sent to my cell phone. It also creates a detailed log entry. I'll paste in an example of one of my scripts that does this below. This is on a dual purpose server and monitors both cpu load average and postgres. You can have the text message sent to multiple email addresses, just put a space separated list of e-mail addresses between quotes in the CONTACTS= line. It's simple, but it works and its always nice to know when there's a problem *before the boss discovers it* ;-) # Create some messages HOSTNAME=`hostname` WARNING_DB="Database connections on $HOSTNAME is rather high" WARNING_CPU="CPU load on $HOSTNAME is rather high" CONTACTS="[EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED]" WARN=0 #calculate the db load DB_LOAD=`ps -ax | grep postgres | wc -l` if (($DB_LOAD > 150)) then WARN=1 echo "$WARNING_DB ($DB_LOAD) " | mail -s "db_load is high ($DB_LOAD)" $CONTACTS fi #calculate the processor load CPU_LOAD=`cat /proc/loadavg | cut --delimiter=" " -f 2 | cut --delimiter="." -f 1` if (($CPU_LOAD > 8)) then WARN=1 echo "$WARNING_CPU ($CPU_LOAD) " | mail -s "CPU_load is high ($CPU_LOAD)" $CONTACTS fi if (($WARN > 0)) then echo -=-=-=-=-=-=-=-=- W A R N I N G -=-=-=-=-=-=-=-=- >> /tmp/warn.txt NOW=`date` echo -=-=-=-=-=-$NOW-=-=-=-=-=- >> /tmp/warn.txt echo CPU LOAD: $CPU_LOAD DB LOAD: $DB_LOAD >> /tmp/warn.txt echo >> /tmp/warn.txt top -bn 1 >> /tmp/warn.txt echo >> /tmp/warn.txt fi NOW=`date` CPU_LOAD=`cat /proc/loadavg | cut --delimiter=" " -f 1,2,3 --output-delimiter=\|` echo -e $NOW\|$CPU_LOAD\|$DB_LOAD >> ~/LOAD_MONITOR.LOG -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Postgresql Hardware - Recommendations
On 9/5/05, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > ... The only info I can > give so far is that the database size is about 60GB, and that it will be > frequently accessed by multiple users (about 100 will be connected > during business hours). The applications accessing the database are > mostly reporting tools. Optimizing hardware for mostly selects is different than optimizing for lots of inserts. You will get good responses from this list if you can give a little more details. Here are some questions: How do you get your data into the db? Do you do bullk loads at periodic intervals during the day? Do you do frequent updates/inserts? You say reporting, do you use many stored procedures and calculations on the server side? I've used some reporting apps that simply grab tons of data from the server and then process it on the client side (ODBC apps seem to do this), while other applications formulate the queries and use stored procedures in order to transfer little data. Of your 60GB, how much of that is active? Does your budget allow you to buy enough RAM to get your active data into the disk cache? For reporting, this *might* be your biggest win. Here are some scenarios: S1: Bulk uploads once or twice daily of about 250 MB of data. Few inserts and updates during the day (1-2%). Reporting is largely done on data from the last 5 business days. In this case you have < 2GB of active data and your disk cache will hold all of your active data in RAM (provided your db structure is diskcache friendly). An example of this I have experienced is a sales application that queries current inventory. Telephone agents queried, quieried, queried the instock-inventory. S2: Same as above but reporting is largely done on data covering 200+ business days. Its doubtful that you will get 50GB of RAM in your server, you need to focus on disk speed. An example of this I have experienced was an application that looked at sales trends and performed commission calculations and projected sales forecasts. S3: Lots of inserts/updates throughout the day (15 - 25%) - you need to focus on disk speed. The content management system my employer develops fits this model. > 3) CPUs vs Memory > Would you rather have 4x CPUs and 8GB of memory, or 2x CPUs with 16GB of > memory? Very hard to say without knowing your application. I have limited experience but what I've found is that applications that support multiple db architectures do not fully utilize the database server and CPU utilization is low. Disk and network i/o is high. I don't know if your application supports multiple backeneds, but chances are good your biggest wins will come from RAM, disk and network investments. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] RAID Configuration Sugestion
On 8/30/05, Ron <[EMAIL PROTECTED]> wrote: > >If you still have the budget, I would suggest considering either > >what Ron suggested or possibly using a 4 drive RAID 10 instead. > > IME, with only 4 HDs, it's usually better to split them them into two > RAID 1's (one for the db, one for everything else including the logs) > than it is to put everything on one RAID 10. YMMV. This coresponds to what I have observed as well. Of course, we all know that work loads varry. Just a note for the OP who has only two drives, there are tools for a variety of OSs that monitor the S.M.A.R.T. features of the drive and give an early warning in case it senses impending failure. I've caught two drives before failure with these types of tools. Also note that when reading discussions of this nature you must take into consideration the value of your data. For some people, restoring from a nightly backup is inconvienent, but not life-or-death. Some people even do twice-daily backups so that in case of a failure they can recover with little loss of data. This might be a good way to mitigate the cost of expensive server hardware. If you cannot afford to lose any data then you need to consider it imperitive to use some type of RAID setup (not RAID 0) and to achieve great performance you'll want more than 2 drives. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] RAID Configuration Sugestion
> > >On Tue, Aug 30, 2005 at 09:37:17 -0300, > > > Alvaro Nunes Melo <[EMAIL PROTECTED]> wrote: > > >>The server is a Dual Xeon 3.0 with 2 GB RAM and two SCSI disks. Our main > > >>doubt is what is the best configuration for the disks. We are thinking > > >>about use them in a RAID-0 array. Is this the best option? What do you > > >>suggest on partitioning? Separate partitions for the OS, data and pg_xlog? > > > > > Our main goal is performance speedup. Disk space might not be a problem. > > I've read a lot here about movig pg_xlog to different partitions, and > > we'll surely make tests to see what configuration might be better. > I've set up several servers with a config like this. Its not ideal, but there's no reason you can't enjoy the benefits of a snappy application. The best results I've had involve dedicating one drive to OS, swap, logs, tmp and everything and dedicate one drive to postgres. If you use *nix you can mount the second drive as /var/lib/pgsql (or where ever postgres lives on your server) with noatime as a mount option. In retrospect, you might have saved the money on the second CPU and gotten two more hard drives, but if you're running a dual task server (i.e. LAMP) you may appreciate the second CPU. The beauty of a server like this is that it puts more of the wizardry of creating a fast application into the hands of the app developer, which results in a better db schema, optimized queries and generally *thinking* about the performance of the code. I personally feel that to be a very rewarding aspect of my job. (As a hobby I program microntrollers that run at 4MHz and have only 256 bytes of RAM, so that could just be me.;-) -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Need for speed
On 8/17/05, Ron <[EMAIL PROTECTED]> wrote: > At 05:15 AM 8/17/2005, Ulrich Wisser wrote: > >Hello, > > > >thanks for all your suggestions. > > > >I can see that the Linux system is 90% waiting for disc io. ... > 1= your primary usage is OLTP-like, but you are also expecting to do > reports against the same schema that is supporting your OLTP-like > usage. Bad Idea. Schemas that are optimized for reporting and other > data mining like operation are pessimal for OLTP-like applications > and vice versa. You need two schemas: one optimized for lots of > inserts and deletes (OLTP-like), and one optimized for reporting > (data-mining like). Ulrich, If you meant that your disc/scsi system is already the fastest available *with your current budget* then following Ron's advise I quoted above will be a good step. I have some systems very similar to yours. What I do is import in batches and then immediately pre-process the batch data into tables optimized for quick queries. For example, if your reports frequenly need to find the total number of views per hour for each customer, create a table whose data contains just the totals for each customer for each hour of the day. This will make it a tiny fraction of the size, allowing it to fit largely in RAM for the query and making the indexes more efficient. This is a tricky job, but if you do it right, your company will be a big success and buy you more hardware to work with. Of course, they'll also ask you to create dozens of new reports, but that's par for the course. Even if you have the budget for more hardware, I feel that creating an effective db structure is a much more elegant solution than to throw more hardware. (I admit, sometimes its cheaper to throw more hardware) If you have particular queries that are too slow, posting the explain analyze for each on the list should garner some help. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] faster INSERT with possible pre-existing row?
On 7/26/05, Dan Harris <[EMAIL PROTECTED]> wrote: > I am working on a process that will be inserting tens of million rows > and need this to be as quick as possible. > > The catch is that for each row I could potentially insert, I need to > look and see if the relationship is already there to prevent > multiple entries. Currently I am doing a SELECT before doing the > INSERT, but I recognize the speed penalty in doing to operations. I > wonder if there is some way I can say "insert this record, only if it > doesn't exist already". To see if it exists, I would need to compare > 3 fields instead of just enforcing a primary key. I struggled with this for a while. At first I tried stored procedures and triggers, but it took very long (over 24 hours for my dataset). After several iterations of rewritting it, first into C# then into Python I got the whole process down to under 30 min. My scenario is this: I want to normalize log data. For example, for the IP address in a log entry, I need to look up the unique id of the IP address, or if the IP address is new, insert it and then return the newly created entry. Multiple processes use the data, but only one process, run daily, actually changes it. Because this one process knows that the data is static, it selects the tables into in-memory hash tables (C#) or Dictionaries (Python) and then does the lookups there. It is *super* fast, but it uses a *lot* of ram. ;-) To limit the ram, I wrote a version of the python code that uses gdbm files instead of Dictionaries. This requires a newer version of Python (to allow a gdbm db to work just like a dictionary) but makes life easier in case someone is using my software on a lower end machine. This doubled the time of the lookups from about 15 minutes to 30, bringing the whole process to about 45 minutes. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Projecting currentdb to more users
On 7/12/05, Yves Vindevogel <[EMAIL PROTECTED]> wrote: > Hi, > > We have a couple of database that are identical (one for each customer). > They are all relatively small, ranging from 100k records to 1m records. > There's only one main table with some smaller tables, a lot of indexes > and some functions. > > I would like to make an estimation of the performance, the diskspace > and other related things, > when we have database of for instance 10 million records or 100 million > records. > > Is there any math to be done on that ? Its pretty easy to make a database run fast with only a few thousand records, or even a million records, however things start to slow down non-linearly when the database grows too big to fit in RAM. I'm not a guru, but my attempts to do this have not been very accurate. Maybe (just maybe) you could get an idea by disabling the OS cache on the file system(s) holding the database and then somehow fragmenting the drive severly (maybe by putting each table in it's own disk partition?!?) and measuring performance. On the positive side, there are a lot of wise people on this list who have +++ experience optimzing slow queries on big databases. So queries now that run in 20 ms but slow down to 7 seconds when your tables grow will likely benefit from optimizing. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] optimized counting of web statistics
On 6/29/05, Rudi Starcevic <[EMAIL PROTECTED]> wrote: > Hi, > > >I do my batch processing daily using a python script I've written. I > >found that trying to do it with pl/pgsql took more than 24 hours to > >process 24 hours worth of logs. I then used C# and in memory hash > >tables to drop the time to 2 hours, but I couldn't get mono installed > >on some of my older servers. Python proved the fastest and I can > >process 24 hours worth of logs in about 15 minutes. Common reports run > >in < 1 sec and custom reports run in < 15 seconds (usually). > > > > > > When you say you do your batch processing in a Python script do you mean > a you are using 'plpython' inside > PostgreSQL or using Python to execut select statements and crunch the > data 'outside' PostgreSQL? > > Your reply is very interesting. Sorry for not making that clear... I don't use plpython, I'm using an external python program that makes database connections, creates dictionaries and does the normalization/batch processing in memory. It then saves the changes to a textfile which is copied using psql. I've tried many things and while this is RAM intensive, it is by far the fastest aproach I've found. I've also modified the python program to optionally use disk based dictionaries based on (I think) gdb. This signfincantly increases the time to closer to 25 min. ;-) but drops the memory usage by an order of magnitude. To be fair to C# and .Net, I think that python and C# can do it equally fast, but between the time of creating the C# version and the python version I learned some new optimization techniques. I feel that both are powerful languages. (To be fair to python, I can write the dictionary lookup code in 25% (aprox) fewer lines than similar hash table code in C#. I could go on but I think I'm starting to get off topic.) -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] optimized counting of web statistics
On 6/28/05, Billy extyeightysix <[EMAIL PROTECTED]> wrote: > Hola folks, > > I have a web statistics Pg database (user agent, urls, referrer, etc) > that is part of an online web survey system. All of the data derived > from analyzing web server logs is stored in one large table with each > record representing an analyzed webserver log entry. > > Currently all reports are generated when the logs are being analyzed > and before the data ever goes into the large table I mention above. > Well, the time has come to build an interface that will allow a user > to make ad-hoc queries against the stats and that is why I am emailing > the performance list. Load your data into a big table, then pre-process into additional tables that have data better organized for running your reports. For example, you may want a table that shows a sum of all hits for each site, for each hour of the day. You may want an additional table that shows the sum of all page views, or maybe sessions for each site for each hour of the day. So, if you manage a single site, each day you will add 24 new records to the sum table. You may want the following fields: site (string) atime (timestamptz) hour_of_day (int) day_of_week (int) total_hits (int8) A record may look like this: site | atime | hour_of_day | day_of_week | total_hits 'www.yoursite.com' '2005-06-28 16:00:00 -0400' 18 2 350 Index all of the fields except total_hits (unless you want a report that shows all hours where hits were greater than x or less than x). Doing: select sum(total_hits) as total_hits from summary_table where atime between now() and (now() - '7 days'::interval); should be pretty fast. You can also normalize your data such as referrers, user agents, etc and create similar tables to the above. In case you haven't guessed, I've already done this very thing. I do my batch processing daily using a python script I've written. I found that trying to do it with pl/pgsql took more than 24 hours to process 24 hours worth of logs. I then used C# and in memory hash tables to drop the time to 2 hours, but I couldn't get mono installed on some of my older servers. Python proved the fastest and I can process 24 hours worth of logs in about 15 minutes. Common reports run in < 1 sec and custom reports run in < 15 seconds (usually). -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Do Views execute underlying query everytime ??
On 6/21/05, PFC <[EMAIL PROTECTED]> wrote: ... > In your case I don't think that is the solution, because you do big > updates. With triggers this would mean issuing one update of your > materialized view per row in your big update. This could be slow. > > In this case you might want to update the cache table in one request > rather than doing an awful lot of updates. > > So you have two solutions : > > 1- Junk it all and rebuild it from scratch (this can be faster than it > seems) > 2- Put the rows to be added in a temporary table, update the cache table > considering the difference between this temporary table and your big > table, then insert the rows in the big table. > > This is the fastest solution but it requires a bit more coding (not THAT > much though). > Amit, I understand your desire to not need any manual intervention... I don't know what OS you use, but here are two practical techniques you can use to achieve the above solution suggested by PFC: a: If you are on a Unix like OS such as Linux of Free BSD you have the beautiful cron program that will run commands nightly. b: If you are on Windows you have to do something else. The simplest solution I've found is called "pycron" (easily locatable by google) and is a service that emulates Unix cron on windows (bypassing a lot of the windows scheduler hassle). Now, using either of those solutions, let's say at 6:00 am you want to do your batch query. 1. Put the queries you want into a text file EXACTLY as you would type them using psql and save the text file. For example, the file may be named "create_mat_view.txt". 2. Test them by doing this from a command prompt: psql dbname < create_mat_view.txt 3. Create a cron entry to run the command once a day, it might look like this: 0 6 * * * /usr/bin/psql dbname < /home/admin/create_mat_view.txt or maybe like this: 0 6 * * * "C:\Program Files\PostgreSQL\8.0\psql.exe" dbname < "C:\create_mat_view.txt" I hope this helps, -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Do Views execute underlying query everytime ??
On 6/21/05, Amit V Shah <[EMAIL PROTECTED]> wrote: > Hi all, ... > I am thinking of having a solution where I create views for each screen, > which are just read only. > > However, I donot know if the query that creates the view is executed > everytime I select something from the view. Because if that is the case, > then I think my queries will again be slow. But if that is the way views > work, then what would be the point in creating them .. > > Any suggestions, helps -- They do get executed every time. I have a similar issue, but my data does not change very frequently, so instead of using a view, I create lookup tables to hold the data. So once a day I do something like this: drop lookup_table_1; create table lookup_table_1 as SELECT ...; In my case, rows are not deleted or updated, so I don't actually do a "drop table..." I merely add new records to the existing table, but if your data changes, the drop table technique can be faster than doing a delete or update. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Help specifying new web server/database machine
On 6/9/05, Rory Campbell-Lange <[EMAIL PROTECTED]> wrote: > Disks: > > I'm somewhat confused here. I've followed the various notes about SATA > vs SCSI and it seems that SCSI is the way to go. On a four-slot 1U > server, would one do a single RAID10 over 4 disks 1rpm U320 disks? > I would run the database in its own partition, separate from the rest of > the OS, possible on LVM. An LSI-Megaraid-2 appears to be the card of > choice. > Can you tell us about your application? How much data will you have, what is your ratio of reads to writes, how tollerant to data loss are you? (for example, some people load their data in batches and if they loose their data its no big deal, others would have heart failure if a few transactions were lost) If your application is 95% writes then people will suggest drastically different hardware than if your application is 95% selects. Here is an example of one of my servers: application is 95+% selects, has 15GB of data (counting indexes), low tollerance for data loss, runs on a 1 GHz P3 Compaq server with mirrored 35 GB IDE disks and 1.6GB of RAM. Application response time is aproximately .1 second to serve a request on a moderately loaded server. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Help specifying new web server/database machine
On 6/8/05, Rory Campbell-Lange <[EMAIL PROTECTED]> wrote: > I'm tasked with specifying a new machine to run a web application > prototype. The machine will be serving web pages with a Postgresql > backend; we will be making extensive use of plpgsql functions. No > database tables are likely to go over a million rows during the > prototype period. ... > 2GB (2x 1024MB) DDR-400 (PC3200) ECC Registered SDRAM (single rank) > 4 Port AMCC/3Ware 9500-4LP PCI SATA RAID Controller > 80GB SATA-150 7200RPM Hard Disk / 8MB Cache > 80GB SATA-150 7200RPM Hard Disk / 8MB Cache > 250GB SATA-150 7200RPM Hard Disk / 8MB Cache > 250GB SATA-150 7200RPM Hard Disk / 8MB Cache If your app is select heavy, especially the types of things that do sequential scans, you will enjoy having enough ram to easily load all of your tables and indexes in ram. If your database will exceed 1GB on disk consider more ram than 2GB. If your database will be write heavy choosing good controllers and disks is essential. Reading through the archives you will see that there are some important disk configurations you can choose for optimizing disk writes such as using the outer portions of the disks exclusively. If data integrity is not an issue, choose a controller that allows caching of writes (usually IDE and cheaper SATA systems cache writes regardless of what you want). If it were my application, and if I had room in the budget, I'd double the RAM. I don't know anything about your application though so use the guidlines above. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] speed up query with max() and odd estimates
On 4/26/05, Steinar H. Gunderson <[EMAIL PROTECTED]> wrote: > On Tue, Apr 26, 2005 at 03:16:57PM -0500, Matthew Nuzum wrote: > > Seq Scan on usage_access (cost=0.00..1183396.40 rows=12713851 > > width=116) (actual time=481796.22..481839.43 rows=3343 loops=1) > > That's a gross misestimation -- four orders of magnitude off! > > Have you considering doing this in two steps, first getting out whatever > comes from the subquery and then doing the query? Well, I don't know if the estimates are correct now or not, but I found that your suggestion of doing it in two steps helped a lot. For the archives, here's what made a drastic improvement: This batch program had an overhead of 25 min to build hash tables using the sql queries. It is now down to about 47 seconds. The biggest improvements (bringing it down to 9 min) were to get rid of all instances of `select max(field) from ...` and replacing them with `select field from ... order by field desc limit 1` Then, to get it down to the final 47 seconds I changed this query: SELECT client,max(atime) as atime from usage_access where atime >= (select atime - '1 hour'::interval from usage_access order by atime desc limit 1) group by client; To these three queries: SELECT atime - '1 hour'::interval from usage_access order by atime desc limit 1; SELECT client, atime into temporary table recent_sessions from usage_access where atime >= '%s'; SELECT client, max(atime) as atime from recent_sessions group by client; Thanks for the help. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] speed up query with max() and odd estimates
On 4/26/05, Steinar H. Gunderson <[EMAIL PROTECTED]> wrote: > On Tue, Apr 26, 2005 at 03:16:57PM -0500, Matthew Nuzum wrote: > > Seq Scan on usage_access (cost=0.00..1183396.40 rows=12713851 > > width=116) (actual time=481796.22..481839.43 rows=3343 loops=1) > > That's a gross misestimation -- four orders of magnitude off! > > Have you considering doing this in two steps, first getting out whatever > comes from the subquery and then doing the query? Have you ANALYZEd recently? > Do you have an index on atime? > Yes, there is an index on atime. I'll re-analyze but I'm pretty certain that runs nightly. Regarding two steps, are you suggesting: begin; select * into temp_table...; select * from temp_table...; drop temp_table; rollback; I have not tried that but will. BTW, I created an index on clients just for the heck of it and there was no improvement. (actually, a slight degradation) -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
[PERFORM] speed up query with max() and odd estimates
I have this query that takes a little over 8 min to run: select client,max(atime) as atime from usage_access where atime >= (select atime - '1 hour'::interval from usage_access order by atime desc limit 1) group by client; I think it can go a lot faster. Any suggestions on improving this? DB is 7.3.4 I think. (There is no index on client because it is very big and this data is used infrequently.) explain ANALYZE select client,max(atime) as atime from usage_access where atime >= (select atime - '1 hour'::interval from usage_access order by atime desc limit 1) group by client; QUERY PLAN Aggregate (cost=3525096.28..3620450.16 rows=1271385 width=20) (actual time=482676.95..482693.69 rows=126 loops=1) InitPlan -> Limit (cost=0.00..0.59 rows=1 width=8) (actual time=0.40..0.41 rows=1 loops=1) -> Index Scan Backward using usage_access_atime on usage_access (cost=0.00..22657796.18 rows=38141552 width=8) (actual time=0.39..0.40 rows=2 loops=1) -> Group (cost=3525096.28..3588665.53 rows=12713851 width=20) (actual time=482676.81..482689.29 rows=3343 loops=1) -> Sort (cost=3525096.28..3556880.90 rows=12713851 width=20) (actual time=482676.79..482679.16 rows=3343 loops=1) Sort Key: client -> Seq Scan on usage_access (cost=0.00..1183396.40 rows=12713851 width=20) (actual time=482641.57..482659.18 rows=3343 loops=1) Filter: (atime >= $0) Total runtime: 482694.65 msec I'm starting to understand this, which is quite frightening to me. I thought that maybe if I shrink the number of rows down I could improve things a bit, but my first attempt didn't work. I thought I'd replace the "from usage_access" with this query instead: select * from usage_access where atime >= (select atime - '1 hour'::interval from usage_access order by atime desc limit 1); QUERY PLAN Seq Scan on usage_access (cost=0.00..1183396.40 rows=12713851 width=116) (actual time=481796.22..481839.43 rows=3343 loops=1) Filter: (atime >= $0) InitPlan -> Limit (cost=0.00..0.59 rows=1 width=8) (actual time=0.41..0.42 rows=1 loops=1) -> Index Scan Backward using usage_access_atime on usage_access (cost=0.00..22657796.18 rows=38141552 width=8) (actual time=0.40..0.41 rows=2 loops=1) Total runtime: 481842.47 msec It doesn't look like this will help at all. This table is primarily append, however I just recently deleted a few million rows from the table, if that helps anyone. -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] How to improve db performance with $7K?
On 4/14/05, Tom Lane <[EMAIL PROTECTED]> wrote: > > That's basically what it comes down to: SCSI lets the disk drive itself > do the low-level I/O scheduling whereas the ATA spec prevents the drive > from doing so (unless it cheats, ie, caches writes). Also, in SCSI it's > possible for the drive to rearrange reads as well as writes --- which > AFAICS is just not possible in ATA. (Maybe in the newest spec...) > > The reason this is so much more of a win than it was when ATA was > designed is that in modern drives the kernel has very little clue about > the physical geometry of the disk. Variable-size tracks, bad-block > sparing, and stuff like that make for a very hard-to-predict mapping > from linear sector addresses to actual disk locations. Combine that > with the fact that the drive controller can be much smarter than it was > twenty years ago, and you can see that the case for doing I/O scheduling > in the kernel and not in the drive is pretty weak. > > So if you all were going to choose between two hard drives where: drive A has capacity C and spins at 15K rpms, and drive B has capacity 2 x C and spins at 10K rpms and all other features are the same, the price is the same and C is enough disk space which would you choose? I've noticed that on IDE drives, as the capacity increases the data density increases and there is a pereceived (I've not measured it) performance increase. Would the increased data density of the higher capacity drive be of greater benefit than the faster spindle speed of drive A? -- Matthew Nuzum www.bearfruit.org ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Spend 7K *WHERE*? WAS Intel SRCS16 SATA raid? and How
I think there are many people who feel that $7,000 is a good budget for a database server, me being one. * I agree with the threads that more disks are better. * I also agree that SCSI is better, but can be hard to justify if your budget is tight, and I have great certainty that 2x SATA drives on a good controller is better than x SCSI drives for many work loads. * I also feel that good database design and proper maintenance can be one of the single biggest performance enhancers available. This can be labor intensive, however, and sometimes throwing more hardware at a problem is cheaper than restructuring a db. Either way, having a good hardware platform is an excellent place to start, as much of your tuning will depend on certain aspects of your hardware. So if you need a db server, and you have $7k to spend, I'd say spend it. >From this list, I've gathered that I/O and RAM are your two most important investments. Once you get that figured out, you can still do some performance tuning on your new server using the excellent advice from this mailing list. By the way, for all those who make this list work, I've rarely found such a thorough, helpful and considerate group of people as these on the performance list. -- Matthew Nuzum <[EMAIL PROTECTED]> www.followers.net - Makers of "Elite Content Management System" View samples of Elite CMS in action by visiting http://www.followers.net/portfolio/ ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] How to improve db performance with $7K?
On 4/14/05, Tom Lane <[EMAIL PROTECTED]> wrote: > > That's basically what it comes down to: SCSI lets the disk drive itself > do the low-level I/O scheduling whereas the ATA spec prevents the drive > from doing so (unless it cheats, ie, caches writes). Also, in SCSI it's > possible for the drive to rearrange reads as well as writes --- which > AFAICS is just not possible in ATA. (Maybe in the newest spec...) > > The reason this is so much more of a win than it was when ATA was > designed is that in modern drives the kernel has very little clue about > the physical geometry of the disk. Variable-size tracks, bad-block > sparing, and stuff like that make for a very hard-to-predict mapping > from linear sector addresses to actual disk locations. Combine that > with the fact that the drive controller can be much smarter than it was > twenty years ago, and you can see that the case for doing I/O scheduling > in the kernel and not in the drive is pretty weak. > > So if you all were going to choose between two hard drives where: drive A has capacity C and spins at 15K rpms, and drive B has capacity 2 x C and spins at 10K rpms and all other features are the same, the price is the same and C is enough disk space which would you choose? I've noticed that on IDE drives, as the capacity increases the data density increases and there is a pereceived (I've not measured it) performance increase. Would the increased data density of the higher capacity drive be of greater benefit than the faster spindle speed of drive A? -- Matthew Nuzum <[EMAIL PROTECTED]> www.followers.net - Makers of Elite Content Management System View samples of Elite CMS in action by visiting http://www.followers.net/portfolio/ ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] performance hit for replication
> >I'm eager to hear your thoughts and experiences, > > > > > Well with replicator you are going to take a pretty big hit initially > during the full > sync but then you could use batch replication and only replicate every > 2-3 hours. > > Sincerely, > > Joshua D. Drake > Thanks, I'm looking at your product and will contact you off list for more details soon. Out of curiosity, does batch mode produce a lighter load? Live updating will provide maximum data security, and I'm most interested in how it affects the server. -- Matthew Nuzum <[EMAIL PROTECTED]> www.followers.net - Makers of "Elite Content Management System" Earn a commission of $100 - $750 by recommending Elite CMS. Visit http://www.elitecms.com/Contact_Us.partner for details. ---(end of broadcast)--- TIP 8: explain analyze is your friend
[PERFORM] performance hit for replication
I'd like to create a fail-over server in case of a problem. Ideally, it would be synchronized with our main database server, but I don't see any major problem with having a delay of up to 4 hours between syncs. My database is a little shy of 10 Gigs, with much of that data being in an archived log table. Every day a batch job is run which adds 100,000 records over the course of 3 hours (the batch job does a lot of pre/post processing). Doing a restore of the db backup in vmware takes about 3 hours. I suspect a powerful server with a better disk setup could do it faster, but I don't have servers like that at my disposal, so I need to assume worst-case of 3-4 hours is typical. So, my question is this: My server currently works great, performance wise. I need to add fail-over capability, but I'm afraid that introducing a stressful task such as replication will hurt my server's performance. Is there any foundation to my fears? I don't need to replicate the archived log data because I can easily restore that in a separate step from the nightly backup if disaster occurs. Also, my database load is largely selects. My application works great with PostgreSQL 7.3 and 7.4, but I'm currently using 7.3. I'm eager to hear your thoughts and experiences, -- Matthew Nuzum <[EMAIL PROTECTED]> www.followers.net - Makers of "Elite Content Management System" Earn a commission of $100 - $750 by recommending Elite CMS. Visit http://www.elitecms.com/Contact_Us.partner for details. ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Preventing query from hogging server
> I would strongly suggest doing the min and max calculations together: > > select groupid, min(col), max(col) from ... > > because if you do them in two separate queries 90% of the effort will be > duplicated. > > regards, tom lane Thanks. Other than avoiding using too much sort mem, is there anything else I can do to ensure this query doesn't starve other processes for resources? Doing the explain analyze only increases my server load by 1 and seems to readily relinquish CPU time, but previously when I had been running a test query my server load rose to unacceptable levels. FWIW, the explain was run from psql running on the db server, the test query the other day was run from one of the webservers. Should I run this on the db server to minimize load? -- Matthew Nuzum <[EMAIL PROTECTED]> www.followers.net - Makers of "Elite Content Management System" View samples of Elite CMS in action by visiting http://www.followers.net/portfolio/ ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Preventing query from hogging server
> How many rows in usage_access? Oh, I just got my explain analyze: QUERY PLAN -- Subquery Scan "*SELECT*" (cost=9499707.90..9856491.74 rows=3567838 width=28) (actual time=11443537.58..12470835.17 rows=1198141 loops=1) -> Aggregate (cost=9499707.90..9856491.74 rows=3567838 width=28) (actual time=11443537.56..12466550.25 rows=1198141 loops=1) -> Group (cost=9499707.90..9767295.78 rows=35678384 width=28) (actual time=11443537.10..12408372.26 rows=35678383 loops=1) -> Sort (cost=9499707.90..9588903.86 rows=35678384 width=28) (actual time=11443537.07..12035366.31 rows=35678383 loops=1) Sort Key: accountid, sessionid -> Seq Scan on usage_access (cost=0.00..1018901.84 rows=35678384 width=28) (actual time=8.13..416580.35 rows=35678383 loops=1) Total runtime: 12625498.84 msec (7 rows) -- Matthew Nuzum <[EMAIL PROTECTED]> www.followers.net - Makers of "Elite Content Management System" View samples of Elite CMS in action by visiting http://www.followers.net/portfolio/ ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Preventing query from hogging server
> How many rows in usage_access? How many groups do you expect? > (Approximate answers are fine.) What PG version is this, and > what's your sort_mem setting? > > regards, tom lane I believe there are about 40,000,000 rows, I expect there to be about 10,000,000 groups. PostgreSQL version is 7.3.2 and the sort_mem is at the default setting. (I know that's an old version. We've been testing with 7.4 now and are nearly ready to upgrade.) -- Matthew Nuzum <[EMAIL PROTECTED]> www.followers.net - Makers of "Elite Content Management System" View samples of Elite CMS in action by visiting http://www.followers.net/portfolio/ ---(end of broadcast)--- TIP 8: explain analyze is your friend
[PERFORM] Preventing query from hogging server
I've got a report that is starting to take too long to run. I'm going to create a lookup table that should speed up the results, but first I've got to create the lookup table. I honestly don't care how long the query takes to run, I just want to run it without causing a major performance impact on other operations. The query seems to take forever even when I limit the results to just 10, so I don't know if I can get good results by splitting the query into groups of queries (for example, for a years worth of data do 12 queries, one for each month or maybe 365 queries, one for each day) or if there is a psql equivalent to "nice." I've tried `nice psql` in the past and I don't think that had much impact, but I haven't tried it on this query. Here is the query (BTW, there will be a corresponding "max" version of this query as well): INSERT INTO usage_sessions_min (accountid,atime,sessionid) select accountid, min(atime) as atime, sessionid from usage_access group by accountid,sessionid; atime is a timestamptz, accountis is varchar(12) and sessionid is int. I've tried to do an explain analyze of this query, but its been running for hours already and I don't know when it will finish. -- Matthew Nuzum <[EMAIL PROTECTED]> www.followers.net - Makers of "Elite Content Management System" View samples of Elite CMS in action by visiting http://www.elitecms.com/ ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Restricting Postgres
Matt - Very interesting information about squid effectiveness, thanks. Martin, You mean your site had no images? No CSS files? No JavaScript files? Nearly everything is dynamic? I've found that our CMS spends more time sending a 23KB image to a dial up user than it does generating and serving dynamic content. This means that if you have a "light" squid process who caches and serves your images and static content from it's cache then your apache processes can truly focus on only the dynamic data. Case in point: A first time visitor hits your home page. A dynamic page is generated (in about 1 second) and served (taking 2 more seconds) which contains links to 20 additional files (images, styles and etc). Then expensive apache processes are used to serve each of those 20 files, which takes an additional 14 seconds. Your precious application server processes have now spent 14 seconds serving stuff that could have been served by an upstream cache. I am all for using upstream caches and SSL accelerators to take the load off of application servers. My apache children often take 16 or 20MB of RAM each. Why spend all of that on a 1.3KB image? Just food for thought. There are people who use proxying in apache to redirect expensive tasks to other servers that are dedicated to just one heavy challenge. In that case you likely do have 99% dynamic content. Matthew Nuzum | Makers of "Elite Content Management System" www.followers.net | View samples of Elite CMS in action [EMAIL PROTECTED] | http://www.followers.net/portfolio/ -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Martin Foster Matt Clark wrote: > In addition we (as _every_ high load site should) run Squid as an > accelerator, which dramatically increases the number of client connections > that can be handled. Across 2 webservers at peak times we've had 50,000 > concurrently open http & https client connections to Squid, with 150 Apache > children doing the work that squid can't (i.e. all the dynamic stuff), and > PG (on a separate box of course) whipping through nearly 800 mixed selects, > inserts and updates per second - and then had to restart Apache on one of > the servers for a config change... Not a problem :-) > > One little tip - if you run squid on the same machine as apache, and use a > dual-proc box, then because squid is single-threaded it will _never_ take > more than half the CPU - nicely self balancing in a way. > > M > I've heard of the merits of Squid in the use as a reverse proxy. However, well over 99% of my traffic is dynamic, hence why I may be experiencing behavior that people normally do not expect. As I have said before in previous threads, the scripts are completely database driven and at the time the database averaged 65 queries per second under MySQL before a migration, while the webserver was averaging 2 to 4. Martin Foster Creator/Designer Ethereal Realms [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] How to time several queries?
When I'm using psql and I want to time queries, which is what I've been doing for a little over a day now, I do the following: Select now(); query 1; query 2; query 3; select now(); This works fine unless you're doing selects with a lot of rows which will cause your first timestamp to scroll off the screen. -- Matthew Nuzum + "Man was born free, and everywhere www.bearfruit.org : he is in chains," Rousseau +~~+ "Then you will know the truth, and the TRUTH will set you free," Jesus Christ (John 8:32 NIV) -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] Sent: Monday, October 18, 2004 2:28 PM To: [EMAIL PROTECTED] Subject: [PERFORM] How to time several queries? Hello I posted this on the general list but think it would be more appropriate here. Sorry. I know it is possible to time isolated queries through the settting of the \timing option in psql. This makes PgSQL report the time it took to perform one operation. I would like to know how one can get a time summary of many operations, if it is at all possible. Thank you. Tim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Caching of Queries
I could spend a week or two tweaking the performance of my database servers and probably make some sizeable improvements, but I'm not going to. Why? Because PostgreSQL screams as it is. I would make sure that if the consensus is to add some sort of caching that it be done only if there is no hit to current performance and stability. That being said, I think that server side caching has major buzz and there's nothing wrong with adding features that sell. I will disagree with 3 points made on the argument against caching. Specifically, the benefit of doing caching on the db server is that the benefits may be reaped by multiple clients where as caching on the client side must be done by each client and may not be as effective. So what if the caching has a slight chance of returning stale results? Just make sure people know about it in advance. There are some things where stale results are no big deal and if I can easily benefit from an aggressive caching system, I will (and I do now with the adodb caching library, but like I said, caching has to be done for each client). In fact, I'm all for using a low-tech cache expiration algorithm to keep complexity down. Finally, if the caching is not likely to help (or may even hurt) simple queries but is likely to help complex queries then fine, make sure people know about it and let them decide if they can benefit. Sorry if I'm beating a dead horse or playing the devil's advocate. Just felt compelled to chime in. -- Matthew Nuzum + "Man was born free, and everywhere www.bearfruit.org : he is in chains," Rousseau +~~+ "Then you will know the truth, and the TRUTH will set you free," Jesus Christ (John 8:32 NIV) -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane Sent: Monday, September 27, 2004 1:19 AM To: Neil Conway Cc: Aaron Werman; Scott Kirkwood; [EMAIL PROTECTED] Subject: Re: [PERFORM] Caching of Queries Neil Conway <[EMAIL PROTECTED]> writes: > I think the conclusion of past discussions about this feature is that > it's a bad idea. Last I checked, MySQL has to clear the *entire* query > cache when a single DML statement modifying the table in question is > issued. Do they actually make a rigorous guarantee that the cached result is still accurate when/if it is returned to the client? (That's an honest question --- I don't know how MySQL implements this.) IIRC, in our past threads on this topic, it was suggested that if you can tolerate not-necessarily-up-to-date results, you should be doing this sort of caching on the client side and not in the DB server at all. I wouldn't try that in a true "client" scenario, but when the DB client is application-server middleware, it would make some sense to cache in the application server. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED]) ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Postgres over Linux NBD or NFS
I just got the BigAdmin newsletter from Sun today... interestingly enough it had a link to an article described as: > Database Performance with NAS: Optimizing Oracle on NFS > This paper discusses the operation of relational databases with network > attached storage (NAS). Management implications and performance > expectations for databases using NAS are presented. The link points to: http://www.sun.com/bigadmin/content/nas/ I read just enough to see if it is relevant. Here is the first part of the summary: > IT departments are increasingly utilizing Network Attached Storage (NAS) > and the Network File System (NFS) to meet the storage needs of mission- > critical relational databases. Reasons for this adoption include improved > storage virtualization, ease of storage deployment, decreased complexity, > and decreased total cost of ownership. This paper directly examines the > performance of databases with NAS. In laboratory tests comparing NFS with > local storage, NFS is shown capable of sustaining the same workload level > as local storage. Under similar workload conditions, NFS does consume an > increased number of CPU cycles; however, the proven benefits of NFS and > NAS outweigh this penalty in most production environments. Matthew Nuzum | ISPs: Make $200 - $5,000 per referral by www.followers.net | recomending Elite CMS to your customers! [EMAIL PROTECTED] | http://www.followers.net/isp ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] postgresql and openmosix migration
Hi Bill, I am more often in the "needing help" category than the "giving help" when it comes to advise about using postgresql. I have found it to be an extremely powerful tool and by far the best performance/price for my work. I think you will get some excellent answers and help to your performance questions if you send the list details about specific queries that are running too slow. If you are willing to throw more/bigger hardware at the problem, let people know that when you ask and they will tell you if your bottleneck can be alleviated through more ram, disks, cpu or whatever. Having been watching this list for some time now, I suspect most of the performance problems can be improved using non-intuitive query or configuration modifications (for example, replacing min()/max() as suggested by Mr. Wolf). The heavy hitters on the list will usually ask for an "explain analyze" of your query. If your query is "select * from foo", then change it to "EXPLAIN ANALYZE select * from foo" and post the output. It will look something like this: QUERY PLAN --- Seq Scan on foo (cost=0.00..1.04 rows=4 width=44) (actual time=8.46..8.47 rows=4 loops=1) Total runtime: 19.63 msec (2 rows) I'm sure your data is confidential; mine is too. The good news is that none of your data is included in the query. Only technical details about what the database is doing. If your problem might involve the application that works with the data, give some details about that. For example, if you're using a Java application, let people know what driver version you use, what jvm and other related info. There are lurkers on this list using just about every programming language imaginable on more platforms than you can shake a stick at (I don't care how good you are at shaking sticks, either). The more details you give the better help you're going to get and you'd be amazed at the results I've seen people get with a judicious amount of tweaking. The other day someone had a query that took hours decrease to less than 10 minutes by using some techniques prescribed by members on the list. Bringing 30 - 60 second queries down to 2-3 seconds is commonplace. You seem to be ready to throw money at the problem by investing in new hardware but I would suggest digging into the performance problems first. Too many times we've seen people on the list say, "I've just spent $x0,000 on a new xyz and I'm still having problems with this query." Often times the true solution is rewriting queries, tweaking config parameters, adding RAM and upgrading disks (in that order I believe). As I found out even today on the SQL list, it's best to ask questions in this form: "I want to do this... I've been trying this... I'm getting this... which is problematic because..." The more clearly you state the abstract goal the more creative answers you'll get with people often suggesting things you'd never considered. I hope this helps and I hope that you achieve your goals of a well performing application. Matthew Nuzum | Makers of "Elite Content Management System" www.followers.net | View samples of Elite CMS in action [EMAIL PROTECTED] | http://www.followers.net/portfolio/ > -Original Message- > From: [EMAIL PROTECTED] [mailto:pgsql-performance- > [EMAIL PROTECTED] On Behalf Of Bill > Sent: Tuesday, June 22, 2004 1:31 PM > To: Josh Berkus > Cc: [EMAIL PROTECTED] > Subject: Re: [PERFORM] postgresql and openmosix migration > > Ok, so maybe someone on this group will have a better idea. We have a > database of financial information, and this has literally millions of > entries. I have installed indicies, but for the rather computationally > demanding processes we like to use, like a select query to find the > commodity with the highest monthly or annual returns, the computer > generally > runs unacceptably slow. So, other than clustring, how could I achieve a > speed increase in these complex queries? Is this better in mysql or > postgresql? > > Thanks. ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] PostgreSQL on VMWare vs Windows vs CoLinux
On Wed, 2004-06-02 at 17:39, Greg Stark wrote: > "Matthew Nuzum" <[EMAIL PROTECTED]> writes: > > > I have colinux running on a Fedora Core 1 image. I have the rhdb 3 (or > > PostgreSQL RedHat Edition 3) on it running. Here are tests with fsync on > > and off: > > FSYNC OFF FSYNC ON RUN > > 136.9142.0 124.5149.1 1 > > 122.1126.7 140.1169.7 2 > > 125.7148.7 147.4180.4 3 > > 103.3136.7 136.8166.3 4 > > 126.5146.1 152.3187.9 5 > > 114.4133.3 144.8176.7 6 > > 124.0146.5 143.3175.0 7 > > 121.7166.8 147.8180.5 8 > > 127.3151.8 146.7180.0 9 > > 124.6143.0 137.2167.5 10 > > -- > > 122.7144.2 142.1173.3 AVG > > > > I hope those numbers' formatting come through all right. > > No, they didn't. You used tabs? Are they four space tabs or 8 space tabs? > I assume 4 space tabs, but then what is the meaning of the four columns? > You have two columns for each fsync setting? One's under Windows and one's > under Vmware? Which is which? > Sorry that wasn't clear. The pgbench program puts out two numbers, can't remember what they are, I think one number included the time to make the connection. Therefore, the first two columns represent the two values presented from pgbench with FSYNC off. The second two columns are those same to figures but with FSYNC ON. The 5th column is the run. I did 10 runs and included the output of all runs so that incase anything significant could be gleaned from the details, the data would be there. The executive summary is this: Tom was curious if colinux might be deceiving the applications that expect the fsync to occur. He suspected that pgbench run with and without fsync enabled might reveal something. Therefore: FSYNC ON: 142.1 FSYNC OFF: 122.7 Having FSYNC off seems to yield faster results. I'd like some input on a more demanding test though, because these tests run so quickly I can't help but be suspicious of their accuracy. When there are two OSs involved, it seems like the momentary activity of a background process could skew these results. > > It occurs to me that the fsync may be performed to the linux filesystem, but > > this filesystem is merely a file on the windows drive. Would Windows cache > > this file? It's 2GB in size, so if it did, it would only be able to cache > > part of it. > > Well VMWare certainly doesn't know that the linux process called fsync. For > all it knows the Linux kernel just schedule the i/o because it felt it was > time. > > So the question is how does VMWare alway handle i/o normally. Does it always > handle i/o from the Guest OS synchronously or does it buffer it via the > Host OS's i/o system. We probably will never know what the internal workings of VMWare are like because it is a closed source program. I'm not slighting them, I have purchased a license of VMWare and use it for my software testing. However, colinux is an open source project and we can easily find out how they handle this. I have little interest in this as I use this merely as a tool to speed up my application development and do not run any critical services what-so-ever. > > I'm actually not sure which it does, it could be doing something strange. But > does seem most likely that it lets Windows buffer the writes, or does so > itself. It might also depend on whether you're using raw disks or a virtual > disk file. Undoable disks would throw another wrench in the works entirely. In these tests I'm using a virtual disk file. This is a 2GB file on the hard drive that linux sees as a disk partition. Colinux does not support undoable disks in the way that vmware does. Their wiky site does not mention anything tricky being done to force disk writes to actually be written; the implication therefore is that it leaves the i/o completely at the discretion of XP. Also note that XP Pro and 2000 Pro both offer different caching options for the user to choose so unless it does something to actually force a write the answer is probably "who knows." > > Note that "caching" isn't really the question. It doesn't have to cache the > entire 2GB file or even very much of it. It just has to store the block that > linux wants to write and report success to linux without waiting for the disk > to report success. Linux will then think the file is sync'd to disk and allow > postgres to continue with the next transaction without actually waiting for > the physical disk to spin around to the right place and the head to seek and &
Re: [PERFORM] PostgreSQL on VMWare vs Windows vs CoLinux
I have colinux running on a Fedora Core 1 image. I have the rhdb 3 (or PostgreSQL RedHat Edition 3) on it running. Here are tests with fsync on and off: FSYNC OFF FSYNC ON RUN 136.9142.0 124.5149.1 1 122.1126.7 140.1169.7 2 125.7148.7 147.4180.4 3 103.3136.7 136.8166.3 4 126.5146.1 152.3187.9 5 114.4133.3 144.8176.7 6 124.0146.5 143.3175.0 7 121.7166.8 147.8180.5 8 127.3151.8 146.7180.0 9 124.6143.0 137.2167.5 10 -- 122.7144.2 142.1173.3 AVG I hope those numbers' formatting come through all right. This computer is an AMD Athlon 900MHz with 448MB Ram running XP Pro SP1 This is using Colinux 0.60 (not the recently released 0.61) and 96MB of RAM allocated to linux. The computer was idle but it was running Putty, Excel and Task Manager during the process. (I prefer to use Putty to SSH into the virtual computer than to run the fltk console) It occurs to me that the fsync may be performed to the linux filesystem, but this filesystem is merely a file on the windows drive. Would Windows cache this file? It's 2GB in size, so if it did, it would only be able to cache part of it. I'd like to run a more difficult test personally. It seems like this test goes too fast to be very useful. If someone would like me to try something more specific, e-mail me right away and I'll do it. I must leave my office at 4:15 EDT and will not return until Friday, although I can do another test on my home computer Thursday. Matthew Nuzum | Makers of "Elite Content Management System" www.followers.net | View samples of Elite CMS in action [EMAIL PROTECTED] | http://www.followers.net/portfolio/ > -Original Message- > From: [EMAIL PROTECTED] [mailto:pgsql-performance- > [EMAIL PROTECTED] On Behalf Of Tom Lane > Sent: Wednesday, June 02, 2004 11:25 AM > To: Greg Stark > Cc: Vitaly Belman; [EMAIL PROTECTED]; Bryan Encina; Matthew > Nuzum > Subject: Re: [PERFORM] PostgreSQL on VMWare vs Windows vs CoLinux > > Greg Stark <[EMAIL PROTECTED]> writes: > > That said, I'm curious why the emulated servers performed better than > the > > Native Windows port. My first thought is that they probably aren't > syncing > > every write to disk so effectively they're defeating the fsyncs, > allowing the > > host OS to buffer disk writes. > > It would be fairly easy to check this by repeating the comparisons with > fsync = off in postgresql.conf. A performance number that doesn't > change much would be a smoking gun ;-). > > The native port hasn't had any performance testing done on it yet, and > I wouldn't be surprised to hear of a gotcha or two. Perhaps with the > recent schedule change there will be some time for performance tuning > before we go beta. > > regards, tom lane > > ---(end of broadcast)--- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to [EMAIL PROTECTED] so that your > message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] PostgreSQL caching
> > Hello Josh, > > JB> Not that you can't improve the query, just that it might not fix > JB> the problem. > > Yes, I'm aware it might be slower than the Linux version, but then, as > you said, I still can improve the query (as I did with your help now). > > But true, if there's something awfully wrong with Win32 port > performance, I might be doing some overwork... > > JB> Therefore ... your detailed feedback is appreciated, especially if you > can > JB> compare stuff to the same database running on a Linux, Unix, or BSD > machine. > > I can't easily install Linux right now.. But I am considering using it > through VMWare. Do you think it would suffice as a comprasion? > > From what I saw (e.g > http://usuarios.lycos.es/hernandp/articles/vpcvs.html) the performance > are bad only when it's coming to graphics, otherwise it looks pretty > good. > > Regards, > Vitaly Belman > An interesting alternative that I've been using lately is colinux (http://colinux.sf.net). It lets you run linux in windows and compared to vmware, I find it remarkably faster and when it is idle less resource intensive. I have vmware but if I'm only going to use a console based program, colinux seems to outperform it. Note that it may simply be interactive processes that run better because it has a simpler interface and does not try to emulate the display hardware. (Therefore no X unless you use vmware) It seems though that there is less overhead and if that's the case, then everything should run faster. Also note that getting it installed is a little more work than vmware. If you're running it on a workstation that you use for normal day-to-day tasks though I think you'll like it because you can detach the terminal and let it run in the background. When I do that I often forget it is running because it produces such a low load on the system. If you are going to give it a try, the one trick I used to get things going was to download the newest beta of winpcap and then the networking came up easily. Everything else was a piece of cake. Matthew Nuzum | Makers of "Elite Content Management System" www.followers.net | View samples of Elite CMS in action [EMAIL PROTECTED] | http://www.followers.net/portfolio/ ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Configuring PostgreSQL to minimize impact of checkpoints
> > Jack Orenstein <[EMAIL PROTECTED]> writes: > > I'm looking at one case in which two successive transactions, each > > updating a handful of records, take 26 and 18 *seconds* (not msec) to > > complete. These transactions normally complete in under 30 msec. ... > None of this is necessarily going to fix matters for an installation > that has no spare I/O capacity, though. And from the numbers you're > quoting I fear you may be in that category. "Buy faster disks" may > be the only answer ... > I had a computer once that had an out-of-the-box hard drive configuration that provided horrible disk performance. I found a tutorial at O'Reilly that explained how to use hdparm to dramatically speed up disk performance on Linux. I've noticed on other computers I've set up recently that hdparm seems to be used by default out of the box to give good performance. Maybe your computer is using all of it's I/O capacity because it's using PIO mode or some other non-optimal method of accessing the disk. Just a suggestion, I hope it helps, Matthew Nuzum | ISPs: Make $200 - $5,000 per referral by www.followers.net | recomending Elite CMS to your customers! [EMAIL PROTECTED] | http://www.followers.net/isp ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Extreme high load averages
> A common problem is a table like this: > > create table test (info text, id int8 primary key); > insert into test values ('ted',1); > .. a few thousand more inserts; > vacuum full; > analyze; > select * from test where id=1; > > will result in a seq scan, always, because the 1 by itself is > autoconverted to int4, which doesn't match int8 automatically. This > query: > > select * from test where id=1::int8 > > will cast the 1 to an int8 so the index can be used. > > Hey Scott, this is a little scary because I probably have a lot of this going on... Is there a way to log something so that after a day or so I can go back and look for things like this that would be good candidates for optimization? I've got fast enough servers that currently the impact of this problem might not be too obvious, but I suspect that after the server gets loaded up the impact will become more of a problem. By the way, I must say that this thread has been very useful. Matthew Nuzum | Makers of "Elite Content Management System" www.followers.net | View samples of Elite CMS in action [EMAIL PROTECTED] | http://www.followers.net/portfolio/ ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Moving postgresql.conf tunables into 2003...
> -Original Message- > From: [EMAIL PROTECTED] [mailto:pgsql-performance- > [EMAIL PROTECTED] On Behalf Of Andrew Sullivan > Sent: Monday, July 07, 2003 5:23 AM > To: [EMAIL PROTECTED] > Subject: Re: [PERFORM] Moving postgresql.conf tunables into 2003... > > On Sat, Jul 05, 2003 at 02:12:56PM -0700, Sean Chittenden wrote: > > The SGML docs aren't in the DBA's face and are way out of the way for > > DBAs rolling out a new system or who are tuning the system. SGML == > > Developer, conf == DBA. > > I could not disagree more. I'd say more like, if the dba won't read > the manual, get yourself a real dba. Sorry, but so-called > professionals who won't learn their tools have no home in my shop. > I don' want to come off confrontational, so please don't take this as an attack. Are you willing to say that the PostgreSQL database system should only be used by DBAs? I believe that Postgres is such a good and useful tool that anyone should be able to start using it with little or no barrier to entry. I don't believe I'm alone in this opinion either. As a matter of fact, this philosophy is being adopted by many in the software industry. Note that Linux and many other OSs that act as servers are being made more secure and easier to use __out of the box__ so that a person can simply install from cd and start using the tool with out too much difficulty. Maybe your definition of "dba" is broader than mine and what you mean is, "someone who installs a postgres database". Also, by manual, are you referring to the 213 page Administration guide, or are you talking about the 340 page Reference Manual? Let us rephrase your statement like this: "If the [person who installs a postgres database] won't read the [340 page reference] manual, then that person should go find a different database to use." I think that the postgres installation procedure, .conf files and documentation can be modified in such a way that a newbie (we were all newbies once) can have a good "out of box experience" with little effort. That means they can __quickly__ get a __good performing__ database up and running with __little effort__ and without needing to subscribe to a mailing list or read a book. I have seen software projects that have what I call an "elitist" attitude; meaning they expect you to be an expert or dedicated to their software in order to use it. Invariably this mentality stifles the usefulness of the product. It seems that there is a relative minority of people on this list who feel that you have to be "elite" in order to have a good working postgres installation. I don't feel that should be a requirement or even a consideration. Matthew Nuzum | Makers of "Elite Content Management System" www.followers.net | View samples of Elite CMS in action [EMAIL PROTECTED] | http://www.followers.net/portfolio/ ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Moving postgresql.conf tunables into 2003...
> > This sort of narrative belongs in the SGML docs, not in a CONF file. > > In fact, one could argue that we should take *all* commentary out of > > the CONF file in order to force people to read the docs. > > The SGML docs aren't in the DBA's face and are way out of the way for > DBAs rolling out a new system or who are tuning the system. SGML == > Developer, conf == DBA. > > > Database performance tuning will always be a "black art," as it > > necessitates a broad knowledge of PostgreSQL, OS architecture, and > > computer hardware. So I doubt that we can post docs that would > > allow any 10% time DBA to make PostgreSQL "fly", but hopefully over > > the next year we can make enough knowledge public to allow anyone to > > make PostgreSQL "sprint". > > I'm highly resistant to/disappointed in this attitude and firmly > believe that there are well understood algorithms that DBAs use to > diagnose and solve performance problems. It's only a black art > because it hasn't been documented. Performance tuning isn't voodoo, > it's adjusting constraints to align with the execution of applications > and we know what the applications do, therefore the database can mold > to the applications' needs. I agree. We often seem to forget simple lessons in human nature. Expecting someone to spend 20 extra seconds to do something is often too much. In many cases, the only "manual" that a person will see is the .conf files. At the very least, if there is good documentation for these parameters, maybe the conf file should provide a link to this info. About the documentation... The few times I've tried reading these sections of the docs it was like reading a dictionary. Bruce's book is a much better writing style because it starts out with a basic concept and then expands on it, sometimes several times until a thorough (but not exhaustive) example has been given. The exhaustive material in the docs is good when you know what you're looking for, and therefore is a critical piece of reference work. I don't want to belittle the authors of that material in any way. An illustration of this would be to compare the O'Reilly "... Nutshell" book series to something like the [fictitious] book "Learn PostgreSQL in 24 hours". To close this message, I would just like to add that one of the most successful open source projects of all time could be used as an example. The Apache httpd project is one of the few open source projects in wide spread use that holds more market share than all competing products combined. It uses a three phase (if not more) documentation level. The .conf file contains detailed instructions in an easy to read and not-to-jargon-ish structure. The docs provide detailed tutorials and papers that expand on configuration params in an easy to read format. Both of these refer to the thorough reference manual that breaks each possible option down into it's nitty gritty details so that a user can get more information if they so desire. Matthew Nuzum | Makers of "Elite Content Management System" www.followers.net | View samples of Elite CMS in action [EMAIL PROTECTED] | http://www.followers.net/portfolio/ ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Moving postgresql.conf tunables into 2003...
> > Brian Suggests: > > I'm curious how many of the configuration values can be determined > > automatically, or with the help of some script. It seem like there > > could be some perl script in contrib that could help figure this out. > > Possibly you are asked a bunch of questions and then the values are > > computed based on that. Something like: > > This would be great! Wanna be in charge of it? > Is there a to-do list for this kind of stuff? Maybe there could be a "help wanted" sign on the website. Seems like there are lot's of good ideas that fly around here but never get followed up on. Additionally, I have an increasingly large production database that I would be willing to do some test-cases on. I don't really know how to do it though... If someone where able to give instructions I could run tests on three different platforms. Matthew Nuzum | Makers of "Elite Content Management System" www.followers.net | View samples of Elite CMS in action [EMAIL PROTECTED] | http://www.followers.net/portfolio/ ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]