Re: [PERFORM] Postgresql in a Virtual Machine
On Tue, 26 Nov 2013, Xenofon Papadopoulos wrote: We have been running several Postgres databases on VMs for the last 9 months. The largest one currently has a few hundreds of millions of rows (~1.5T of data, ~100G of frequently queried data ) and performs at ~1000 tps. Most of our transactions are part of a 2PC, which effectively results to high I/O as asynchronous commit is disabled. Main benefits so far: - ESXi HA makes high availability completely transparent and reduces the number of failover servers (we're running N+1 clusters) - Our projects' load can often miss our expectations, and it changes over the time. Scaling up/down has helped us cope. how do you add another server without having to do a massive data copy in the process? David Lang - Live relocation of databases helps with hardware upgrades and spreading of load. Main issues: - We are not overprovisioning at all (using virtualization exclusively for the management benefits), so we don't know its impact to performance. - I/O has often been a bottleneck. We are not certain whether this is due to the impact of virtualization or due to mistakes in our sizing and configuration. So far we have been coping by spreading the load across more spindles and by increasing the memory. On Tue, Nov 26, 2013 at 1:26 AM, Merlin Moncure wrote: On Mon, Nov 25, 2013 at 4:57 PM, David Lang wrote: On Mon, 25 Nov 2013, Merlin Moncure wrote: On Mon, Nov 25, 2013 at 2:01 PM, Lee Nguyen wrote: Hi, Having attended a few PGCons, I've always heard the remark from a few presenters and attendees that Postgres shouldn't be run inside a VM. That bare metal is the only way to go. Here at work we were entertaining the idea of running our Postgres database on our VM farm alongside our application vm's. We are planning to run a few Postgres synchronous replication nodes. Why shouldn't we run Postgres in a VM? What are the downsides? Does anyone have any metrics or benchmarks with the latest Postgres? Unfortunately (and it really pains me to say this) we live in an increasingly virtualized world and we just have to go ahead and deal with it. I work at a mid cap company and we have a zero tolerance policy in terms of applications targeting hardware: in short, you can't. VMs have downsides: you get less performance per buck and have another thing to fail but the administration advantages are compelling especially for large environments. Furthermore, for any size company it makes less sense to run your own data center with each passing day; the cloud providers are really bringing up their game. This is economic specialization at work. being pedantic, you can get almost all the management benefits on bare metal, and you can rent bare metal from hosting providors, cloud VMs are not the only option. 'Cloud' makes sense if you have a very predictably spiky load and you can add/remove machines to meet that load, but if you end up needing to have the machines running a significant percentage of the time, dedicated boxes are cheaper (as well as faster) Well, that depends on how you define 'most'. The thing is for me is that for machines around the office (just like with people) about 10% of them do 90% of the work. Being able to slide them around based on that (sometime changing) need is a tremendous time and cost saver. For application and infrastructure development dealing with hardware is just a distraction. I'd rather click on some interface and say, 'this application needs 25k iops guaranteed' and then make a cost driven decision on software optimization. It's hard to let go after decades of hardware innovation (the SSD revolution was the final shoe to drop) but for me the time has finally come. As recently as a year ago I was arguing databases needed to be run against metal. merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Postgresql in a Virtual Machine
On Mon, 25 Nov 2013, Merlin Moncure wrote: On Mon, Nov 25, 2013 at 2:01 PM, Lee Nguyen wrote: Hi, Having attended a few PGCons, I've always heard the remark from a few presenters and attendees that Postgres shouldn't be run inside a VM. That bare metal is the only way to go. Here at work we were entertaining the idea of running our Postgres database on our VM farm alongside our application vm's. We are planning to run a few Postgres synchronous replication nodes. Why shouldn't we run Postgres in a VM? What are the downsides? Does anyone have any metrics or benchmarks with the latest Postgres? Unfortunately (and it really pains me to say this) we live in an increasingly virtualized world and we just have to go ahead and deal with it. I work at a mid cap company and we have a zero tolerance policy in terms of applications targeting hardware: in short, you can't. VMs have downsides: you get less performance per buck and have another thing to fail but the administration advantages are compelling especially for large environments. Furthermore, for any size company it makes less sense to run your own data center with each passing day; the cloud providers are really bringing up their game. This is economic specialization at work. being pedantic, you can get almost all the management benefits on bare metal, and you can rent bare metal from hosting providors, cloud VMs are not the only option. 'Cloud' makes sense if you have a very predictably spiky load and you can add/remove machines to meet that load, but if you end up needing to have the machines running a significant percentage of the time, dedicated boxes are cheaper (as well as faster) David Lang -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Postgresql Performance on an HP DL385 and
On Tue, 15 Aug 2006 [EMAIL PROTECTED] wrote: This is also wrong. fsck is needed because the file system is broken. nope, the file system *may* be broken. the dirty flag simply indicates that the filesystem needs to be checked to find out whether or not it is broken. Ah, but if we knew it wasn't broken, then fsck wouldn't be needed, now would it? So we assume that it is broken. A little bit of a game, but it is important to me. If I assumed the file system was not broken, I wouldn't run fsck. I run fsck, because I assume it may be broken. If broken, it indicates potential corruption. note tha the ext3, reiserfs, jfs, and xfs developers (at least) consider fsck nessasary even for journaling fileysstems. they just let you get away without it being mandatory after a unclean shutdown. David Lang ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Hardware upgraded but performance still ain't good
On Wed, 9 Aug 2006, Joshua D. Drake wrote: Ahh and which companies would these be? As a representative of the most prominent one in the US I can tell you that you are not speaking from a knowledgeable position. note I said many, not all. I am aware that your company does not fall into this catagory. I know, but I am curious as to *what* companies. Any reputable PostgreSQL company is going to support Linux as a whole except maybe some fringe distros like Gentoo or RedFlag. Not to mention FreeBSD and Solaris. I'm not going to name names in public, but I will point out that different companies definitions of what constatutes 'fringe distros' are different. For some any linux other then RedHat Enterprise or SuSE is a fringe distro (with SuSE being a relativly recent addition, for a while RedHat were frequently the only supported distro versions) and please note, when I'm talking about support, it's not just postgresql support, but also hardware/driver support that can run into these problems David Lang ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Hardware upgraded but performance still ain't good
On Wed, 9 Aug 2006, Joshua D. Drake wrote: even many of the companies that offer support for postgres have this problem. the explination is always that they can't test every distro out there so they pick a few and support those (this is one of the reasons why Ahh and which companies would these be? As a representative of the most prominent one in the US I can tell you that you are not speaking from a knowledgeable position. note I said many, not all. I am aware that your company does not fall into this catagory. David Lang ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Hardware upgraded but performance still ain't good
On Wed, 9 Aug 2006, Stephen Frost wrote: * David Lang ([EMAIL PROTECTED]) wrote: there's a huge difference between 'works on debian' and 'supported on debian'. I do use debian extensivly, (along with slackware on my personal machines), so i am comfortable getting things to work. but 'supported' means that when you run into a problem you can call for help without being told 'sorry, switch distros, then call us back'. Have you ever actually had that happen? I havn't and I've called support for a number of different issues for various commercial software. In the end it might boil down to some distribution-specific issue that they're not willing to fix but honestly that's pretty rare. unfortunantly I have, repeatedly with different products. if you can manage to get past the first couple of levels of support to people who really understand things rather then just useing checklists you are more likly to get help, but even there I've run into people who seem eager to take the easy way out by assuming that it must be a distro thing rather then anything with their product (even in cases where it ended up being a simple config thing) David Lang ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Hardware upgraded but performance still ain't good
On Tue, 8 Aug 2006, Stephen Frost wrote: * Alex Turner ([EMAIL PROTECTED]) wrote: First off - very few third party tools support debian. Debian is a sure fire way to have an unsupported system. Use RedHat or SuSe (flame me all you want, it doesn't make it less true). Yeah, actually, it does make it less true since, well, it's really not all that true to begin with. What you're probably intending to say is that fewer companies say "Works with Debian!" on their advertising material or list it as "officially supported". I've had *very* few problems running commercial apps on Debian (including things like Oracle and IBM SAN management software). Generally it's just take the rpms and either install them *using* rpm (which is available in Debian...) or use alien to convert them to a tarball and/or deb. there's a huge difference between 'works on debian' and 'supported on debian'. I do use debian extensivly, (along with slackware on my personal machines), so i am comfortable getting things to work. but 'supported' means that when you run into a problem you can call for help without being told 'sorry, switch distros, then call us back'. even many of the companies that offer support for postgres have this problem. the explination is always that they can't test every distro out there so they pick a few and support those (this is one of the reasons why I am watching ubuntu with great interest, it's debian under the covers, but they're starting to get the recognition from the support groups of companies) David Lang ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] import performance
On Tue, 14 Mar 2006, Chris wrote: The only other thing I can see is the old server is ext2: /dev/hda4 on / type ext2 (rw,errors=remount-ro) the new one is ext3: /dev/hda2 on / type ext3 (rw) this is actually a fairly significant difference. with ext3 most of your data actually gets written twice, once to the journal and a second time to the spot on the disk it's actually going to live. in addition there are significant differences in how things are arranged on disk between the two filesystems, (overridable at mount, but only changes future new files). the ext3 layout is supposed to be better for a general purpose filesystem, but I've found common cases (lots of files and directories) where it's significantly slower, and I think postgres will fall into those layouts. try makeing a xfs filesystem for your postgres data and see what sort of performance you get on it. David Lang ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] x206-x225
On Sat, 11 Mar 2006, Joost Kraaijeveld wrote: Date: Sat, 11 Mar 2006 09:17:09 +0100 From: Joost Kraaijeveld <[EMAIL PROTECTED]> To: David Lang <[EMAIL PROTECTED]> Cc: Richard Huxton , pgsql-performance@postgresql.org Subject: Re: [PERFORM] x206-x225 On Fri, 2006-03-10 at 23:57 -0800, David Lang wrote: On Sat, 11 Mar 2006, Joost Kraaijeveld wrote: On Fri, 2006-03-10 at 13:40 +, Richard Huxton wrote: Your ATA disk is lying about disk caching being turned off. Assuming each insert is in a separate transaction, then it's not going to do 10,000 / 6 = 1667 transactions/sec - that's faster than it's rotational speed. Could you explain the calculation? Why should the number of transactions be related to the rotational speed of the disk, without saying anything about the number of bytes per rotation? each transaction requires a sync to the disk, a sync requires a real write (which you then wait for), so you can only do one transaction per rotation. Not according to a conversation I had with Western Digital about the write performance of my own SATA disks. What I understand from their explanation their disk are limited by the MB/sec and not by the number of writes/second, e.g. I could write 50 MB/sec *in 1 bit/write* on my disk. This would suggest that the maximum transactions of my disk (overhead of OS and PostgreSQL ignored) would be 50MB / (transaction size in MB) per second. Or am I missing something (what would not surprise me, as I do not understand the perforance of my system at all ;-))? but if you do a 1 bit write, and wait for it to complete, and then do another 1 bit write that belongs on disk immediatly after the first one (and wait for it to complete) you have to wait until the disk rotates to the point that it can make the write before it's really safe on disk. so you can do one transaction in less then one rotation, but if you do 50 transactions you must wait at least 49 (and a fraction) roatations. if the disk cache is turned on then you don't have to wait for this, but you also will loose the data if you loose power so it's really not safe. David Lang ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] x206-x225
On Sat, 11 Mar 2006, Joost Kraaijeveld wrote: On Fri, 2006-03-10 at 13:40 +, Richard Huxton wrote: Your ATA disk is lying about disk caching being turned off. Assuming each insert is in a separate transaction, then it's not going to do 10,000 / 6 = 1667 transactions/sec - that's faster than it's rotational speed. Could you explain the calculation? Why should the number of transactions be related to the rotational speed of the disk, without saying anything about the number of bytes per rotation? each transaction requires a sync to the disk, a sync requires a real write (which you then wait for), so you can only do one transaction per rotation. David Lang ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] qsort again (was Re: [PERFORM] Strange Create Index
On Thu, 16 Feb 2006, Mark Lewis wrote: On Thu, 2006-02-16 at 17:51 -0500, Greg Stark wrote: Data types which could probably provide a useful function for f would be int2, int4, oid, and possibly int8 and text (at least for SQL_ASCII). How exactly do you imagine doing this for text? I could see doing it for char(n)/varchar(n) where n<=4 in SQL_ASCII though. In SQL_ASCII, just take the first 4 characters (or 8, if using a 64-bit sortKey as elsewhere suggested). The sorting key doesn't need to be a one-to-one mapping. that would violate your second contraint ( f(a)==f(b) iff (a==b) ) if you could drop that constraint (the cost of which would be extra 'real' compares within a bucket) then a helper function per datatype could work as you are talking. David Lang ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very
On Sat, 21 Jan 2006, Tom Lane wrote: Ron <[EMAIL PROTECTED]> writes: At 07:23 PM 1/20/2006, Tom Lane wrote: Well, we're trying to split an index page that's gotten full into two index pages, preferably with approximately equal numbers of items in each new page (this isn't a hard requirement though). Maybe we are over thinking this. What happens if we do the obvious and just make a new page and move the "last" n/2 items on the full page to the new page? Search performance will go to hell in a handbasket :-(. We have to make at least some effort to split the page in a way that will allow searches to visit only one of the two child pages rather than both. does the order of the items within a given page matter? if not this sounds like a partial quicksort algorithm would work. you don't need to fully sort things, but you do want to make sure that everything on the first page is 'less' then everything on the second page so you can skip passes that don't cross a page boundry It's certainly true though that finding the furthest pair is not a necessary component of that. It's reasonable if you try to visualize the problem in 2D or 3D, but I'm not sure that that geometric intuition holds up in such a high-dimensional space as we have here. I will say that I'm not understanding the problem well enough to understand themulti-dimentional nature of this problem. David Lang ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] How to handle a large DB and simultaneous accesses?
On Tue, 10 Jan 2006, Charles A. Landemaine wrote: Hello, I have to develop a companies search engine (looks like the Yellow pages). We're using PostgreSQL at the company, and the initial DB is 2GB large, as it has companies from the entire world, with a fair amount of information. What reading do you suggest so that we can develop the search engine core, in order that the result pages show up instantly, no matter the heavy load and the DB size. The DB is 2GB but should grow to up to 10GB in 2 years, and there should be 250,000 unique visitors per month by the end of the year. Are there special techniques? Maybe there's a way to sort of cache search results? We're using PHP5 + phpAccelerator. Thanks, frankly that is a small enough chunk of data compared to available memory sizes that I think your best bet is to plan to have enough ram that you only do disk I/O to write and on boot. a dual socket Opteron system can hold 16G with 2G memory modules (32G as 4G modules become readily available over the next couple of years). this should be enough to keep your data and indexes in ram at all times. if you find that other system processes push the data out of ram consider loading the data from disk to a ramfs filesystem, just make sure you don't update the ram-only copy (or if you do that you have replication setup to replicate from the ram copy to a copy on real disks somewhere). depending on your load you could go with single core or dual core chips (and the cpu's are a small enough cost compared to this much ram that you may as well go with the dual core cpu's) now even with your data in ram you can slow down if your queries, indexes, and other settings are wrong, but if performance is important you should be able to essentially eliminate disks for databases of this size. David Lang ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] help tuning queries on large database
On Fri, 6 Jan 2006, Tom Lane wrote: Date: Fri, 06 Jan 2006 18:47:55 -0500 From: Tom Lane <[EMAIL PROTECTED]> To: peter royal <[EMAIL PROTECTED]> Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] help tuning queries on large database peter royal <[EMAIL PROTECTED]> writes: So, my question is, is there anything I can do to boost performance with what I've got, or am I in a position where the only 'fix' is more faster disks? I can't think of any schema/index changes that would help, since everything looks pretty optimal from the 'explain analyze' output. I'd like to get a 10x improvement when querying from the 'cold' state. I don't think you have any hope of improving the "cold" state much. The right way to think about this is not to be in the "cold" state. Check your kernel parameters and make sure it's not set to limit the amount of memory used for cache (I'm not actually sure if there is such a limit on Linux, but there definitely is on some other Unixen). Linux doesn't have any ability to limit the amount of memory used for caching (there are periodicly requests for such a feature) David Lang Look around and see if you can reduce the memory used by processes, or even better, offload non-database tasks to other machines. Basically you need to get as much of the database as you can to stay in disk cache. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Help in avoiding a query 'Warm-Up' period/shared buffer
On Thu, 5 Jan 2006, Mark Liberman wrote: Obviously, I will be testing this - but it might take a few days, as I haven't figure out how to simulate the "period of inactivity" to get the data flushed out of the cache ... so I have to run this each morning. cat large_file >/dev/null will probably do a pretty good job of this (especially if large_file is noticably larger then the amount of ram you have) David Lang ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?
On Mon, 26 Dec 2005, Alex Turner wrote: Yes, but those blocks in RAID 10 are largely irrelevant as they are to independant disks. In RAID 5 you have to write parity to an 'active' drive that is part of the stripe. (They are irrelevant unless of course you are maxing out your SCSI bus - yet another reason why SATA can be faster than SCSI, particularly in RAID 10, every channel is independant). I don't understand your 'active' vs 'inactive' drive argument, in raid 1 or 1+0 all drives are active. with good components you need to worry about maxing out your PCI bus as much as any other one (this type of thing is where the hardware raid has a definante advantage since the card handles the extra I/O, not your system) Sorry - my math for RAID 5 was a bit off - I don't know why I was considering only a three dirve situation - which is the worst. It's n+1 you are right. still, for small arrays thats a big penalty. Still, there is definately a penatly contrary to the assertion of the orignal poster. I agree totally that the read+parity-calc+write in the worst case is totaly bad, which is why I alway recommend people should _never ever_ use RAID 5. In this day and age of large capacity chassis, and large capacity SATA drives, RAID 5 is totally inapropriate IMHO for _any_ application least of all databases. In reality I have yet to benchmark a system where RAID 5 on the same number of drives with 8 drives or less in a single array beat a RAID 10 with the same number of drives. I would definately be interested in a SCSI card that could actualy achieve the theoretical performance of RAID 5 especially under Linux. but it's not a 'same number of drives' comparison you should be makeing. if you have a 8 drive RAID5 array you need to compare it with a 14 drive RAID1/10 array. With RAID 5 you get to watch you system crumble and fail when a drive fails and the array goes into a failed state. It's just not worth it. speed is worth money (and therefor number of drives) in some cases, but not in all cases. also the speed penalty when you have a raid drive fail varies based on your controller it's wrong to flatly rule out any RAID configuration, they all have their place and the important thing is to understand what the advantages and disadvantages are for each of them so you can know when to use each one. for example I have a situation I am looking at where RAID0 is looking appropriate for a database (a multi-TB array that gets completely reloaded every month or so as data expires and new data is loaded from the authoritative source, adding another 16 drives to get redundancy isn't reasonable) David Lang Alex. On 12/26/05, David Lang <[EMAIL PROTECTED]> wrote: On Mon, 26 Dec 2005, Alex Turner wrote: It's irrelavent what controller, you still have to actualy write the parity blocks, which slows down your write speed because you have to write n+n/2 blocks. instead of just n blocks making the system write 50% more data. RAID 5 must write 50% more data to disk therefore it will always be slower. raid5 writes n+1 blocks not n+n/2 (unless n=2 for a 3-disk raid). you can have a 15+1 disk raid5 array for example however raid1 (and raid10) have to write 2*n blocks to disk. so if you are talking about pure I/O needed raid5 wins hands down. (the same 16 drives would be a 8+8 array) what slows down raid 5 is that to modify a block you have to read blocks from all your drives to re-calculate the parity. this interleaving of reads and writes when all you are logicly doing is writes can really hurt. (this is why I asked the question that got us off on this tangent, when doing new writes to an array you don't have to read the blocks as they are blank, assuming your cacheing is enough so that you can write blocksize*n before the system starts actually writing the data) David Lang Alex. On 12/25/05, Michael Stone <[EMAIL PROTECTED]> wrote: On Sat, Dec 24, 2005 at 05:45:20PM -0500, Ron wrote: Caches help, and the bigger the cache the better, but once you are doing enough writes fast enough (and that doesn't take much even with a few GBs of cache) the recalculate-checksums-and-write-new-ones overhead will decrease the write speed of real data. Bear in mind that the HD's _raw_ write speed hasn't been decreased. Those HD's are pounding away as fast as they can for you. Your _effective_ or _data level_ write speed is what decreases due to overhead. You're overgeneralizing. Assuming a large cache and a sequential write, there's need be no penalty for raid 5. (For random writes you may need to read unrelated blocks in order to calculate parity, but for large sequential writes the parity blocks should all be read from cache.) A modern cpu can calculate parity for raid 5 on the order of gigabytes per second, and even crummy embedded processors can do hund
Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?
On Mon, 26 Dec 2005, Alex Turner wrote: It's irrelavent what controller, you still have to actualy write the parity blocks, which slows down your write speed because you have to write n+n/2 blocks. instead of just n blocks making the system write 50% more data. RAID 5 must write 50% more data to disk therefore it will always be slower. raid5 writes n+1 blocks not n+n/2 (unless n=2 for a 3-disk raid). you can have a 15+1 disk raid5 array for example however raid1 (and raid10) have to write 2*n blocks to disk. so if you are talking about pure I/O needed raid5 wins hands down. (the same 16 drives would be a 8+8 array) what slows down raid 5 is that to modify a block you have to read blocks from all your drives to re-calculate the parity. this interleaving of reads and writes when all you are logicly doing is writes can really hurt. (this is why I asked the question that got us off on this tangent, when doing new writes to an array you don't have to read the blocks as they are blank, assuming your cacheing is enough so that you can write blocksize*n before the system starts actually writing the data) David Lang Alex. On 12/25/05, Michael Stone <[EMAIL PROTECTED]> wrote: On Sat, Dec 24, 2005 at 05:45:20PM -0500, Ron wrote: Caches help, and the bigger the cache the better, but once you are doing enough writes fast enough (and that doesn't take much even with a few GBs of cache) the recalculate-checksums-and-write-new-ones overhead will decrease the write speed of real data. Bear in mind that the HD's _raw_ write speed hasn't been decreased. Those HD's are pounding away as fast as they can for you. Your _effective_ or _data level_ write speed is what decreases due to overhead. You're overgeneralizing. Assuming a large cache and a sequential write, there's need be no penalty for raid 5. (For random writes you may need to read unrelated blocks in order to calculate parity, but for large sequential writes the parity blocks should all be read from cache.) A modern cpu can calculate parity for raid 5 on the order of gigabytes per second, and even crummy embedded processors can do hundreds of megabytes per second. You may have run into some lousy implementations, but you should be much more specific about what hardware you're talking about instead of making sweeping generalizations. Side Note: people often forget the other big reason to use RAID 10 over RAID 5. RAID 5 is always only 2 HD failures from data loss. RAID 10 can lose up to 1/2 the HD's in the array w/o data loss unless you get unlucky and lose both members of a RAID 1 set. IOW, your RAID 10 is only 2 HD failures from data loss also. If that's an issue you need to go with RAID 6 or add another disk to each mirror. Mike Stone ---(end of broadcast)--- TIP 6: explain analyze is your friend ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?
On Sat, 24 Dec 2005, Luke Lonergan wrote: David, now hot-swap may not be supported on all interface types, that may be what you have run into, but with SCSI or SATA you should be able to hot-swap with the right controller. That's actually the problem - Linux hot swap is virtually non-functional for SCSI. You can write into the proper places in /proc, then remove and rescan to get a new drive up, but I've found that the resulting OS state is flaky. This is true of the latest 2.6 kernels and LSI and Adaptec SCSI controllers. The problems I've seen are with Linux, not the controllers. Thanks for the clarification, I knew that PATA didn't do hotswap, and I've seen discussions on the linux-kernel list about SATA hotswap being worked on, but I thought that scsi handled it. how recent a kernel have you had problems with? David Lang ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?
On Sat, 24 Dec 2005, Luke Lonergan wrote: Recently, I helped a company named DeepData to improve their dbms performance, which was a combination of moving them to software RAID50 on Linux and getting them onto Bizgres. The disk subsystem sped up on the same hardware (minus the HW RAID card) by over a factor of 10. The downside is that SW RAID is a pain in the neck for management - you have to shut down the Linux host when a disk fails to replace it. Luke, you should not need to shut down the linux host when a disk fails. you should be able to use mdadm to mark the drive as failed, then remove it from the system and replace it, then use mdadm to add the drive to the array. I'm fighting through a double disk failure on my system at home and when I hit a bad spot on a drive (failing it from the array) I can just re-add it without having to restart everything (if it's the second drive I will have to stop and restart the array, but that's becouse the entire array has failed at that point) now hot-swap may not be supported on all interface types, that may be what you have run into, but with SCSI or SATA you should be able to hot-swap with the right controller. David Lang ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?
On Sat, 24 Dec 2005, Ron wrote: At 02:50 PM 12/24/2005, Frank Wiles wrote: Juan Casero <[EMAIL PROTECTED]> wrote: > Sorry folks. I had a couple of glasses of wine as I wrote this. > Anyway I originally wanted the box to have more than two drives so I > could do RAID 5 but that is going to cost too much. Also, contrary > to my statement below it seems to me I should run the 32 bit > postgresql server on the 64 bit kernel. Would you agree this will > probably yield the best performance?I know it depends alot on the > system but for now this database is about 20 gigabytes. Not too large > right now but it may grow 5x in the next year. You definitely DO NOT want to do RAID 5 on a database server. That is probably the worst setup you could have, I've seen it have lower performance than just a single hard disk. RAID 1 and RAID 1+0 are optimal, but you want to stay far away from RAID 5. IMHO RAID 5 is only useful on near line backup servers or Samba file servers where space is more important than speed. That's a bit misleading. RAID 5 excels when you want read speed but don't care as much about write speed. Writes are typical ~2/3 the speed of reads on a typical decent RAID 5 set up. So if you have tables that are read often and written to rarely or not at all, putting them on RAID 5 is optimal. In both data mining like and OLTP like apps there are usually at least some such tables. raid 5 is bad for random writes as you state, but how does it do for sequential writes (for example data mining where you do a large import at one time, but seldom do other updates). I'm assuming a controller with a reasonable amount of battery-backed cache. David Lang ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?
On Thu, 22 Dec 2005, Juan Casero wrote: Ok thanks. I think I will go with 64 bit everything on the box. If I can get the Sun Fire V20Z then I will stick with Solaris 10 x86 and download the 64 bit PostgreSQL 8.1 binaries from blastwave.org. I develop the PHP code to my DSS system on my Windows XP laptop. Normally, I test the code on this laptop but let it hit the live database when I want to run some tests. Well just this afternoon I installed PostgreSQL 8.1.1 on my windows laptop and rebuilt the the entire live database instance on there from a pg_dump archive. I am blown away by the performance increase in PostgreSQL 8.1.x. Has anyone else had a chance to test it? All the queries I run against it are remarkably fast but more importantly I can see that the two cores of my Hyper Threaded P4 are being used. One of the questions I posted on this list was whether PostgreSQL could make use of the large number of cores available on the Ultrasparc T1000/T2000 cores. I am beginning to think that with PostgreSQL 8.1.x the buffer manager could indeed use all those cores. This could make running a DSS or OLTP on an Ultrasparc T1000/T2000 with PostgreSQL a much better bargain than on an intel system. Any thoughts? if you have enough simultanious transactions, and your I/O systems (disk and memory interfaces) can keep up with your needs then postgres can use quite a few cores. there are some limits that will show up with more cores, but I don't think it's well known where they are (this will also be very dependant on your workload as well). there was the discussion within the last month or two that hit the postgres weekly news where more attention is being paied to the locking mechanisms used so this is an area under active development (note especially that some locking strategies that work well with multiple full cores can be crippling with virtual cores (Intel HT etc). but it boils down to the fact that there just isn't enough experiance with the new sun systems to know how well they will work. they could end up being fabulous speed demons, or dogs (and it could even be both, depending on your workload) David Lang Thanks, Juan On Thursday 22 December 2005 22:12, David Lang wrote: On Wed, 21 Dec 2005, Juan Casero wrote: Date: Wed, 21 Dec 2005 22:31:54 -0500 From: Juan Casero <[EMAIL PROTECTED]> To: pgsql-performance@postgresql.org Subject: Re: [PERFORM] What's the best hardver for PostgreSQL 8.1? Sorry folks. I had a couple of glasses of wine as I wrote this. Anyway I originally wanted the box to have more than two drives so I could do RAID 5 but that is going to cost too much. Also, contrary to my statement below it seems to me I should run the 32 bit postgresql server on the 64 bit kernel. Would you agree this will probably yield the best performance? you definantly need a 64 bit kernel to address as much ram as you will need. the question of 32 bit vs 64 bit postgres needs to be benchmarked, but my inclination is that you probably do want 64 bit for that as well. 64 bit binaries are slightly larger then 32 bit ones (less so on x86/AMD64 then on any other mixed platform though), but the 64 bit version also has access to twice as many registers as a 32 bit one, and the Opteron chips have some other features that become availabel in 64 bit mode (or more useful) like everything else this needs benchmarks to prove with your workload (I'm trying to get some started, but haven't had a chance yet) David Lang I know it depends alot on the system but for now this database is about 20 gigabytes. Not too large right now but it may grow 5x in the next year. Thanks, Juan On Wednesday 21 December 2005 22:09, Juan Casero wrote: I just sent my boss an email asking him for a Sun v20z with dual 2.2 Ghz opterons, 2 Gigs of RAM and RAID 1. I would have liked a better server capable of RAID but that seems to be out of his budget right now. Ok so I assume I get this Sun box. Most likely I will go with Linux since it is a fair bet he doesn't want to pay for the Solaris 10 x86 license. Although I kind of like the idea of using Solaris 10 x86 for this. I will assume I need to install the x64 kernel that comes with say Fedora Core 4. Should I run the Postgresql 8.x binaries in 32 bit mode or 64 bit mode? My instinct tells me 64 bit mode is most efficient for our database size about 20 gigs right now but may grow to 100 gigs in a year or so. I just finished loading a 20 gig database on a dual 900 Mhz Ultrasparc III system with 2 gigs of ram and about 768 megs of shared memory available for the posgresql server running Solaris 10. The load has smoked a P4 3.2 Ghz system I am using also with 2 gigs of ram running postgresql 8.0.3. I mean I started the sparc load after the P4 load. The sparc load has finished already rebuilding the database from a pg_dump file but the P4 system is still going. The p4 has 1.3 Gigs of shared memory
Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?
On Wed, 21 Dec 2005, Juan Casero wrote: Date: Wed, 21 Dec 2005 22:31:54 -0500 From: Juan Casero <[EMAIL PROTECTED]> To: pgsql-performance@postgresql.org Subject: Re: [PERFORM] What's the best hardver for PostgreSQL 8.1? Sorry folks. I had a couple of glasses of wine as I wrote this. Anyway I originally wanted the box to have more than two drives so I could do RAID 5 but that is going to cost too much. Also, contrary to my statement below it seems to me I should run the 32 bit postgresql server on the 64 bit kernel. Would you agree this will probably yield the best performance? you definantly need a 64 bit kernel to address as much ram as you will need. the question of 32 bit vs 64 bit postgres needs to be benchmarked, but my inclination is that you probably do want 64 bit for that as well. 64 bit binaries are slightly larger then 32 bit ones (less so on x86/AMD64 then on any other mixed platform though), but the 64 bit version also has access to twice as many registers as a 32 bit one, and the Opteron chips have some other features that become availabel in 64 bit mode (or more useful) like everything else this needs benchmarks to prove with your workload (I'm trying to get some started, but haven't had a chance yet) David Lang I know it depends alot on the system but for now this database is about 20 gigabytes. Not too large right now but it may grow 5x in the next year. Thanks, Juan On Wednesday 21 December 2005 22:09, Juan Casero wrote: I just sent my boss an email asking him for a Sun v20z with dual 2.2 Ghz opterons, 2 Gigs of RAM and RAID 1. I would have liked a better server capable of RAID but that seems to be out of his budget right now. Ok so I assume I get this Sun box. Most likely I will go with Linux since it is a fair bet he doesn't want to pay for the Solaris 10 x86 license. Although I kind of like the idea of using Solaris 10 x86 for this. I will assume I need to install the x64 kernel that comes with say Fedora Core 4. Should I run the Postgresql 8.x binaries in 32 bit mode or 64 bit mode? My instinct tells me 64 bit mode is most efficient for our database size about 20 gigs right now but may grow to 100 gigs in a year or so. I just finished loading a 20 gig database on a dual 900 Mhz Ultrasparc III system with 2 gigs of ram and about 768 megs of shared memory available for the posgresql server running Solaris 10. The load has smoked a P4 3.2 Ghz system I am using also with 2 gigs of ram running postgresql 8.0.3. I mean I started the sparc load after the P4 load. The sparc load has finished already rebuilding the database from a pg_dump file but the P4 system is still going. The p4 has 1.3 Gigs of shared memory allocated to postgresql. How about them apples? Thanks, Juan On Wednesday 21 December 2005 18:57, William Yu wrote: Juan Casero wrote: Can you elaborate on the reasons the opteron is better than the Xeon when it comes to disk io? I have a PostgreSQL 7.4.8 box running a DSS. One of our Opterons have 64-bit IOMMU -- Xeons don't. That means in 64-bit mode, transfers to > 4GB, the OS must allocated the memory < 4GB, DMA to that block and then the CPU must do extra work in copying the memory to > 4GB. Versus on the Opteron, it's done by the IO adaptor using DMA in the background. ---(end of broadcast)--- TIP 6: explain analyze is your friend ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] What's the best hardver for PostgreSQL 8.1?
On Tue, 20 Dec 2005, Juan Casero wrote: Date: Tue, 20 Dec 2005 19:50:47 -0500 From: Juan Casero <[EMAIL PROTECTED]> To: pgsql-performance@postgresql.org Subject: Re: [PERFORM] What's the best hardver for PostgreSQL 8.1? Can you elaborate on the reasons the opteron is better than the Xeon when it comes to disk io? the opteron is cheaper so you have more money to spend on disks :-) also when you go into multi-cpu systems the front-side-bus design of the Xeon's can easily become your system bottleneck so that you can't take advantage of all the CPU's becouse they stall waiting for memory accesses, Opteron systems have a memory bus per socket so the more CPU's you have the more memory bandwidth you have. The database itself is about 20 gigs but I want it to scale to 100 gigs. how large is the working set? in your tests you ran into swapping on your 1.2G system, buying a dual opteron with 16gigs of ram will allow you to work with much larger sets of data, and you can go beyond that if needed. David Lang ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] PostgreSQL and Ultrasparc T1
On Tue, 20 Dec 2005, Alan Stange wrote: David Lang wrote: On Tue, 20 Dec 2005, Alan Stange wrote: Jignesh K. Shah wrote: I guess it depends on what you term as your metric for measurement. If it is just one query execution time .. It may not be the best on UltraSPARC T1. But if you have more than 8 complex queries running simultaneously, UltraSPARC T1 can do well compared comparatively provided the application can scale also along with it. I just want to clarify one issue here. It's my understanding that the 8-core, 4 hardware thread (known as strands) system is seen as a 32 cpu system by Solaris. So, one could have up to 32 postgresql processes running in parallel on the current systems (assuming the application can scale). note that like hyperthreading, the strands aren't full processors, their efficiancy depends on how much other threads shareing the core stall waiting for external things. Exactly. Until we have a machine in hand (and substantial technical documentation) we won't know all the limitations. by the way, when you do get your hands on it I would be interested to hear how Linux compares to Solaris on the same hardware. given how new the hardware is it's also likly that linux won't identify the hardware properly (either seeing it as 32 true processors or just as 8 without being able to use the strands), so the intitial tests may not reflect the Linux performance in a release or two. David Lang ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] PostgreSQL and Ultrasparc T1
On Tue, 20 Dec 2005, Alan Stange wrote: Jignesh K. Shah wrote: I guess it depends on what you term as your metric for measurement. If it is just one query execution time .. It may not be the best on UltraSPARC T1. But if you have more than 8 complex queries running simultaneously, UltraSPARC T1 can do well compared comparatively provided the application can scale also along with it. I just want to clarify one issue here. It's my understanding that the 8-core, 4 hardware thread (known as strands) system is seen as a 32 cpu system by Solaris. So, one could have up to 32 postgresql processes running in parallel on the current systems (assuming the application can scale). note that like hyperthreading, the strands aren't full processors, their efficiancy depends on how much other threads shareing the core stall waiting for external things. David Lang ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] separate drives for WAL or pgdata files
On Mon, 19 Dec 2005, David Lang wrote: this is getting dangerously close to being able to fit in ram. I saw an article over the weekend that Samsung is starting to produce 8G DIMM's, that can go 8 to a controller (instead of 4 per as is currently done), when motherboards come out that support this you can have 64G of ram per opteron socket. it will be pricy, but the performance a message on another mailing list got me to thinking, there is the horas project that is aiming to put togeather 16 socket Opteron systems within a year (they claim sooner, but I'm being pessimistic ;-), combine this with these 8G dimms and you can have a SINGLE system with 1TB of ram on it (right at the limits of the Opteron's 40 bit external memory addressing) _wow_ and the thing it that it won't take much change in the software stack to deal with this. Linux is already running on machines with 1TB of ram (and 512 CPU's) so it will run very well. Postgres probably needs some attention to it's locks, but it is getting that attention now (and it will get more with the Sun Niagra chips being able to run 8 processes simultaniously) just think of the possibilities (if you have the money to afford the super machine :-) David Lang ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] separate drives for WAL or pgdata files
On Mon, 19 Dec 2005, Anjan Dave wrote: I am not sure if there's an obvious answer to this...If there's a choice of an external RAID10 (Fiber Channel 6 or 8 15Krpm drives) enabled drives, what is more beneficial to store on it, the WAL, or the Database files? One of the other would go on the local RAID10 (4 drives, 15Krpm) along with the OS. the WAL is small compared to the data, and it's mostly sequential access, so it doesn't need many spindles, it just needs them more-or-less dedicated to the WAL and not distracted by other things. the data is large (by comparison), and is accessed randomly, so the more spindles that you can throw at it the better. In your place I would consider making the server's internal drives into two raid1 pairs (one for the OS, one for the WAL), and then going with raid10 on the external drives for your data This is a very busy database with high concurrent connections, random reads and writes. Checkpoint segments are 300 and interval is 6 mins. Database size is less than 50GB. this is getting dangerously close to being able to fit in ram. I saw an article over the weekend that Samsung is starting to produce 8G DIMM's, that can go 8 to a controller (instead of 4 per as is currently done), when motherboards come out that support this you can have 64G of ram per opteron socket. it will be pricy, but the performance in the meantime you can already go 4G/slot * 4 slots/socket and get 64G on a 4-socket system. it won't be cheap, but the performance will blow away any disk-based system. for persistant storage you can replicate from your ram-based system to a disk-based system, and as long as your replication messages hit disk quickly you can allow the disk-based version to lag behind in it's updates during your peak periods (as long as it is able to catch up with the writes overnight), and as the disk-based version won't have to do the seeks for the reads it will be considerably faster then if it was doing all the work (especially if you have good, large battery-backed disk caches to go with those drives to consolodate the writes) It has become a bit more confusing because I am trying to allot shared storage across several hosts, and want to be careful not to overload one of the 2 storage processors. there's danger here, if you share spindles with other apps you run the risk of slowing down your database significantly. you may be better off with fewer, but dedicated drives rather then more, but shared drives. David Lang ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Overriding the optimizer
On Fri, 16 Dec 2005, Mark Kirkwood wrote: Craig A. James wrote: What would be cool would be some way the developer could alter the plan, but they way of doing so would strongly encourage the developer to send the information to this mailing list. Postgres would essentially say, "Ok, you can do that, but we want to know why!" Yeah it would - an implementation I have seen that I like is where the developer can supply the *entire* execution plan with a query. This is complex enough to make casual use unlikely :-), but provides the ability to try out other plans, and also fix that vital query that must run today. hmm, I wonder if this option would have uses beyond the production hacks that are being discussed. specificly developers working on the optimizer (or related things like clustered databases) could use the same hooks to develop and modify the 'optimizer' externally to postgres (doing an explain would let them find the costs that postgres thinks each option has, along with it's reccomendation, but the developer could try different execution plans without having to recompile postgres between runs. and for clustered databases where the data is split between machines this would be a hook that the cluster engine could use to put it's own plan into place without having to modify and recompile) David Lang ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Overriding the optimizer
On Thu, 15 Dec 2005, Craig A. James wrote: The example I raised in a previous thread, of irregular usage, is the same: I have a particular query that I *always* want to be fast even if it's only used rarely, but the system swaps its tables out of the file-system cache, based on "low usage", even though the "high usage" queries are low priority. How can Postgres know such things when there's no way for me to tell it? actually, postgres doesn't manage the file-system cache, it deliberatly leaves that up to the OS it is running on to do that job. one (extremely ugly) method that you could use would be to have a program that looks up what files are used to store your high priority tables and then write a trivial program to keep those files in memory (it may be as simple as mmaping the files and then going to sleep, or you may have to read various points through the file to keep them current in the cache, it WILL vary depending on your OS and filesystem in use) oracle goes to extremes with this sort of control, I'm actually mildly surprised that they still run on a host OS and haven't completely taken over the machine (I guess they don't want to have to write device drivers, that's about the only OS code they really want to use, they do their own memory management, filesystem, and user systems), by avoiding areas like this postgres sacrafices a bit of performance, but gains a much broader set of platforms (hardware and OS) that it can run on. and this by itself can result in significant wins (does oracle support Opteron CPU's in 64 bit mode yet? as of this summer it just wasn't an option) David Lang ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Simple Join
On Fri, 16 Dec 2005, Mark Kirkwood wrote: Right on. Some of these "coerced" plans may perform much better. If so, we can look at tweaking your runtime config: e.g. effective_cache_size random_page_cost default_statistics_target to see if said plans can be chosen "naturally". Mark, I've seen these config options listed as tweaking targets fairly frequently, has anyone put any thought or effort into creating a test program that could analyse the actual system and set the defaults based on the measured performance? David Lang ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Overriding the optimizer
On Thu, 15 Dec 2005, Craig A. James wrote: Mark Kirkwood wrote: I hear what you are saying, but to use this fine example - I don't know what the best plan is - these experiments part of an investigation to find *if* there is a better plan, and if so, why Postgres is not finding it. There isn't a database in the world that is as smart as a developer, or that can have insight into things that only a developer can possibly know. That is often true - but the aim is to get Postgres's optimizer closer to developer smartness. What would be cool would be some way the developer could alter the plan, but they way of doing so would strongly encourage the developer to send the information to this mailing list. Postgres would essentially say, "Ok, you can do that, but we want to know why!" at the risk of sounding flippent (which is NOT what I intend) I will point out that with the source you can change the optimizer any way you need to :-) that being said, in your example the issue is the cost of the user created function and the fact that postgres doesn't know it's cost. would a resonable answer be to give postgres a way to learn how expensive the call is? a couple ways I could see to do this. 1. store some stats automagicly when the function is called and update the optimization plan when you do an ANALYSE 2. provide a way for a user to explicitly set a cost factor for a function (with a default value that's sane for fairly trivial functions so that it would only have to be set for unuseually expensive functions) now, neither of these will work all the time if a given function is sometimes cheap and sometimes expensive (depending on it's parameters), but in that case I would say that if the application knows that a function will be unusueally expensive under some conditions (and knows what those conditions will be) it may be a reasonable answer to duplicate the function, one copy that it uses most of the time, and a second copy that it uses when it expects it to be expensive. at this point the cost of the function can be set via either of the methods listed above) After years of using several other database products (some supporting hint type constructs and some not), I have come to believe that hinting (or similar) actually *hinders* the development of a great optimizer. I agree. It takes the pressure off the optimizer gurus. If the users can just work around every problem, then the optimizer can suck and the system is still usable. Lest anyone think I'm an all-out advocate of overriding the optimizer, I know from first-hand experience what a catastrophe it can be. An Oracle hint I used worked fine on my test schema, but the customer's "table" turned out to be a view, and Oracle's optimizer worked well on the view whereas my hint was horrible. Unfortunately, without the hint, Oracle sucked when working on an ordinary table. Hints are dangerous, and I consider them a last resort. I've been on the linux-kernel mailing list for the last 9 years, and have seen a similar debate rage during that entire time about kernel memory management. overall both of these tend to be conflicts between short-term and long-term benifits. in the short-term the application user wants to be able to override the system to get the best performance _now_ in the long run the system designers don't trust the application programmers to get the hints right and want to figure out the right optimizer plan, even if it takes a lot longer to do so. the key to this balance seems to be to work towards as few controls as possible, becouse the user will get them wrong far more frequently then they get them right, but when you hit a point where there's absolutly no way for the system to figure things out (and it's a drastic difference) provide the application with a way to hint to the system that things are unusueal, but always keep looking for patterns that will let the system detect the need itself even the existing defaults are wrong as frequently as they are right (they were set when hardware was very different then it is today) so some way to gather real-world stats and set the system defaults based on actual hardware performance is really the right way to go (even for things like sequential scan speed that are set in the config file today) David Lang ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] opinion on disk speed
On Thu, 8 Dec 2005, Vivek Khera wrote: I have a choice to make on a RAID enclosure: 14x 36GB 15kRPM ultra 320 SCSI drives OR 12x 72GB 10kRPM ultra 320 SCSI drives both would be configured into RAID 10 over two SCSI channels using a megaraid 320-2x card. My goal is speed. Either would provide more disk space than I would need over the next two years. The database does a good number of write transactions, and a decent number of sequential scans over the whole DB (about 60GB including indexes) for large reports. My only concern is the 10kRPM vs 15kRPM. The advantage of the 10k disks is that it would come from the same vendor as the systems to which it will be connected, making procurement easier. if space isn't an issue then you fall back to the old standby rules of thumb more spindles are better (more disk heads that can move around independantly) faster drives are better (less time to read or write a track) so the 15k drive option is better one other note, you probably don't want to use all the disks in a raid10 array, you probably want to split a pair of them off into a seperate raid1 array and put your WAL on it. David Lang ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] two disks - best way to use them?
On Tue, 6 Dec 2005, Thomas Harold wrote: Ron wrote: For accuracy's sake, which exact config did you finally use? How did you choose the config you finally used? Did you test the three options or just pick one? (Note: I'm not the original poster.) I just picked the option of putting the data/pg_xlog directory (WAL) on a 2nd set of spindles. That was the easiest thing for me to change on this test box. The test server is simply a Gentoo box running software RAID and LVM2. The primary disk set is 2x7200RPM 300GB drives and the secondary disk set is 2x5400RPM 300GB drives. Brand new install of PGSQL 8.1, with mostly default settings (I changed FSM pages to be a higher value, max_fsm_pages = 15). PGSQL was given it's own ext3 32GB LVM volume on the primary disk set (2x7200RPM). Originally, all files were on the primary disk. the WAL is more sensitive to drive speeds then the data is, so you may pick up a little more performance by switching the WAL to the 7200 rpm drives instead of the 5400 rpm drives. if you see a noticable difference with this, consider buying a pair of smaller, but faster drives (10k or 15k rpm drives, or a solid-state drive). you can test this (with significant data risk) by putting the WAL on a ramdisk and see what your performance looks like. David Lang ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] BLCKSZ
On Tue, 6 Dec 2005, Steinar H. Gunderson wrote: On Tue, Dec 06, 2005 at 01:40:47PM +0300, Olleg wrote: I can't undestand why "bigger is better". For instance in search by index. Index point to page and I need load page to get one row. Thus I load 8kb from disk for every raw. And keep it then in cache. You recommend 64kb. With your recomendation I'll get 8 times more IO throughput, 8 time more head seek on disk, 8 time more memory cache (OS cache and postgresql) become busy. Hopefully, you won't have eight times the seeking; a single block ought to be in one chunk on disk. You're of course at your filesystem's mercy, though. in fact useually it would mean 1/8 as many seeks, since the 64k chunk would be created all at once it's probably going to be one chunk on disk as Steiner points out and that means that you do one seek per 64k instead of one seek per 8k. With current disks it's getting to the point where it's the same cost to read 8k as it is to read 64k (i.e. almost free, you could read substantially more then 64k and not notice it in I/O speed), it's the seeks that are expensive. yes it will eat up more ram, but assuming that you are likly to need other things nearby it's likly to be a win. as processor speed keeps climing compared to memory and disk speed true random access is really not the correct way to think about I/O anymore. It's frequently more appropriate to think of your memory and disks as if they were tape drives (seek then read, repeat) even for memory access what you really do is seek to the beginning of a block (expensive) then read that block into cache (cheap, you get the entire cacheline of 64-128 bytes no matter if you need it or not) and then you can then access that block fairly quickly. with memory on SMP machines it's a constant cost to seek anywhere in memory, with NUMA machines (including multi-socket Opterons) the cost to do the seek and fetch depends on where in memory you are seeking to and what cpu you are running on. it also becomes very expensive for multiple CPU's to write to memory addresses that are in the same block (cacheline) of memory. for disks it's even more dramatic, the seek is incredibly expensive compared to the read/write, and the cost of the seek varies based on how far you need to seek, but once you are on a track you can read the entire track in for about the same cost as a single block (in fact the drive useually does read the entire track before sending the one block on to you). Raid complicates this becouse you have a block size per drive and reading larger then that block size involves multiple drives. most of the work in dealing with these issues and optimizing for them is the job of the OS, some other databases work very hard to take over this work from the OS, Postgres instead tries to let the OS do this work, but we still need to keep it in mind when configuring things becouse it's possible to make it much easier or much harder for the OS optimize things. David Lang ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] two disks - best way to use them?
On Mon, 5 Dec 2005, Thomas Harold wrote: Yeah, I don't think I was clear about the config. It's (4) disks setup as a pair of RAID1 sets. My original config was pgsql on the first RAID set (data and WAL). I'm now experimenting with putting the data/pg_xlog folder on the 2nd set of disks. Under the old setup (everything on the original RAID1 set, in a dedicated 32GB LVM volume), I was seeing 80-90% wait percentages in "top". My understanding is that this is an indicator of an overloaded / bottlenecked disk system. This was while doing massive inserts into a test table (millions of narrow rows). I'm waiting to see what happens once I have data/pg_xlog on the 2nd disk set. in that case you logicly have two disks, so see the post from Ron earlier in this thread. David Lang ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] two disks - best way to use them?
On Mon, 5 Dec 2005, Thomas Harold wrote: (noob question incoming) Section 26.4 WAL Internals http://www.postgresql.org/docs/8.1/interactive/wal-internals.html This seems to be the applicable chapter. They talk about creating a symlink for the data/pg_xlog folder to point at another disk set. If I have (2) RAID1 sets with LVM2, can I instead create a logical volume on the 2nd disk set and just mount data/pg_xlog to point at the logical volume on the 2nd disk set? For example, I have an LVM on my primary mirror called 'pgsql'. And I've created a 2nd LVM on my secondary mirror called 'pgxlog'. These are mounted as: /dev/vgraida/pgsql on /var/lib/postgresql type ext3 (rw,noatime) /dev/vgraidb/pgxlog on /var/lib/postgresql/data/pg_xlog type ext3 (rw,noatime) From the application's P.O.V., it's the same thing, right? (It seems to be working, I'm just trying to double-check that I'm not missing something.) the application can' tell the difference, but the reason for seperating them isn't for the application, it's so that different pieces of hardware can work on different things without having to bounce back and forth between them. useing the same drives with LVM doesn't achieve this goal. the problem is that the WAL is doing a LOT of writes, and postgres waits until each write is completed before going on to the next thing (for safety), if a disk is dedicated to the WAL then the head doesn't move much. if the disk is used for other things as well then the heads have to move across the disk surface between the WAL and where the data is. this drasticly slows down the number of items that can go into the WAL, and therefor slows down the entire system. this slowdown isn't even something as simple as cutting your speed in half (half the time spent working on the WAL, half spent on the data itself), it's more like 10% spent on the WAL, 10% spent on the data, and 80% moveing back and forth between them (I am probably wrong on the exact numbers, but it is something similarly drastic) this is also the reason why it's so good to have a filesystem journal on a different drive. David Lang ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Database restore speed
On Sat, 3 Dec 2005, Luke Lonergan wrote: Tom, On 12/3/05 12:32 PM, "Tom Lane" <[EMAIL PROTECTED]> wrote: "Luke Lonergan" <[EMAIL PROTECTED]> writes: Last I looked at the Postgres binary dump format, it was not portable or efficient enough to suit the need. The efficiency problem with it was that there was descriptive information attached to each individual data item, as compared to the approach where that information is specified once for the data group as a template for input. Are you complaining about the length words? Get real... Hmm - "" repeat, efficiency is 1/2 of "" repeat. I think that's worth complaining about. but how does it compare to the ASCII representation of that int? (remember to include your seperator characters as well) yes it seems less efficiant, and it may be better to do something like send a record description header that gives the sizes of each item and then send the records following that without the size items, but either way should still be an advantage over the existing ASCII messages. also, how large is the in the message? there are other optimizations that can be done as well, but if there's still a question about if it's worth it to do the parseing on the client then a first implmentation should be done without makeing to many changes to test things. also some of the optimizations need to have measurements done to see if they are worth it (even something that seems as obvious as seperating the sizeof from the data itself as you suggest above has a penalty, namely it spreads the data that needs to be accessed to process a line between different cache lines, so in some cases it won't be worth it) David Lang ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Database restore speed
On Fri, 2 Dec 2005, Luke Lonergan wrote: And how do we compose the binary data on the client? Do we trust that the client encoding conversion logic is identical to the backend's? If there is a difference, what happens if the same file loaded from different client machines has different results? Key conflicts when loading a restore from one machine and not from another? - Luke the same way you deal with text data that could be in different encodings, you tag your message with the format version you are useing and throw an error if you get a format you don't understand how to deal with. if a client claims to be useing one format, but is instead doing something different you will be in deep trouble anyway. remember, we aren't talking about random application code here, we are talking about postgres client code and libraries, if the library is incorrect then it's a bug, parsing bugs could happen in the server as welll. (in fact, the server could parse things to the intermediate format and then convert them, this sounds expensive, but given the high clock multipliers in use, it may not end up being measurable) David Lang ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Database restore speed
On Fri, 2 Dec 2005, Luke Lonergan wrote: Micahel, On 12/2/05 1:46 PM, "Michael Stone" <[EMAIL PROTECTED]> wrote: Not necessarily; you may be betting that it's more *efficient* to do the parsing on a bunch of lightly loaded clients than your server. Even if you're using the same code this may be a big win. If it were possible in light of the issues on client parse / convert, then we should analyze whether it's a performance win. In the restore case, where we've got a dedicated server with a dedicated client machine, I don't see why there would be a speed benefit from running the same parse / convert code on the client versus running it on the server. Imagine a pipeline where there is a bottleneck, moving the bottleneck to a different machine doesn't make it less of a bottleneck. your database server needs to use it's CPU for other things besides the parseing. you could buy a bigger machine, but it's useally far cheaper to buy two dual-proc machines then it is one quad proc machine (and if you load is such that you already have a 8-proc machine as the database, swallow hard when you ask for the price of a 16 proc machine), and in addition there is a substantial efficiancy loss in multi-proc machines (some software, some hardware) that may give you more available work cycles on the multiple small machines. if you can remove almost all the parsing load (CPU cycles, memory footprint, and cache thrashing effects) then that box can do the rest of it's stuff more efficiantly. meanwhile the client can use what would otherwise be idle CPU to do the parseing. if you only have a 1-1 relationship it's a good question as to if it's a win (it depends on how much other stuff each box is having to do to support this), but if you allow for multiple clients it easily becomes a win. David Lang ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Database restore speed
On Fri, 2 Dec 2005, Luke Lonergan wrote: Stephen, On 12/2/05 1:19 PM, "Stephen Frost" <[EMAIL PROTECTED]> wrote: I've used the binary mode stuff before, sure, Postgres may have to convert some things but I have a hard time believing it'd be more expensive to do a network_encoding -> host_encoding (or toasting, or whatever) than to do the ascii -> binary change. From a performance standpoint no argument, although you're betting that you can do parsing / conversion faster than the COPY core in the backend can (I know *we* can :-). It's a matter of safety and generality - in general you can't be sure that client machines / OS'es will render the same conversions that the backend does in all cases IMO. One more thing - this is really about the lack of a cross-platform binary input standard for Postgres IMO. If there were such a thing, it *would* be safe to do this. The current Binary spec is not cross-platform AFAICS, it embeds native representations of the DATUMs, and does not specify a universal binary representation of same. For instance - when representing a float, is it an IEEE 32-bit floating point number in little endian byte ordering? Or is it IEEE 64-bit? With libpq, we could do something like an XDR implementation, but the machinery isn't there AFAICS. This makes sense, however it then raises the question of how much effort it would take to define such a standard and implement the shim layer needed to accept the connections vs how much of a speed up it would result in (the gain could probaly be approximated with just a little hacking to use the existing binary format between two machines of the same type) as for the standards, standard network byte order is big endian, so that should be the standard used (in spite of the quantity of x86 machines out there). for the size of the data elements, useing the largest size of each will probably still be a win in size compared to ASCII. converting between binary formats is useally a matter of a few and and shift opcodes (and with the core so much faster then it's memory you can afford to do quite a few of these on each chunk of data without it being measurable in your overall time) an alturnative would be to add a 1-byte data type before each data element to specify it's type, but then the server side code would have to be smarter to deal with the additional possibilities. David Lang ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Database restore speed
On Fri, 2 Dec 2005, Michael Stone wrote: On Fri, Dec 02, 2005 at 01:24:31PM -0800, Luke Lonergan wrote: From a performance standpoint no argument, although you're betting that you can do parsing / conversion faster than the COPY core in the backend can Not necessarily; you may be betting that it's more *efficient* to do the parsing on a bunch of lightly loaded clients than your server. Even if you're using the same code this may be a big win. it's a lot easier to throw hardware at the problem by spliting your incomeing data between multiple machines and have them all working in parallel throwing the data at one database then it is to throw more hardware at the database server to speed it up (and yes, assuming that MPP splits the parseing costs as well, it can be an answer for some types of systems) David Lang ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Database restore speed
On Fri, 2 Dec 2005, Luke Lonergan wrote: Stephen, On 12/2/05 12:18 PM, "Stephen Frost" <[EMAIL PROTECTED]> wrote: Just a thought, but couldn't psql be made to use the binary mode of libpq and do at least some of the conversion on the client side? Or does binary mode not work with copy (that wouldn't suprise me, but perhaps copy could be made to support it)? Yes - I think this idea is implicit in what David suggested, and my response as well. The problem is that the way the client does conversions can potentially differ from the way the backend does. Some of the types in Postgres are machine intrinsic and the encoding conversions use on-machine libraries, each of which preclude the use of client conversion methods (without a lot of restructuring). We'd tackled this problem in the past and concluded that the parse / convert stage really belongs in the backend. I'll bet this parsing cost varys greatly with the data types used, I'm also willing to bet that for the data types that hae different encoding on different systems there could be a intermediate encoding that is far faster to parse then ASCII text is. for example, (and I know nothing about the data storage itself so this is just an example), if the issue was storing numeric values on big endian and little endian systems (and 32 bit vs 64 bit systems to end up with 4 ways of holding the data) you have a substantial cost in parseing the ASCII and converting it to a binary value, but the client can't (and shouldn't) know which endian type and word size the server is. but it could create a big endian multi-precision encoding that would then be very cheap for the server to split and flip as nessasary. yes this means more work is done overall, but it's split between different machines, and the binary representation of the data will reduce probably your network traffic as a side effect. and for things like date which get parsed in multiple ways until one is found that seems sane, there's a significant amount of work that the server could avoid. David Lang The other thought, of course, is that you could use PITR for your backups instead of pgdump... Totally - great idea, if this is actually a backup / restore then PITR plus filesystem copy (tarball) is hugely faster than dump / restore. - Luke ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] filesystem performance with lots of files
On Fri, 2 Dec 2005, Qingqing Zhou wrote: I don't have all the numbers readily available (and I didn't do all the tests on every filesystem), but I found that even with only 1000 files/directory ext3 had some problems, and if you enabled dir_hash some functions would speed up, but writing lots of files would just collapse (that was the 80 min run) Interesting. I would suggest test small number but bigger file would be better if the target is for database performance comparison. By small number, I mean 10^2 - 10^3; By bigger, I mean file size from 8k to 1G (PostgreSQL data file is at most this size under normal installation). I agree, that round of tests was done on my system at home, and was in response to a friend who had rsync over a local lan take > 10 hours for <10G of data. but even so it generated some interesting info. I need to make a more controlled run at it though. Let's take TPCC as an example, if we get a TPCC database of 500 files, each one is at most 1G (PostgreSQL has this feature/limit in ordinary installation), then this will give us a 500G database, which is big enough for your current configuration. Regards, Qingqing ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Database restore speed
On Fri, 2 Dec 2005, Luke Lonergan wrote: Steve, When we restore the postmaster process tries to use 100% of the CPU. The questions we have are: 1) What is postmaster doing that it needs so much CPU? Parsing mostly, and attribute conversion from text to DBMS native formats. 2) How can we get our system to go faster? Use Postgres 8.1 or Bizgres. Get a faster CPU. These two points are based on our work to improve COPY speed, which led to a near doubling in Bizgres, and in the 8.1 version it's about 60-70% faster than in Postgres 8.0. There are currently two main bottlenecks in COPY, one is parsing + attribute conversion (if the postgres CPU is nailed at 100% that's what your limit is) and the other is the write speed through the WAL. You can roughly divide the write speed of your disk by 3 to get that limit, e.g. if your disk can write 8k blocks at 100MB/s, then your COPY speed might be limited to 33MB/s. You can tell which of these limits you've hit using "vmstat 1" on Linux or iostat on Solaris and watch the blocks input/output on your disk while you watch your CPU. Luke, would it help to have one machine read the file and have it connect to postgres on a different machine when doing the copy? (I'm thinking that the first machine may be able to do a lot of the parseing and conversion, leaving the second machine to just worry about doing the writes) David Lang ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] 15,000 tables
On Thu, 1 Dec 2005, Craig A. James wrote: So say I need 10,000 tables, but I can create tablespaces. Wouldn't that solve the performance problem caused by Linux's (or ext2/3's) problems with large directories? For example, if each user creates (say) 10 tables, and I have 1000 users, I could create 100 tablespaces, and assign groups of 10 users to each tablespace. This would limit each tablespace to 100 tables, and keep the ext2/3 file-system directories manageable. Would this work? Would there be other problems? This would definantly help, however there's still the question of how large the tables get, and how many total files are needed to hold the 100 tables. you still have the problem of having to seek around to deal with all these different files (and tablespaces just spread them further apart), you can't solve this, but a large write-back journal (as opposed to metadata-only) would mask the problem. it would be a trade-off, you would end up writing all your data twice, so the throughput would be lower, but since the data is safe as soon as it hits the journal the latency for any one request would be lower, which would allow the system to use the CPU more and overlap it with your seeking. David Lang ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Open request for benchmarking input (fwd)
here are the suggestions from the MySQL folks, what additional tests should I do. I'd like to see some tests submitted that map out when not to use a particular database engine, so if you have a test that you know a particular database chokes on let me know (bonus credibility if you include tests that your own database has trouble with :) David Lang -- Forwarded message -- Date: Thu, 01 Dec 2005 16:14:25 David, The choice of benchmark depends on what kind of application would you like to see performance for. Than someone speaks about one or other database to be faster than other in general, it makes me smile. That would be the same as tell one car would be able to win all competitions starting from Formula-1 and ending with off-road racing. There are certain well known cases when MySQL will be faster - for example in memory storage engine is hard to beat in point selects, or bulk inserts in MyISAM (no transactional overhead). There are certain known cases when MySQL would not perform well - it is easy to build the query using subqueries which would be horribly slow on MySQL but decent on postgresql... but well writing application for MySQL you would not write such query. I think most database agnostic way would be to select the "workload" from user point of view and have it implemented the most efficient way for database in question - for example you may find TPC-C implementations by different vendors are a lot different. For my own interests, I would like to at least cover the following bases: 32 bit vs 64 bit vs 64 bit kernel + 32 bit user-space; data warehouse type tests (data >> memory); and web prefs test (active data RAM) You may grab Dell DVD store: http://linux.dell.com/dvdstore/ for Web benchmark. It does not have PostgreSQL build in but there some implementations available in the Internet DBT2 by OSDL is other good candidate - it does support postgreSQL and MySQL natively. If you want some raw performance number such as number selects/sec you may use SysBench - http://sysbench.sourceforge.net For DataWarehouse workloads you could grab TPC-H or DBT3 implementation by OSDL - We run this successfully with MySQL You also could take a look at http://benchw.sourceforge.net/ ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] filesystem performance with lots of files
On Thu, 1 Dec 2005, Qingqing Zhou wrote: "David Lang" <[EMAIL PROTECTED]> wrote a few weeks ago I did a series of tests to compare different filesystems. the test was for a different purpose so the particulars are not what I woud do for testing aimed at postgres, but I think the data is relavent) and I saw major differences between different filesystems, I'll see aobut re-running the tests to get a complete set of benchmarks in the next few days. My tests had their times vary from 4 min to 80 min depending on the filesystem in use (ext3 with hash_dir posted the worst case). what testing have other people done with different filesystems? That's good ... what benchmarks did you used? I was doing testing in the context of a requirement to sync over a million small files from one machine to another (rsync would take >10 hours to do this over a 100Mb network so I started with the question 'how long would it take to do a tar-ftp-untar cycle with no smarts) so I created 1m x 1K files in a three deep directory tree (10d/10d/10d/1000files) and was doing simple 'time to copy tree', 'time to create tar', 'time to extract from tar', 'time to copy tarfile (1.6G file). I flushed the memory between each test with cat largefile >/dev/null (I know now that I should have unmounted and remounted between each test), source and destination on different IDE controllers I don't have all the numbers readily available (and I didn't do all the tests on every filesystem), but I found that even with only 1000 files/directory ext3 had some problems, and if you enabled dir_hash some functions would speed up, but writing lots of files would just collapse (that was the 80 min run) I'll have to script it and re-do the tests (and when I do this I'll also set it to do a test with far fewer, far larger files as well) David Lang ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] About the relation between fragmentation of file and
On Thu, 1 Dec 2005, Richard Huxton wrote: Tatsumi Abe wrote: Question is about the relation between fragmentation of file and VACUUM performance. OS:RedHat Enterprise Linux AS Release 3(Taroon Update 6) Kernel 2.4.21-37.ELsmp on an i686 Filesystem Type ext3 Filesystem features: has_journal filetype needs_recovery sparse_super large_file try different filesystems, ext2/3 do a very poor job when you have lots of files in a directory (and 7000+ files is a lot). you can also try mounting the filesystem with noatime, nodiratime to reduce the seeks when reading, and try mounting it with oldalloc (which changes how the files are arranged on disk when writing and extending them), I've seen drastic speed differences between ext2 and ext3 based on this option (ext2 defaults to oldalloc, ext3 defaults to orlov, which is faster in many cases) CPU:Intel(R) Xeon(TM) CPU 2.80GHz stepping 01 Memory:2.0GB HDD:80GB(S-ATA) SATA max UDMA/133 PostgreSQL:7.3.8 1. Approx. there are 3500 tables in the DB When the performance of inserting data was measured in the above- mentioned environment, it takes six minutes to write 1 lines after 4/5 days the measurement had begun. While searching the reason of bottleneck by executing iostat command it is understood that DISK I/O was problem for the neck as %iowait was almost 100% at that time. On the very first day processing time of VACUUM is not a problem but when the day progress its process time is increasing.Then I examined the fragmentation of database area(pgsql/data/base) by using the following tools. Disk Allocation Viewer http://sourceforge.net/projects/davtools/ Fragmentation rate is 28% before defrag. I'd guess the root of your problem is the number of tables (3500), which if each has one index represents at least 7000 files. That means a lot of your I/O time will probably be spent moving the disk heads between the different files. depending on the size of the tables it can actually be a lot worse then this (remember Postgres splits the tables into fixed size chunks) when postgres adds data it will eventually spill over into additional files, when you do a vaccum does it re-write the tables into a smaller number of files or just rewrite the individual files (makeing each of them smaller, but keeping the same number of files) speaking of this, the selection of the size of these chunks is a comprimize between the time needed to seek in an individual file and the number of files that are created, is there an easy way to tinker with this (I am sure the default is not correct for all filesystems, the filesystem handling of large and/or many files differ drasticly) You say you can't stop the server, so there's no point in thinking about a quick hardware upgrade to help you. Also a version-upgrade is not do-able for you. there's a difference between stopping the server once for an upgrade (hardware or software) and having to stop it every few days to defrag things forever after. David Lang I can only think of two other options: 1. Change the database schema to reduce the number of tables involved. I'm assuming that of the 3500 tables most hold the same data but for different clients (or something similar). This might not be practical either. 2. Re-order how you access the database. ANALYSE the updated tables regularly, but only VACUUM them after deletions. Group your inserts so that all the inserts for table1 go together, then all the inserts for table2 go together and so on. This should help with the fragmentation by making sure the files get extended in larger chunks. Are you sure it's not possible to spend 15 mins offline to solve this? -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[PERFORM] filesystem performance with lots of files
this subject has come up a couple times just today (and it looks like one that keeps popping up). under linux ext2/3 have two known weaknesses (or rather one weakness with two manifestations). searching through large objects on disk is slow, this applies to both directories (creating, opening, deleting files if there are (or have been) lots of files in a directory), and files (seeking to the right place in a file). the rule of thumb that I have used for years is that if files get over a few tens of megs or directories get over a couple thousand entries you will start slowing down. common places you can see this (outside of postgres) 1. directories, mail or news storage. if you let your /var/spool/mqueue directory get large (for example a server that can't send mail for a while or mail gets misconfigured on). there may only be a few files in there after it gets fixed, but if the directory was once large just doing a ls on the directory will be slow. news servers that store each message as a seperate file suffer from this as well, they work around it by useing multiple layers of nested directories so that no directory has too many files in it (navigating the layers of directories costs as well, it's all about the tradeoffs). Mail servers that use maildir (and Cyrus which uses a similar scheme) have the same problem. to fix this you have to create a new directory and move the files to that directory (and then rename the new to the old) ext3 has an option to make searching directories faster (htree), but enabling it kills performance when you create files. And this doesn't help with large files. 2. files, mbox formatted mail files and log files as these files get large, the process of appending to them takes more time. syslog makes this very easy to test. On a box that does syncronous syslog writing (default for most systems useing standard syslog, on linux make sure there is not a - in front of the logfile name) time how long it takes to write a bunch of syslog messages, then make the log file large and time it again. a few weeks ago I did a series of tests to compare different filesystems. the test was for a different purpose so the particulars are not what I woud do for testing aimed at postgres, but I think the data is relavent) and I saw major differences between different filesystems, I'll see aobut re-running the tests to get a complete set of benchmarks in the next few days. My tests had their times vary from 4 min to 80 min depending on the filesystem in use (ext3 with hash_dir posted the worst case). what testing have other people done with different filesystems? David Lang ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] 15,000 tables
On Thu, 1 Dec 2005, Michael Riess wrote: Hi, we are currently running a postgres server (upgraded to 8.1) which has one large database with approx. 15,000 tables. Unfortunately performance suffers from that, because the internal tables (especially that which holds the attribute info) get too large. is it becouse the internal tables get large, or is it a problem with disk I/O? with 15,000 tables you are talking about a LOT of files to hold these (30,000 files with one index each and each database being small enough to not need more then one file to hold it), on linux ext2/3 this many files in one directory will slow you down horribly. try different filesystems (from my testing and from other posts it looks like XFS is a leading contender), and also play around with the tablespaces feature in 8.1 to move things out of the main data directory into multiple directories. if you do a ls -l on the parent directory you will see that the size of the directory is large if it's ever had lots of files in it, the only way to shrink it is to mv the old directory to a new name, create a new directory and move the files from the old directory to the new one. David Lang ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Hardware/OS recommendations for large databases (
On Mon, 28 Nov 2005, Brendan Duddridge wrote: Forgive my ignorance, but what is MPP? Is that part of Bizgres? Is it possible to upgrade from Postgres 8.1 to Bizgres? MPP is the Greenplum propriatary extention to postgres that spreads the data over multiple machines, (raid, but with entire machines not just drives, complete with data replication within the cluster to survive a machine failing) for some types of queries they can definantly scale lineraly with the number of machines (other queries are far more difficult and the overhead of coordinating the machines shows more. this is one of the key things that the new version they recently announced the beta for is supposed to be drasticly improving) early in the year when I first looked at them their prices were exorbadent, but Luke says I'm wildly mistake on their current prices so call them for details it uses the same interfaces as postgres so it should be a drop in replacement to replace a single server with a cluster. it's facinating technology to read about. I seem to remember reading that one of the other postgres companies is also producing a clustered version of postgres, but I don't remember who and know nothing about them. David Lang ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Hardware/OS recommendations for large databases (
On Mon, 28 Nov 2005, Brendan Duddridge wrote: Hi David, Thanks for your reply. So how is that different than something like Slony2 or pgcluster with multi-master replication? Is it similar technology? We're currently looking for a good clustering solution that will work on our Apple Xserves and Xserve RAIDs. MPP doesn't just split up the data, it splits up the processing as well, so if you have a 5 machine cluster, each machine holds 1/5 of your data (plus a backup for one of the other machines) and when you do a query MPP slices and dices the query to send a subset of the query to each machine, it then gets the responses from all the machines and combines them if you ahve to do a full table scan for example, wach machine would only have to go through 20% of the data a Slony of pgcluster setup has each machine with a full copy of all the data, only one machine can work on a given query at a time, and if you have to do a full table scan one machine needs to read 100% of the data. in many ways this is the holy grail of databases. almost all other areas of computing can now be scaled by throwing more machines at the problem in a cluster, with each machine just working on it's piece of the problem, but databases have had serious trouble doing the same and so have been ruled by the 'big monster machine'. Oracle has been selling Oracle Rac for a few years, and reports from people who have used it range drasticly (from it works great, to it's a total disaster), in part depending on the types of queries that have been made. Greenplum thinks that they have licked the problems for the more general case (and that commodity networks are now fast enough to match disk speeds in processing the data) if they are right then when they hit full release with the new version they should be cracking a lot of the price/performance records on the big database benchmarks (TPC and similar), and if their pricing is reasonable, they may be breaking them by an order of magnatude or more (it's not unusual for the top machines to spend more then $1,000,000 on just their disk arrays for those systems, MPP could conceivably put togeather a cluster of $5K machines that runs rings around them (and probably will for at least some of the subtests, the big question is if they can sweep the board and take the top spots outright) they have more details (and marketing stuff) on their site at http://www.greenplum.com/prod_deepgreen_cluster.html don't get me wrong, I am very impressed with their stuff, but (haveing ranted a little here on the list about them) I think MPP and it's performace is a bit off topic for the postgres performance list (at least until the postgres project itself starts implementing similar features :-) David Lang Thanks, Brendan Duddridge | CTO | 403-277-5591 x24 | [EMAIL PROTECTED] ClickSpace Interactive Inc. Suite L100, 239 - 10th Ave. SE Calgary, AB T2G 0V9 http://www.clickspace.com On Nov 27, 2005, at 8:09 PM, David Lang wrote: On Mon, 28 Nov 2005, Brendan Duddridge wrote: Forgive my ignorance, but what is MPP? Is that part of Bizgres? Is it possible to upgrade from Postgres 8.1 to Bizgres? MPP is the Greenplum propriatary extention to postgres that spreads the data over multiple machines, (raid, but with entire machines not just drives, complete with data replication within the cluster to survive a machine failing) for some types of queries they can definantly scale lineraly with the number of machines (other queries are far more difficult and the overhead of coordinating the machines shows more. this is one of the key things that the new version they recently announced the beta for is supposed to be drasticly improving) early in the year when I first looked at them their prices were exorbadent, but Luke says I'm wildly mistake on their current prices so call them for details it uses the same interfaces as postgres so it should be a drop in replacement to replace a single server with a cluster. it's facinating technology to read about. I seem to remember reading that one of the other postgres companies is also producing a clustered version of postgres, but I don't remember who and know nothing about them. David Lang ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Open request for benchmarking input
On Sun, 27 Nov 2005, Andreas Pflug wrote: David Lang wrote: Postgres needs to work on the low end stuff as well as the high end stuff or people will write their app to work with things that DO run on low end hardware and they spend much more money then is needed to scale the hardware up rather then re-writing their app. I agree that pgsql runs on low end stuff, but a dual Opteron with 2x15kSCSI isn't low end, is it? The CPU/IO performance isn't balanced for the total cost, you probably could get a single CPU/6x15kRPM machine for the same price delivering better TP performance in most scenarios. Benchmarks should deliver results that are somewhat comparable. If performed on machines that don't deliver a good CPU/IO power balance for the type of DB load being tested, they're misleading and hardly usable for comparision purposes, and even less for learning how to configure a decent server since you might have to tweak some parameters in an unusual way. a couple things to note, first, when running benchmarks there is a need for client machines to stress the database, these machines are what are available to be clients as well as servers. second, the smaller machines are actually about what I would spec out for a high performance database that's reasonably small, a couple of the boxes have 144G drives, if they are setup as raid1 then the boxes would be reasonable to use for a database up to 50G or larger (assuming you need space on the DB server to dump the database, up to 100G or so if you don't) David Lang ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Hardware/OS recommendations for large databases (
On Sun, 27 Nov 2005, Luke Lonergan wrote: For data warehousing its pretty well open and shut. To use all cpus and io channels on each query you will need mpp. Has anyone done the math.on the original post? 5TB takes how long to scan once? If you want to wait less than a couple of days just for a seq scan, you'd better be in the multi-gb per second range. if you truely need to scan the entire database then you are right, however indexes should be able to cut the amount you need to scan drasticly. David Lang ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Hardware/OS recommendations for large databases (
Another thought - I priced out a maxed out machine with 16 cores and 128GB of RAM and 1.5TB of usable disk - $71,000. You could instead buy 8 machines that total 16 cores, 128GB RAM and 28TB of disk for $48,000, and it would be 16 times faster in scan rate, which is the most important factor for large databases. The size would be 16 rack units instead of 5, and you'd have to add a GigE switch for $1500. Scan rate for above SMP: 200MB/s Scan rate for above cluster: 3,200Mb/s You could even go dual core and double the memory on the cluster and you'd about match the price of the "god box". - Luke Luke, I assume you are talking about useing the Greenplum MPP for this (otherwise I don't know how you are combining all the different systems). If you are, then you are overlooking one very significant factor, the cost of the MPP software, at $10/cpu the cluster has an extra $160K in software costs, which is double the hardware costs. if money is no object then go for it, but if it is then you comparison would be (ignoring software maintinance costs) the 16 core 128G ram system vs ~3xsmall systems totaling 6 cores and 48G ram. yes if scan speed is the bottleneck you still win with the small systems, but for most other uses the large system would win easily. and in any case it's not the open and shut case that you keep presenting it as. David Lang ---(end of broadcast)--- TIP 6: explain analyze is your friend
[PERFORM] Open request for benchmarking input
These boxes don't look like being designed for a DB server. The first are very CPU bound, and the third may be a good choice for very large amounts of streamed data, but not optimal for TP random access. I don't know what you mean when you say that the first ones are CPU bound, they have far more CPU then they do disk I/O however I will agree that they were not designed to be DB servers, they weren't. they happen to be the machines that I have available. they only have a pair of disks each, which would not be reasonable for most production DB uses, and they have far more CPU then is normally reccomended. So I'll have to run raid 0 instead of 0+1 (or not use raid) which would be unacceptable in a production environment, but can still give some useful info. the 5th box _was_ purchased to be a DB server, but one to store and analyse large amounts of log data, so large amounts of data storage were more important then raw DB performance (although we did max out the RAM at 16G to try and make up for it). it was a deliberate price/performance tradeoff. this machine ran ~$20k, but a similar capacity with SCSI drives would have been FAR more expensive (IIRC a multiple of 4x or more more expensive). Hopefully, when publicly visible benchmarks are performed, machines are used that comply with common engineering knowledge, ignoring those guys who still believe that sequential performance is the most important issue on disk subsystems for DBMS. are you saying that I shouldn't do any benchmarks becouse the machines aren't what you would consider good enough? if so I disagree with you and think that benchmarks should be done on even worse machines, but should also be done on better machines. (are you volunteering to provide time on better machines for benchmarks?) not everyone will buy a lot of high-end hardware before they start useing a database. in fact most companies will start with a database on lower end hardware and then as their requirements grow they will move to better hardware. I'm willing to bet that what I have available is better then the starting point for most places. Postgres needs to work on the low end stuff as well as the high end stuff or people will write their app to work with things that DO run on low end hardware and they spend much more money then is needed to scale the hardware up rather then re-writing their app. Part of the reason that I made the post on /. to start this was the hope that a reasonable set of benchmarks could be hammered out and then more people then just me could run them to get a wider range of results. David Lang ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Open request for benchmarking input
by the way, this is the discussion that promped me to start this project http://lwn.net/Articles/161323/ David Lang ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Open request for benchmarking input
Ok, I've subscribed (hopefully list volume won't kill me :-) I'm covering several things in this message since I didn't receive the prior messages in the thread first off these benchamrks are not being sponsered by my employer, they need the machines burned in and so I'm going to use them for the tests while burning them in. I can spend a little official time on this, justifying it as learning the proper config/tuneing settings for our project, but not too much. and I'm deliberatly not useing my work e-mail and am not mentioning the company name, so please respect this and keep the two seperate (some of you have dealt with us formally, others will over the next few months) this means no remote access for people (but I am willing to run tests and send configs around). in fact the machines will not have Internet access for the duration of the tests. it also means I'm doing almost all the configuration work for this on my own time (nights and weekends). the machines will not be moved to production for a couple of months. this should mean that we can go back and forth with questions and answers (albeit somewhat slowly, with me checking in every night) while whatever tests are done happen during the day. once we get past the system tuneing and start doing different tests it would probably be helpful if people can send me scripts to run that I can just let loose. I don't have any money to pay for benchmark suites, so if things like the TPC benchmarks cost money to do I won't be able to do them to clarify the hardware I have 5 machines total to work with, this includes client machines to make the queries (I may be able to get hold of 2-3 more, but they are similar configs) none of these have dual-core processors on them, the CPU's are 246 or 252 Opterons (I'll have to double check which is in which machine, I think the large disk machine has 246's and the others 252's) I have access to a gig-E switch that's on a fairly idle network to use to connect these machines the large-disk machine has 3ware 9500 series 8-port SATA controllers in them with battery backup. in our official dealings with Greenplum we attempted to do a set of benchmarks on that machine, but had horrible timing with me being too busy when they worked with us on this and we never did figure out the best setting to use for this machine. Part of the reason I posted this to /. rather then just contacting you and MySQL folks directly is that I would like to see a reasonable set of benchmarks agreed to and have people with different hardware then I have run the same sets of tests. I know the tuneing will be different for different hardware, but if we can have a bunch of people run similar tests we should learn a lot. David Lang ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly