Re: [PERFORM] Weird XFS WAL problem
Greg Smith wrote: Kevin Grittner wrote: I don't know at the protocol level; I just know that write barriers do *something* which causes our controllers to wait for actual disk platter persistence, while fsync does not It's in the docs now: http://www.postgresql.org/docs/9.0/static/wal-reliability.html FLUSH CACHE EXT is the ATAPI-6 call that filesystems use to enforce barriers on that type of drive. Here's what the relevant portion of the ATAPI spec says: This command is used by the host to request the device to flush the write cache. If there is data in the write cache, that data shall be written to the media.The BSY bit shall remain set to one until all data has been successfully written or an error occurs. SAS systems have a similar call named SYNCHRONIZE CACHE. The improvement I actually expect to arrive here first is a reliable implementation of O_SYNC/O_DSYNC writes. Both SAS and SATA drives that capable of doing Native Command Queueing support a write type called Force Unit Access, which is essentially just like a direct write that cannot be cached. When we get more kernels with reliable sync writing that maps under the hood to FUA, and can change wal_sync_method to use them, the need to constantly call fsync for every write to the WAL will go away. Then the blow out the RAID cache when barriers are on behavior will only show up during checkpoint fsyncs, which will make things a lot better (albeit still not ideal). Great information! I have added the attached documentation patch to explain the write-barrier/BBU interaction. This will appear in the 9.0 documentation. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. + Index: doc/src/sgml/wal.sgml === RCS file: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v retrieving revision 1.66 diff -c -c -r1.66 wal.sgml *** doc/src/sgml/wal.sgml 13 Apr 2010 14:15:25 - 1.66 --- doc/src/sgml/wal.sgml 7 Jul 2010 13:55:58 - *** *** 48,68 some later time. Such caches can be a reliability hazard because the memory in the disk controller cache is volatile, and will lose its contents in a power failure. Better controller cards have !firsttermbattery-backed/ caches, meaning the card has a battery that maintains power to the cache in case of system power loss. After power is restored the data will be written to the disk drives. /para para And finally, most disk drives have caches. Some are write-through !while some are write-back, and the !same concerns about data loss exist for write-back drive caches as !exist for disk controller caches. Consumer-grade IDE and SATA drives are !particularly likely to have write-back caches that will not survive a !power failure, though acronymATAPI-6/ introduced a drive cache !flush command (FLUSH CACHE EXT) that some file systems use, e.g. acronymZFS/. !Many solid-state drives (SSD) also have volatile write-back !caches, and many do not honor cache flush commands by default. To check write caching on productnameLinux/ use commandhdparm -I/; it is enabled if there is a literal*/ next to literalWrite cache/; commandhdparm -W/ to turn off --- 48,74 some later time. Such caches can be a reliability hazard because the memory in the disk controller cache is volatile, and will lose its contents in a power failure. Better controller cards have !firsttermbattery-backed unit/ (acronymBBU/) caches, meaning !the card has a battery that maintains power to the cache in case of system power loss. After power is restored the data will be written to the disk drives. /para para And finally, most disk drives have caches. Some are write-through !while some are write-back, and the same concerns about data loss !exist for write-back drive caches as exist for disk controller !caches. Consumer-grade IDE and SATA drives are particularly likely !to have write-back caches that will not survive a power failure, !though acronymATAPI-6/ introduced a drive cache flush command !(commandFLUSH CACHE EXT/) that some file systems use, e.g. !acronymZFS/, acronymext4/. (The SCSI command !commandSYNCHRONIZE CACHE/ has long been available.) Many !solid-state drives (SSD) also have volatile write-back caches, and !many do not honor cache flush commands by default. ! /para ! ! para To check write caching on productnameLinux/ use commandhdparm -I/; it is enabled if there is a literal*/ next to literalWrite cache/; commandhdparm -W/ to turn off *** *** 83,88 --- 89,113 /para para +Many file systems that use write barriers (e.g. acronymZFS/, +acronymext4/) internally
Re: [PERFORM] Weird XFS WAL problem
Kevin Grittner wrote: I don't know at the protocol level; I just know that write barriers do *something* which causes our controllers to wait for actual disk platter persistence, while fsync does not It's in the docs now: http://www.postgresql.org/docs/9.0/static/wal-reliability.html FLUSH CACHE EXT is the ATAPI-6 call that filesystems use to enforce barriers on that type of drive. Here's what the relevant portion of the ATAPI spec says: This command is used by the host to request the device to flush the write cache. If there is data in the write cache, that data shall be written to the media.The BSY bit shall remain set to one until all data has been successfully written or an error occurs. SAS systems have a similar call named SYNCHRONIZE CACHE. The improvement I actually expect to arrive here first is a reliable implementation of O_SYNC/O_DSYNC writes. Both SAS and SATA drives that capable of doing Native Command Queueing support a write type called Force Unit Access, which is essentially just like a direct write that cannot be cached. When we get more kernels with reliable sync writing that maps under the hood to FUA, and can change wal_sync_method to use them, the need to constantly call fsync for every write to the WAL will go away. Then the blow out the RAID cache when barriers are on behavior will only show up during checkpoint fsyncs, which will make things a lot better (albeit still not ideal). -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Weird XFS WAL problem
On Thu, 3 Jun 2010, Greg Smith wrote: And it's also quite reasonable for a RAID controller to respond to that flush the whole cache call by flushing its cache. Remember that the RAID controller is presenting itself to the OS as a large disc, and hiding the individual discs from the OS. Why should the OS care what has actually happened to the individual discs' caches, as long as that flush the whole cache command guarantees that the data is persistent. Taking the RAID array as a whole, that happens when the data hits the write-back cache. The only circumstance where you actually need to flush the data to the individual discs is when you need to take that disc away somewhere else and read it on another system. That's quite a rare use case for a RAID array (http://thedailywtf.com/Articles/RAIDing_Disks.aspx notwithstanding). If the controller had some logic that said it's OK to not flush the cache when that call comes in if my battery is working fine, that would make this whole problem go away. The only place this can be properly sorted is the RAID controller. Anywhere else would be crazy. Matthew -- To err is human; to really louse things up requires root privileges. -- Alexander Pope, slightly paraphrased -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Weird XFS WAL problem
Greg Smith wrote: Kevin Grittner wrote: I've seen this, too (with xfs). Our RAID controller, in spite of having BBU cache configured for writeback, waits for actual persistence on disk for write barriers (unlike for fsync). This does strike me as surprising to the point of bordering on qualifying as a bug. Completely intentional, and documented at http://xfs.org/index.php/XFS_FAQ#Q._Should_barriers_be_enabled_with_storage_which_has_a_persistent_write_cache.3F The issue is that XFS will actually send the full flush your cache call to the controller, rather than just the usual fsync call, and that eliminates the benefit of having a write cache there in the first place. Good controllers respect that and flush their whole write cache out. And ext4 has adopted the same mechanism. This is very much a good thing from the perspective of database reliability for people with regular hard drives who don't have a useful write cache on their cheap hard drives. It allows them to keep the disk's write cache on for other things, while still getting the proper cache flushes when the database commits demand them. It does mean that everyone with a non-volatile battery backed cache, via RAID card typically, needs to turn barriers off manually. I've already warned on this list that PostgreSQL commit performance on ext4 is going to appear really terrible to many people. If you benchmark and don't recognize ext3 wasn't operating in a reliable mode before, the performance drop now that ext4 is doing the right thing with barriers looks impossibly bad. Well, this is depressing. Now that we finally have common battery-backed cache RAID controller cards, the file system developers have throw down another roadblock in ext4 and xfs. Do we need to document this? On another topic, I am a little unclear on how things behave when the drive is write-back. If the RAID controller card writes to the drive, but the data isn't on the platers, how does it know when it can discard that information from the BBU RAID cache? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. + -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Weird XFS WAL problem
Bruce Momjian br...@momjian.us wrote: On another topic, I am a little unclear on how things behave when the drive is write-back. If the RAID controller card writes to the drive, but the data isn't on the platers, how does it know when it can discard that information from the BBU RAID cache? The controller waits for the drive to tell it that it has made it to the platter before it discards it. What made you think otherwise? -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Weird XFS WAL problem
Kevin Grittner wrote: Bruce Momjian br...@momjian.us wrote: On another topic, I am a little unclear on how things behave when the drive is write-back. If the RAID controller card writes to the drive, but the data isn't on the platers, how does it know when it can discard that information from the BBU RAID cache? The controller waits for the drive to tell it that it has made it to the platter before it discards it. What made you think otherwise? Because a write-back drive cache says it is on the drive before it hits the platters, which I think is the default for SATA drive. Is that inaccurate? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. + -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Weird XFS WAL problem
Bruce Momjian br...@momjian.us wrote: Kevin Grittner wrote: The controller waits for the drive to tell it that it has made it to the platter before it discards it. What made you think otherwise? Because a write-back drive cache says it is on the drive before it hits the platters, which I think is the default for SATA drive. Is that inaccurate? Any decent RAID controller will ensure that the drives themselves aren't using write-back caching. When we've mentioned write-back versus write-through on this thread we've been talking about the behavior of the *controller*. We have our controllers configured to use write-back through the BBU cache as long as the battery is good, but to automatically switch to write-through if the battery goes bad. -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Weird XFS WAL problem
Kevin Grittner wrote: Bruce Momjian br...@momjian.us wrote: Kevin Grittner wrote: The controller waits for the drive to tell it that it has made it to the platter before it discards it. What made you think otherwise? Because a write-back drive cache says it is on the drive before it hits the platters, which I think is the default for SATA drive. Is that inaccurate? Any decent RAID controller will ensure that the drives themselves aren't using write-back caching. When we've mentioned write-back versus write-through on this thread we've been talking about the behavior of the *controller*. We have our controllers configured to use write-back through the BBU cache as long as the battery is good, but to automatically switch to write-through if the battery goes bad. OK, good, but when why would a BBU RAID controller flush stuff to disk with a flush-all command? I thought the whole goal of BBU was to avoid such flushes. What is unique about the command ext4/xfs is sending? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. + -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Weird XFS WAL problem
Bruce Momjian br...@momjian.us wrote: Kevin Grittner wrote: Any decent RAID controller will ensure that the drives themselves aren't using write-back caching. When we've mentioned write-back versus write-through on this thread we've been talking about the behavior of the *controller*. We have our controllers configured to use write-back through the BBU cache as long as the battery is good, but to automatically switch to write-through if the battery goes bad. OK, good, but when why would a BBU RAID controller flush stuff to disk with a flush-all command? I thought the whole goal of BBU was to avoid such flushes. That has been *precisely* my point. I don't know at the protocol level; I just know that write barriers do *something* which causes our controllers to wait for actual disk platter persistence, while fsync does not. The write barrier concept seems good to me, and I wish it could be used at the OS level without killing performance. I blame the controller, for not treating it the same as fsync (i.e., as long as it's in write-back mode it should treat data as persisted as soon as it's in BBU cache). -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Weird XFS WAL problem
Kevin Grittner wrote: Bruce Momjian br...@momjian.us wrote: Kevin Grittner wrote: Any decent RAID controller will ensure that the drives themselves aren't using write-back caching. When we've mentioned write-back versus write-through on this thread we've been talking about the behavior of the *controller*. We have our controllers configured to use write-back through the BBU cache as long as the battery is good, but to automatically switch to write-through if the battery goes bad. OK, good, but when why would a BBU RAID controller flush stuff to disk with a flush-all command? I thought the whole goal of BBU was to avoid such flushes. That has been *precisely* my point. I don't know at the protocol level; I just know that write barriers do *something* which causes our controllers to wait for actual disk platter persistence, while fsync does not. The write barrier concept seems good to me, and I wish it could be used at the OS level without killing performance. I blame the controller, for not treating it the same as fsync (i.e., as long as it's in write-back mode it should treat data as persisted as soon as it's in BBU cache). Yeah. I wonder if it honors the cache flush because it might think it is replacing disks or something odd. I think we are going to have to document this in 9.0 because obviously you have seen it already. Is this an issue with SAS cards/drives as well? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. + -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Weird XFS WAL problem
On Wed, Jun 2, 2010 at 7:30 PM, Craig James craig_ja...@emolecules.com wrote: I'm testing/tuning a new midsize server and ran into an inexplicable problem. With an RAID10 drive, when I move the WAL to a separate RAID1 drive, TPS drops from over 1200 to less than 90! I've checked everything and can't find a reason. Here are the details. 8 cores (2x4 Intel Nehalem 2 GHz) 12 GB memory 12 x 7200 SATA 500 GB disks 3WARE 9650SE-12ML RAID controller with bbu 2 disks: RAID1 500GB ext4 blocksize=4096 8 disks: RAID10 2TB, stripe size 64K, blocksize=4096 (ext4 or xfs - see below) 2 disks: hot swap Ubuntu 10.04 LTS (Lucid) With xfs or ext4 on the RAID10 I got decent bonnie++ and pgbench results (this one is for xfs): Version 1.03e --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP argon 24064M 70491 99 288158 25 129918 16 65296 97 428210 23 558.9 1 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 23283 81 + +++ 13775 56 20143 74 + +++ 15152 54 argon,24064M,70491,99,288158,25,129918,16,65296,97,428210,23,558.9,1,16,23283,81,+,+++,13775,56,20143\ ,74,+,+++,15152,54 pgbench -i -s 100 -U test pgbench -c 10 -t 1 -U test scaling factor: 100 query mode: simple number of clients: 10 number of transactions per client: 1 number of transactions actually processed: 10/10 tps = 1046.104635 (including connections establishing) tps = 1046.337276 (excluding connections establishing) Now the mystery: I moved the pg_xlog directory to a RAID1 array (same 3WARE controller, two more SATA 7200 disks). Run the same tests and ... tps = 82.325446 (including connections establishing) tps = 82.326874 (excluding connections establishing) I thought I'd made a mistake, like maybe I moved the whole database to the RAID1 array, but I checked and double checked. I even watched the lights blink - the WAL was definitely on the RAID1 and the rest of Postgres on the RAID10. So I moved the WAL back to the RAID10 array, and performance jumped right back up to the 1200 TPS range. Next I check the RAID1 itself: dd if=/dev/zero of=./bigfile bs=8192 count=200 which yielded 98.8 MB/sec - not bad. bonnie++ on the RAID1 pair showed good performance too: Version 1.03e --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP argon 24064M 68601 99 110057 18 46534 6 59883 90 123053 7 471.3 1 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 + +++ + +++ + +++ + +++ + +++ + +++ argon,24064M,68601,99,110057,18,46534,6,59883,90,123053,7,471.3,1,16,+,+++,+,+++,+,+++,+,\ +++,+,+++,+,+++ So ... anyone have any idea at all how TPS drops to below 90 when I move the WAL to a separate RAID1 disk? Does this make any sense at all? It's repeatable. It happens for both ext4 and xfs. It's weird. You can even watch the disk lights and see it: the RAID10 disks are on almost constantly when the WAL is on the RAID10, but when you move the WAL over to the RAID1, its lights are dim and flicker a lot, like it's barely getting any data, and the RAID10 disk's lights barely go on at all. *) Is your raid 1 configured writeback cache on the controller? *) have you tried changing wal_sync_method to fdatasync? merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Weird XFS WAL problem
Craig James wrote: I'm testing/tuning a new midsize server and ran into an inexplicable problem. With an RAID10 drive, when I move the WAL to a separate RAID1 drive, TPS drops from over 1200 to less than 90! Normally 100 TPS means that the write cache on the WAL drive volume is disabled (or set to write-through instead of write-back). When things in this area get fishy, I will usually download sysbench and have it specifically test how many fsync calls can happen per second. http://projects.2ndquadrant.com/talks , Database Hardware Benchmarking, page 28 has an example of the right incantation for that. Also, make sure you run 3ware's utilities and confirm all the disks have finished their initialization and verification stages. If you just adjusted disk layout that and immediate launched into benchmarks, those are useless until the background cleanup is done. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Weird XFS WAL problem
On 6/2/10 4:40 PM, Mark Kirkwood wrote: On 03/06/10 11:30, Craig James wrote: I'm testing/tuning a new midsize server and ran into an inexplicable problem. With an RAID10 drive, when I move the WAL to a separate RAID1 drive, TPS drops from over 1200 to less than 90! I've checked everything and can't find a reason. Are the 2 new RAID1 disks the same make and model as the 12 RAID10 ones? Yes. Also, are barriers *on* on the RAID1 mount and off on the RAID10 one? It was the barriers. barrier=1 isn't just a bad idea on ext4, it's a disaster. pgbench -i -s 100 -U test pgbench -c 10 -t 1 -U test Change WAL to barrier=0 tps = 1463.264981 (including connections establishing) tps = 1463.725687 (excluding connections establishing) Change WAL to noatime, nodiratime, barrier=0 tps = 1479.331476 (including connections establishing) tps = 1479.810545 (excluding connections establishing) Change WAL to barrier=1 tps = 82.325446 (including connections establishing) tps = 82.326874 (excluding connections establishing) This is really hard to believe, because the bonnie++ numbers and dd(1) numbers look good (see my original post). But it's totally repeatable. It must be some really unfortunate just missed the next sector going by the write head problem. So with ext4, bonnie++ and dd aren't the whole story. BTW, I also learned that if you edit /etc/fstab and use mount -oremount it WON'T change barrier=0/1 unless it is explicit in the fstab file. That is, if you put barrier=0 into /etc/fstab and use the remount, it will change it to no barriers. But if you then remove it from /etc/fstab, it won't change it back to the default. You have to actually put barrier=1 if you want to get it back to the default. This seems like a bug to me, and it made it really hard to track this down. mount -oremount is not the same as umount/mount! Craig -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Weird XFS WAL problem
On Thu, 3 Jun 2010, Craig James wrote: Also, are barriers *on* on the RAID1 mount and off on the RAID10 one? It was the barriers. barrier=1 isn't just a bad idea on ext4, it's a disaster. This worries me a little. Does your array have a battery-backed cache? If so, then it should be fast regardless of barriers (although barriers may make a small difference). If it does not, then it is likely that the fast speed you are seeing with barriers off is unsafe. There should be no just missed the sector going past for write problem ever with a battery-backed cache. Matthew -- There once was a limerick .sig that really was not very big It was going quite fine Till it reached the fourth line -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Weird XFS WAL problem
Matthew Wakeling matt...@flymine.org wrote: On Thu, 3 Jun 2010, Craig James wrote: Also, are barriers *on* on the RAID1 mount and off on the RAID10 one? It was the barriers. barrier=1 isn't just a bad idea on ext4, it's a disaster. This worries me a little. Does your array have a battery-backed cache? If so, then it should be fast regardless of barriers (although barriers may make a small difference). If it does not, then it is likely that the fast speed you are seeing with barriers off is unsafe. I've seen this, too (with xfs). Our RAID controller, in spite of having BBU cache configured for writeback, waits for actual persistence on disk for write barriers (unlike for fsync). This does strike me as surprising to the point of bordering on qualifying as a bug. It means that you can't take advantage of the BBU cache and get the benefit of write barriers in OS cache behavior. :-( -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Weird XFS WAL problem
Kevin Grittner wrote: I've seen this, too (with xfs). Our RAID controller, in spite of having BBU cache configured for writeback, waits for actual persistence on disk for write barriers (unlike for fsync). This does strike me as surprising to the point of bordering on qualifying as a bug. Completely intentional, and documented at http://xfs.org/index.php/XFS_FAQ#Q._Should_barriers_be_enabled_with_storage_which_has_a_persistent_write_cache.3F The issue is that XFS will actually send the full flush your cache call to the controller, rather than just the usual fsync call, and that eliminates the benefit of having a write cache there in the first place. Good controllers respect that and flush their whole write cache out. And ext4 has adopted the same mechanism. This is very much a good thing from the perspective of database reliability for people with regular hard drives who don't have a useful write cache on their cheap hard drives. It allows them to keep the disk's write cache on for other things, while still getting the proper cache flushes when the database commits demand them. It does mean that everyone with a non-volatile battery backed cache, via RAID card typically, needs to turn barriers off manually. I've already warned on this list that PostgreSQL commit performance on ext4 is going to appear really terrible to many people. If you benchmark and don't recognize ext3 wasn't operating in a reliable mode before, the performance drop now that ext4 is doing the right thing with barriers looks impossibly bad. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Weird XFS WAL problem
Craig James wrote: This is really hard to believe, because the bonnie++ numbers and dd(1) numbers look good (see my original post). But it's totally repeatable. It must be some really unfortunate just missed the next sector going by the write head problem. Commit performance is a separate number to measure that is not reflected in any benchmark that tests sequential performance. I consider it the fourth axis of disk system performance (seq read, seq write, random IOPS, commit rate), and directly measure it with the sysbench fsync test I recommended already. (You can do it with the right custom pgbench script too). You only get one commit per rotation on a drive, which is exactly what you're seeing: a bit under the 120 spins/second @ 7200 RPM. Attempts to time things just right to catch more than one sector per spin are extremely difficult to accomplish, I spent a week on that once without making any good progress. You can easily get 100MB/s on reads and writes but only manage 100 commits/second. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Weird XFS WAL problem
Greg Smith g...@2ndquadrant.com wrote: Kevin Grittner wrote: I've seen this, too (with xfs). Our RAID controller, in spite of having BBU cache configured for writeback, waits for actual persistence on disk for write barriers (unlike for fsync). This does strike me as surprising to the point of bordering on qualifying as a bug. Completely intentional, and documented at http://xfs.org/index.php/XFS_FAQ#Q._Should_barriers_be_enabled_with_storage_which_has_a_persistent_write_cache.3F Yeah, I read that long ago and I've disabled write barriers because of it; however, it still seems wrong that the RAID controller insists on flushing to the drives in write-back mode. Here are my reasons for wishing it was otherwise: (1) We've had batteries on our RAID controllers fail occasionally. The controller automatically degrades to write-through, and we get an email from the server and schedule a tech to travel to the site and replace the battery; but until we take action we are now exposed to possible database corruption. Barriers don't automatically come on when the controller flips to write-through mode. (2) It precludes any possibility of moving from fsync techniques to write barrier techniques for ensuring database integrity. If the OS respected write barriers and the controller considered the write satisfied when it hit BBU cache, write barrier techniques would work, and checkpoints could be made smoother. Think how nicely that would inter-operate with point (1). So, while I understand it's Working As Designed, I think the design is surprising and sub-optimal. -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Weird XFS WAL problem
On Thu, Jun 3, 2010 at 12:40 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Yeah, I read that long ago and I've disabled write barriers because of it; however, it still seems wrong that the RAID controller insists on flushing to the drives in write-back mode. Here are my reasons for wishing it was otherwise: I think it's a case of the quickest, simplest answer to semi-new tech. Not sure what to do with barriers? Just flush the whole cache. I'm guessing that this will get optimized in the future. BTW, I'll have LSI Megaraid latest and greatest to test on in a month, and older Areca 1680s as well. I'll be updating the firmware on the arecas, and I'll run some tests on the whole barrier behaviour to see if it's gotten any better lately. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Weird XFS WAL problem
Scott Marlowe scott.marl...@gmail.com wrote: I think it's a case of the quickest, simplest answer to semi-new tech. Not sure what to do with barriers? Just flush the whole cache. I'm guessing that this will get optimized in the future. Let's hope so. That reminds me, the write barrier concept is at least on the horizon as a viable technology; does anyone know if the asynchronous graphs concept in this (one page) paper ever came to anything? (I haven't hear anything about it lately.) http://www.usenix.org/events/fast05/wips/burnett.pdf -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Weird XFS WAL problem
Scott Marlowe wrote: I think it's a case of the quickest, simplest answer to semi-new tech. Not sure what to do with barriers? Just flush the whole cache. Well, that really is the only useful thing you can do with regular SATA drives; the ATA command set isn't any finer grained than that in a way that's useful for this context. And it's also quite reasonable for a RAID controller to respond to that flush the whole cache call by flushing its cache. So it's not just the simplest first answer, I believe it's the only answer until a better ATA command set becomes available. I think this can only be resolved usefully for all of us at the RAID firmware level. If the controller had some logic that said it's OK to not flush the cache when that call comes in if my battery is working fine, that would make this whole problem go away. I don't expect it's possible to work around the exact set of concerns Kevin listed any other way, because as he pointed out the right thing to do is very dependent on the battery health, which the OS also doesn't know (again, would require some new command set verbage). -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Weird XFS WAL problem
On Thu, Jun 3, 2010 at 1:31 PM, Greg Smith g...@2ndquadrant.com wrote: Scott Marlowe wrote: I think it's a case of the quickest, simplest answer to semi-new tech. Not sure what to do with barriers? Just flush the whole cache. Well, that really is the only useful thing you can do with regular SATA drives; the ATA command set isn't any finer grained than that in a way that's useful for this context. And it's also quite reasonable for a RAID controller to respond to that flush the whole cache call by flushing its cache. So it's not just the simplest first answer, I believe it's the only answer until a better ATA command set becomes available. I think this can only be resolved usefully for all of us at the RAID firmware level. If the controller had some logic that said it's OK to not flush the cache when that call comes in if my battery is working fine, That's what already happens for fsync on a BBU controller, so I don't think the code to do so would be something fancy and new, just a simple change of logic on which code path to take. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Weird XFS WAL problem
Greg Smith g...@2ndquadrant.com wrote: I think this can only be resolved usefully for all of us at the RAID firmware level. If the controller had some logic that said it's OK to not flush the cache when that call comes in if my battery is working fine, that would make this whole problem go away. That is exactly what I've been trying to suggest. Sorry for not being more clear about it. -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Weird XFS WAL problem
I'm testing/tuning a new midsize server and ran into an inexplicable problem. With an RAID10 drive, when I move the WAL to a separate RAID1 drive, TPS drops from over 1200 to less than 90! I've checked everything and can't find a reason. Here are the details. 8 cores (2x4 Intel Nehalem 2 GHz) 12 GB memory 12 x 7200 SATA 500 GB disks 3WARE 9650SE-12ML RAID controller with bbu 2 disks: RAID1 500GB ext4 blocksize=4096 8 disks: RAID10 2TB, stripe size 64K, blocksize=4096 (ext4 or xfs - see below) 2 disks: hot swap Ubuntu 10.04 LTS (Lucid) With xfs or ext4 on the RAID10 I got decent bonnie++ and pgbench results (this one is for xfs): Version 1.03e --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP argon24064M 70491 99 288158 25 129918 16 65296 97 428210 23 558.9 1 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 23283 81 + +++ 13775 56 20143 74 + +++ 15152 54 argon,24064M,70491,99,288158,25,129918,16,65296,97,428210,23,558.9,1,16,23283,81,+,+++,13775,56,20143\ ,74,+,+++,15152,54 pgbench -i -s 100 -U test pgbench -c 10 -t 1 -U test scaling factor: 100 query mode: simple number of clients: 10 number of transactions per client: 1 number of transactions actually processed: 10/10 tps = 1046.104635 (including connections establishing) tps = 1046.337276 (excluding connections establishing) Now the mystery: I moved the pg_xlog directory to a RAID1 array (same 3WARE controller, two more SATA 7200 disks). Run the same tests and ... tps = 82.325446 (including connections establishing) tps = 82.326874 (excluding connections establishing) I thought I'd made a mistake, like maybe I moved the whole database to the RAID1 array, but I checked and double checked. I even watched the lights blink - the WAL was definitely on the RAID1 and the rest of Postgres on the RAID10. So I moved the WAL back to the RAID10 array, and performance jumped right back up to the 1200 TPS range. Next I check the RAID1 itself: dd if=/dev/zero of=./bigfile bs=8192 count=200 which yielded 98.8 MB/sec - not bad. bonnie++ on the RAID1 pair showed good performance too: Version 1.03e --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP argon24064M 68601 99 110057 18 46534 6 59883 90 123053 7 471.3 1 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 + +++ + +++ + +++ + +++ + +++ + +++ argon,24064M,68601,99,110057,18,46534,6,59883,90,123053,7,471.3,1,16,+,+++,+,+++,+,+++,+,\ +++,+,+++,+,+++ So ... anyone have any idea at all how TPS drops to below 90 when I move the WAL to a separate RAID1 disk? Does this make any sense at all? It's repeatable. It happens for both ext4 and xfs. It's weird. You can even watch the disk lights and see it: the RAID10 disks are on almost constantly when the WAL is on the RAID10, but when you move the WAL over to the RAID1, its lights are dim and flicker a lot, like it's barely getting any data, and the RAID10 disk's lights barely go on at all. Thanks, Craig -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Weird XFS WAL problem
On 03/06/10 11:30, Craig James wrote: I'm testing/tuning a new midsize server and ran into an inexplicable problem. With an RAID10 drive, when I move the WAL to a separate RAID1 drive, TPS drops from over 1200 to less than 90! I've checked everything and can't find a reason. Are the 2 new RAID1 disks the same make and model as the 12 RAID10 ones? Also, are barriers *on* on the RAID1 mount and off on the RAID10 one? Cheers Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance