Re: [PERFORM] Weird XFS WAL problem

2010-07-07 Thread Bruce Momjian
Greg Smith wrote:
 Kevin Grittner wrote:
  I don't know at the protocol level; I just know that write barriers
  do *something* which causes our controllers to wait for actual disk
  platter persistence, while fsync does not
 
 It's in the docs now:  
 http://www.postgresql.org/docs/9.0/static/wal-reliability.html
 
 FLUSH CACHE EXT is the ATAPI-6 call that filesystems use to enforce 
 barriers on that type of drive.  Here's what the relevant portion of the 
 ATAPI spec says:
 
 This command is used by the host to request the device to flush the 
 write cache. If there is data in the write
 cache, that data shall be written to the media.The BSY bit shall remain 
 set to one until all data has been
 successfully written or an error occurs.
 
 SAS systems have a similar call named SYNCHRONIZE CACHE.
 
 The improvement I actually expect to arrive here first is a reliable 
 implementation of O_SYNC/O_DSYNC writes.  Both SAS and SATA drives that 
 capable of doing Native Command Queueing support a write type called 
 Force Unit Access, which is essentially just like a direct write that 
 cannot be cached.  When we get more kernels with reliable sync writing 
 that maps under the hood to FUA, and can change wal_sync_method to use 
 them, the need to constantly call fsync for every write to the WAL will 
 go away.  Then the blow out the RAID cache when barriers are on 
 behavior will only show up during checkpoint fsyncs, which will make 
 things a lot better (albeit still not ideal).

Great information!  I have added the attached documentation patch to
explain the write-barrier/BBU interaction.  This will appear in the 9.0
documentation.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +
Index: doc/src/sgml/wal.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v
retrieving revision 1.66
diff -c -c -r1.66 wal.sgml
*** doc/src/sgml/wal.sgml	13 Apr 2010 14:15:25 -	1.66
--- doc/src/sgml/wal.sgml	7 Jul 2010 13:55:58 -
***
*** 48,68 
 some later time. Such caches can be a reliability hazard because the
 memory in the disk controller cache is volatile, and will lose its
 contents in a power failure.  Better controller cards have
!firsttermbattery-backed/ caches, meaning the card has a battery that
 maintains power to the cache in case of system power loss.  After power
 is restored the data will be written to the disk drives.
/para
  
para
 And finally, most disk drives have caches. Some are write-through
!while some are write-back, and the
!same concerns about data loss exist for write-back drive caches as
!exist for disk controller caches.  Consumer-grade IDE and SATA drives are
!particularly likely to have write-back caches that will not survive a
!power failure, though acronymATAPI-6/ introduced a drive cache
!flush command (FLUSH CACHE EXT) that some file systems use, e.g. acronymZFS/.
!Many solid-state drives (SSD) also have volatile write-back
!caches, and many do not honor cache flush commands by default.
 To check write caching on productnameLinux/ use
 commandhdparm -I/;  it is enabled if there is a literal*/ next
 to literalWrite cache/; commandhdparm -W/ to turn off
--- 48,74 
 some later time. Such caches can be a reliability hazard because the
 memory in the disk controller cache is volatile, and will lose its
 contents in a power failure.  Better controller cards have
!firsttermbattery-backed unit/ (acronymBBU/) caches, meaning
!the card has a battery that
 maintains power to the cache in case of system power loss.  After power
 is restored the data will be written to the disk drives.
/para
  
para
 And finally, most disk drives have caches. Some are write-through
!while some are write-back, and the same concerns about data loss
!exist for write-back drive caches as exist for disk controller
!caches.  Consumer-grade IDE and SATA drives are particularly likely
!to have write-back caches that will not survive a power failure,
!though acronymATAPI-6/ introduced a drive cache flush command
!(commandFLUSH CACHE EXT/) that some file systems use, e.g.
!acronymZFS/, acronymext4/.  (The SCSI command
!commandSYNCHRONIZE CACHE/ has long been available.) Many
!solid-state drives (SSD) also have volatile write-back caches, and
!many do not honor cache flush commands by default.
!   /para
! 
!   para
 To check write caching on productnameLinux/ use
 commandhdparm -I/;  it is enabled if there is a literal*/ next
 to literalWrite cache/; commandhdparm -W/ to turn off
***
*** 83,88 
--- 89,113 
/para
  
para
+Many file systems that use write barriers (e.g.  acronymZFS/,
+acronymext4/) internally 

Re: [PERFORM] Weird XFS WAL problem

2010-06-05 Thread Greg Smith

Kevin Grittner wrote:

I don't know at the protocol level; I just know that write barriers
do *something* which causes our controllers to wait for actual disk
platter persistence, while fsync does not


It's in the docs now:  
http://www.postgresql.org/docs/9.0/static/wal-reliability.html


FLUSH CACHE EXT is the ATAPI-6 call that filesystems use to enforce 
barriers on that type of drive.  Here's what the relevant portion of the 
ATAPI spec says:


This command is used by the host to request the device to flush the 
write cache. If there is data in the write
cache, that data shall be written to the media.The BSY bit shall remain 
set to one until all data has been

successfully written or an error occurs.

SAS systems have a similar call named SYNCHRONIZE CACHE.

The improvement I actually expect to arrive here first is a reliable 
implementation of O_SYNC/O_DSYNC writes.  Both SAS and SATA drives that 
capable of doing Native Command Queueing support a write type called 
Force Unit Access, which is essentially just like a direct write that 
cannot be cached.  When we get more kernels with reliable sync writing 
that maps under the hood to FUA, and can change wal_sync_method to use 
them, the need to constantly call fsync for every write to the WAL will 
go away.  Then the blow out the RAID cache when barriers are on 
behavior will only show up during checkpoint fsyncs, which will make 
things a lot better (albeit still not ideal).


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Weird XFS WAL problem

2010-06-04 Thread Matthew Wakeling

On Thu, 3 Jun 2010, Greg Smith wrote:
And it's also quite reasonable for a RAID controller to respond to that 
flush the whole cache call by flushing its cache.


Remember that the RAID controller is presenting itself to the OS as a 
large disc, and hiding the individual discs from the OS. Why should the OS 
care what has actually happened to the individual discs' caches, as long 
as that flush the whole cache command guarantees that the data is 
persistent. Taking the RAID array as a whole, that happens when the data 
hits the write-back cache.


The only circumstance where you actually need to flush the data to the 
individual discs is when you need to take that disc away somewhere else 
and read it on another system. That's quite a rare use case for a RAID 
array (http://thedailywtf.com/Articles/RAIDing_Disks.aspx 
notwithstanding).


If the controller had some logic that said it's OK to not flush the 
cache when that call comes in if my battery is working fine, that would 
make this whole problem go away.


The only place this can be properly sorted is the RAID controller. 
Anywhere else would be crazy.


Matthew

--
To err is human; to really louse things up requires root
privileges. -- Alexander Pope, slightly paraphrased

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Weird XFS WAL problem

2010-06-04 Thread Bruce Momjian
Greg Smith wrote:
 Kevin Grittner wrote:
  I've seen this, too (with xfs).  Our RAID controller, in spite of
  having BBU cache configured for writeback, waits for actual
  persistence on disk for write barriers (unlike for fsync).  This
  does strike me as surprising to the point of bordering on qualifying
  as a bug.
 Completely intentional, and documented at 
 http://xfs.org/index.php/XFS_FAQ#Q._Should_barriers_be_enabled_with_storage_which_has_a_persistent_write_cache.3F
 
 The issue is that XFS will actually send the full flush your cache 
 call to the controller, rather than just the usual fsync call, and that 
 eliminates the benefit of having a write cache there in the first 
 place.  Good controllers respect that and flush their whole write cache 
 out.  And ext4 has adopted the same mechanism.  This is very much a good 
 thing from the perspective of database reliability for people with 
 regular hard drives who don't have a useful write cache on their cheap 
 hard drives.  It allows them to keep the disk's write cache on for other 
 things, while still getting the proper cache flushes when the database 
 commits demand them.  It does mean that everyone with a non-volatile 
 battery backed cache, via RAID card typically, needs to turn barriers 
 off manually.
 
 I've already warned on this list that PostgreSQL commit performance on 
 ext4 is going to appear really terrible to many people.  If you 
 benchmark and don't recognize ext3 wasn't operating in a reliable mode 
 before, the performance drop now that ext4 is doing the right thing with 
 barriers looks impossibly bad.

Well, this is depressing.  Now that we finally have common
battery-backed cache RAID controller cards, the file system developers
have throw down another roadblock in ext4 and xfs.  Do we need to
document this?

On another topic, I am a little unclear on how things behave when the
drive is write-back. If the RAID controller card writes to the drive,
but the data isn't on the platers, how does it know when it can discard
that information from the BBU RAID cache?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Weird XFS WAL problem

2010-06-04 Thread Kevin Grittner
Bruce Momjian br...@momjian.us wrote:
 
 On another topic, I am a little unclear on how things behave when
 the drive is write-back. If the RAID controller card writes to the
 drive, but the data isn't on the platers, how does it know when it
 can discard that information from the BBU RAID cache?
 
The controller waits for the drive to tell it that it has made it to
the platter before it discards it.  What made you think otherwise?
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Weird XFS WAL problem

2010-06-04 Thread Bruce Momjian
Kevin Grittner wrote:
 Bruce Momjian br...@momjian.us wrote:
  
  On another topic, I am a little unclear on how things behave when
  the drive is write-back. If the RAID controller card writes to the
  drive, but the data isn't on the platers, how does it know when it
  can discard that information from the BBU RAID cache?
  
 The controller waits for the drive to tell it that it has made it to
 the platter before it discards it.  What made you think otherwise?

Because a write-back drive cache says it is on the drive before it hits
the platters, which I think is the default for SATA drive.  Is that
inaccurate?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Weird XFS WAL problem

2010-06-04 Thread Kevin Grittner
Bruce Momjian br...@momjian.us wrote:
 Kevin Grittner wrote:
 
 The controller waits for the drive to tell it that it has made it
 to the platter before it discards it.  What made you think
 otherwise?
 
 Because a write-back drive cache says it is on the drive before it
 hits the platters, which I think is the default for SATA drive.
 Is that inaccurate?
 
Any decent RAID controller will ensure that the drives themselves
aren't using write-back caching.  When we've mentioned write-back
versus write-through on this thread we've been talking about the
behavior of the *controller*.  We have our controllers configured to
use write-back through the BBU cache as long as the battery is good,
but to automatically switch to write-through if the battery goes
bad.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Weird XFS WAL problem

2010-06-04 Thread Bruce Momjian
Kevin Grittner wrote:
 Bruce Momjian br...@momjian.us wrote:
  Kevin Grittner wrote:
  
  The controller waits for the drive to tell it that it has made it
  to the platter before it discards it.  What made you think
  otherwise?
  
  Because a write-back drive cache says it is on the drive before it
  hits the platters, which I think is the default for SATA drive.
  Is that inaccurate?
  
 Any decent RAID controller will ensure that the drives themselves
 aren't using write-back caching.  When we've mentioned write-back
 versus write-through on this thread we've been talking about the
 behavior of the *controller*.  We have our controllers configured to
 use write-back through the BBU cache as long as the battery is good,
 but to automatically switch to write-through if the battery goes
 bad.

OK, good, but when why would a BBU RAID controller flush stuff to disk
with a flush-all command?  I thought the whole goal of BBU was to avoid
such flushes.  What is unique about the command ext4/xfs is sending?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Weird XFS WAL problem

2010-06-04 Thread Kevin Grittner
Bruce Momjian br...@momjian.us wrote:
 Kevin Grittner wrote:
 
 Any decent RAID controller will ensure that the drives themselves
 aren't using write-back caching.  When we've mentioned write-back
 versus write-through on this thread we've been talking about the
 behavior of the *controller*.  We have our controllers configured
 to use write-back through the BBU cache as long as the battery is
 good, but to automatically switch to write-through if the battery
 goes bad.
 
 OK, good, but when why would a BBU RAID controller flush stuff to
 disk with a flush-all command?  I thought the whole goal of BBU
 was to avoid such flushes.
 
That has been *precisely* my point.
 
I don't know at the protocol level; I just know that write barriers
do *something* which causes our controllers to wait for actual disk
platter persistence, while fsync does not.
 
The write barrier concept seems good to me, and I wish it could be
used at the OS level without killing performance.  I blame the
controller, for not treating it the same as fsync (i.e., as long as
it's in write-back mode it should treat data as persisted as soon as
it's in BBU cache).
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Weird XFS WAL problem

2010-06-04 Thread Bruce Momjian
Kevin Grittner wrote:
 Bruce Momjian br...@momjian.us wrote:
  Kevin Grittner wrote:
  
  Any decent RAID controller will ensure that the drives themselves
  aren't using write-back caching.  When we've mentioned write-back
  versus write-through on this thread we've been talking about the
  behavior of the *controller*.  We have our controllers configured
  to use write-back through the BBU cache as long as the battery is
  good, but to automatically switch to write-through if the battery
  goes bad.
  
  OK, good, but when why would a BBU RAID controller flush stuff to
  disk with a flush-all command?  I thought the whole goal of BBU
  was to avoid such flushes.
  
 That has been *precisely* my point.
  
 I don't know at the protocol level; I just know that write barriers
 do *something* which causes our controllers to wait for actual disk
 platter persistence, while fsync does not.
  
 The write barrier concept seems good to me, and I wish it could be
 used at the OS level without killing performance.  I blame the
 controller, for not treating it the same as fsync (i.e., as long as
 it's in write-back mode it should treat data as persisted as soon as
 it's in BBU cache).

Yeah.  I wonder if it honors the cache flush because it might think it
is replacing disks or something odd.  I think we are going to have to
document this in 9.0 because obviously you have seen it already.

Is this an issue with SAS cards/drives as well?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Weird XFS WAL problem

2010-06-03 Thread Merlin Moncure
On Wed, Jun 2, 2010 at 7:30 PM, Craig James craig_ja...@emolecules.com wrote:
 I'm testing/tuning a new midsize server and ran into an inexplicable
 problem.  With an RAID10 drive, when I move the WAL to a separate RAID1
 drive, TPS drops from over 1200 to less than 90!   I've checked everything
 and can't find a reason.

 Here are the details.

 8 cores (2x4 Intel Nehalem 2 GHz)
 12 GB memory
 12 x 7200 SATA 500 GB disks
 3WARE 9650SE-12ML RAID controller with bbu
  2 disks: RAID1  500GB ext4  blocksize=4096
  8 disks: RAID10 2TB, stripe size 64K, blocksize=4096 (ext4 or xfs - see
 below)
  2 disks: hot swap
 Ubuntu 10.04 LTS (Lucid)

 With xfs or ext4 on the RAID10 I got decent bonnie++ and pgbench results
 (this one is for xfs):

 Version 1.03e       --Sequential Output-- --Sequential Input-
 --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
 --Seeks--
 Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec
 %CP
 argon        24064M 70491  99 288158  25 129918  16 65296  97 428210  23
 558.9   1
                    --Sequential Create-- Random
 Create
                    -Create-- --Read--- -Delete-- -Create-- --Read---
 -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
 %CP
                 16 23283  81 + +++ 13775  56 20143  74 + +++ 15152
  54
 argon,24064M,70491,99,288158,25,129918,16,65296,97,428210,23,558.9,1,16,23283,81,+,+++,13775,56,20143\
 ,74,+,+++,15152,54

 pgbench -i -s 100 -U test
 pgbench -c 10 -t 1 -U test
    scaling factor: 100
    query mode: simple
    number of clients: 10
    number of transactions per client: 1
    number of transactions actually processed: 10/10
    tps = 1046.104635 (including connections establishing)
    tps = 1046.337276 (excluding connections establishing)

 Now the mystery: I moved the pg_xlog directory to a RAID1 array (same 3WARE
 controller, two more SATA 7200 disks).  Run the same tests and ...

    tps = 82.325446 (including connections establishing)
    tps = 82.326874 (excluding connections establishing)

 I thought I'd made a mistake, like maybe I moved the whole database to the
 RAID1 array, but I checked and double checked.  I even watched the lights
 blink - the WAL was definitely on the RAID1 and the rest of Postgres on the
 RAID10.

 So I moved the WAL back to the RAID10 array, and performance jumped right
 back up to the 1200 TPS range.

 Next I check the RAID1 itself:

  dd if=/dev/zero of=./bigfile bs=8192 count=200

 which yielded 98.8 MB/sec - not bad.  bonnie++ on the RAID1 pair showed good
 performance too:

 Version 1.03e       --Sequential Output-- --Sequential Input-
 --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
 --Seeks--
 Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec
 %CP
 argon        24064M 68601  99 110057  18 46534   6 59883  90 123053   7
 471.3   1
                    --Sequential Create-- Random
 Create
                    -Create-- --Read--- -Delete-- -Create-- --Read---
 -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec
 %CP
                 16 + +++ + +++ + +++ + +++ + +++ +
 +++
 argon,24064M,68601,99,110057,18,46534,6,59883,90,123053,7,471.3,1,16,+,+++,+,+++,+,+++,+,\
 +++,+,+++,+,+++

 So ... anyone have any idea at all how TPS drops to below 90 when I move the
 WAL to a separate RAID1 disk?  Does this make any sense at all?  It's
 repeatable. It happens for both ext4 and xfs. It's weird.

 You can even watch the disk lights and see it: the RAID10 disks are on
 almost constantly when the WAL is on the RAID10, but when you move the WAL
 over to the RAID1, its lights are dim and flicker a lot, like it's barely
 getting any data, and the RAID10 disk's lights barely go on at all.

*) Is your raid 1 configured writeback cache on the controller?
*) have you tried changing wal_sync_method to fdatasync?

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Weird XFS WAL problem

2010-06-03 Thread Greg Smith

Craig James wrote:
I'm testing/tuning a new midsize server and ran into an inexplicable 
problem.  With an RAID10 drive, when I move the WAL to a separate 
RAID1 drive, TPS drops from over 1200 to less than 90!


Normally 100 TPS means that the write cache on the WAL drive volume is 
disabled (or set to write-through instead of write-back).  When things 
in this area get fishy, I will usually download sysbench and have it 
specifically test how many fsync calls can happen per second.  
http://projects.2ndquadrant.com/talks , Database Hardware 
Benchmarking, page 28 has an example of the right incantation for that.


Also, make sure you run 3ware's utilities and confirm all the disks have 
finished their initialization and verification stages.  If you just 
adjusted disk layout that and immediate launched into benchmarks, those 
are useless until the background cleanup is done.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Weird XFS WAL problem

2010-06-03 Thread Craig James

On 6/2/10 4:40 PM, Mark Kirkwood wrote:

On 03/06/10 11:30, Craig James wrote:

I'm testing/tuning a new midsize server and ran into an inexplicable
problem. With an RAID10 drive, when I move the WAL to a separate RAID1
drive, TPS drops from over 1200 to less than 90! I've checked
everything and can't find a reason.


Are the 2 new RAID1 disks the same make and model as the 12 RAID10 ones?


Yes.


Also, are barriers *on* on the RAID1 mount and off on the RAID10 one?


It was the barriers.  barrier=1 isn't just a bad idea on ext4, it's a 
disaster.

pgbench -i -s 100 -U test
pgbench -c 10 -t 1 -U test

Change WAL to barrier=0

tps = 1463.264981 (including connections establishing)
tps = 1463.725687 (excluding connections establishing)

Change WAL to noatime, nodiratime, barrier=0

tps = 1479.331476 (including connections establishing)
tps = 1479.810545 (excluding connections establishing)

Change WAL to barrier=1

tps = 82.325446 (including connections establishing)
tps = 82.326874 (excluding connections establishing)

This is really hard to believe, because the bonnie++ numbers and dd(1) numbers look good 
(see my original post).  But it's totally repeatable.  It must be some really unfortunate 
just missed the next sector going by the write head problem.

So with ext4, bonnie++ and dd aren't the whole story.

BTW, I also learned that if you edit /etc/fstab and use mount -oremount it WON'T change barrier=0/1 
unless it is explicit in the fstab file.  That is, if you put barrier=0 into /etc/fstab and use the remount, it will 
change it to no barriers.  But if you then remove it from /etc/fstab, it won't change it back to the default.  You have to 
actually put barrier=1 if you want to get it back to the default.  This seems like a bug to me, and it made it really 
hard to track this down. mount -oremount is not the same as umount/mount!

Craig

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Weird XFS WAL problem

2010-06-03 Thread Matthew Wakeling

On Thu, 3 Jun 2010, Craig James wrote:

Also, are barriers *on* on the RAID1 mount and off on the RAID10 one?


It was the barriers.  barrier=1 isn't just a bad idea on ext4, it's a 
disaster.


This worries me a little. Does your array have a battery-backed cache? If 
so, then it should be fast regardless of barriers (although barriers may 
make a small difference). If it does not, then it is likely that the fast 
speed you are seeing with barriers off is unsafe.


There should be no just missed the sector going past for write problem 
ever with a battery-backed cache.


Matthew

--
There once was a limerick .sig
that really was not very big
It was going quite fine
Till it reached the fourth line

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Weird XFS WAL problem

2010-06-03 Thread Kevin Grittner
Matthew Wakeling matt...@flymine.org wrote:
 On Thu, 3 Jun 2010, Craig James wrote:
 Also, are barriers *on* on the RAID1 mount and off on the RAID10
one?

 It was the barriers.  barrier=1 isn't just a bad idea on ext4,
 it's a disaster.
 
 This worries me a little. Does your array have a battery-backed
 cache? If so, then it should be fast regardless of barriers
 (although barriers may make a small difference). If it does not,
 then it is likely that the fast speed you are seeing with barriers
 off is unsafe.
 
I've seen this, too (with xfs).  Our RAID controller, in spite of
having BBU cache configured for writeback, waits for actual
persistence on disk for write barriers (unlike for fsync).  This
does strike me as surprising to the point of bordering on qualifying
as a bug.  It means that you can't take advantage of the BBU cache
and get the benefit of write barriers in OS cache behavior.  :-(
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Weird XFS WAL problem

2010-06-03 Thread Greg Smith

Kevin Grittner wrote:

I've seen this, too (with xfs).  Our RAID controller, in spite of
having BBU cache configured for writeback, waits for actual
persistence on disk for write barriers (unlike for fsync).  This
does strike me as surprising to the point of bordering on qualifying
as a bug.
Completely intentional, and documented at 
http://xfs.org/index.php/XFS_FAQ#Q._Should_barriers_be_enabled_with_storage_which_has_a_persistent_write_cache.3F


The issue is that XFS will actually send the full flush your cache 
call to the controller, rather than just the usual fsync call, and that 
eliminates the benefit of having a write cache there in the first 
place.  Good controllers respect that and flush their whole write cache 
out.  And ext4 has adopted the same mechanism.  This is very much a good 
thing from the perspective of database reliability for people with 
regular hard drives who don't have a useful write cache on their cheap 
hard drives.  It allows them to keep the disk's write cache on for other 
things, while still getting the proper cache flushes when the database 
commits demand them.  It does mean that everyone with a non-volatile 
battery backed cache, via RAID card typically, needs to turn barriers 
off manually.


I've already warned on this list that PostgreSQL commit performance on 
ext4 is going to appear really terrible to many people.  If you 
benchmark and don't recognize ext3 wasn't operating in a reliable mode 
before, the performance drop now that ext4 is doing the right thing with 
barriers looks impossibly bad.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Weird XFS WAL problem

2010-06-03 Thread Greg Smith

Craig James wrote:
This is really hard to believe, because the bonnie++ numbers and dd(1) 
numbers look good (see my original post).  But it's totally 
repeatable.  It must be some really unfortunate just missed the next 
sector going by the write head problem.


Commit performance is a separate number to measure that is not reflected 
in any benchmark that tests sequential performance.  I consider it the 
fourth axis of disk system performance (seq read, seq write, random 
IOPS, commit rate), and directly measure it with the sysbench fsync test 
I recommended already.  (You can do it with the right custom pgbench 
script too).


You only get one commit per rotation on a drive, which is exactly what 
you're seeing:  a bit under the 120 spins/second @ 7200 RPM.  Attempts 
to time things just right to catch more than one sector per spin are 
extremely difficult to accomplish, I spent a week on that once without 
making any good progress.  You can easily get 100MB/s on reads and 
writes but only manage 100 commits/second.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Weird XFS WAL problem

2010-06-03 Thread Kevin Grittner
Greg Smith g...@2ndquadrant.com wrote:
 Kevin Grittner wrote:
 I've seen this, too (with xfs).  Our RAID controller, in spite of
 having BBU cache configured for writeback, waits for actual
 persistence on disk for write barriers (unlike for fsync).  This
 does strike me as surprising to the point of bordering on
 qualifying as a bug.
 Completely intentional, and documented at 

http://xfs.org/index.php/XFS_FAQ#Q._Should_barriers_be_enabled_with_storage_which_has_a_persistent_write_cache.3F
 
Yeah, I read that long ago and I've disabled write barriers because
of it; however, it still seems wrong that the RAID controller
insists on flushing to the drives in write-back mode.  Here are my
reasons for wishing it was otherwise:
 
(1)  We've had batteries on our RAID controllers fail occasionally. 
The controller automatically degrades to write-through, and we get
an email from the server and schedule a tech to travel to the site
and replace the battery; but until we take action we are now exposed
to possible database corruption.  Barriers don't automatically come
on when the controller flips to write-through mode.
 
(2)  It precludes any possibility of moving from fsync techniques to
write barrier techniques for ensuring database integrity.  If the OS
respected write barriers and the controller considered the write
satisfied when it hit BBU cache, write barrier techniques would
work, and checkpoints could be made smoother.  Think how nicely that
would inter-operate with point (1).
 
So, while I understand it's Working As Designed, I think the design
is surprising and sub-optimal.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Weird XFS WAL problem

2010-06-03 Thread Scott Marlowe
On Thu, Jun 3, 2010 at 12:40 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:

 Yeah, I read that long ago and I've disabled write barriers because
 of it; however, it still seems wrong that the RAID controller
 insists on flushing to the drives in write-back mode.  Here are my
 reasons for wishing it was otherwise:

I think it's a case of the quickest, simplest answer to semi-new tech.
 Not sure what to do with barriers?  Just flush the whole cache.

I'm guessing that this will get optimized in the future.

BTW, I'll have LSI Megaraid latest and greatest to test on in a month,
and older Areca 1680s as well. I'll be updating the firmware on the
arecas, and I'll run some tests on the whole barrier behaviour to see
if it's gotten any better lately.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Weird XFS WAL problem

2010-06-03 Thread Kevin Grittner
Scott Marlowe scott.marl...@gmail.com wrote:
 
 I think it's a case of the quickest, simplest answer to semi-new
 tech.  Not sure what to do with barriers?  Just flush the whole
 cache.
 
 I'm guessing that this will get optimized in the future.
 
Let's hope so.
 
That reminds me, the write barrier concept is at least on the
horizon as a viable technology; does anyone know if the asynchronous
graphs concept in this (one page) paper ever came to anything?  (I
haven't hear anything about it lately.)
 
http://www.usenix.org/events/fast05/wips/burnett.pdf
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Weird XFS WAL problem

2010-06-03 Thread Greg Smith

Scott Marlowe wrote:

I think it's a case of the quickest, simplest answer to semi-new tech.
 Not sure what to do with barriers?  Just flush the whole cache.
  


Well, that really is the only useful thing you can do with regular SATA 
drives; the ATA command set isn't any finer grained than that in a way 
that's useful for this context.  And it's also quite reasonable for a 
RAID controller to respond to that flush the whole cache call by 
flushing its cache.  So it's not just the simplest first answer, I 
believe it's the only answer until a better ATA command set becomes 
available.


I think this can only be resolved usefully for all of us at the RAID 
firmware level.  If the controller had some logic that said it's OK to 
not flush the cache when that call comes in if my battery is working 
fine, that would make this whole problem go away.  I don't expect it's 
possible to work around the exact set of concerns Kevin listed any other 
way, because as he pointed out the right thing to do is very dependent 
on the battery health, which the OS also doesn't know (again, would 
require some new command set verbage).


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Weird XFS WAL problem

2010-06-03 Thread Scott Marlowe
On Thu, Jun 3, 2010 at 1:31 PM, Greg Smith g...@2ndquadrant.com wrote:
 Scott Marlowe wrote:

 I think it's a case of the quickest, simplest answer to semi-new tech.
  Not sure what to do with barriers?  Just flush the whole cache.


 Well, that really is the only useful thing you can do with regular SATA
 drives; the ATA command set isn't any finer grained than that in a way
 that's useful for this context.  And it's also quite reasonable for a RAID
 controller to respond to that flush the whole cache call by flushing its
 cache.  So it's not just the simplest first answer, I believe it's the only
 answer until a better ATA command set becomes available.

 I think this can only be resolved usefully for all of us at the RAID
 firmware level.  If the controller had some logic that said it's OK to not
 flush the cache when that call comes in if my battery is working fine,

That's what already happens for fsync on a BBU controller, so I don't
think the code to do so would be something fancy and new, just a
simple change of logic on which code path to take.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Weird XFS WAL problem

2010-06-03 Thread Kevin Grittner
Greg Smith g...@2ndquadrant.com wrote:
 
 I think this can only be resolved usefully for all of us at the
 RAID firmware level.  If the controller had some logic that said
 it's OK to not flush the cache when that call comes in if my
 battery is working fine, that would make this whole problem go
 away.
 
That is exactly what I've been trying to suggest.  Sorry for not
being more clear about it.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Weird XFS WAL problem

2010-06-02 Thread Craig James

I'm testing/tuning a new midsize server and ran into an inexplicable problem.  
With an RAID10 drive, when I move the WAL to a separate RAID1 drive, TPS drops 
from over 1200 to less than 90!   I've checked everything and can't find a 
reason.

Here are the details.

8 cores (2x4 Intel Nehalem 2 GHz)
12 GB memory
12 x 7200 SATA 500 GB disks
3WARE 9650SE-12ML RAID controller with bbu
  2 disks: RAID1  500GB ext4  blocksize=4096
  8 disks: RAID10 2TB, stripe size 64K, blocksize=4096 (ext4 or xfs - see below)
  2 disks: hot swap
Ubuntu 10.04 LTS (Lucid)

With xfs or ext4 on the RAID10 I got decent bonnie++ and pgbench results (this 
one is for xfs):

Version 1.03e   --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
argon24064M 70491  99 288158  25 129918  16 65296  97 428210  23 558.9  
 1
--Sequential Create-- Random Create
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 16 23283  81 + +++ 13775  56 20143  74 + +++ 15152  54
argon,24064M,70491,99,288158,25,129918,16,65296,97,428210,23,558.9,1,16,23283,81,+,+++,13775,56,20143\
,74,+,+++,15152,54

pgbench -i -s 100 -U test
pgbench -c 10 -t 1 -U test
scaling factor: 100
query mode: simple
number of clients: 10
number of transactions per client: 1
number of transactions actually processed: 10/10
tps = 1046.104635 (including connections establishing)
tps = 1046.337276 (excluding connections establishing)

Now the mystery: I moved the pg_xlog directory to a RAID1 array (same 3WARE 
controller, two more SATA 7200 disks).  Run the same tests and ...

tps = 82.325446 (including connections establishing)
tps = 82.326874 (excluding connections establishing)

I thought I'd made a mistake, like maybe I moved the whole database to the 
RAID1 array, but I checked and double checked.  I even watched the lights blink 
- the WAL was definitely on the RAID1 and the rest of Postgres on the RAID10.

So I moved the WAL back to the RAID10 array, and performance jumped right back up 
to the 1200 TPS range.

Next I check the RAID1 itself:

  dd if=/dev/zero of=./bigfile bs=8192 count=200

which yielded 98.8 MB/sec - not bad.  bonnie++ on the RAID1 pair showed good 
performance too:

Version 1.03e   --Sequential Output-- --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
argon24064M 68601  99 110057  18 46534   6 59883  90 123053   7 471.3   
1
--Sequential Create-- Random Create
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
  files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 16 + +++ + +++ + +++ + +++ + +++ + +++
argon,24064M,68601,99,110057,18,46534,6,59883,90,123053,7,471.3,1,16,+,+++,+,+++,+,+++,+,\
+++,+,+++,+,+++

So ... anyone have any idea at all how TPS drops to below 90 when I move the 
WAL to a separate RAID1 disk?  Does this make any sense at all?  It's 
repeatable. It happens for both ext4 and xfs. It's weird.

You can even watch the disk lights and see it: the RAID10 disks are on almost 
constantly when the WAL is on the RAID10, but when you move the WAL over to the 
RAID1, its lights are dim and flicker a lot, like it's barely getting any data, 
and the RAID10 disk's lights barely go on at all.

Thanks,
Craig










--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Weird XFS WAL problem

2010-06-02 Thread Mark Kirkwood

On 03/06/10 11:30, Craig James wrote:
I'm testing/tuning a new midsize server and ran into an inexplicable 
problem.  With an RAID10 drive, when I move the WAL to a separate 
RAID1 drive, TPS drops from over 1200 to less than 90!   I've checked 
everything and can't find a reason.





Are the 2 new RAID1 disks the same make and model as the 12 RAID10 ones?

Also, are barriers *on* on the RAID1 mount and off on the RAID10 one?

Cheers

Mark


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance