I'm testing/tuning a new midsize server and ran into an inexplicable problem.  
With an RAID10 drive, when I move the WAL to a separate RAID1 drive, TPS drops 
from over 1200 to less than 90!   I've checked everything and can't find a 
reason.

Here are the details.

8 cores (2x4 Intel Nehalem 2 GHz)
12 GB memory
12 x 7200 SATA 500 GB disks
3WARE 9650SE-12ML RAID controller with bbu
  2 disks: RAID1  500GB ext4  blocksize=4096
  8 disks: RAID10 2TB, stripe size 64K, blocksize=4096 (ext4 or xfs - see below)
  2 disks: hot swap
Ubuntu 10.04 LTS (Lucid)

With xfs or ext4 on the RAID10 I got decent bonnie++ and pgbench results (this 
one is for xfs):

Version 1.03e       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
argon        24064M 70491  99 288158  25 129918  16 65296  97 428210  23 558.9  
 1
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 23283  81 +++++ +++ 13775  56 20143  74 +++++ +++ 15152  54
argon,24064M,70491,99,288158,25,129918,16,65296,97,428210,23,558.9,1,16,23283,81,+++++,+++,13775,56,20143\
,74,+++++,+++,15152,54

pgbench -i -s 100 -U test
pgbench -c 10 -t 10000 -U test
    scaling factor: 100
    query mode: simple
    number of clients: 10
    number of transactions per client: 10000
    number of transactions actually processed: 100000/100000
    tps = 1046.104635 (including connections establishing)
    tps = 1046.337276 (excluding connections establishing)

Now the mystery: I moved the pg_xlog directory to a RAID1 array (same 3WARE 
controller, two more SATA 7200 disks).  Run the same tests and ...

    tps = 82.325446 (including connections establishing)
    tps = 82.326874 (excluding connections establishing)

I thought I'd made a mistake, like maybe I moved the whole database to the 
RAID1 array, but I checked and double checked.  I even watched the lights blink 
- the WAL was definitely on the RAID1 and the rest of Postgres on the RAID10.

So I moved the WAL back to the RAID10 array, and performance jumped right back up 
to the >1200 TPS range.

Next I check the RAID1 itself:

  dd if=/dev/zero of=./bigfile bs=8192 count=2000000

which yielded 98.8 MB/sec - not bad.  bonnie++ on the RAID1 pair showed good 
performance too:

Version 1.03e       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
argon        24064M 68601  99 110057  18 46534   6 59883  90 123053   7 471.3   
1
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
argon,24064M,68601,99,110057,18,46534,6,59883,90,123053,7,471.3,1,16,+++++,+++,+++++,+++,+++++,+++,+++++,\
+++,+++++,+++,+++++,+++

So ... anyone have any idea at all how TPS drops to below 90 when I move the 
WAL to a separate RAID1 disk?  Does this make any sense at all?  It's 
repeatable. It happens for both ext4 and xfs. It's weird.

You can even watch the disk lights and see it: the RAID10 disks are on almost 
constantly when the WAL is on the RAID10, but when you move the WAL over to the 
RAID1, its lights are dim and flicker a lot, like it's barely getting any data, 
and the RAID10 disk's lights barely go on at all.

Thanks,
Craig










--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Reply via email to