Thank you so much for the input, and the detail provided. 

I'll be making the configuration changes, probably over the course of the week, 
checking the affect after each (as reminded by Scott Marlowe). I was pushed to 
put the new version into production over the weekend, which at least may 
provide me with some accurate feedback, and so will see what happens for a bit 
before addressing the disk/drive layout. 

-Midge

  ----- Original Message ----- 
  From: Greg Smith 
  To: pgsql-performance@postgresql.org 
  Sent: Sunday, August 21, 2011 12:20 PM
  Subject: Re: [PERFORM] settings input for upgrade


  On 08/18/2011 05:55 PM, Midge Brown wrote:


    DB1 is 10GB and consists of multiple tables that I've spread out so that 
the 3 most used have their data and indexes on 6 separate RAID1 drives, the 3 
next busiest have data & index on 3 drives, and the remaining tables and 
indexes are on the RAID10 drive. The WAL for all is on a separate RAID1 drive. 
    DB2 is 25GB with data, index, and WAL all on separate RAID1 drives.
    DB3 is 15GB with data, index, and WAL on separate RAID1 drives.

  Anytime you have a set of disks and a set of databases/tables to lay out onto 
them, there are two main options to consider:

  -Put all of them into a single RAID10 array.  Performance will be high now 
matter what subset of the database is being used.  But if one particular part 
of a database is really busy, it can divert resources away from the rest.

  -Break the database into fine-grained pieces and carefully lay out each of 
them on disk.  Performance of any individual chunk will be steady here.  But if 
only a subset of the data is being used, server resources will be idle.  All of 
the disks that don't have data related to that will be unused.

  Consider two configurations following these ideas:

  1) 12 disks are placed into a large RAID10 array.  Peak transfer rate will be 
about 600MB/s on sequential scans.

  2) 6 RAID1 arrays are created and the database is manually laid out onto 
those disks.  Peak transfer rate from any one section will be closer to 100MB/s.

  Each of these is optimizing for a different use scenario.  Here's the best 
case for each:

  -One user is active, and they're hitting one of the database sections.  In 
setup (1) they might get 600MB/s, the case where it shows the most benefit.  In 
setup (2), they'd only get 100MB/s.

  -10 users are pounding one section of the database; 1 user is hitting a 
different section.  In setup (2), all 10 users will be fighting over access to 
one section of the disk, each getting (at best) 10MB/s of its transfers. The 
nature of random I/O means that it will likely be much worse for them.  
Meanwhile, the user hitting the other database section will still be merrily 
chugging away getting their 100MB/s.  Had setup (1) been used, you'd have 11 
users fighting over 600MB/s, so at best 55MB/s for each.  And with the random 
mix, it could be much worse.

  Which of these is better?  Well, (1) is guaranteed to use your hardware to 
its fullest capability.  There are some situations where contention over the 
disk array will cause performance to be lower for some people, compared to if 
they had an isolated environment split up more like (2).  But the rest of the 
time, (2) will have taken a large number of disks and left them idle.  The 
second example shows this really well.  The mere fact that you have such a huge 
aggregate speed available means that the big array really doesn't necessarily 
suffer that badly from a heavy load.  It has 6X as much capacity to handle 
them.  You really need to have a >6:1 misbalance in access before the carefully 
laid out version pulls ahead.  In every other case, the big array wins.

  You can defend (2) as the better choice if you have really compelling, hard 
data proving use of the various parts of the data is split quite evenly among 
the expected incoming workload.  If you have response time latency targets that 
require separating resources evenly among the various types of users, it can 
also make sense there.  I don't know if the data you've been collecting from 
your older version is good enough to know that for sure or not.

  In every other case, you'd be better off just dumping the whole pile into a 
single, large array, and letting the array and operating system figure out how 
to schedule things best.  That why this is the normal practice for building 
PostgreSQL systems.  The sole exception is that splitting out the pg_xlog 
filesystem can usually be justified in a larger array.  The fact that it's 
always sequential I/O means that mixing its work with the rest of the server 
doesn't work as well as giving it a dedicated pair of drives to write to, where 
it doesn't ever stop to seek somewhere else.
    
     wal_buffers = 32MB

  This might as well drop to 16MB.  And you've already gotten some warnings 
about work_mem.  Switching to a connection pooler would help with that, too.



     autovacuum_analyze_threshold = 250 
     autovacuum_naptime = 10min



     autovacuum_vacuum_threshold  = 250

     vacuum_cost_delay = 10ms


  This strikes me as more customization than you really should be doing to 
autovacuum, if you haven't been running on a recent version of PostgreSQL yet.  
You shouldn't ever need to touch the thresholds for example.  Those only matter 
on really small tables; once something gets big enough to really matter, the 
threshold part is really small compared to the scale factor one.  And the 
defaults are picked partly so that cleanup of the system catalog tables is done 
frequently enough.  You're slowing that cleanup by moving the thresholds upward 
so much, and that's not a great idea.  

  For similar reasons, you really shouldn't be touching autovacuum_naptime 
unless there's really good evidence it's necessary for your environment.

  Changing things such that regular vacuums executed at the command line happen 
with a cost delay like this should be fine though.  Those will happen using 
twice as many resources as the autovacuum ones, but not run as fast as possible 
as in the normal case.


     deadlock_timeout = 3s 

  You probably don't want to increase this.  When you reach the point where you 
want to find slow lock issues by turning on log_lock_waits, you're just going 
to put it right back to the default again--or lower it.


-- 
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Reply via email to