Re: [PERFORM] Quad processor options - summary

2004-05-14 Thread Mark Kirkwood
I would recommend trying out several stripe sizes, and making your own 
measurements.

A while ago I was involved in building a data warehouse system (Oracle, 
DB2) and after several file and db benchmark exercises we used 256K 
stripes, as these gave the best overall performance results for both 
systems.

I am not saying "1M is wrong", but I am saying "1M may not be right" :-)
regards
Mark
Bjoern Metzdorf wrote:
1. Get many drives and stripe them into a RAID0 with a stripe width of 
1MB. I am not quite sure if this stripe width is to be controlled at 
the application level (does postgres support this?) or if e.g. the 
"chunk size" of the linux software driver is meant. Normally a chunk 
size of 4KB is recommended, so 1MB sounds fairly large.


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Quad processor options - summary

2004-05-13 Thread Paul Tuckfield
One big caveat re. the "SAME" striping strategy, is that readahead can 
really hurt an OLTP you.

Mind you, if you're going from a few disks to a caching array with many 
disks, it'll be hard to not have a big improvement

But if you push the envelope of the array with a "SAME" configuration, 
readahead will hurt.  Readahead is good for sequential reads but bad 
for random reads, because the various caches (array and filesystem) get 
flooded with all the blocks that happen to come after whatever random 
blocks  you're reading.  Because they're random reads these extra 
blocks are genarally *not* read by subsequent queries if the database 
is large enough to be much larger than the cache itself.   Of course, 
the readahead blocks are good if you're doing sequential scans, but 
you're not doing sequential scans because it's an OLTP database, right?

So this'll probably incite flames but:
In an OLTP environment of decent size, readahead is bad.  The ideal 
would be to adjust it dynamically til optimum (likely no readahead)  if 
the array allows it, but most people are fooled by good performance of 
readahead on simple singlethreaded or small dataset tests, and get 
bitten by this under concurrent loads or large datasets.

James Thornton wrote:

This is what I am considering the ultimate platform for postgresql:
Hardware:
Tyan Thunder K8QS board
2-4 x Opteron 848 in NUMA mode
4-8 GB RAM (DDR400 ECC Registered 1 GB modules, 2 for each processor)
LSI Megaraid 320-2 with 256 MB cache ram and battery backup
6 x 36GB SCSI 10K drives + 1 spare running in RAID 10, split over 
both channels (3 + 4) for pgdata including indexes and wal.
You might also consider configuring the Postgres data drives for a 
RAID 10 SAME configuration as described in the Oracle paper "Optimal 
Storage Configuration Made Easy" 
(http://otn.oracle.com/deploy/availability/pdf/oow2000_same.pdf). Has 
anyone delved into this before?
Ok, if I understand it correctly the papers recommends the following:
1. Get many drives and stripe them into a RAID0 with a stripe width of 
1MB. I am not quite sure if this stripe width is to be controlled at 
the application level (does postgres support this?) or if e.g. the 
"chunk size" of the linux software driver is meant. Normally a chunk 
size of 4KB is recommended, so 1MB sounds fairly large.

2. Mirror your RAID0 and get a RAID10.
3. Use primarily the fast, outer regions of your disks. In practice 
this might be achieved by putting only half of the disk (the outer 
half) into your stripe set. E.g. put only the outer 18GB of your 36GB 
disks into the stripe set. Btw, is it common for all drives that the 
outer region is on the higher block numbers? Or is it sometimes on the 
lower block numbers?

4. Subset data by partition, not disk. If you have 8 disks, then don't 
take a 4 disk RAID10 for data and the other one for log or indexes, 
but make a global 8 drive RAID10 and have it partitioned the way that 
data and log + indexes are located on all drives.

They say, which is very interesting, as it is really contrary to what 
is normally recommended, that it is good or better to have one big 
stripe set over all disks available, than to put log + indexes on a 
separated stripe set. Having one big stripe set means that the speed 
of this big stripe set is available to all data. In practice this 
setup is as fast as or even faster than the "old" approach.


Bottom line for a normal, less than 10 disk setup:
Get many disks (8 + spare), create a RAID0 with 4 disks and mirror it 
to the other 4 disks for a RAID10. Make sure to create the RAID on the 
outer half of the disks (setup may depend on the disk model and raid 
controller used), leaving the inner half empty.
Use a logical volume manager (LVM), which always helps when adding 
disk space, and create 2 partitions on your RAID10. One for data and 
one for log + indexes. This should look like this:

- - - -
| 1 | | 1 | | 1 | | 1 |
- - - -  <- outer, faster half of the disk
| 2 | | 2 | | 2 | | 2 | part of the RAID10
- - - -
|   | |   | |   | |   |
|   | |   | |   | |   |  <- inner, slower half of the disk
|   | |   | |   | |   | not used at all
- - - -
Partition 1 for data, partition 2 for log + indexes. All mirrored to 
the other 4 disks not shown.

If you take 36GB disks, this should end up like this:
RAID10 has size of 36 / 2 * 4 = 72GB
Partition 1 is 36 GB
Partition 2 is 36 GB
If 36GB is not enough for your pgdata set, you might consider moving 
to 72GB disks, or (even better) make a 16 drive RAID10 out of 36GB 
disks, which both will end up in a size of 72GB for your data (but the 
16 drive version will be faster).

Any comments?
Regards,
Bjoern
---(end of 
broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAI

Re: [PERFORM] Quad processor options - summary

2004-05-13 Thread James Thornton
Bjoern Metzdorf wrote:
You might also consider configuring the Postgres data drives for a 
RAID 10 SAME configuration as described in the Oracle paper "Optimal 
Storage Configuration Made Easy" 
(http://otn.oracle.com/deploy/availability/pdf/oow2000_same.pdf). Has 
anyone delved into this before?
Ok, if I understand it correctly the papers recommends the following:
1. Get many drives and stripe them into a RAID0 with a stripe width of 
1MB. I am not quite sure if this stripe width is to be controlled at the 
application level (does postgres support this?) or if e.g. the "chunk 
size" of the linux software driver is meant. Normally a chunk size of 
4KB is recommended, so 1MB sounds fairly large.

2. Mirror your RAID0 and get a RAID10.
Don't use RAID 0+1 -- use RAID 1+0 instead. Performance is the same, but 
if a disk fails in a RAID 0+1 configuration, you are left with a RAID 0 
array. In a RAID 1+0 configuration, multiple disks can fail.

A few weeks ago I called LSI asking about the Dell PERC4-Di card, which 
is actually an LSI Megaraid 320-2. Dell's documentation said that its 
support for RAID 10 was in the form of RAID-1 concatenated, but LSI said 
that this is incorrect and that it supports RAID 10 proper.

3. Use primarily the fast, outer regions of your disks. In practice this 
might be achieved by putting only half of the disk (the outer half) into 
your stripe set. E.g. put only the outer 18GB of your 36GB disks into 
the stripe set. 
You can still use the inner-half of the drives, just relegate it to 
less-frequently accessed data.

You also need to consider the filesystem.
SGI and IBM did a detailed study on Linux filesystem performance, which 
included XFS, ext2, ext3 (various modes), ReiserFS, and JFS, and the 
results are presented in a paper entitled "Filesystem Performance and 
Scalability in Linux 2.4.17" 
(http://oss.sgi.com/projects/xfs/papers/filesystem-perf-tm.pdf).

The scaling and load are key factors when selecting a filesystem. Since 
Postgres data is stored in large files, ReiserFS is not the ideal choice 
since it has been optimized for small files. XFS is probably the best 
choice for a database server running on a quad processor box.

However, Dr. Bert Scalzo of Quest argues that general file system 
benchmarks aren't ideal for benchmarking a filesystem for a database 
server. In a paper entitled "Tuning an Oracle8i Database running Linux" 
(http://otn.oracle.com/oramag/webcolumns/2002/techarticles/scalzo_linux02.html), 
 he says, "The trouble with these tests-for example, Bonnie, Bonnie++, 
Dbench, Iobench, Iozone, Mongo, and Postmark-is that they are basic file 
system throughput tests, so their results generally do not pertain in 
any meaningful fashion to the way relational database systems access 
data files." Instead he suggests using these two well-known and widely 
accepted database benchmarks:

* AS3AP: a scalable, portable ANSI SQL relational database benchmark 
that provides a comprehensive set of tests of database-processing power; 
has built-in scalability and portability for testing a broad range of 
systems; minimizes human effort in implementing and running benchmark 
tests; and provides a uniform, metric, straightforward interpretation of 
the results.

* TPC-C: an online transaction processing (OLTP) benchmark that involves 
a mix of five concurrent transactions of various types and either 
executes completely online or queries for deferred execution. The 
database comprises nine types of tables, having a wide range of record 
and population sizes. This benchmark measures the number of transactions 
per second.

In the paper, Scalzo benchmarks ext2, ext3, ReiserFS, JFS, but not XFS. 
Surprisingly ext3 won, but Scalzo didn't address scaling/load. The 
results are surprising because most think ext3 is just ext2 with 
journaling, thus having extra overhead from journaling.

If you read papers on ext3, you'll discover that has some optimizations 
that reduce disk head movement. For example, Daniel Robbins' "Advanced 
filesystem implementor's guide, Part 7: Introducing ext3" 
(http://www-106.ibm.com/developerworks/library/l-fs7/) says:

"The approach that the [ext3 Journaling Block Device layer API] uses is 
called physical journaling, which means that the JBD uses complete 
physical blocks as the underlying currency for implementing the 
journal...the use of full blocks allows ext3 to perform some additional 
optimizations, such as "squishing" multiple pending IO operations within 
a single block into the same in-memory data structure. This, in turn, 
allows ext3 to write these multiple changes to disk in a single write 
operation, rather than many. In addition, because the literal block data 
is stored in memory, little or no massaging of the in-memory data is 
required before writing it to disk, greatly reducing CPU overhead."

I suspect that less writes may be the key factor in ext3 winning 
Scalzo's DB benchmark. But as I said, Scalzo didn't benchmark XFS and 

Re: [PERFORM] Quad processor options - summary

2004-05-13 Thread Bjoern Metzdorf
James Thornton wrote:
This is what I am considering the ultimate platform for postgresql:
Hardware:
Tyan Thunder K8QS board
2-4 x Opteron 848 in NUMA mode
4-8 GB RAM (DDR400 ECC Registered 1 GB modules, 2 for each processor)
LSI Megaraid 320-2 with 256 MB cache ram and battery backup
6 x 36GB SCSI 10K drives + 1 spare running in RAID 10, split over both 
channels (3 + 4) for pgdata including indexes and wal.
You might also consider configuring the Postgres data drives for a RAID 
10 SAME configuration as described in the Oracle paper "Optimal Storage 
Configuration Made Easy" 
(http://otn.oracle.com/deploy/availability/pdf/oow2000_same.pdf). Has 
anyone delved into this before?
Ok, if I understand it correctly the papers recommends the following:
1. Get many drives and stripe them into a RAID0 with a stripe width of 
1MB. I am not quite sure if this stripe width is to be controlled at the 
application level (does postgres support this?) or if e.g. the "chunk 
size" of the linux software driver is meant. Normally a chunk size of 
4KB is recommended, so 1MB sounds fairly large.

2. Mirror your RAID0 and get a RAID10.
3. Use primarily the fast, outer regions of your disks. In practice this 
might be achieved by putting only half of the disk (the outer half) into 
your stripe set. E.g. put only the outer 18GB of your 36GB disks into 
the stripe set. Btw, is it common for all drives that the outer region 
is on the higher block numbers? Or is it sometimes on the lower block 
numbers?

4. Subset data by partition, not disk. If you have 8 disks, then don't 
take a 4 disk RAID10 for data and the other one for log or indexes, but 
make a global 8 drive RAID10 and have it partitioned the way that data 
and log + indexes are located on all drives.

They say, which is very interesting, as it is really contrary to what is 
normally recommended, that it is good or better to have one big stripe 
set over all disks available, than to put log + indexes on a separated 
stripe set. Having one big stripe set means that the speed of this big 
stripe set is available to all data. In practice this setup is as fast 
as or even faster than the "old" approach.


Bottom line for a normal, less than 10 disk setup:
Get many disks (8 + spare), create a RAID0 with 4 disks and mirror it to 
the other 4 disks for a RAID10. Make sure to create the RAID on the 
outer half of the disks (setup may depend on the disk model and raid 
controller used), leaving the inner half empty.
Use a logical volume manager (LVM), which always helps when adding disk 
space, and create 2 partitions on your RAID10. One for data and one for 
log + indexes. This should look like this:

- - - -
| 1 | | 1 | | 1 | | 1 |
- - - -  <- outer, faster half of the disk
| 2 | | 2 | | 2 | | 2 | part of the RAID10
- - - -
|   | |   | |   | |   |
|   | |   | |   | |   |  <- inner, slower half of the disk
|   | |   | |   | |   | not used at all
- - - -
Partition 1 for data, partition 2 for log + indexes. All mirrored to the 
other 4 disks not shown.

If you take 36GB disks, this should end up like this:
RAID10 has size of 36 / 2 * 4 = 72GB
Partition 1 is 36 GB
Partition 2 is 36 GB
If 36GB is not enough for your pgdata set, you might consider moving to 
72GB disks, or (even better) make a 16 drive RAID10 out of 36GB disks, 
which both will end up in a size of 72GB for your data (but the 16 drive 
version will be faster).

Any comments?
Regards,
Bjoern
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PERFORM] Quad processor options - summary

2004-05-13 Thread James Thornton
Bjoern Metzdorf wrote:
Hi,
at first, many thanks for your valuable replies. On my quest for the 
ultimate hardware platform I'll try to summarize the things I learned.

-

This is what I am considering the ultimate platform for postgresql:
Hardware:
Tyan Thunder K8QS board
2-4 x Opteron 848 in NUMA mode
4-8 GB RAM (DDR400 ECC Registered 1 GB modules, 2 for each processor)
LSI Megaraid 320-2 with 256 MB cache ram and battery backup
6 x 36GB SCSI 10K drives + 1 spare running in RAID 10, split over both 
channels (3 + 4) for pgdata including indexes and wal.
You might also consider configuring the Postgres data drives for a RAID 
10 SAME configuration as described in the Oracle paper "Optimal Storage 
Configuration Made Easy" 
(http://otn.oracle.com/deploy/availability/pdf/oow2000_same.pdf). Has 
anyone delved into this before?

--
 James Thornton
__
Internet Business Consultant, http://jamesthornton.com
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Off Topic - Re: [PERFORM] Quad processor options - summary

2004-05-13 Thread Greg Spiegelberg
This is somthing I wish more of us did on the lists.  The list archives
have solutions and workarounds for every variety of problem but very few
summary emails exist.  A good example of this practice is in the
sun-managers mailling list.  The original poster sends a "SUMMARY" reply
to the list with the original problem included and all solutions found.
Also makes searching the list archives easier.
Simply a suggestion for us all including myself.
Greg
Bjoern Metzdorf wrote:
Hi,
at first, many thanks for your valuable replies. On my quest for the 
ultimate hardware platform I'll try to summarize the things I learned.

-
This is our current setup:
Hardware:
Dual Xeon DP 2.4 on a TYAN S2722-533 with HT enabled
3 GB Ram (2 x 1 GB + 2 x 512 MB)
Mylex Extremeraid Controller U160 running RAID 10 with 4 x 18 GB SCSI 
10K RPM, no other drives involved (system, pgdata and wal are all on the 
same volume).

Software:
Debian 3.0 Woody
Postgresql 7.4.1 (selfcompiled, no special optimizations)
Kernel 2.4.22 + fixes
Database specs:
Size of a gzipped -9 full dump is roughly 1 gb
70-80% selects, 20-30% updates (roughly estimated)
up to 700-800 connections during peak times
kernel.shmall = 805306368
kernel.shmmax = 805306368
max_connections = 900
shared_buffers = 2
sort_mem = 16384
checkpoint_segments = 6
statistics collector is enabled (for pg_autovacuum)
Loads:
We are experiencing average CPU loads of up to 70% during peak hours. As 
Paul Tuckfield correctly pointed out, my vmstat output didn't support 
this. This output was not taken during peak times, it was freshly 
grabbed when I wrote my initial mail. It resembles perhaps 50-60% peak 
time load (30% cpu usage). iostat does not give results about disk 
usage, I don't know exactly why, the blk_read/wrtn columns are just 
empty. (Perhaps due to the Mylex rd driver, I don't know).

-
Suggestions and solutions given:
Anjan Dave reported, that he is pretty confident with his Quad Xeon 
setups, which will cost less than $20K at Dell with a reasonable 
hardware setup. ( Dell 6650 with 2.0GHz/1MB cache/8GB Memory, 5 internal 
drives (4 in RAID 10, 1 spare) on U320, 128MB cache on the PERC controller)

Scott Marlowe pointed out, that one should consider more than 4 drives 
(6 to 8, 10K rpm is enough, 15K is rip-off) for a Raid 10 setup, because 
that can boost performance quite a lot. One should also be using a 
battery backed raid controller. Scott has good experiences with the LSI 
Megaraid single channel controller, which is reasonably priced at ~ 
$500. He also stated, that 20-30% writes on a database is quite a lot.

Next Rob Sell told us about his research on more-than-2-way Intel based 
systems. The memory bandwidth on the xeon platform is always shared 
between the cpus. While a 2way xeon may perform quite well, a 4way 
system will be suffering due to the reduced memory bandwith available 
for each processor.

J. Andrew Roberts supports this. He said that 4way opteron systems scale 
much better than a 4way xeon system. Scaling limits begin at 6-8 cpus on 
the opteron platform. He also says that a fully equipped dual channel 
LSI Megaraid 320 with 256MB cache ram will be less that $1K. A complete 
4way opteron system will be at $10K-$12K.

Paul Tuckfield then gave the suggestion to bump up my shared_buffers. 
With a 3GB memory system, I could happily be using 1GB for shared 
buffers (125000). This was questioned by Andrew McMillian, Manfred 
Kolzar and Halford Dace, who say that common tuning advices limit 
reasonable settings to 1-2 shared buffers, because the OS is 
better at caching than the database.

-
Conclusion:
After having read some comparisons between n-way xeon and opteron systems:
http://www.anandtech.com/IT/showdoc.html?i=1982
http://www.aceshardware.com/read.jsp?id=6275
I was given the impression, that an opteron system is the way to go.
This is what I am considering the ultimate platform for postgresql:
Hardware:
Tyan Thunder K8QS board
2-4 x Opteron 848 in NUMA mode
4-8 GB RAM (DDR400 ECC Registered 1 GB modules, 2 for each processor)
LSI Megaraid 320-2 with 256 MB cache ram and battery backup
6 x 36GB SCSI 10K drives + 1 spare running in RAID 10, split over both 
channels (3 + 4) for pgdata including indexes and wal.
2 x 80 GB S-ATA IDE for system, running linux software raid 1 or 
available onboard hardware raid (perhaps also 2 x 36 GB SCSI)

Software:
Debian Woody in amd64 biarch mode, or perhaps Redhat/SuSE Enterprise 
64bit distributions.
Kernel 2.6
Postgres 7.4.2 in 64bit mode
shared_buffers = 2
a bumbed up effective_cache_size

Now the only problem left (besides my budget) is the availability of 
such a system.

I have found some vendors which ship similar systems, so I will have to 
talk to them about my dream configuration. I will not self build thi