Greetings,
in contrast to some performance tips regarding small file *read*
performance, I want to share these results. The test is rather simple
but yields some very remarkable results: 400% improved read performance
by simply dropping some of the so called performance translators!
Please note that this test resembles a simplified version of our
workload, which is more or less sequential, read-only small file serving
with an average of 100 concurrent clients. (We use GlusterFS as a
flat-file backend to a cluster of webservers, which is hit only after
missing some caches in a more sophisticated caching infrastructure on
top of it)
The test-setup is a 3 node AFR cluster, with server+client on each one,
single process model (one volfile, the local volume is attached to
within the same process to save overhead), connected via 1 Gbit
Ethernet. This way each node can continue to operate on it's own, even
if the whole internal network for GlusterFS is down.
We used commodity hardware for the test. Each node is identical:
- Intel Core i7
- 12G RAM
- 500GB filesystem
- 1 Gbit NIC dedicated for GlusterFS
Software:
- Linux 2.6.32.8
- GlusterFS 3.0.2
- FUSE inited with protocol versions: glusterfs 7.13 kernel 7.13
- Filesystem / Storage Backend:
- LVM2 on top of software RAID 1
- ext4 with noatime
I will paste the configurations inline, so people can comment on them.
/etc/fstab:
-
/dev/data/test /mnt/brick/test ext4noatime 0 2
/etc/glusterfs/test.vol /mnt/glusterfs/test glusterfs
noauto,noatime,log-level=NORMAL,log-file=/var/log/glusterfs/test.log 0 0
-
***
Please note: this is the final configuration with the best results. All
translators are numbered to make the explanation easier later on. Unused
translators are commented out...
The volume spec is identical on all nodes, except that the bind-address
option in the server volume [*4*] is adjusted.
***
/etc/glusterfs/test.vol
-
# Sat Feb 27 16:53:00 CET 2010 John Feuerstein j...@feurix.com
#
# Single Process Model with AFR (Automatic File Replication).
##
## Storage backend
##
#
# POSIX STORAGE [*1*]
#
volume posix
type storage/posix
option directory /mnt/brick/test/glusterfs
end-volume
#
# POSIX LOCKS [*2*]
#
#volume locks
volume brick
type features/locks
subvolumes posix
end-volume
##
## Performance translators (server side)
##
#
# IO-Threads [*3*]
#
#volume brick
# type performance/io-threads
# subvolumes locks
# option thread-count 8
#end-volume
### End of performance translators
#
# TCP/IP server [*4*]
#
volume server
type protocol/server
subvolumes brick
option transport-type tcp
option transport.socket.bind-address 10.1.0.1 # FIXME
option transport.socket.listen-port 820
option transport.socket.nodelay on
option auth.addr.brick.allow 127.0.0.1,10.1.0.1,10.1.0.2,10.1.0.3
end-volume
#
# TCP/IP clients [*5*]
#
volume node1
type protocol/client
option remote-subvolume brick
option transport-type tcp/client
option remote-host 10.1.0.1
option remote-port 820
option transport.socket.nodelay on
end-volume
volume node2
type protocol/client
option remote-subvolume brick
option transport-type tcp/client
option remote-host 10.1.0.2
option remote-port 820
option transport.socket.nodelay on
end-volume
volume node3
type protocol/client
option remote-subvolume brick
option transport-type tcp/client
option remote-host 10.1.0.3
option remote-port 820
option transport.socket.nodelay on
end-volume
#
# Automatic File Replication Translator (AFR) [*6*]
#
# NOTE: node3 is the primary metadata node, so this one *must*
# be listed first in all volume specs! Also, node3 is the global
# favorite-child with the definite file version if any conflict
# arises while self-healing...
#
volume afr
type cluster/replicate
subvolumes node3 node1 node2
option read-subvolume node2
option favorite-child node3
end-volume
##
## Performance translators (client side)
##
#
# IO-Threads [*7*]
#
#volume client-threads-1
# type performance/io-threads
# subvolumes afr
# option thread-count 8
#end-volume
#
# Write-Behind [*8*]
#
volume wb
type performance/write-behind
subvolumes afr
option cache-size 4MB
end-volume
#
# Read-Ahead [*9*]
#
#volume ra
# type performance/read-ahead
# subvolumes wb
# option page-count 2
#end-volume
#
# IO-Cache [*10*]
#
volume cache
type performance/io-cache
subvolumes wb
option cache-size 1024MB
option cache-timeout 60
end-volume
#
# Quick-Read for small files [*11*]
#
#volume qr
# type performance/quick-read
# subvolumes cache
# option cache-timeout 60
#end-volume
#
# Metadata prefetch [*12*]
#
#volume sp
# type performance/stat-prefetch
# subvolumes qr
#end-volume
#
# IO-Threads [*13*]
#
#volume