Eugen Leitl wrote:
On Wed, Sep 16, 2009 at 08:02:35PM +0300, Markus Kovero wrote:
It's possible to do 3-way (or more) mirrors too, so you may achieve better
redundancy than raidz2/3
I understand there's almost no additional performance penalty to raidz3
over raidz2 in terms of CPU load. Is that correct?
As far as I understand the z3 algorithms, the performance penalty is
very slightly higher than z2. I think it's reasonable to treat z1, z2,
and z3 as equal in terms of CPU load.
So SSDs for ZIL/L2ARC don't bring that much when used with raidz2/raidz3,
if I write a lot, at least, and don't access the cache very much, according
to some recent posts on this list.
Not true.
Remember: ZIL = write cache
L2ARC = read cache
So, if you have a write-heavy workload which seldom does much more than
large reads, an L2ARC SSD doesn't make much sense. Main RAM should
suffice for storing the read cache.
Random reads aren't fast on RAIDZ, so a read cache is a good thing if
you are doing that kind of I/O. Similarly, random writes (particularly
small random writes) are suck hard on RADIZ, so a write cache is a
fabulous idea there.
If you are doing very large sequential writes to a RAIDZ (any sort)
pool, then a write cache will likely be much less helpful. But
remember, very large means that you frequently exceed the size of the
SSD you've allocated for the ZIL. I'd have to run the numbers, but you
should still see a major performance improvement by using a SSD for ZIL,
up to the point where your typical write load exceeds 10% of the size of
SSD. Naturally, write-heavy workloads will be murder on a MLC or hybrid
SSD's life expectancy, though, a large sequential-write-heavy load will
allow the SSD to perform better and longer than a small random write load.
A write SSD will help you up until you try to write to the SSD faster
than it can flush out it's contents to actual disk. So, you need to take
into consideration exactly how much data is coming in, and the write
speed of your (non-SSD) disks. If you are continuously (and constantly)
exceeding the speed of your disks with incoming data, then SSDs won't
really help. You'll see some help up until the SSD fills up, then
performance will drop to equal that as if the SSD didn't exist.
Doing [very] rough calculations, let's say your SSD has a read/write
throughput of 200MB/s, and is 100GB in size. If your hard drives can
only do 50MB/s, then you can write up to 150MB/s to the SSD, read
50MB/s from the SSD, and write 50MB/s to the disks. This means, each
second, you fill the SSD with 100MB more data that can't be flushed out
fast enough. At 100MB/s, it takes 1,000 seconds to fill 100GB. So, in
about 17 minutes, you've completely filled the SSD, and performance
drops like a rock. There is a similar cliff problem around IOPS.
How much drive space am I'm losing with mirrored pools versus raidz3? IIRC
in RAID 10 it's only 10% over RAID 6, which is why I went for RAID 10 in
my 14-drive SATA (WD RE4) setup.
Basic math says for N disks, you get N-3 amount of space for a RAIDZ3,
and N/2 for a 2-way mirror. N-3 > N/2 for all N = 6 or more. But,
remember, you'll generally need at least one hot spare for a mirror, so
really, the equations looks like this:
N-3 > (N/2) -1 which means, RAIDZ3 gives you more space for N > 4
Let's assume I want to fill a 24-drive Supermicro chassis with 1 TByte
WD Caviar Black or 2 TByte RE4 drives, and use 4x X25-M 80 GByte
2nd gen Intel consumer drives, mirrored, each pair as ZIL/L2ARC
for the 24 SATA drives behind them. Let's assume CPU is not an issue,
with dual-socket Nehalems and 24 GByte RAM or more. There are applications
packaged in Solaris containers running on the same box, however.
Remember to take a look at Richard's spreadsheet about drive errors and
the amount of time you can expect to go without serious issue. He's
also got good stuff about optimizing for speed vs space.
http://blogs.sun.com/relling/
Quick math for a 24-drive setup:
Scenario A: stripe of mirrors, plus global spares.
11 x 2-way mirror = 11 disks of data, plus 2
additional hot spares
Scenario B: stripe of raidz3, no global spares
3 x 8-drive RAIDz3 (5 data + 3 parity drives )=
3 x 5 = 15 data drives, with a total of 9 internal "spares"
Thus, A gives you about 30% less disk space than B.
Let's say the workload is mostly multiple streams (hundreds to thousands
simultaneously, some continuous, some bursty) each writing data
to the storage system. However, some few clients will be using database-like
queries to read, potentially on the entire data store.
With above workload, is raidz2/raid3 right out, and will I need mirrored
pools?
The database queries will definitely benefit from a L2ARC SSD - the size
of that SSD depends on exactly how much data the query has to check. If
it's just checking metadata (mod times, file sizes, permissions, etc.)
of lots of files, then you're probably good with a smaller SSD. If you
have to actually read large amounts of the streams, then you're pretty
well hosed, as your data set is far larger than any cache can hold.
How would you lay out the pools for above workload, assuming 24 SATA
drives/chassis (24-48 TBytes raw storage), and 80 GByte SSD each for ZIL/L2ARC
(is that too little? Would 160 GByte work better?)
Thanks lots.
I can't make recommendations about SSD size without much more specific
numbers about the actual workload.
Look at Richard's Raid Optimizer output for a 48-disk Thumper/Thor.
http://blogs.sun.com/relling/entry/sample_raidoptimizer_output
It should give you a good idea about IOPS and read/write speeds for
various configs.
Reading/Writing a large stream is a sequential operation. Bursty
read/write of a stream looks like random I/O. But, more importantly,
the relative size of the stream is important. Whether continuous or
bursty, the important characteristic is HOW MUCH data needs to be
written/read at once. Anything under 100k is definitely "random", and
anything over 10MB is "sequential" (as far as general performance
goes). Sizes in between makes it depend on how much other stuff is
going on (i.e. having 100,000 streams each trying to write 1MB has a
different impact than 1,000 streams trying to write the same 1MB each).
Personally, I hate to use any form of SATA drive for a heavy random
write or read workload, even with an SSD. SAS disks performs so much
better.
--
Erik Trimble
Java System Support
Mailstop: usca22-123
Phone: x17195
Santa Clara, CA
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss