Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-12-03 Thread Vincent Fox
Thanks for your observations.

HOWEVER, I didn't pose the question

"How do I architect the HA and storage and everything for an email system?"

Our site like many other data centers has HA standards and politics and all 
this other baggage that may lead a design to a certain point.  Thus our answer 
will be different than yours.  You can poke holes in my designs, I can poke 
holes in yours, this could go on all day.

Considering I am adding a new server to a group of existing servers of similar 
design, we are not going to make radical ground-up redesign decisions at this 
time.  I can fiddle around in the margins with things like stripe-size.

I will point out AS I HAVE BEFORE that ZFS is not yet completely 
enterprise-ready in our view.  For example in one commonly-proposed amateurish 
(IMO) scenario, we would have 2 big JBOD units and mirror the drives between 
arrays.  This works fine if a drive fails or even if an array goes down.  BUT, 
you are then left with a storage pool which must be immediately serviced or a 
single additional drive failure will destroy the pool.  Or simple drive failure 
which spare rolls in? The one from the same array, or one from the other?  
Seems a coin toss.  When it's a terabyte of email and 10K+ users that's a big 
deal for some people, and we did our HA design such that multiple failures can 
occur with no service impact.  The performance may not be ideal, and the design 
may not seem ELEGANT to everyone.  Mixing HW controller RAID and ZFS mirroring 
is admittedly an odd hybrid design.  Our answer works for us and that is all 
that matters.

So if someone has an idea of what stripe-size will work best for us that would 
be helpful.

Thanks!
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-12-02 Thread Al Hopper
On Fri, 30 Nov 2007, Vincent Fox wrote:

... reformatted ...
> We will be using Cyrus to store mail on 2540 arrays.
>
> We have chosen to build 5-disk RAID-5 LUNs in 2 arrays which are 
> both connected to same host, and mirror and stripe the LUNs.  So a 
> ZFS RAID-10 set composed of 4 LUNs.  Multi-pathing also in use for 
> redundancy.
>
> My question is any guidance on best choice in CAM for stripe size in the LUNs?

[after reading the entire thread where details of the storage related 
application is presented piecemeal and piecing together the details] I 
can't give you an answer or a recommendation, because the question 
does not make sense IMHO.

IOW: This is like saying: "I want to get from Dallas to LA as quickly 
as possible and have already decided that a bicycle would be the best 
mode of transport to use; can you tell me how I should configure the 
bicycle."  The problem is that its very unlikely that the bicycle is 
the correct solution and to recommend which bicycle config is correct 
is likely to provide very bad advice. and also validate the 
supposition that the solution utilizing the bicycle is, indeed, the 
correct solution.

> Default is 128K right now, can go up to 512K, should we go higher?
>
> Cyrus stores mail messages as many small files, not big mbox files. 
> But there are so many layers in action here it's hard to know what 
> is best choice.

[again based on reading the entire thread and not an answer to the 
above paragraph]

It appears that the chosen solution is to use a stripe of two hardware 
RAID5 luns presented by a 2540 (please correct me if this is 
incorrect).  There are several issues with this proposal:

a) You're mixing solutions: Hardware RAID5 and ZFS.  Why?  All this 
does is introduce needless complexity and make it very difficult to 
troubleshoot issues with the storage subsystem - especially if the 
issue is performance related.  Also - how do you localize a fault 
condition that is caused by a 2540 RAID firmware bug?  How do you 
isolate performance issues caused by the interaction between the 
hardware RAID5 luns and ZFS?

b) You've chosen a stripe - despite Richard Ellings best advice 
(something like "friends don't let friends use stripes").  See 
Richards blogs for a comparison of the reliability rates for different 
storage configurations.

c) For a mail storage subsystem a stripe seems totally wrong. 
Generally speaking, email (stores) consists of many small files - with 
occasional medium sized files (due to attachments) and less commonly, 
some large files - usually limited by the max message size defined by 
the MTA (typical value is 10Mb - what is it in your case?).

d) ZFS, with its built-in volume manager, relies on having direct 
access to individual disks (JBOD).  Placing a hardware RAID engine 
between ZFS and the actual disks is a "black box" in terms of the ZFS 
volume manager - and it can't possibly "understand" how various 
storage providers' "black boxes" will behave especially when ZFS 
tells the "disk" to do something and the hardware RAID lun lies to ZFS
(example sync writes).

e) You've presented no data in terms of typical iostat -xcnz 5 output 
- generalized over various times of the day where particular user data 
access patterns are known.  This information would allow us to give 
you some basic recommendations.  IOW - we need to know the basic 
requirements in terms of IOPS and average I/O transfer sizes.  BTW: 
Brendan Greggs DTrace scripts will allow you to gather very detailed 
I/O usage data on the production system with no risk.

f) You have not provided any details of the 2540 config - except for 
the fact that it is "fully loaded" IIRC.  SAS disks?  10,000 RPM 
drives of 15k RPM drives?  Disk drive size?

g) You've provided no details of how the host is configured.  If you 
decide to deploy a ZFS based system, the amount of installed RAM on 
the mailserver will have a *huge* impact on the actual load placed on 
the I/O subsystem.  In this regard, ZFS is your friend, as it'll cache 
almost _everything_, given enough RAM.  And DDR2 RAM is (arguably) 
less than $40 a gigabyte today - with 2Gb SIMMs having reached price 
parity with the equivalent pricing of 2 * 1Gb DIMMs.

For example: if an end-user MUA is configured to poll the mailserver 
every 30 Seconds, to check if new mail has arrived, if the mailserver 
has sufficient (cache) memory, then only the first request will 
require disk access and a large number of subsequent requests will be 
handled out of (cache) memory.

h) Another observation: You've commented on the importance of system 
reliability because there are 10k users on the mailserver.  Whether 
you have 10 users or 10k users or 100k users is of no importance if 
you are considering system reliability (aka failure rates).  IOW - a 
system that is configured to a certain reliability requirement will be 
the same, regardless of the number of end users that rely on that 
system.  The number of concurren

Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-12-01 Thread can you guess?
> > That depends upon exactly what effect turning off
> the
> > ZFS cache-flush mechanism has.
> 
> The only difference is that ZFS won't send a
> SYNCHRONIZE CACHE command at the end of a transaction
> group (or ZIL write). It doesn't change the actual
> read or write commands (which are always sent as
> ordinary writes -- for the ZIL, I suspect that
> setting the FUA bit on writes rather than flushing
> the whole cache might provide better performance in
> some cases, but I'm not sure, since it probably
> depends what other I/O might be outstanding.)

It's a bit difficult to imagine a situation where flushing the entire cache 
unnecessarily just to force the ZIL would be preferable - especially if ZFS 
makes any attempt to cluster small transaction groups together into larger 
aggregates (in which case you'd like to let them continue to accumulate until 
the aggregate is large enough to be worth forcing to disk in a single I/O).

> 
> > Of course, if that's true then disabling
> cache-flush
> > should have no noticeable effect on performance
> (the
> > controller just answers "Done" as soon as it
> receives
> > a cache-flush request, because there's no
> applicable
> > cache to flush), so you might as well just leave
> it
> > enabled.
> 
> The problem with SYNCHRONIZE CACHE is that its
> semantics aren't quite defined as precisely as one
> would want (until a fairly recent update). Some
> controllers interpret it as "push all data to disk"
> even if they have battery-backed NVRAM.

That seems silly, given that for most other situations they consider that data 
in NVRAM is equivalent to data on the platter.  But silly or not, if that's the 
way some arrays interpret the command, then it does have performance 
implications (and the other reply I just wrote would be unduly alarmist in such 
cases).

Thanks for adding some actual experience with the hardware to what had been a 
purely theoretical discussion.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-12-01 Thread can you guess?
> Bill, you have a long-winded way of saying "I don't
> know".  But thanks for elucidating the possibilities.

Hmmm - I didn't mean to be *quite* as noncommittal as that suggests:  I was 
trying to say (without intending to offend) "FOR GOD'S SAKE, MAN:  TURN IT BACK 
ON!", and explaining why (i.e., that either disabling it made no difference and 
thus it might as well be enabled, or that if indeed it made a difference that 
indicated that it was very likely dangerous).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-12-01 Thread Anton B. Rang
> That depends upon exactly what effect turning off the
> ZFS cache-flush mechanism has.

The only difference is that ZFS won't send a SYNCHRONIZE CACHE command at the 
end of a transaction group (or ZIL write). It doesn't change the actual read or 
write commands (which are always sent as ordinary writes -- for the ZIL, I 
suspect that setting the FUA bit on writes rather than flushing the whole cache 
might provide better performance in some cases, but I'm not sure, since it 
probably depends what other I/O might be outstanding.)

> Of course, if that's true then disabling cache-flush
> should have no noticeable effect on performance (the
> controller just answers "Done" as soon as it receives
> a cache-flush request, because there's no applicable
> cache to flush), so you might as well just leave it
> enabled.

The problem with SYNCHRONIZE CACHE is that its semantics aren't quite defined 
as precisely as one would want (until a fairly recent update). Some controllers 
interpret it as "push all data to disk" even if they have battery-backed NVRAM. 
In this case, you lose quite a lot of performance, and you gain only a modicum 
of reliability (at least in the case of larger RAID systems, which will 
generally use their battery to flush NVRAM to disk if power is lost).

There's a bit defined now that can be used to say "Only flush volatile caches, 
it's OK if data is in non-volatile cache." But not many controllers support 
this yet, and Solaris didn't as of last year -- not sure if it's been added yet.

-- Anton
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-12-01 Thread Vincent Fox
Bill, you have a long-winded way of saying "I don't know".  But thanks for 
elucidating the possibilities.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-12-01 Thread can you guess?
> I think the point of dual battery-backed controllers
> is
> that data should never be lost.  Am I wrong?

That depends upon exactly what effect turning off the ZFS cache-flush mechanism 
has.  If all data is still sent to the controllers as 'normal' disk writes and 
they have no concept of, say, using *volatile* RAM to store stuff when higher 
levels enable the "disk's" write-back cache nor any inclination to pass along 
such requests blithely to their underlying disks (which of course would subvert 
any controller-level guarantees, since they can evict data from their own 
write-back caches as soon as the disk write request completes), then presumably 
as long as they get the data they guarantee that it will eventually get to the 
platters and the ZFS cache-flush mechanism is a no-op.

Of course, if that's true then disabling cache-flush should have no noticeable 
effect on performance (the controller just answers "Done" as soon as it 
receives a cache-flush request, because there's no applicable cache to flush), 
so you might as well just leave it enabled.  Conversely, if you found that 
disabling it *did* improve performance, then it probably opened up a 
significant reliability hole.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-12-01 Thread Vincent Fox
> From Neil's comment in the blog entry that you
> referenced, that sounds *very* dicey (at least by
> comparison with the level of redundancy that you've
> built into the rest of your system) - even if you
> have rock-solid UPSs (which have still been known to
> fail).  Allowing a disk to lie to higher levels of
> the system (if indeed that's what you did by 'turning
> off cache flush') by saying that it's completed a
> write when it really hasn't is usually a very bad
> idea, because those higher levels really *do* make
> important assumptions based on that information.

I think the point of dual battery-backed controllers is that
data should never bet lost.  Perhaps I don't know enough.
Is it that bad?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-12-01 Thread can you guess?
> We are running Solaris 10u4 is the log option in
> there?

Someone more familiar with the specifics of the ZFS releases will have to 
answer that.

> 
> If this ZIL disk also goes dead, what is the failure
> mode and recovery option then?

The ZIL should at a minimum be mirrored.  But since that won't give you as much 
redundancy as your main pool has, perhaps you should create a small 5-disk 
RAID-0 LUN sharing the disks of each RAID-5 LUN and mirror the log to all four 
of them:  even if one entire array box is lost, the other will still have a 
mirrored ZIL and all the RAID-5 LUNs will be the same size (not that I'd expect 
a small variation in size between the two pairs of LUNs to be a problem that 
ZFS couldn't handle:  can't it handle multiple disk sizes in a mirrored pool as 
long as each individual *pair* of disks matches?).

Having 4 copies of the ZIL on disks shared with the RAID-5 activity will 
compromise the log's performance, since each log write won't complete until the 
slowest copy finishes (i.e., congestion in either of the RAID-5 pairs could 
delay it).  It still should usually be faster than just throwing the log in 
with the rest of the RAID-5 data, though.

Then again, I see from your later comment that you have the same questions that 
I had about whether the results reported in 
http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on suggest that having 
a ZIL may not help much anyway (at least for your specific workload:  I can 
imagine circumstances in which performance of small, synchronous writes might 
be more critical than other performance, in which case separating them out 
could be useful).

> 
> We did get the 2540 fully populated with 15K 146-gig
> drives.  With 12 disks, and wanting to have at least
> ONE hot global spare in each array, and needing to
> keep LUNs the same size, you end up doing 2 5-disk
> RAID-5 LUNs and 2 hot spares in each array.  Not that
> I really need 2 spares I just didn't see any way to
> make good use of an extra disk in each array.  If we
> wanted to dedicate them instead to this ZIL need,
> what is best way to go about that?

As I noted above, you might not want to have less redundancy in the ZIL than 
you have in the main pool:  while the data in the ZIL is only temporary (until 
it gets written back to the main pool), there's a good chance that there will 
*always* be *some* data in it, so if you lost one array box entirely at least 
that small amount of data would be at the mercy of any failure on the log disk 
that made any portion of the log unreadable.

Now, if you could dedicate all four spare disks to the log (mirroring it 4 
ways) and make each box understand that it was OK to steal one of them to use 
as a hot spare should the need arise, that might give you reasonable protection 
(since then any increased exposure would only exist until the failed disk was 
manually replaced - and normally the other box would still hold two copies as 
well).  But I have no idea whether the box provides anything like that level of 
configurability.

...

> Hundreds of POP and IMAP user processes coming and
> going from users reading their mail.  Hundreds more
> LMTP processes from mail being delivered to the Cyrus
> mail-store.

And with 10K or more users a *lot* of parallelism in the workload - which is 
what I assumed given that you had over 1 TB of net email storage space (but I 
probably should have made that assumption more explicit, just in case it was 
incorrect).

  Sometimes writes predominate over reads,
> depends on time of day whether backups are running,
> etc.  The servers are T2000 with 16 gigs RAM so no
> shortage of room for ARC cache. I have turned off
> cache flush also pursuing performance.

>From Neil's comment in the blog entry that you referenced, that sounds *very* 
>dicey (at least by comparison with the level of redundancy that you've built 
>into the rest of your system) - even if you have rock-solid UPSs (which have 
>still been known to fail).  Allowing a disk to lie to higher levels of the 
>system (if indeed that's what you did by 'turning off cache flush') by saying 
>that it's completed a write when it really hasn't is usually a very bad idea, 
>because those higher levels really *do* make important assumptions based on 
>that information.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-12-01 Thread Vincent Fox
> Sounds good so far:  lots of small files in a largish
> system with presumably significant access parallelism
> makes RAID-Z a non-starter, but RAID-5 should be OK,
> especially if the workload is read-dominated.  ZFS
> might aggregate small writes such that their
> performance would be good as well if Cyrus doesn't
> force them to be performed synchronously (and ZFS
> doesn't force them to disk synchronously on file
> close); even synchronous small writes could perform
> well if you mirror the ZFS small-update log:  flash -
> at least the kind with decent write performance -
> might be ideal for this, but if you want to steer
> clear of a specialized configuration just carving one
> small LUN for mirroring out of each array (you could
> use a RAID-0 stripe on each array if you were
> compulsive about keeping usage balanced; it would be
> nice to be able to 'center' it on the disks, but
> probably not worth the management overhead unless the
> array makes it easy to do so) should still offer a
> noticeable improvement over just placing the ZIL on
> the RAID-5 LUNs.

I'm not sure I understand you here.  I suppose I need to read
up on the ZIL option.  We are running Solaris 10u4 not OpenSolaris.

Can I setup a disk in each 2540 array for this ZIL disk, and then mirror them 
such that if one array goes down I'm not dead?  If this ZIL disk also goes 
dead, what is the failure mode and recovery option then?

We did get the 2540 fully populated.  With 12 disks, and wanting to have at 
least ONE hot global spare in each array, and needing to keep LUNs the same 
size, you end up doing 2 5-disk RAID-5 LUNs and 2 hot spares in each array.  
Not that I really need 2 spares I just didn't see any way to make good use of 
an extra disk in each array.  If we wanted to dedicate them instead to this ZIL 
need, what is best way to go about that?  Our current setup to be specific:

{cyrus3-1:vf5:136} zpool status
  pool: ms11
 state: ONLINE
 scrub: none requested
config:

NAME   STATE READ WRITE CKSUM
ms11   ONLINE   0 0 0
  mirror   ONLINE   0 0 0
c6t600A0B800038ACA002AB47504368d0  ONLINE   0 0 0
c6t600A0B800038A0440251475045D1d0  ONLINE   0 0 0
  mirror   ONLINE   0 0 0
c6t600A0B800038A1CF02994750442Fd0  ONLINE   0 0 0
c6t600A0B800038A3C4028447504628d0  ONLINE   0 0 0

errors: No known data errors

> By 'stripe size' do you mean the size of the entire
> stripe (i.e., your default above reflects 32 KB on
> each data disk, plus a 32 KB parity segment) or the
> amount of contiguous data on each disk (i.e., your
> default above reflects 128 KB on each data disk for a
> total of 512 KB in the entire stripe, exclusive of
> the 128 KB parity segment)?

I'm going from the pulldown menu choices in CAM 6.0 for
the 2540 arrays, which are currently 128K, and only go
up to 512K.  I'll have to pull up the interface again when I am
at work but I think it was called stripe size, and referred to values
the 2540 firmware was assigning onto the 5-disk RAID-5 sets.

> If the former, by all means increase it to 512 KB:
> this will keep the largest ZFS block on a single
> disk (assuming that ZFS aligns them on 'natural'
> boundaries) and help read-access parallelism
> significantly in large-block cases (I'm guessing
> that ZFS would use small blocks for small files but
> still quite possibly use large blocks for its
> metadata).  Given ZFS's attitude toward multi-block
> on-disk contiguity there might not be much benefit
> in going to even larger stripe sizes, though it
> probably wouldn't hurt noticeably either as long as
> the entire stripe (ignoring parity) didn't exceed 4
> - 16 MB in size (all the above numbers assume the 4
>  + 1 stripe configuration that you described).
> 
> In general, having less than 1 MB per-disk stripe
> segments doesn't make sense for *any* workload:  it
> only takes 10 - 20 milliseconds to transfer 1 MB from
> a contemporary SATA drive (the analysis for
> high-performance SCSI/FC/SAS drives is similar, since
> both bandwidth and latency performance improve),
> which is comparable to the 12 - 13 ms. that it takes
> on average just to position to it - and you can still
> stream data at high bandwidths in parallel from the
> disks in an array as long as you have a client buffer
> as large in MB as the number of disks you need to
> stream from to reach the required bandwidth (you want
> 1 GB/sec?  no problem:  just use a 10 - 20 MB buffer
> and stream from 10 - 20 disks in parallel).  Of
> course, this assumes that higher software layers
> organize data storage to provide that level of
> contiguity to leverage...

Hundreds of POP and IMAP user processes coming and going from users reading 
their mail. 

Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-12-01 Thread Vincent Fox
> On Dec 1, 2007 7:15 AM, Vincent Fox
> 
> Any reason why you are using a mirror of raid-5
> lun's?
> 
> I can understand that perhaps you want ZFS to be in
> control of
> rebuilding broken vdev's, if anything should go wrong
> ... but
> rebuilding RAID-5's seems a little over the top.

Because the decision of our technical leads was that a straight
ZFS RAID-10 set made up of individual disks from the 2540 was
more risky.  A double-disk failure in a mirror pair would hose the
pool and when the pool contains email for >10K people this was not
acceptable. Another possibility is one of the arrays goes offline
now you are running a RAID-0 stripe set and a single-disk fails
then you are again dead.

The setup we have can survive quite multiple failures
and we have seen enough weird events in our career that
wee decided to do this. YMMV.  Let's move on, I just wanted
to describe our setup not start an argument about it.


> How about running a ZFS mirror over RAID-0 luns? Then
> again, the
> downside is that you need intervention to fix a LUN
> after a disk goes
> boom! But you don't waste all that space :)
> 
> PS: It would be nice to know what the LSI firmware
> does (after 15
> years of evolution) to writes into the controller...
> it might have
> been better to buy JOBD's ... I see Sun will be
> releasing some soon
> (rumour?)

A guy in our group exported the disks as LUNs by
the way and ran Bonnie++ and the results were a little
better for a straight RAID-10 set of all disks, but not
hugely better enough to tip the balance towards it.
Not perhaps the best test, but what we had time to do.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-12-01 Thread can you guess?
> Hi Bill,

...

  lots of small files in a
> largish system with presumably significant access
> parallelism makes RAID-Z a non-starter,
> Why does "lots of small files in a largish system
> with presumably 
> significant access parallelism makes RAID-Z a
> non-starter"?
> thanks,
> max

Every ZFS block in a RAID-Z system is split across the N + 1 disks in a stripe 
- so not only do N + 1 disks get written for every block update, but N disks 
get *read* on every block *read*.

Normally, small files can be read in a single I/O request to one disk (even in 
conventional parity-RAID implementations).  RAID-Z requires N I/O requests 
spread across N disks, so for parallel-access reads to small files RAID-Z 
provides only about 1/Nth the throughput of conventional implementations unless 
the disks are sufficiently lightly loaded that they can absorb the additional 
load that RAID-Z places on them without reducing throughput commensurately.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-12-01 Thread [EMAIL PROTECTED]
Hi Bill,
can you guess? wrote:
>> We will be using Cyrus to store mail on 2540 arrays.
>>
>> We have chosen to build 5-disk RAID-5 LUNs in 2
>> arrays which are both connected to same host, and
>> mirror and stripe the LUNs.  So a ZFS RAID-10 set
>> composed of 4 LUNs.  Multi-pathing also in use for
>> redundancy.
>> 
>
> Sounds good so far:  lots of small files in a largish system with presumably 
> significant access parallelism makes RAID-Z a non-starter,
Why does "lots of small files in a largish system with presumably 
significant access parallelism makes RAID-Z a non-starter"?
thanks,
max

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-12-01 Thread can you guess?
> Any reason why you are using a mirror of raid-5
> lun's?

Some people aren't willing to run the risk of a double failure - especially 
when recovery from a single failure may take a long time.  E.g., if you've 
created a disaster-tolerant configuration that separates your two arrays and a 
fire completely destroys one of them, you'd really like to be able to run the 
survivor without worrying too much until you can replace its twin (hence each 
must be robust in its own right).

The above situation is probably one reason why 'RAID-6' and similar approaches 
(like 'RAID-Z2') haven't generated more interest:  if continuous on-line access 
to your data is sufficiently critical to consider them, then it's also probably 
sufficiently critical to require such a disaster-tolerant approach (which 
dual-parity RAIDs can't address).

It would still be nice to be able to recover from a bad sector on the single 
surviving site, of course, but you don't necessarily need full-blown RAID-6 for 
that:  you can quite probably get by with using large blocks and appending a 
private parity sector to them (maybe two private sectors just to accommodate a 
situation where a defect hits both the last sector in the block and the parity 
sector that immediately follows it; it would also be nice to know that the 
block size is significantly smaller than a disk track size, for similar 
reasons).  This would, however, tend to require file-system involvement such 
that all data was organized into such large blocks:  otherwise, all writes for 
smaller blocks would turn into read/modify/writes.

Panasas (I always tend to put an extra 's' into that name, and to judge from 
Google so do a hell of a lot of other people:  is it because of the resemblance 
to 'parnassas'?) has been crowing about something that it calls 'tiered parity' 
recently, and it may be something like the above.

...

> How about running a ZFS mirror over RAID-0 luns? Then
> again, the
> downside is that you need intervention to fix a LUN
> after a disk goes
> boom! But you don't waste all that space :)

'Wasting' 20% of your disk space (in the current example) doesn't seem all that 
alarming - especially since you're getting more for that expense than just 
faster and more automated recovery if a disk (or even just a sector) fails.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-12-01 Thread can you guess?
> We will be using Cyrus to store mail on 2540 arrays.
> 
> We have chosen to build 5-disk RAID-5 LUNs in 2
> arrays which are both connected to same host, and
> mirror and stripe the LUNs.  So a ZFS RAID-10 set
> composed of 4 LUNs.  Multi-pathing also in use for
> redundancy.

Sounds good so far:  lots of small files in a largish system with presumably 
significant access parallelism makes RAID-Z a non-starter, but RAID-5 should be 
OK, especially if the workload is read-dominated.  ZFS might aggregate small 
writes such that their performance would be good as well if Cyrus doesn't force 
them to be performed synchronously (and ZFS doesn't force them to disk 
synchronously on file close); even synchronous small writes could perform well 
if you mirror the ZFS small-update log:  flash - at least the kind with decent 
write performance - might be ideal for this, but if you want to steer clear of 
a specialized configuration just carving one small LUN for mirroring out of 
each array (you could use a RAID-0 stripe on each array if you were compulsive 
about keeping usage balanced; it would be nice to be able to 'center' it on the 
disks, but probably not worth the management overhead unless the array makes it 
easy to do so) should still offer a noticeable improv
 ement over just placing the ZIL on the RAID-5 LUNs.

> 
> My question is any guidance on best choice in CAM for
> stripe size in the LUNs?
> 
> Default is 128K right now, can go up to 512K, should
> we go higher?

By 'stripe size' do you mean the size of the entire stripe (i.e., your default 
above reflects 32 KB on each data disk, plus a 32 KB parity segment) or the 
amount of contiguous data on each disk (i.e., your default above reflects 128 
KB on each data disk for a total of 512 KB in the entire stripe, exclusive of 
the 128 KB parity segment)?

If the former, by all means increase it to 512 KB:  this will keep the largest 
ZFS block on a single disk (assuming that ZFS aligns them on 'natural' 
boundaries) and help read-access parallelism significantly in large-block cases 
(I'm guessing that ZFS would use small blocks for small files but still quite 
possibly use large blocks for its metadata).  Given ZFS's attitude toward 
multi-block on-disk contiguity there might not be much benefit in going to even 
larger stripe sizes, though it probably wouldn't hurt noticeably either as long 
as the entire stripe (ignoring parity) didn't exceed 4 - 16 MB in size (all the 
above numbers assume the 4 + 1 stripe configuration that you described).

In general, having less than 1 MB per-disk stripe segments doesn't make sense 
for *any* workload:  it only takes 10 - 20 milliseconds to transfer 1 MB from a 
contemporary SATA drive (the analysis for high-performance SCSI/FC/SAS drives 
is similar, since both bandwidth and latency performance improve), which is 
comparable to the 12 - 13 ms. that it takes on average just to position to it - 
and you can still stream data at high bandwidths in parallel from the disks in 
an array as long as you have a client buffer as large in MB as the number of 
disks you need to stream from to reach the required bandwidth (you want 1 
GB/sec?  no problem:  just use a 10 - 20 MB buffer and stream from 10 - 20 
disks in parallel).  Of course, this assumes that higher software layers 
organize data storage to provide that level of contiguity to leverage...

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Best stripe-size in array for ZFS mail storage?

2007-11-30 Thread Louwtjie Burger
On Dec 1, 2007 7:15 AM, Vincent Fox <[EMAIL PROTECTED]> wrote:
> We will be using Cyrus to store mail on 2540 arrays.
>
> We have chosen to build 5-disk RAID-5 LUNs in 2 arrays which are both 
> connected to same host, and mirror and stripe the LUNs.  So a ZFS RAID-10 set 
> composed of 4 LUNs.  Multi-pathing also in use for redundancy.

Any reason why you are using a mirror of raid-5 lun's?

I can understand that perhaps you want ZFS to be in control of
rebuilding broken vdev's, if anything should go wrong ... but
rebuilding RAID-5's seems a little over the top.

How about running a ZFS mirror over RAID-0 luns? Then again, the
downside is that you need intervention to fix a LUN after a disk goes
boom! But you don't waste all that space :)

PS: It would be nice to know what the LSI firmware does (after 15
years of evolution) to writes into the controller... it might have
been better to buy JOBD's ... I see Sun will be releasing some soon
(rumour?)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss