Re: Filesystem, RAID Question

2008-10-30 Thread Matthew Seaman

Jeremy Chadwick wrote:


The RAID card itself may have a BBU, so during loss of power any cached
data *on the card* will be attempt to be flushed to disk... except the
PC (including hard disks -- unless they're powered from some other
source) is already down/offline by this point.  And let's not forget
that the OS/kernel is also gone, which means any writes which were
sitting in cached memory in the kernel are lost as well.



For some reason people think that a H/W RAID card with a BBU guarantees
data integrity (keyword: guarantees).  I'm still trying to understand
why people think that.


Pending writes in the BBU-backed RAM will be completed when the machine
reboots.  The assumption is that the machine will be rebooted within a
few hours, before the battery runs down. Given that limitation, data
in BBU-backed cache can be regarded as committed to permanent (or more
accurately: /persistent/) storage.

Data in OS-level caches will be lost, yes.  But this is the point of
softupdates.  It reorders the instructions for writing data/metadata
to disk so that the data on disk remains consistent -- you may loose
some data in transit, but your disk contents will still be consistent.
Journalling achieves a similar effect in a different way -- recording
a changelog onto non-volatile storage which can later be played out to
bring the stored data into the correct state,

However, given that any access to rotating magnetic media is going to
take about 1000x longer than access to main RAM, there's always going
to be uncommitted filesystem changes in RAM that will be lost if the
machine suddenly looses power or otherwise fails.  Until there is a
persistent storage medium with something like the timing characteristics
of RAM, that effect is simply unavoidable. The only strategy you can 
employ is to provide uninterruptible power supplies, and choose hardware 
and OS wisely, so that unexpected system failures are minimized.  
Sometimes you can benefit from having multiple machines and multiple 
copies of your data, but this is not always possible.


Cheers,

Matthew

--
Dr Matthew J Seaman MA, D.Phil.   7 Priory Courtyard
 Flat 3
PGP: http://www.infracaninophile.co.uk/pgpkey Ramsgate
 Kent, CT11 9PW



signature.asc
Description: OpenPGP digital signature


Re: Filesystem, RAID Question

2008-10-30 Thread Jeremy Chadwick
On Thu, Oct 30, 2008 at 10:05:43PM -0500, Rich Winkel wrote:
> On Thu, Oct 30, 2008 at 07:33:47PM -0700, Jeremy Chadwick wrote:
> > > One of the main functions of softupdates is to order disk updates in such
> > > a way that the fs organizational integrity is maintained at all times.
> > 
> > And we've recently found that this is simply not the case.  The benefits
> > of SU are applicable to very specific environments; desktop PCs are the
> > main ones, offering great performance improvements there.
> 
> Thanks for pointing that out.  Is this an acknowledged bug in SU?  Is it
> still a problem in 7.0?

It's a problem in every release.  I believe it's more of an engineering
oversight; I don't know if it's truly fixable.  I guess there are some
kinds of filesystem errors which can't safely be fixed automatically.

There's no harm in background_fsck="no", but the reason that's not the
default is that most people want their system back up and working
immediately after a crash (don't want to wait for fsck to finish).

It's a personal choice: I would prefer the system stay down longer due
to a thorough fsck than have it come back up and still have some
underlying corruption that's being silenced.

The thread is below.  It is quite long and complex, so be sure to have
coffee or water on hand.

http://lists.freebsd.org/pipermail/freebsd-stable/2008-September/thread.html#45211

I kind of consider all of this "water under the bridge" now that ZFS is
available, and addresses all of these problems quite effectively.

> > > Of course this doesn't protect against actual sector corruption, but if
> > > the disk is between writes at the time it loses power, the fs structure
> > > is supposed to still be internally consistent.  At least that's my
> > > understanding of it.
> > 
> > Yep, that's how I understand it as well.  But this is a different topic
> > than what we were discussing 2-3 replies ago, talking about how a RAID
> > controller with cache + BBU is sufficient enough to guarantee data
> > integrity even when power is lost -- that's incorrect.
> 
> The reason I brought it up is that it occurred to me that if the hardware
> raid card reorders disk i/o it would mess with SU's ordering.  I wonder
> whether this was happening in the previous thread you referred to
> concerning fsck?

Quite honestly, I don't understand the technical details of RAID card
I/O re-ordering vs. softupdates to be able to state "yeah, that's a
problem".  Someone much more familiar with the intricacies will have
to comment on this, and I believe freebsd-fs would be a better group
for that discussion, not -questions.

But I assume that if it was a problem, we'd be seeing a *very* large
number of business customers (making the assumption they're the ones
using hardware RAID cards) complaining regularly and loudly.

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Filesystem, RAID Question

2008-10-30 Thread Rich Winkel
On Thu, Oct 30, 2008 at 07:33:47PM -0700, Jeremy Chadwick wrote:
> > One of the main functions of softupdates is to order disk updates in such
> > a way that the fs organizational integrity is maintained at all times.
> 
> And we've recently found that this is simply not the case.  The benefits
> of SU are applicable to very specific environments; desktop PCs are the
> main ones, offering great performance improvements there.

Thanks for pointing that out.  Is this an acknowledged bug in SU?  Is it
still a problem in 7.0?

> > Of course this doesn't protect against actual sector corruption, but if
> > the disk is between writes at the time it loses power, the fs structure
> > is supposed to still be internally consistent.  At least that's my
> > understanding of it.
> 
> Yep, that's how I understand it as well.  But this is a different topic
> than what we were discussing 2-3 replies ago, talking about how a RAID
> controller with cache + BBU is sufficient enough to guarantee data
> integrity even when power is lost -- that's incorrect.

The reason I brought it up is that it occurred to me that if the hardware
raid card reorders disk i/o it would mess with SU's ordering.  I wonder
whether this was happening in the previous thread you referred to
concerning fsck?

Rich

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Filesystem, RAID Question

2008-10-30 Thread Jeremy Chadwick
On Thu, Oct 30, 2008 at 08:41:59PM -0500, Rich Winkel wrote:
> On Thu, Oct 30, 2008 at 04:38:49PM -0700, Jeremy Chadwick wrote:
> > On Thu, Oct 30, 2008 at 06:12:07PM -0500, Rich Winkel wrote:
> > > Doesn't hw.ata.wc affect only card-level caching?
> > 
> > hw.ata.wc causes the ata(4) subsystem to disable write caching on all
> > disks attached to the subsystem.  It does not disable card features.
> 
> I mean, the individual disks are invisible to the OS unless the
> card's driver (and the card itself) specifically supports it.

Correct.

With regards to ATA: ata(4) has support for pass-through on some RAID
cards, such as Promise.  FreeBSD will see the individual disks (e.g.
ad4, ad6, etc.) as well as the array (e.g. ar0).

With regards to SCSI: pass(4) provides this capability.  I don't think
in the case of SCSI that the disks will appear in FreeBSD (e.g. da0)
though.  Instead, pass(4) can be used to query individual disks on an
array, e.g. smartctl's -d flag (e.g. -d 3ware, -d marvell, etc.).

In both cases (ATA and SCSI), the card itself has to support
pass-through, *and* the FreeBSD driver has to have code to allow for
such, otherwise no go.

> > There's also the below PR, which extends atacontrol to permit disabling
> > and enabling write caching on a per-disk basis.
> > 
> > http://www.freebsd.org/cgi/query-pr.cgi?pr=127717
> 
> But not on disks which are behind hardware raid cards, correct?

Correct.  For FreeBSD to be able to disable write caching on disks
behind a RAID controller, one of two things is needed:

1) Pass-through support (see above),
2) A native CLI program that interfaces with the card directly (usually
   written by the vendor).

Sadly, #2 appears to be the most common choice when a RAID card is used.
I say "sadly" because many vendors do not support FreeBSD, and only
offer Linux CLI programs -- requiring an administrator to install Linux
emulation, Linux libraries, etc., and *hoping* that it works.

If the neither of the above options are available, then your only choice
is to go into the RAID card's BIOS and disable write caching in there,
assuming the option exists (on many cards it does).

> > What gives you the impression that during a power outage your data is
> > going to be intact?
> 
> One of the main functions of softupdates is to order disk updates in such
> a way that the fs organizational integrity is maintained at all times.

And we've recently found that this is simply not the case.  The benefits
of SU are applicable to very specific environments; desktop PCs are the
main ones, offering great performance improvements there.

But there's a known problem with the "background fsck" feature of
FreeBSD, which is only applicable to filesystems which use SU; sometimes
fsck does not correct all errors, causing the filesystem to be marked
clean, even though there are actual problems with it.  There's a thread
from about a month ago discussing why background_fsck="no" is highly
recommended when using SU.

> Of course this doesn't protect against actual sector corruption, but if
> the disk is between writes at the time it loses power, the fs structure
> is supposed to still be internally consistent.  At least that's my
> understanding of it.

Yep, that's how I understand it as well.  But this is a different topic
than what we were discussing 2-3 replies ago, talking about how a RAID
controller with cache + BBU is sufficient enough to guarantee data
integrity even when power is lost -- that's incorrect.

Back to write caching:

Disabling write caching on disks does not guarantee integrity in the
case of such failures either -- on the other hand, by disabling an extra
layer of caching, you've essentially diminished the risk by only a
nominal amount.

I personally believe disabling write caching is not a plausible option
for users; the performance hit is major (I have done tests) -- write
speeds drop to 12% of total capability.  Meaning: 70MB/sec with WC
enabled, 8.4MB/sec with WC disabled.  This is *without* a controller
that does caching of any kind.

Essentially you can use this to benchmark which is faster: write
caching disabled on disks + caching enabled on a controller, or
write caching enabled on disks + caching disabled on a controller.
It would be interesting to see benchmark comparisons of different
controllers.

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Filesystem, RAID Question

2008-10-30 Thread Rich Winkel
On Thu, Oct 30, 2008 at 04:38:49PM -0700, Jeremy Chadwick wrote:
> On Thu, Oct 30, 2008 at 06:12:07PM -0500, Rich Winkel wrote:
> > Doesn't hw.ata.wc affect only card-level caching?
> 
> hw.ata.wc causes the ata(4) subsystem to disable write caching on all
> disks attached to the subsystem.  It does not disable card features.

I mean, the individual disks are invisible to the OS unless the
card's driver (and the card itself) specifically supports it.

> There's also the below PR, which extends atacontrol to permit disabling
> and enabling write caching on a per-disk basis.
> 
> http://www.freebsd.org/cgi/query-pr.cgi?pr=127717

But not on disks which are behind hardware raid cards, correct?

> What gives you the impression that during a power outage your data is
> going to be intact?

One of the main functions of softupdates is to order disk updates in such
a way that the fs organizational integrity is maintained at all times.
Of course this doesn't protect against actual sector corruption, but if
the disk is between writes at the time it loses power, the fs structure
is supposed to still be internally consistent.  At least that's my
understanding of it.

Rich

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Filesystem, RAID Question

2008-10-30 Thread Jeremy Chadwick
On Thu, Oct 30, 2008 at 04:38:49PM -0700, Jeremy Chadwick wrote:
> ...
> In this scenario, write caching on the disks is usually done by the
> controller itself (through a BIOS option), and not by FreeBSD.

This should have read: "... usually enabled/disabled by the controller
itself".  :-)  Sorry if that confused anyone.

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Filesystem, RAID Question

2008-10-30 Thread Jeremy Chadwick
On Thu, Oct 30, 2008 at 06:12:07PM -0500, Rich Winkel wrote:
> On Wed, Oct 29, 2008 at 07:49:00PM +, Matthew Seaman wrote:
> > Given that you don't have a BBU, what is the status of write caching
> > on the individual hard drives?  You'll have to use 3dm2 or the CLI 
> > equivalent to investigate this, as the RAID controller tends to hide 
> > that level of information from the OS.  However, this setting is the
> > same thing as controlled by the hw.ata.wc sysctl -- and like that 
> > it has a major effect on disk IO performance.  Turning write caching 
> > off is the safe, conservative thing to do for maximum data security.  
> 
> Doesn't hw.ata.wc affect only card-level caching?

hw.ata.wc causes the ata(4) subsystem to disable write caching on all
disks attached to the subsystem.  It does not disable card features.

There's also the below PR, which extends atacontrol to permit disabling
and enabling write caching on a per-disk basis.

http://www.freebsd.org/cgi/query-pr.cgi?pr=127717

> It seems likely that the softupdates queuing order might be scrambled
> by card-level caching if it juggles pending writes around to minimize
> seek times.  If so, it would be disasterous for data integrity in
> the event of a power outage.  Disk-level caching might be safe
> though ...  Someone needs to ask 3ware whether the card reorders
> updates and if so, if there's a setting to keep them in order.

What gives you the impression that during a power outage your data is
going to be intact?

The RAID card itself may have a BBU, so during loss of power any cached
data *on the card* will be attempt to be flushed to disk... except the
PC (including hard disks -- unless they're powered from some other
source) is already down/offline by this point.  And let's not forget
that the OS/kernel is also gone, which means any writes which were
sitting in cached memory in the kernel are lost as well.

Even disabling write caching on the disks themselves won't help,
although it might help with actual I/O performance (using 2 levels of
caching: RAID controller, and OS/kernel).  In this scenario, write
caching on the disks is usually done by the controller itself (through
a BIOS option), and not by FreeBSD.

For some reason people think that a H/W RAID card with a BBU guarantees
data integrity (keyword: guarantees).  I'm still trying to understand
why people think that.

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Filesystem, RAID Question

2008-10-30 Thread Rich Winkel
On Wed, Oct 29, 2008 at 07:49:00PM +, Matthew Seaman wrote:
> Given that you don't have a BBU, what is the status of write caching
> on the individual hard drives?  You'll have to use 3dm2 or the CLI 
> equivalent to investigate this, as the RAID controller tends to hide 
> that level of information from the OS.  However, this setting is the
> same thing as controlled by the hw.ata.wc sysctl -- and like that 
> it has a major effect on disk IO performance.  Turning write caching 
> off is the safe, conservative thing to do for maximum data security.  

Doesn't hw.ata.wc affect only card-level caching?

It seems likely that the softupdates queuing order might be scrambled
by card-level caching if it juggles pending writes around to minimize
seek times.  If so, it would be disasterous for data integrity in
the event of a power outage.  Disk-level caching might be safe
though ...  Someone needs to ask 3ware whether the card reorders
updates and if so, if there's a setting to keep them in order.

Rich

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Filesystem, RAID Question

2008-10-29 Thread Josh Paetzel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Matthew Seaman wrote:
> Rich Fairbanks wrote:
> 
>> Now, this is how I set up the array. I installed the card, popped in the
>> drives. The card bios found the drives and allowed me to setup in RAID 5.
>> Then, FreeBSD booted and found the "disk" as da0. I want the entire
>> array to
>> be one big chunk of space. In other words, I don't need a bunch of
>> slices or
>> partitions (or DO I? I'm still very new to the whole slice vs. partition
>> concept)
> 

newfs /dev/da0 gives you a filesystem with softupdates turned off.
You'll want to enable them.  Either reinitialize the filesystem with
newfs -U or use tunefs to turn softupdates on.

3ware recently released new firmware for the 9650 and 9690 cards that
has given me some impressive jumps in application level performance.
You can flash the card from in the OS using tw_cli


- --
Thanks,

Josh Paetzel

PGP: 8A48 EF36 5E9F 4EDA 5ABC 11B4 26F9 01F1 27AF AECB
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.8 (Darwin)

iEYEARECAAYFAkkIxXUACgkQJvkB8SevrsvQugCbBOFjfcTsxt+yzoiATJ7pgVk7
55sAmQF7v302XoF0OBv7hoC6rZA6tPhM
=oSsJ
-END PGP SIGNATURE-
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Filesystem, RAID Question

2008-10-29 Thread Matthew Seaman

Rich Fairbanks wrote:


Now, this is how I set up the array. I installed the card, popped in the
drives. The card bios found the drives and allowed me to setup in RAID 5.
Then, FreeBSD booted and found the "disk" as da0. I want the entire array to
be one big chunk of space. In other words, I don't need a bunch of slices or
partitions (or DO I? I'm still very new to the whole slice vs. partition
concept)


The default settings should actually work just about right for a 
general purpose file system with reasonably sized files.  A RAID5 
across 3x1TB drives will give you 2+ TB usable space -- that's within 
the  capabilities of UFS2, so you should be OK there.  However a 3 disk 
RAID5 is the worst performing RAID5 setup you can create.  A larger 
number of smaller disks would probably have served you better.



I typed newfs /dev/da0 . A ton of numbers went across the screen, then I
mounted /dev/da0 at /usr/home/storage. It works, but perhaps I missed a step
that would have made things easier/perform better, etc.


The sort of changes you can make at newfs time mostly affect how 
efficient the storage is -- ie. tuning the system for particularly 
large or small files.  While newfs and tunefs can affect performance, 
they aren't the first thing to look at here. 


Besides creating the file system a different way, what would be an optimum
stripe size for the array? I will using this for storing, basically, a TON
of word documents and email messages, and a few large .pst files. So, the
average file size will be in the 25-100K range, but a few 1-2GB files.


Just take the default stripe size the array controller presents you 
with -- it will be appropriate for this sort of mixed file sizes.


The first thing to consider is what sort of IO caching strategy your
hardware is using.  Does your RAID controller have a battery backup
unit?  Probably not, as that tends to add a large whack onto the price.

If not, then your array controller will not report an IO operation as 
complete to the OS until the bits have been written to the disk[*].  
With the BBU, the controller can report the operation as complete as 
soon as  the data is stored in (battery backed) RAM on the controller.  
These  modes are called 'write through' and 'write back' in some 
controllers, but I can't for the life of me remember which is which.


Given that you don't have a BBU, what is the status of write caching
on the individual hard drives?  You'll have to use 3dm2 or the CLI 
equivalent to investigate this, as the RAID controller tends to hide 
that level of information from the OS.  However, this setting is the
same thing as controlled by the hw.ata.wc sysctl -- and like that 
it has a major effect on disk IO performance.  Turning write caching 
off is the safe, conservative thing to do for maximum data security.  

Turning write caching on is the only way to get decent performance out 
of ordinary hard drives, but it leaves you open to data loss if the 
machine should crash or lose power suddenly.  Most systems with ATA

or ordinary SATA drives default to using write caching.  SCSI and fast
SAS drives can be configured either way.

You'ld always turn disk level write caching off if you've got a BBU, 
because it's made redundant in that case by the controller memory 
cache.


If fiddling with write caching can't make things any better, then I'd 
reconsider using RAID5.  Unfortunately 3 disks doesn't leave you with 
many options.  Add another drive of the same size and you can make a 4 
disk RAID10 with 2TB usable space.  Or you can configure the RAID 
controller to act as a JBOD and try out ZFS -- the RAID-Z mode is 
the moral equivalent of RAID5 but quite different in operation.


Cheers,

Matthew

[*] Some disks have been known to lie about completing IO transactions 
even when set to the most conservative mode.  IMHO they aren't fit for 
purpose and should you be landed with such things you'ld be entitled 
to a refund from the vendor.


--
Dr Matthew J Seaman MA, D.Phil.   7 Priory Courtyard
 Flat 3
PGP: http://www.infracaninophile.co.uk/pgpkey Ramsgate
 Kent, CT11 9PW



signature.asc
Description: OpenPGP digital signature