Re: [ANNOUNCE] block device interfaces changes

2000-01-11 Thread Richard Gooch


PLEASE TAKE ME OFF THE CC LIST.

BTW: I'm on holidays and won't be replying to email for a while.

Richard B. Johnson writes:
 On Sun, 9 Jan 2000, Alexander Viro wrote:
 [SNIPPED]
 
   that we provide source to the end-user, they required that we supply
 a
   "current" distribution of Linux if the end-user requests it.
  
  Oh. My. God. They are requiring you to do WHAT??? Do you mean that you
  really ship 2.3.x to your customers? Arrggh. "Source" == "source of what
  we are shipping". And not "anything that was written by other guys who
  started from the same source". It's utter nonsense. _No_ license can
  oblige you to include the modifications done by somebody else. Otherwise
  you'ld have those drivers in the main tree, BTW - _that_ much should be
  clear even for your LD.
  
  [snip]
  
   The obvious solution, given these constraints, is that we just ignore
   all changes until shipping time, then attempt to compile with the latest
   distribution, fixing all the problems at once. However, we then end up
   shipping untested software which ends up being another problem. Checking
   to see if it "runs" isn't testing software in the cold cruel world of
   industry.
  
  You do realize that stability of the system doesn't exceed that of the
  weakest link, don't you? You _are_ shipping untested software if you are
  shipping 2.3.whatever + your drivers. It's called unstable for a good
  reason. Ouch... OK, what if Linus will put a
  pre-patch-2.3.39-dont-even-think-of-using-it-anywhere-near-your-data-3.gz
  on ftp.kernel.org tomorrow? Will your LD require you to ship _that_? No?
  Is the notion of 'untested software' completely alien to them?
  
  BTW, you could point them to Debian or RH - none of them ships the 2.3.x
  in released versions _and_ it's not even the latest 2.2.x existing. Hell,
  Debian 2.1 is shipped with 2.0 - switch to 2.2 is in potato (== Debian 2.2
  to be). RH uses 2.2.12, AFAICS (with a lot of patches). And all of them
  have darn good reasons to do so - stability being the first one. Is there
  any chance to get your legal folks talking with RH lawyers? Or Caldera, or
  Corel ones...
  
   So, presently, I have 13 drivers I have to keep "current". Yesterday
   they all got broken again. A week before, half of them were broken
   because somebody didn't like a variable name!
  
  Which might mean that repository of pointers to 3rd-party drivers (along
  with the contact info) might be Good Thing(tm).
  
  I would suggest the following: keep this information in DNS (RBL-like
  scheme; i.e. driver_name.author_or_company_name.drivers.linux.org
  having TXT record with URL and kernel version(s) in the body). Then all
  you need is (a) standard address (e.g. [EMAIL PROTECTED]) aliased
  to the script; (b) said script verifying (PGP, GPG, whatever) the source
  of mail and updating the record. IOW, all it really takes is somebody with
  nameserver, clue and decent connectivity. Any takers?
  
 
 Again, since there was so much mail on this, I will answer this one
 only with cc to linux-kernel.
 
 The idea is that once something gets "released", it gets built with
 whatever distribution is available at that time. That distribution is
 shipped (if required). It's just like DOS/Windows/SunOS/Solaris, etc.
 You ship with the "current" distribution. Certainly a customer would
 be really pissed to get a new product with a two-year-old version of
 software.
 
 Once the product is shipped, we can make "service-pack" updates just
 like everybody else, which are not free.
 
 What this means for us, is to keep our development software sufficiently
 up-to-date so that there are no radical changes required once a release
 is imminent. In no case do we intend to ship using "development"
 kernels. But, to keep up-to-date for a pending release we need to be
 using development kernels in engineering. I never attempt to use any
 of the "pre-NN" intermediate stuff anyway.
 
 A lot of people are going to have to do the same as Linux works its
 way from being just a desktop OS to something being used to replace
 Sun software and Windows in industrial applications. The embeded
 market doesn't have to worry about releases and version numbers
 because the customer usually couldn't "upgrade" anyway. I have
 a "platinum" project which uses some old stable version of Linux
 that I happen to like. It's an embedded system that runs VXI bus
 instruments. It works just fine. That kernel will never have to
 be changed, even if bugs are found. They can be fixed with the
 reset button. 
 
 There are two major interfaces that are critical to success:
 
 (1) The POSIX/C-interface API.
 (2) The module API.
 
 Much care has been taken to assure that (1) stays stable. We need
 some such care on (2). In particular, if major changes are made,
 they should be terminating, i.e., made in such a way that they
 are unlikely to ever have to be made again.
 
 For example, open(), read(), write(), ioctl(), in the kernel
 seem to 

Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread Benno Senoner

"Stephen C. Tweedie" wrote:

(...)


 3) The soft-raid backround rebuild code reads and writes through the
buffer cache with no synchronisation at all with other fs activity.
After a crash, this background rebuild code will kill the
write-ordering attempts of any journalling filesystem.

This affects both ext3 and reiserfs, under both RAID-1 and RAID-5.

 Interaction 3) needs a bit more work from the raid core to fix, but it's
 still not that hard to do.

 So, can any of these problems affect other, non-journaled filesystems
 too?  Yes, 1) can: throughout the kernel there are places where buffers
 are modified before the dirty bits are set.  In such places we will
 always mark the buffers dirty soon, so the window in which an incorrect
 parity can be calculated is _very_ narrow (almost non-existant on
 non-SMP machines), and the window in which it will persist on disk is
 also very small.

 This is not a problem.  It is just another example of a race window
 which exists already with _all_ non-battery-backed RAID-5 systems (both
 software and hardware): even with perfect parity calculations, it is
 simply impossible to guarantee that an entire stipe update on RAID-5
 completes in a single, atomic operation.  If you write a single data
 block and its parity block to the RAID array, then on an unexpected
 reboot you will always have some risk that the parity will have been
 written, but not the data.  On a reboot, if you lose a disk then you can
 reconstruct it incorrectly due to the bogus parity.

 THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
 only way you can get bitten by this failure mode is to have a system
 failure and a disk failure at the same time.



 --Stephen

thank you very much for these clear explanations,

Last doubt: :-)
Assume all RAID code - FS interaction problems get fixed,
since a linux soft-RAID5 box has no battery backup,
does this mean that we will loose data
ONLY if there is a power failure AND successive disk failure ?
If we loose the power and then after reboot all disks remain intact
can the RAID layer reconstruct all information in a safe way ?

The problem is that power outages are unpredictable even in presence
of UPSes therefore it is important to have some protection against
power losses.

regards,
Benno.






[FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread Stephen C. Tweedie

Hi,

This is a FAQ: I've answered it several times, but in different places,
so here's a definitive answer which will be my last one: future
questions will be directed to the list archives. :-)

On Tue, 11 Jan 2000 16:20:35 +0100, Benno Senoner [EMAIL PROTECTED]
said:

 then raid can miscalculate parity by assuming that the buffer matches
 what is on disk, and that can actually cause damage to other data
 than the data being written if a disk dies and we have to start using
 parity for that stripe.

 do you know if using soft RAID5 + regular etx2 causes the same sort of
 damages, or if the corruption chances are lower when using a non
 journaled FS ?

Sort of.  See below.

 is the potential corruption caused by the RAID layer or by the FS
 layer ?  ( does need the FS code or the RAID code to be fixed ?)

It is caused by neither: it is an interaction effect.

 if it's caused by the FS layer, how does behave XFS (not here yet ;-)
 ) or ReiserFS in this case ?

They will both fail in the same way.

Right, here's the problem:

The semantics of the linux-2.2 buffer cache are not well defined with
respect to write ordering.  There is no policy to guide what gets
written and when: the writeback caching can trickle to disk at any time,
and other system components such as filesystems and the VM can force a
write-back of data to disk at any time.

Journaling imposes write ordering constraints which insist that data in
the buffer cache *MUST NOT* be written to disk unless the filesystem
explicitly says so.

RAID-5 needs to interact directly with the buffer cache in order to be
able to improve performance.

There are three nasty interactions which result:

1) RAID-5 tries to bunch writes of dirty buffers up so that all the data
   in a stripe gets written to disk at once.  For RAID-5, this is very
   much faster than dribbling the stripe back one disk at a time.
   Unfortunately, this can result in dirty buffers being written to disk
   earlier than the filesystem expected, with the result that on a
   crash, the filesystem journal may not be entirely consistent.

   This interaction hits ext3, which stores its pending transaction
   buffer updates in the buffer cache with the b_dirty bit set.

2) RAID-5 peeks into the buffer cache to look for buffer contents in
   order to calculate parity without reading all of the disks in a
   stripe.  If a journaling system tries to prevent modified data from
   being flushed to disk by deferring the setting of the buffer dirty
   flag, then RAID-5 will think that the buffer, being clean, matches
   the state of the disk and so it will calculate parity which doesn't
   actually match what is on disk.  If we crash and one disk fails on
   reboot, wrong parity may prevent recovery of the lost data.

   This interaction hits reiserfs, which stores its pending transaction
   buffer updates in the buffer cache with the b_dirty bit clear.

Both interactions 1) and 2) can be solved by making RAID-5 completely
avoid buffers which have an incremented b_count reference count, and
making sure that the filesystems all hold that count raised when the
buffers are in an inconsistent or pinned state.

3) The soft-raid backround rebuild code reads and writes through the
   buffer cache with no synchronisation at all with other fs activity.
   After a crash, this background rebuild code will kill the
   write-ordering attempts of any journalling filesystem.  

   This affects both ext3 and reiserfs, under both RAID-1 and RAID-5.

Interaction 3) needs a bit more work from the raid core to fix, but it's
still not that hard to do.


So, can any of these problems affect other, non-journaled filesystems
too?  Yes, 1) can: throughout the kernel there are places where buffers
are modified before the dirty bits are set.  In such places we will
always mark the buffers dirty soon, so the window in which an incorrect
parity can be calculated is _very_ narrow (almost non-existant on
non-SMP machines), and the window in which it will persist on disk is
also very small.

This is not a problem.  It is just another example of a race window
which exists already with _all_ non-battery-backed RAID-5 systems (both
software and hardware): even with perfect parity calculations, it is
simply impossible to guarantee that an entire stipe update on RAID-5
completes in a single, atomic operation.  If you write a single data
block and its parity block to the RAID array, then on an unexpected
reboot you will always have some risk that the parity will have been
written, but not the data.  On a reboot, if you lose a disk then you can
reconstruct it incorrectly due to the bogus parity.

THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
only way you can get bitten by this failure mode is to have a system
failure and a disk failure at the same time.


--Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread Stephen C. Tweedie

Hi,

On Tue, 11 Jan 2000 20:17:22 +0100, Benno Senoner [EMAIL PROTECTED]
said:

 Assume all RAID code - FS interaction problems get fixed, since a
 linux soft-RAID5 box has no battery backup, does this mean that we
 will loose data ONLY if there is a power failure AND successive disk
 failure ?  If we loose the power and then after reboot all disks
 remain intact can the RAID layer reconstruct all information in a safe
 way ?

Yes.

--Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-11 Thread mauelsha

"Stephen C. Tweedie" wrote:
 
 Hi,
 
 This is a FAQ: I've answered it several times, but in different places,

SNIP

 THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
 only way you can get bitten by this failure mode is to have a system
 failure and a disk failure at the same time.
 

To try to avoid this kind of problem some brands do have additional
logging (to disk
which is slow for sure or to NVRAM) in place, which enables them to at
least recognize
the fault to avoid the reconstruction of invalid data or even enables
them to recover
the data by using redundant copies of it in NVRAM + logging information
what could be
written to the disks and what not.

Heinz