Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-12 Thread Stephen C. Tweedie

Hi,

On Wed, 12 Jan 2000 07:21:17 -0500 (EST), Ingo Molnar <[EMAIL PROTECTED]>
said:

> On Wed, 12 Jan 2000, Gadi Oxman wrote:

>> As far as I know, we took care not to poke into the buffer cache to
>> find clean buffers -- in raid5.c, the only code which does a find_buffer()
>> is:

> yep, this is still the case.

OK, that's good to know.

> Especially the reconstruction code is a rathole. Unfortunately
> blocking reconstruction if b_count == 0 is not acceptable because
> several filesystems (such as ext2fs) keep metadata caches around
> (eg. the block group descriptors in the ext2fs case) which have
> b_count == 1 for a longer time.

That's not a problem: we don't need reconstruction to interact with the
buffer cache at all.

Ideally, what I'd like to see the reconstruction code do is to:

* lock a stripe
* read a new copy of that stripe locally
* recalc parity and write back whatever disks are necessary for the stripe
* unlock the stripe

so that the data never goes through the buffer cache at all, but that
the stripe is locked with respect to other IOs going on below the level
of ll_rw_block (remember there may be IOs coming in to ll_rw_block which
are not from the buffer cache, eg. swap or journal IOs).

> We are '100% journal-safe' if power fails during resync. 

Except for the fact that resync isn't remotely journal-safe in the first
place, yes.  :-)

--Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-12 Thread Stephen C. Tweedie

Hi,

On Tue, 11 Jan 2000 16:41:55 -0600, "Mark Ferrell"
<[EMAIL PROTECTED]> said:

>   Perhaps I am confused.  How is it that a power outage while attached
> to the UPS becomes "unpredictable"?  

One of the most common ways to get an outage while on a UPS is somebody
tripping over, or otherwise removing, the cable between the UPS and the
computer.  How exactly is that predictable?

Just because you reduce the risk of unexpected power outage doesn't mean
we can ignore the possibility.

--Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power fai

2000-01-12 Thread Stephen C. Tweedie

Hi,

On Wed, 12 Jan 2000 11:28:28 MET-1, "Petr Vandrovec"
<[EMAIL PROTECTED]> said:

>   I did not follow this thread (on -fsdevel) too close (and I never
> looked into RAID code, so I should shut up), but... can you
> confirm that after buffer with data is finally marked dirty, parity
> is recomputed anyway? So that window is really small and same problems
> occurs every moment when you wrote data, but did not wrote parity yet?

Yes, that's what I said.

--Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-12 Thread Benno Senoner

"Stephen C. Tweedie" wrote:

> Ideally, what I'd like to see the reconstruction code do is to:
>
> * lock a stripe
> * read a new copy of that stripe locally
> * recalc parity and write back whatever disks are necessary for the stripe
> * unlock the stripe
>
> so that the data never goes through the buffer cache at all, but that
> the stripe is locked with respect to other IOs going on below the level
> of ll_rw_block (remember there may be IOs coming in to ll_rw_block which
> are not from the buffer cache, eg. swap or journal IOs).
>
> > We are '100% journal-safe' if power fails during resync.
>
> Except for the fact that resync isn't remotely journal-safe in the first
> place, yes.  :-)
>
> --Stephen

Sorry for my ignorance I got a little confused by this post:

Ingo said we are 100% journal-safe, you said the contrary,

can you or Ingo please explain us in which situation (power-loss)
running linux-raid+ journaled FS we risk a corrupted filesystem ?

I am interested what happens if the power goes down while you write
heavily to a ext3/reiserfs (journaled FS) on soft-raid5 array.

After the reboot if all disk remain intact physically,
will we only lose the data that was being written, or is there a possibility
to end up in a corrupted filesystem which could more damages in future ?

(or do we need to wait for the raid code in 2.3 ?)

sorry for re-asking that question, but I am still confused.

regards,
Benno.





Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure =problems ?

2000-01-12 Thread Ingo Molnar


On Wed, 12 Jan 2000, Gadi Oxman wrote:

> As far as I know, we took care not to poke into the buffer cache to
> find clean buffers -- in raid5.c, the only code which does a find_buffer()
> is:

yep, this is still the case. (Sorry Stephen, my bad.) We will have these
problems once we try to eliminate the current copying overhead.
Nevertheless there are bad (illegal) interactions between the RAID code
and the buffer cache, i'm cleaning up this for 2.3 right now. Especially
the reconstruction code is a rathole. Unfortunately blocking
reconstruction if b_count == 0 is not acceptable because several
filesystems (such as ext2fs) keep metadata caches around (eg. the block
group descriptors in the ext2fs case) which have b_count == 1 for a longer
time.

If both power and a disk fails at once then we still might get local
corruption for partially written RAID5 stripes. If either power or a disk
fails, then the Linux RAID5 code is safe wrt. journalling, because it
behaves like an ordinary disk. We are '100% journal-safe' if power fails
during resync. We are also 100% journal-safe if power fails during
reconstruction of failed disk or in degraded mode.

the 2.3 buffer-cache enhancements i wrote ensure that 'cache snooping' and
adding to the buffer-cache can be done safely by 'external' cache
managers. I also added means to do atomic IO operations which in fact are
several underlying IO operations - without the need of allocating a
separate bh. The RAID code uses these facilities now.

Ingo



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power fai

2000-01-12 Thread Petr Vandrovec

On 11 Jan 00 at 22:24, Stephen C. Tweedie wrote:
> The race I'm concerned about could occur when the raid driver wants to
> compute parity for a stripe and finds some of the blocks are present,
> and clean, in the buffer cache.  Raid assumes that those buffers
> represent what is on disk, naturally enough.  So, it uses them to
> calculate parity without rereading all of the disk blocks in the stripe.
> The trouble is that the standard practice in the kernel, when modifying
> a buffer, is to make the change and _then_ mark the buffer dirty.  If
> you hit that window, then the raid driver will find a buffer which
> doesn't match what is on disk, and will compute parity from that buffer
> rather than from the on-disk contents.
Hi Stephen,
  I did not follow this thread (on -fsdevel) too close (and I never
looked into RAID code, so I should shut up), but... can you
confirm that after buffer with data is finally marked dirty, parity
is recomputed anyway? So that window is really small and same problems
occurs every moment when you wrote data, but did not wrote parity yet?
Thanks,
Petr Vandrovec
[EMAIL PROTECTED]




Re: file system size limits

2000-01-12 Thread Matti Aarnio

On Mon, Jan 10, 2000 at 05:14:29PM +0100, Manfred Spraul wrote:
> 2^10  kilo
> 2^20  mega
> 2^30  giga
> 2^40  terra
> 
> ---> 2^^41== 2 terrabyte.

Sorry Manfred, the multiplier is 'TERA' - not 'TERRA', which
rather confusing spelling difference is used by M$ to market
their Terraserver (satellite earth imagery service).

Pick a dictionary to see why M$ chose word 'Terra' for that use..

And powers of two have this strange suggestion of adding 'i' after
the primary multiplier:
kiB, MiB, GiB, TiB, EiB, ...
(E = Eksa)

> --
>   Manfred

/Matti Aarnio



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-12 Thread Mark Ferrell

  Perhaps I am confused.  How is it that a power outage while attached
to the UPS becomes "unpredictable"?  

  We run a Dell PowerEdge 2300/400 using Linux software raid and the
system monitors it's own UPS.  When power failure occures the system
will bring itself down to a minimal state (runlevel 1) after the
batteries are below 50% .. and once below 15% it will shutdown which
turns off the UPS.  When power comes back on the UPS fires up and the
system resumes as normal.

  Addmitedly this wont prevent issues like god reaching out and slapping
my system via lightning or something, nor will it resolve issues where
someone decides to grab the power cable and swing around on it severing
the connection from the UPS to the system .. but for the most part it
has thus far prooven to be a fairly decent configuration.

Benno Senoner wrote:
> 
> "Stephen C. Tweedie" wrote:
> 
> (...)
> 
> >
> > 3) The soft-raid backround rebuild code reads and writes through the
> >buffer cache with no synchronisation at all with other fs activity.
> >After a crash, this background rebuild code will kill the
> >write-ordering attempts of any journalling filesystem.
> >
> >This affects both ext3 and reiserfs, under both RAID-1 and RAID-5.
> >
> > Interaction 3) needs a bit more work from the raid core to fix, but it's
> > still not that hard to do.
> >
> > So, can any of these problems affect other, non-journaled filesystems
> > too?  Yes, 1) can: throughout the kernel there are places where buffers
> > are modified before the dirty bits are set.  In such places we will
> > always mark the buffers dirty soon, so the window in which an incorrect
> > parity can be calculated is _very_ narrow (almost non-existant on
> > non-SMP machines), and the window in which it will persist on disk is
> > also very small.
> >
> > This is not a problem.  It is just another example of a race window
> > which exists already with _all_ non-battery-backed RAID-5 systems (both
> > software and hardware): even with perfect parity calculations, it is
> > simply impossible to guarantee that an entire stipe update on RAID-5
> > completes in a single, atomic operation.  If you write a single data
> > block and its parity block to the RAID array, then on an unexpected
> > reboot you will always have some risk that the parity will have been
> > written, but not the data.  On a reboot, if you lose a disk then you can
> > reconstruct it incorrectly due to the bogus parity.
> >
> > THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
> > only way you can get bitten by this failure mode is to have a system
> > failure and a disk failure at the same time.
> >
> 
> >
> > --Stephen
> 
> thank you very much for these clear explanations,
> 
> Last doubt: :-)
> Assume all RAID code - FS interaction problems get fixed,
> since a linux soft-RAID5 box has no battery backup,
> does this mean that we will loose data
> ONLY if there is a power failure AND successive disk failure ?
> If we loose the power and then after reboot all disks remain intact
> can the RAID layer reconstruct all information in a safe way ?
> 
> The problem is that power outages are unpredictable even in presence
> of UPSes therefore it is important to have some protection against
> power losses.
> 
> regards,
> Benno.



Re: UMSDOS under i386 and PPC

2000-01-12 Thread Matija Nalis

On 8 Jan 2000 11:44:28 +0100, Yair Itzhaki <[EMAIL PROTECTED]> wrote:
>I've found a cross-platform incompatibility when passing UMSDOS formatted
>media (a FLASH disk) between i386 and PowerPC. Media created under i386
>cannot be read using a PPC platform, and vice-versa.
>
>I've traced it to the fact the some kernel elements (__kernel_dev_t,
>__kernel_uid_t, __kernel_gid_t ...) are defined as 'unsigned short' under
>i386, and 'unsigned int' under ppc (see /include/asm-XXX/posix_types.h). 

Hmmm, does anybody on the list have idea about handling this?

As current maintainer, I could:

1) change __kernel_uid_t, __kernel_gid_t and others to be __u16 (big/little
   endian handling was already in place for eons, it is just that neither I
   nor probably original author have multiple platforms to test it)
   This would break all current UMSDOS big-endian users.
   Also, what happens if uid_t or gid_t is > 16bits ?
   
2) same as 1), but to __u32, and break all current UMSDOS IA32 users.
   This also would also require bumping up compatibility level, so new
   tools etc. would be required.

3) make optional #define to do 1) (undef by default), so users who need it can
   have IA32 compatibile UMSDOS filesystems (with all problems of spilling
   uid_t and gid_t if they are too big)
   
4) just leave it as it is, and wait for big VFS changes which will enable
   making of clean stackable fs so UMSDOS can be reinvented over any fs ?


>This is causing the 'umsdos_dirent' structure to be sized differently, and
>functions that read this directly from the EMD file get wrong data by
>accessing wrong offsets in the structure ("umsdos_emd_dir_read" for
>example).

To me, 4 or 3 look most realistic.

Any opinions on how it should be done correctly (if it can be) ?

-- 
Opinions above are GNU-copylefted.



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-12 Thread Stephen C. Tweedie

Hi,

On Wed, 12 Jan 2000 00:12:55 +0200 (IST), Gadi Oxman
<[EMAIL PROTECTED]> said:

> Stephen, I'm afraid that there are some misconceptions about the
> RAID-5 code.

I don't think so --- I've been through this with Ingo --- but I
appreciate your feedback since I'm getting inconsistent advise here!
Please let me explain...

> In an early pre-release version of the RAID code (more than two years
> ago?), which didn't protect against that race, we indeed saw locked
> buffers changing under us from the point in which we computed the
> parity till the point in which they were actually written to the disk,
> leading to a corrupted parity.

That is not the race.  The race has nothing at all to do with buffers
changing while they are being used for parity: that's a different
problem, long ago fixed by copying the buffers.

The race I'm concerned about could occur when the raid driver wants to
compute parity for a stripe and finds some of the blocks are present,
and clean, in the buffer cache.  Raid assumes that those buffers
represent what is on disk, naturally enough.  So, it uses them to
calculate parity without rereading all of the disk blocks in the stripe.

The trouble is that the standard practice in the kernel, when modifying
a buffer, is to make the change and _then_ mark the buffer dirty.  If
you hit that window, then the raid driver will find a buffer which
doesn't match what is on disk, and will compute parity from that buffer
rather than from the on-disk contents.

> 1. n dirty blocks are scheduled for a stripe write.

That's not the race.  The problem occurs when only one single dirty
block is scheduled for a write, and we need to find the contents of the
rest of the stripe to compute parity.

> Point (2) is also incorrect; we have taken care *not* to peek into
> the buffer cache to find clean buffers and use them for parity
> calculations. We make no such assumptions.

Not according to Ingo --- can we get a definitive answer on this,
please?

Many thanks,
  Stephen



Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?

2000-01-12 Thread mauelsha

"Stephen C. Tweedie" wrote:
> 
> Hi,
> 
> On Tue, 11 Jan 2000 15:03:03 +0100, mauelsha
> <[EMAIL PROTECTED]> said:
> 
> >> THIS IS EXPECTED.  RAID-5 isn't proof against multiple failures, and the
> >> only way you can get bitten by this failure mode is to have a system
> >> failure and a disk failure at the same time.
> 
> > To try to avoid this kind of problem some brands do have additional
> > logging (to disk which is slow for sure or to NVRAM) in place, which
> > enables them to at least recognize the fault to avoid the
> > reconstruction of invalid data or even enables them to recover the
> > data by using redundant copies of it in NVRAM + logging information
> > what could be written to the disks and what not.
> 
> Absolutely: the only way to avoid it is to make the data+parity updates
> atomic, either in NVRAM or via transactions.  I'm not aware of any
> software RAID solutions which do such logging at the moment: do you know
> of any?
> 

AFAIK Veritas only does the first part of what i mentioned above
(invalid
on disk data recognition).

They do logging by default for RAID5 volumes and optionaly also for
RAID1 volumes.

In the RAID5 (with logging) case they can figure out if an n-1 disk
write took place and can
rebuild the data. In case an n-m (1 < m < n) took place they can
therefore at least
recognize the desaster ;-)

In the RAID1 (with logging) scenario they are able to recognize, which
of the n mirrors have actual
data and which ones don't to deliver the actual data to the user and to
try to make
the other mirrors consistent.

But because it's a software solution without any NVRAM support they
can't
handle the data redundancy case.

Heinz