Re: mismatch_cnt questions

2007-03-13 Thread H. Peter Anvin

Andre Noll wrote:

On 00:21, H. Peter Anvin wrote:

I have just updated the paper at:

http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf

... with this information (in slightly different notation and with a bit 
more detail.)


There's a typo in the new section:

s/By assumption, X_z != D_n/By assumption, X_z != D_z/



Thanks, fixed.

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions

2007-03-13 Thread Andre Noll
On 00:21, H. Peter Anvin wrote:
> I have just updated the paper at:
> 
> http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf
> 
> ... with this information (in slightly different notation and with a bit 
> more detail.)

There's a typo in the new section:

s/By assumption, X_z != D_n/By assumption, X_z != D_z/

Regards
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe


signature.asc
Description: Digital signature


Re: mismatch_cnt questions - how about raid10?

2007-03-12 Thread Peter Rabbitson

Neil Brown wrote:

On Tuesday March 6, [EMAIL PROTECTED] wrote:

>

Though it is less likely, a regular filesystem could still (I think)
genuinely write different data to difference devices in a raid1/10.

So relying on mismatch_cnt for early problem detection probably isn't
really workable.

And I think that if a drive is returning bad data without signalling
an error, then you are very much into the 'late' side of problem
detection.


I agree with the later, but my concern is not that much with the cause, 
but with the effect. From past discussion on the list I gather that no 
special effort is made to determine which chunk to take as 'valid', even 
though more than 2 logically identical chunks might be present 
(raid1/10). And you also seem to think that the DMA syndrome might even 
apply to plain fast-changing filesystems, left alone something with 
multiple layers (fs on lvm on raid).
So here is my question: how (theoretically) safe it is to use a raid1/10 
array for something very disk intensive, e.g. a mail spool? How likely 
it is that the effect you described above will creep different blocks 
onto disks, and subsequently will return the wrong data to the kernel? 
Should I look into raid5/6 for this kind of activity, in case both 
uptime and data integrity are my number one priorities, and I am willing 
to sacrifice performance?


Thank you
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions - how about raid10?

2007-03-11 Thread Neil Brown
On Tuesday March 6, [EMAIL PROTECTED] wrote:
> > 
> 
> I see. So basically for those of us who want to run swap on raid 1 or 
> 10, and at the same time want to rely on mismatch_cnt for early problem 
> detection, the only option is to create a separate md device just for 
> the swap. Is this about right?

Though it is less likely, a regular filesystem could still (I think)
genuinely write different data to difference devices in a raid1/10.

So relying on mismatch_cnt for early problem detection probably isn't
really workable.

And I think that if a drive is returning bad data without signalling
an error, then you are very much into the 'late' side of problem
detection.

I see the 'check' and 'repair' functions mostly as valuable for the
fact that they read every block and will trigger sleeping bad blocks
early.  If they every find a discrepancy, then it is either perfectly
normal, or something seriously wrong that could have been wrong for a
while 

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions

2007-03-08 Thread Bill Davidsen

H. Peter Anvin wrote:

Bill Davidsen wrote:


When last I looked at Hamming code, and that would be 1989 or 1990, I 
believe that I learned that the number of Hamming bits needed to 
cover N data bits was 1+log2(N), which for 512 bytes would be 1+12, 
and fit into a 16 bit field nicely. I don't know that I would go that 
way, fix any one bit error, detect any two bit error, rather than a 
CRC which gives me only one chance in 64k of an undetected data 
error, but I find it interesting.




A Hamming code across the bytes of a sector is pretty darn pointless, 
since that's not a typical failure pattern.
I just thought it was perhaps one of those little known facts that 
meaningful ECC could fit in 16 bits. I mentioned that I wouldn't go that 
way, mainly because it would be less effective catching multibit errors. 
This was a "fun fact" for all those folks who missed Hamming codes in 
their education, because they are old tech.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions

2007-03-08 Thread H. Peter Anvin

Bill Davidsen wrote:


When last I looked at Hamming code, and that would be 1989 or 1990, I 
believe that I learned that the number of Hamming bits needed to cover N 
data bits was 1+log2(N), which for 512 bytes would be 1+12, and fit into 
a 16 bit field nicely. I don't know that I would go that way, fix any 
one bit error, detect any two bit error, rather than a CRC which gives 
me only one chance in 64k of an undetected data error, but I find it 
interesting.




A Hamming code across the bytes of a sector is pretty darn pointless, 
since that's not a typical failure pattern.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions

2007-03-08 Thread Bill Davidsen

Martin K. Petersen wrote:

"hpa" == H Peter Anvin <[EMAIL PROTECTED]> writes:



  

What we really want in drives that store 520 byte sectors so that a
checksum can be passed all the way up and down through the stack
 or something like that.

  


hpa> A lot of SCSI disks have that option, but I believe it's not
hpa> arbitrary bytes.  In particular, the integrity check portion is
hpa> only 2 bytes, 16 bits.

It's important to distinguish between drives that support 520 byte
sectors and drives that include the Data Integrity Feature which also
uses 520 byte sectors.

Most regular SCSI drives can be formatted with 520 byte sectors and a
lot of disk arrays use the extra space to store an internal checksum.
The downside to 520 byte sectors is that it makes buffer management a
pain as 512 bytes of data is followed by 8 bytes of protection data.
That sucks when writing - say - a 4KB block because your scatterlist
becomes long and twisted having to interleave data and protection
data every sector.

The data integrity feature also uses 520 byte byte sectors.  The
difference is that the format of the 8 bytes is well defined.  And
that both initiator and target are capable of verifying the integrity
of an I/O.  It is correct that the CRC is only 16 bits.
  


When last I looked at Hamming code, and that would be 1989 or 1990, I 
believe that I learned that the number of Hamming bits needed to cover N 
data bits was 1+log2(N), which for 512 bytes would be 1+12, and fit into 
a 16 bit field nicely. I don't know that I would go that way, fix any 
one bit error, detect any two bit error, rather than a CRC which gives 
me only one chance in 64k of an undetected data error, but I find it 
interesting.


I also looked at fire codes, which at the time would still be a viable 
topic for a thesis. I remember nothing about how they worked whatsoever.

DIF is strictly between HBA and disk.  I'm lobbying HBA vendors to
expose it to the OS so we can use it.  I'm also lobbying to get them
to allow us to submit the data and the protection data in separate
scatterlists so we don't have to do the interleaving at the OS level.


hpa> One option, of course, would be to store, say, 16
hpa> sectors/pages/blocks in 17 physical sectors/pages/blocks, where
hpa> the last one is a packing of some sort of high-powered integrity
hpa> checks, e.g. SHA-256, or even an ECC block.  This would hurt
hpa> performance substantially, but it would be highly useful for very
hpa> high data integrity applications.

A while ago I tinkered with something like that.  I actually cheated
and stored the checking data in a different partition on the same
drive.  It was a pretty simple test using my DIF code (i.e. 8 bytes
per sector).

I wanted to see how badly the extra seeks would affect us.  The
results weren't too discouraging but I decided I liked the ZFS
approach better (having the checksum in the fs parent block which
you'll be reading anyway).

  



--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions

2007-03-08 Thread Martin K. Petersen
> "hpa" == H Peter Anvin <[EMAIL PROTECTED]> writes:

>> What we really want in drives that store 520 byte sectors so that a
>> checksum can be passed all the way up and down through the stack
>>  or something like that.
>> 

hpa> A lot of SCSI disks have that option, but I believe it's not
hpa> arbitrary bytes.  In particular, the integrity check portion is
hpa> only 2 bytes, 16 bits.

It's important to distinguish between drives that support 520 byte
sectors and drives that include the Data Integrity Feature which also
uses 520 byte sectors.

Most regular SCSI drives can be formatted with 520 byte sectors and a
lot of disk arrays use the extra space to store an internal checksum.
The downside to 520 byte sectors is that it makes buffer management a
pain as 512 bytes of data is followed by 8 bytes of protection data.
That sucks when writing - say - a 4KB block because your scatterlist
becomes long and twisted having to interleave data and protection
data every sector.

The data integrity feature also uses 520 byte byte sectors.  The
difference is that the format of the 8 bytes is well defined.  And
that both initiator and target are capable of verifying the integrity
of an I/O.  It is correct that the CRC is only 16 bits.

DIF is strictly between HBA and disk.  I'm lobbying HBA vendors to
expose it to the OS so we can use it.  I'm also lobbying to get them
to allow us to submit the data and the protection data in separate
scatterlists so we don't have to do the interleaving at the OS level.


hpa> One option, of course, would be to store, say, 16
hpa> sectors/pages/blocks in 17 physical sectors/pages/blocks, where
hpa> the last one is a packing of some sort of high-powered integrity
hpa> checks, e.g. SHA-256, or even an ECC block.  This would hurt
hpa> performance substantially, but it would be highly useful for very
hpa> high data integrity applications.

A while ago I tinkered with something like that.  I actually cheated
and stored the checking data in a different partition on the same
drive.  It was a pretty simple test using my DIF code (i.e. 8 bytes
per sector).

I wanted to see how badly the extra seeks would affect us.  The
results weren't too discouraging but I decided I liked the ZFS
approach better (having the checksum in the fs parent block which
you'll be reading anyway).

-- 
Martin K. Petersen  Oracle Linux Engineering

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions

2007-03-08 Thread H. Peter Anvin

I have just updated the paper at:

http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf

... with this information (in slightly different notation and with a bit 
more detail.)


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions

2007-03-07 Thread H. Peter Anvin

H. Peter Anvin wrote:

Eyal Lebedinsky wrote:

Neil Brown wrote:
[trim Q re how resync fixes data]

For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy
and writing it over all other copies.
For raid5 we assume the data is correct and update the parity.


Can raid6 identify the bad block (two parity blocks could allow this
if only one block has bad data in a stripe)? If so, does it?

This will surely mean more value for raid6 than just the two-disk-failure
protection.



No.  It's not mathematically possible.



Okay, I've thought about it, and I got it wrong the first time 
(off-the-cuff misapplication of the pigeonhole principle.)


It apparently *is* possible (for notation and algebra rules, see my paper):

Let's assume we know exactly one of the data (Dn) drives is corrupt 
(ignoring the case of P or Q corruption for now.)  That means instead of 
Dn we have a corrupt value, Xn.  Note that which data drive that is 
corrupt (n) is not known.


We compute P' and Q' as the computed values over the corrupt set.

P+P' = Dn+Xn
Q+Q' = g^n Dn + g^n Xn  g = {02}

Q+Q' = g^n (Dn+Xn)

By assumption, Dn != Xn, so P+P' = Dn+Xn != {00}.
g^n is *never* {00}, so Q+Q' = g^n (Dn+Xn) != {00}.

(Q+Q')/(P+P') = [g^n (Dn+Xn)]/(Dn+Xn) = g^n

Since n is known to be in the range [0,255), we thus have:

n = log_g((Q+Q')/(P+P'))

... which is a well-defined relation.

For the case where either the P or the Q drives are corrupt (and the 
data drives are all good), this is easily detected by the fact that if P 
is the corrupt drive, Q+Q' = {00}; similarly, if Q is the corrupt drive, 
P+P' = {00}.  Obviously, if P+P' = Q+Q' = {00}, then as far as RAID-6 
can discover, there is no corruption in the drive set.


So, yes, RAID-6 *can* detect single drive corruption, and even tell you 
which drive it is, if you're willing to compute a full syndrome set (P', 
Q') on every read (as well on every write.)


Note: RAID-6 cannot detect 2-drive corruption, unless of course the 
corruption is in different byte positions.  If multiple corresponding 
byte positions are corrupt, then the algorithm above will generally 
point you to a completely innocent drive.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions

2007-03-07 Thread H. Peter Anvin

Neil Brown wrote:

On Monday March 5, [EMAIL PROTECTED] wrote:

Neil Brown wrote:
[trim Q re how resync fixes data]

For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy
and writing it over all other copies.
For raid5 we assume the data is correct and update the parity.

Can raid6 identify the bad block (two parity blocks could allow this
if only one block has bad data in a stripe)? If so, does it?


No, it doesn't.

I guess that maybe it could:
   Rebuild each block in turn based on the xor parity, and then test 
   if the Q-syndrome is satisfied.

but I doubt the gain would be worth the pain.

What we really want in drives that store 520 byte sectors so that a
checksum can be passed all the way up and down through the stack
 or something like that.



A lot of SCSI disks have that option, but I believe it's not arbitrary 
bytes.  In particular, the integrity check portion is only 2 bytes, 16 bits.


One option, of course, would be to store, say, 16 sectors/pages/blocks 
in 17 physical sectors/pages/blocks, where the last one is a packing of 
some sort of high-powered integrity checks, e.g. SHA-256, or even an ECC 
block.  This would hurt performance substantially, but it would be 
highly useful for very high data integrity applications.


I will look at the mathematics of trying to do this with RAID-6, but I'm 
99% sure RAID-6 isn't sufficient to do it, even with syndrome set 
recomputation on every read.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions

2007-03-07 Thread H. Peter Anvin

Eyal Lebedinsky wrote:

Neil Brown wrote:
[trim Q re how resync fixes data]

For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy
and writing it over all other copies.
For raid5 we assume the data is correct and update the parity.


Can raid6 identify the bad block (two parity blocks could allow this
if only one block has bad data in a stripe)? If so, does it?

This will surely mean more value for raid6 than just the two-disk-failure
protection.



No.  It's not mathematically possible.

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions

2007-03-06 Thread Bill Davidsen

Neil Brown wrote:

On Monday March 5, [EMAIL PROTECTED] wrote:
  

Neil Brown wrote:
[trim Q re how resync fixes data]


For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy
and writing it over all other copies.
For raid5 we assume the data is correct and update the parity.
  

Can raid6 identify the bad block (two parity blocks could allow this
if only one block has bad data in a stripe)? If so, does it?



No, it doesn't.

I guess that maybe it could:
   Rebuild each block in turn based on the xor parity, and then test 
   if the Q-syndrome is satisfied.

but I doubt the gain would be worth the pain.
What's the value of "I have a drive which returned bad data" vs. "I have 
a whole array and some part of it returned bad data?" What's the cost of 
doing that identification, since it need only be done when the data are 
inconsistent between the drives and give a parity or Q mismatch? It 
seems easy, given that you are going to read all the pertinent sectors 
into memory anyway.


If the drive can be identified the data can be rewritten with confidence.

--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions - how about raid10?

2007-03-06 Thread Justin Piszcz



On Tue, 6 Mar 2007, Peter Rabbitson wrote:


Neil Brown wrote:

On Tuesday March 6, [EMAIL PROTECTED] wrote:

Neil Brown wrote:

When we write to a raid1, the data is DMAed from memory out to each
device independently, so if the memory changes between the two (or
more) DMA operations, you will get inconsistency between the devices.
Does this apply to raid 10 devices too? And in case of LVM if swap is on 
top of a LV which is a part of a VG which has a single PV as the raid 
array - will this happen as well? Or will the LVM layer take the data once 
and distribute exact copies of it to the PVs (in this case just the raid) 
effectively giving the raid array invariable data?


Yes, it applies to raid10 too.

I don't know the details of the inner workings of LVM, but I doubt it
will make a difference.  Copying the data in memory is just too costly
to do if it can be avoided.  With LVM and raid1/10 it can be avoided
with no significant cost.
With raid4/5/6, not copying into the cache can cause data corruption.
So we always copy.



I see. So basically for those of us who want to run swap on raid 1 or 10, and 
at the same time want to rely on mismatch_cnt for early problem detection, 
the only option is to create a separate md device just for the swap. Is this 
about right?

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



That is what I do.

/dev/md0 - swap
/dev/md1 - boot
/dev/md2 - root
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions - how about raid10?

2007-03-06 Thread Peter Rabbitson

Neil Brown wrote:

On Tuesday March 6, [EMAIL PROTECTED] wrote:

Neil Brown wrote:

When we write to a raid1, the data is DMAed from memory out to each
device independently, so if the memory changes between the two (or
more) DMA operations, you will get inconsistency between the devices.
Does this apply to raid 10 devices too? And in case of LVM if swap is on 
top of a LV which is a part of a VG which has a single PV as the raid 
array - will this happen as well? Or will the LVM layer take the data 
once and distribute exact copies of it to the PVs (in this case just the 
raid) effectively giving the raid array invariable data?


Yes, it applies to raid10 too.

I don't know the details of the inner workings of LVM, but I doubt it
will make a difference.  Copying the data in memory is just too costly
to do if it can be avoided.  With LVM and raid1/10 it can be avoided
with no significant cost.
With raid4/5/6, not copying into the cache can cause data corruption.
So we always copy.



I see. So basically for those of us who want to run swap on raid 1 or 
10, and at the same time want to rely on mismatch_cnt for early problem 
detection, the only option is to create a separate md device just for 
the swap. Is this about right?

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions - how about raid10?

2007-03-06 Thread Neil Brown
On Tuesday March 6, [EMAIL PROTECTED] wrote:
> Neil Brown wrote:
> > When we write to a raid1, the data is DMAed from memory out to each
> > device independently, so if the memory changes between the two (or
> > more) DMA operations, you will get inconsistency between the devices.
> 
> Does this apply to raid 10 devices too? And in case of LVM if swap is on 
> top of a LV which is a part of a VG which has a single PV as the raid 
> array - will this happen as well? Or will the LVM layer take the data 
> once and distribute exact copies of it to the PVs (in this case just the 
> raid) effectively giving the raid array invariable data?

Yes, it applies to raid10 too.

I don't know the details of the inner workings of LVM, but I doubt it
will make a difference.  Copying the data in memory is just too costly
to do if it can be avoided.  With LVM and raid1/10 it can be avoided
with no significant cost.
With raid4/5/6, not copying into the cache can cause data corruption.
So we always copy.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions - how about raid10?

2007-03-06 Thread Peter Rabbitson

Neil Brown wrote:

When we write to a raid1, the data is DMAed from memory out to each
device independently, so if the memory changes between the two (or
more) DMA operations, you will get inconsistency between the devices.


Does this apply to raid 10 devices too? And in case of LVM if swap is on 
top of a LV which is a part of a VG which has a single PV as the raid 
array - will this happen as well? Or will the LVM layer take the data 
once and distribute exact copies of it to the PVs (in this case just the 
raid) effectively giving the raid array invariable data?

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions

2007-03-05 Thread Paul Davidson

Hi Neil,

I've been following this thread with interest and I have a few questions.

Neil Brown wrote:

On Monday March 5, [EMAIL PROTECTED] wrote:


Neil Brown wrote:



When a disk fails we know what to rewrite, but when we discover a mismatch
we do not have this knowledge. It may corrupt the good copy of a raid1.


If a block differs between the different drives in a raid1, then no
copy is 'good'.  It is possible that one copy is the one you think you
want, but you probably wouldn't know by looking at it.
The worst situation is the have inconsistent data. If you read and get
one value, then later read and get another value, that is really bad.

For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy
and writing it over all other copies.
For raid5 we assume the data is correct and update the parity.


Wouldn't it be better to signal an error rather than potentially
corrupt data - or perhaps this already happens? Does the above only
refer to a 'repair' action?

I'm worrying here about silent data corruption that gets on to my
backup tapes. If an error was (is?) signaled by the raid system
during the backup and could be tracked to the file being copied at
the time, it would allow recovery of the data from a prior
backup. If raid remains silent, the corrupted data eventually
gets copied onto my entire backup rotation. Can you comment on this?

FWIW, my 600GB raid5 array shows mismatch_cnt of 24 when I 'check'
it - that machine has hung up on occasion.

Cheers,
Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions

2007-03-05 Thread Neil Brown
On Monday March 5, [EMAIL PROTECTED] wrote:
> Neil Brown wrote:
> [trim Q re how resync fixes data]
> > For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy
> > and writing it over all other copies.
> > For raid5 we assume the data is correct and update the parity.
> 
> Can raid6 identify the bad block (two parity blocks could allow this
> if only one block has bad data in a stripe)? If so, does it?

No, it doesn't.

I guess that maybe it could:
   Rebuild each block in turn based on the xor parity, and then test 
   if the Q-syndrome is satisfied.
but I doubt the gain would be worth the pain.

What we really want in drives that store 520 byte sectors so that a
checksum can be passed all the way up and down through the stack
 or something like that.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions

2007-03-04 Thread Eyal Lebedinsky
Neil Brown wrote:
[trim Q re how resync fixes data]
> For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy
> and writing it over all other copies.
> For raid5 we assume the data is correct and update the parity.

Can raid6 identify the bad block (two parity blocks could allow this
if only one block has bad data in a stripe)? If so, does it?

This will surely mean more value for raid6 than just the two-disk-failure
protection.

-- 
Eyal Lebedinsky ([EMAIL PROTECTED]) 
attach .zip as .dat
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions

2007-03-04 Thread Neil Brown
On Monday March 5, [EMAIL PROTECTED] wrote:
> Neil Brown wrote:
> > On Sunday March 4, [EMAIL PROTECTED] wrote:
> >>I have a mismatch_cnt of 384 on a 2-way mirror.
> [trim]
> >>3) Is the "repair" sync action safe to use on the above kernel? Any
> >>other methods / additional steps for fixing this?
> > 
> > "repair" is safe, though it may not be effective.
> > "repair" for raid1 was did not work until Jan 26th this year.
> > Before then it was identical in effect to 'check'.
> 
> How is "repair" safe but not effective? When it finds a mismatch, how does
> it know which part is correct and which should be fixed (which copy of
> raid1, or which block in raid5)?

It is not 'effective' in that before 26jan2007 it did not actually
copy the chosen data on to the other drives.  i.e. a 'repair' had the
same effect as a 'check', which is 'safe'.

> 
> When a disk fails we know what to rewrite, but when we discover a mismatch
> we do not have this knowledge. It may corrupt the good copy of a raid1.

If a block differs between the different drives in a raid1, then no
copy is 'good'.  It is possible that one copy is the one you think you
want, but you probably wouldn't know by looking at it.
The worst situation is the have inconsistent data. If you read and get
one value, then later read and get another value, that is really bad.

For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy
and writing it over all other copies.
For raid5 we assume the data is correct and update the parity.

You might be able to imagine a failure scenario where this produces
the 'wrong' result, but I'm confident that is the majority of cases it
is as good as any other option.

If we had something like ZFS which tracks checksums for all blocks,
and could somehow get that information usefully into the md level,
than maybe we could do something better.

I suspect that it would be very rare for raid5 to detect a mismatch
during a 'check', and raid1 would only see them when a write was
aborted, such as swap can do, and filesystems might do occasionally
(e.g. truncate a file that was recently written to).

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions

2007-03-04 Thread Neil Brown
On Sunday March 4, [EMAIL PROTECTED] wrote:
> Hey, that was quick ... thanks!
> 
> > > 1) Where does the mismatch come from? The box hasn't been down since the 
> > > creation of
> > >  the array.
> >
> > Do you have swap on the mirror at all?
> 
> As a matter of fact I do, /dev/md0_p2 is a swap partition.
> 
> > I recently discovered/realised that when 'swap' writes to a raid1 it can 
> > end up with different
> > data on the different devices.  This is perfectly acceptable as in that 
> > case the data will never
> > be read.
> 
> Interesting ... care to elaborate a little?

When we write to a raid1, the data is DMAed from memory out to each
device independently, so if the memory changes between the two (or
more) DMA operations, you will get inconsistency between the devices.

When the data being written is part of a file, the page will still be
dirty after the write 'completes' so another write will be issued
fairly soon (depending on various VM settings) and so the
inconsistency will only be visible for a short time, and you probably
won't notice.

If this happens when writing to swap - i.e. if the page is dirtied
while the write is happening - then the swap system will just forget
that that page was written out.  It is obviously still active, so some
other page will get swapped out instead.
There will never be any attempt to write out the 'correct' data to the
device as that doesn't really mean anything.

As more swap activity happens it is quite possible that the
inconsistent area of the array will be written again with consistent
data, but it is also quite possible that it won't be written for a
long time.  Long enough that a 'check' will find it.

In any of these cases there is no risk of data corruption as the
inconsistent area of the array will never be read from.

> 
> Would disabling swap, running mkswap again and rerunning check return
> 0 in this case?

Disable swap, write to the entire swap area
   dd if=/dev/zero of=/dev/md0_p2 bs=1M
then mkswap and rerun 'check' and it should return '0'.  It did for
me.

NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions

2007-03-04 Thread Eyal Lebedinsky
Neil Brown wrote:
> On Sunday March 4, [EMAIL PROTECTED] wrote:
>>I have a mismatch_cnt of 384 on a 2-way mirror.
[trim]
>>3) Is the "repair" sync action safe to use on the above kernel? Any
>>other methods / additional steps for fixing this?
> 
> "repair" is safe, though it may not be effective.
> "repair" for raid1 was did not work until Jan 26th this year.
> Before then it was identical in effect to 'check'.

How is "repair" safe but not effective? When it finds a mismatch, how does
it know which part is correct and which should be fixed (which copy of
raid1, or which block in raid5)?

When a disk fails we know what to rewrite, but when we discover a mismatch
we do not have this knowledge. It may corrupt the good copy of a raid1.

-- 
Eyal Lebedinsky ([EMAIL PROTECTED]) 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions

2007-03-04 Thread Christian Pernegger

Hey, that was quick ... thanks!


> 1) Where does the mismatch come from? The box hasn't been down since the 
creation of
>  the array.

Do you have swap on the mirror at all?


As a matter of fact I do, /dev/md0_p2 is a swap partition.


I recently discovered/realised that when 'swap' writes to a raid1 it can end up 
with different
data on the different devices.  This is perfectly acceptable as in that case 
the data will never
be read.


Interesting ... care to elaborate a little?

Would disabling swap, running mkswap again and rerunning check return
0 in this case?

Regards,

C.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mismatch_cnt questions

2007-03-04 Thread Neil Brown
On Sunday March 4, [EMAIL PROTECTED] wrote:
> Hello,
> 
> these questions apparently got buried in another thread, so here goes again 
> ...
> 
> I have a mismatch_cnt of 384 on a 2-way mirror.
> The box runs 2.6.17.4 and can't really be rebooted or have its kernel
> updated easily
> 
> 1) Where does the mismatch come from?
>  The box hasn't been down since the creation of the array.

Do you have swap on the mirror at all? 
I recently discovered/realised that when 'swap' writes to a raid1 it
can end up with different data on the different devices.  This is
perfectly acceptable as in that case the data will never be read.

If you don't have swap, then I don't know what is happening.

> 
> 2) How much data is 384? Blocks? Chunks? Bytes?

The units is 'sectors', but the granularity is about 64K.
so '384' means 3 different 64K sections of the device showed an
error.  One day I might reduce the granularity.

> 
> 3) Is the "repair" sync action safe to use on the above kernel? Any
> other methods / additional steps for fixing this?

"repair" is safe, though it may not be effective.
"repair" for raid1 was did not work until Jan 26th this year.
Before then it was identical in effect to 'check'.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html