raid5: I lost a XFS file system due to a minor IDE cable problem

2007-05-24 Thread Pallai Roland

Hi,

 I wondering why the md raid5 does accept writes after 2 disks failed. I've an 
array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable failed 
(my friend kicked it off from the box on the floor:) and 2 disks have been 
kicked but my download (yafc) not stopped, it tried and could write the file 
system for whole night!
 Now I changed the cable, tried to reassembly the array (mdadm -f --run), 
event counter increased from 4908158 up to 4929612 on the failed disks, but I 
cannot mount the file system and the 'xfs_repair -n' shows lot of errors 
there. This is expainable by the partially successed writes. Ext3 and JFS 
has "error=" mount option to switch filesystem read-only on any error, but 
XFS hasn't: why? It's a good question too, but I think the md layer could 
save dumb filesystems like XFS if denies writes after 2 disks are failed, and 
I cannot see a good reason why it's not behave this way.

 Do you have better idea how can I avoid such filesystem corruptions in the 
future? No, I don't want to use ext3 on this box. :)


my mount error:
XFS: Log inconsistent (didn't find previous header)
XFS: failed to find log head
XFS: log mount/recovery failed: error 5
XFS: log mount failed


--
 d

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: I lost a XFS file system due to a minor IDE cable problem

2007-05-24 Thread Justin Piszcz

Including XFS mailing list on this one.

On Thu, 24 May 2007, Pallai Roland wrote:



Hi,

I wondering why the md raid5 does accept writes after 2 disks failed. I've an
array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable failed
(my friend kicked it off from the box on the floor:) and 2 disks have been
kicked but my download (yafc) not stopped, it tried and could write the file
system for whole night!
Now I changed the cable, tried to reassembly the array (mdadm -f --run),
event counter increased from 4908158 up to 4929612 on the failed disks, but I
cannot mount the file system and the 'xfs_repair -n' shows lot of errors
there. This is expainable by the partially successed writes. Ext3 and JFS
has "error=" mount option to switch filesystem read-only on any error, but
XFS hasn't: why? It's a good question too, but I think the md layer could
save dumb filesystems like XFS if denies writes after 2 disks are failed, and
I cannot see a good reason why it's not behave this way.

Do you have better idea how can I avoid such filesystem corruptions in the
future? No, I don't want to use ext3 on this box. :)


my mount error:
XFS: Log inconsistent (didn't find previous header)
XFS: failed to find log head
XFS: log mount/recovery failed: error 5
XFS: log mount failed


--
d

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: Identify SATA Disks

2007-05-24 Thread Colin McCabe

lewis shobbrook wrote:

Hi All,
I'm wondering if anyone has discovered any nice tricks to assist in
identification of hdd devices.
I have an 8 bay hotswap array, pretty lights and have been wondering
what others out there might be doing to determine which disk in an
array is which.

I've noted that device allocation can change with the generation of
new initrd's and installation of new kernels; i.e. /dev/sdc becomes
/dev/sda depending upon what order the modules load etc.
I'm wondering if one could send a looped read/write task to a swap
partition or something to determine which the device is?



The device UUID in the RAID superblock doesn't change across a reboot.

If you run mdadm --examine on the disk, you should see something like
[EMAIL PROTECTED] root]# mdadm --examine /dev/sda
/dev/sda:
  Magic : a92b4efc
Version : 01
Feature Map : 0x0
 Array UUID : eab59421:6ddd9761:05e6ca46:d2342b03
   Name : 408088ETX1:single
  Creation Time : Mon Dec  4 21:25:55 2006
 Raid Level : raid1
   Raid Devices : 2

Device Size : 117210096 (55.89 GiB 60.01 GB)
 Array Size : 117187500 (55.88 GiB 60.00 GB)
  Used Size : 117187500 (55.88 GiB 60.00 GB)
   Super Offset : 117210224 sectors
  State : active
==>Device UUID : 06795ada:d2fb18a1:2e7de09f:af66a7e3 <==

Update Time : Thu May 24 11:49:21 2007
   Checksum : 7a695b9f - correct
 Events : 4346975

That number should always uniquely identify your disks.

Maybe even a better way is to run:
[EMAIL PROTECTED] root]# smartctl -d ata /dev/sda -i
smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: FUJITSU MHV2060BH
> Serial Number:NW02T6826LM5 <
Firmware Version: 0028
User Capacity:60,011,642,880 bytes
Device is:Not in smartctl database [for details use: -P showall]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 4a
Local Time is:Thu May 24 11:51:16 2007 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

That serial number never changes, even if you wipe the disk.

Colin


Cheers,

Lew


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: Identify SATA Disks

2007-05-24 Thread Colin McCabe

lewis shobbrook wrote:
Also I've not had much joy in attempting to "hotswap" SATA on a live 
system.

Can anyone attest to successful hotswap (or blanket rule out as
doesn't work) using std on board SATA controllers,  cf dedicated raid
card, or suggest further reading?
I've spent considerable time here and there in recent years trying to
find some decent info on this...


Hotswap works fine for me. My disks are both: FUJITSU MHV2060BH
The serial ata controller is recognized as an ICH6.

Make sure that your BIOS settings are in serial ATA mode, not legacy 
emulation mode. Also, you should be using the libata driver if you want 
hotswap.


Colin

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: Identify SATA Disks

2007-05-24 Thread Tomasz Chmielewski

lewis shobbrook schrieb:

Hi All,
I'm wondering if anyone has discovered any nice tricks to assist in
identification of hdd devices.
I have an 8 bay hotswap array, pretty lights and have been wondering
what others out there might be doing to determine which disk in an
array is which.


Considering you can see HDD LEDs blinking, something like:

dd if=/dev/sdb of=/dev/null


should help you identify the disk :)


--
Tomasz Chmielewski
http://wpkg.org
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: Identify SATA Disks

2007-05-24 Thread Gabor Gombas
On Thu, May 24, 2007 at 09:29:04AM +1000, lewis shobbrook wrote:

> I've noted that device allocation can change with the generation of
> new initrd's and installation of new kernels; i.e. /dev/sdc becomes
> /dev/sda depending upon what order the modules load etc.
> I'm wondering if one could send a looped read/write task to a swap
> partition or something to determine which the device is?

If you're using a relatively modern distro with udev then you can use
paths under /dev/disk/by-{id,path}. Unless you're using a RAID card that
hides the disk IDs...

> Also I've not had much joy in attempting to "hotswap" SATA on a live 
> system.
> Can anyone attest to successful hotswap (or blanket rule out as
> doesn't work) using std on board SATA controllers,  cf dedicated raid
> card, or suggest further reading?

Make sure you have a chipset that supports hotplug (some older ones do
not). Make sure its driver supports hotplug. Make sure you stop using
the disk before pulling it out (umount, swapoff, mdadm --remove,
pvremove whatever). Power down the disk before pulling it out if your
backplane/enclosure does not do that for you. Then it should work.

If the chipset does not support sending interrupt on hotswap or if the
driver does not implement hotswap signalling, you may need explicit
"scsiadd -r" before yanking out the old drive and "scsiadd -s" after
inserting the new one.

Also remember that this area is rather new and still evolving, so be
sure to try the latest kernel if you encounter problems.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid-5 long write wait while reading

2007-05-24 Thread Thomas Jager

Holger Kiehl wrote:

Hello

On Tue, 22 May 2007, Thomas Jager wrote:


Hi list.

I run a file server on MD raid-5.
If a client reads one big file and at the same time another client 
tries to write a file, the thread writing just sits in 
uninterruptible sleep until the reader has finished. Only very small 
amount of writes get trough while the reader is still working.


I assume from the vmstat numbers the reader does a lot of seeks 
(iowait > 80%!).

I don't think so unless the file is really fragmented. But I doubt it.



I'm having some trouble pinpointing the problem.
It's not consistent either sometimes it works as expected both the 
reader and writer gets some transactions. On huge reads I've seen the 
writer blocked for 30-40 minutes without any significant writes 
happening (Maybe a few megabytes, of several gigs waiting). It 
happens with NFS, SMB and FTP, and local with dd. And seems to be 
connected to raid-5. This does not happen on block devices without 
raid-5. I'm also wondering if it can have anything to do with 
loop-aes? I use loop-aes on top of the md, but then again i have not 
observed this problem on loop-devices with disk backend. I do know 
that loop-aes degrades performance but i didn't think it would do 
something like this?



What IO scheduler are you using? Maybe try using a different scheduler
(eg. deadline) if that does make any difference.
I was using deadline. I tried switching to CFQ but I'm still seeing the 
same strange problems.


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: I lost a XFS file system due to a minor IDE cable problem

2007-05-24 Thread David Chinner
On Thu, May 24, 2007 at 07:20:35AM -0400, Justin Piszcz wrote:
> Including XFS mailing list on this one.

Thanks Justin.

> On Thu, 24 May 2007, Pallai Roland wrote:
> 
> >
> >Hi,
> >
> >I wondering why the md raid5 does accept writes after 2 disks failed. I've 
> >an
> >array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable 
> >failed
> >(my friend kicked it off from the box on the floor:) and 2 disks have been
> >kicked but my download (yafc) not stopped, it tried and could write the 
> >file
> >system for whole night!
> >Now I changed the cable, tried to reassembly the array (mdadm -f --run),
> >event counter increased from 4908158 up to 4929612 on the failed disks, 
> >but I
> >cannot mount the file system and the 'xfs_repair -n' shows lot of errors
> >there. This is expainable by the partially successed writes. Ext3 and JFS
> >has "error=" mount option to switch filesystem read-only on any error, but
> >XFS hasn't: why?

"-o ro,norecovery" will allow you to mount the filesystem and get any
uncorrupted data off it.

You still may get shutdowns if you trip across corrupted metadata in
the filesystem, though.

> >It's a good question too, but I think the md layer could
> >save dumb filesystems like XFS if denies writes after 2 disks are failed, 
> >and
> >I cannot see a good reason why it's not behave this way.

How is *any* filesystem supposed to know that the underlying block
device has gone bad if it is not returning errors?

I did mention this exact scenario in the filesystems workshop back
in february - we'd *really* like to know if a RAID block device has gone
into degraded mode (i.e. lost a disk) so we can throttle new writes
until the rebuil dhas been completed. Stopping writes completely on a
fatal error (like 2 lost disks in RAID5, and 3 lost disks in RAID6)
would also be possible if only we could get the information out
of the block layer.

> >Do you have better idea how can I avoid such filesystem corruptions in the
> >future? No, I don't want to use ext3 on this box. :)

Well, the problem is a bug in MD - it should have detected
drives going away and stopped access to the device until it was
repaired. You would have had the same problem with ext3, or JFS,
or reiser or any other filesystem, too.

> >my mount error:
> >XFS: Log inconsistent (didn't find previous header)
> >XFS: failed to find log head
> >XFS: log mount/recovery failed: error 5
> >XFS: log mount failed

You MD device is still hosed - error 5 = EIO; the md device is
reporting errors back the filesystem now. You need to fix that
before trying to recover any data...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: I lost a XFS file system due to a minor IDE cable problem

2007-05-24 Thread Pallai Roland

On Fri, 2007-05-25 at 10:05 +1000, David Chinner wrote:
> On Thu, May 24, 2007 at 07:20:35AM -0400, Justin Piszcz wrote:
> > On Thu, 24 May 2007, Pallai Roland wrote:
> > >I wondering why the md raid5 does accept writes after 2 disks failed. I've 
> > >an
> > >array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable 
> > >failed
> > >(my friend kicked it off from the box on the floor:) and 2 disks have been
> > >kicked but my download (yafc) not stopped, it tried and could write the 
> > >file
> > >system for whole night!
> > >Now I changed the cable, tried to reassembly the array (mdadm -f --run),
> > >event counter increased from 4908158 up to 4929612 on the failed disks, 
> > >but I
> > >cannot mount the file system and the 'xfs_repair -n' shows lot of errors
> > >there. This is expainable by the partially successed writes. Ext3 and JFS
> > >has "error=" mount option to switch filesystem read-only on any error, but
> > >XFS hasn't: why?
> 
> "-o ro,norecovery" will allow you to mount the filesystem and get any
> uncorrupted data off it.
> 
> You still may get shutdowns if you trip across corrupted metadata in
> the filesystem, though.
 Thanks, I'll try it

> > >It's a good question too, but I think the md layer could
> > >save dumb filesystems like XFS if denies writes after 2 disks are failed, 
> > >and
> > >I cannot see a good reason why it's not behave this way.
> 
> How is *any* filesystem supposed to know that the underlying block
> device has gone bad if it is not returning errors?
 It is returning errors, I think so. If I try to write raid5 with 2
failed disks with dd, I've got errors on the missing chunks.
 The difference between ext3 and XFS is that ext3 will remount to
read-only on the first write error but the XFS won't, XFS only fails
only the current operation, IMHO. The method of ext3 isn't perfect, but
in practice, it's working well.

> I did mention this exact scenario in the filesystems workshop back
> in february - we'd *really* like to know if a RAID block device has gone
> into degraded mode (i.e. lost a disk) so we can throttle new writes
> until the rebuil dhas been completed. Stopping writes completely on a
> fatal error (like 2 lost disks in RAID5, and 3 lost disks in RAID6)
> would also be possible if only we could get the information out
> of the block layer.
 It would be nice, but as I mentioned above, ext3 do it well in practice
now.

> > >Do you have better idea how can I avoid such filesystem corruptions in the
> > >future? No, I don't want to use ext3 on this box. :)
> 
> Well, the problem is a bug in MD - it should have detected
> drives going away and stopped access to the device until it was
> repaired. You would have had the same problem with ext3, or JFS,
> or reiser or any other filesystem, too.
> 
> > >my mount error:
> > >XFS: Log inconsistent (didn't find previous header)
> > >XFS: failed to find log head
> > >XFS: log mount/recovery failed: error 5
> > >XFS: log mount failed
> 
> You MD device is still hosed - error 5 = EIO; the md device is
> reporting errors back the filesystem now. You need to fix that
> before trying to recover any data...
 I play with it tomorrow, thanks for your help


--
 d


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


raid1 check/repair read error recovery in 2.6.20

2007-05-24 Thread Mike Accetta
I believe I've come across a bug in the disk read error recovery logic
for raid1 check/repair operations in 2.6.20.  The raid1.c file looks
identical in 2.6.21 so the problem should still exist there as well.

This all surfaced when using a variant of CONFIG_FAIL_MAKE_REQUEST to
inject read errors on one of the mirrors of a raid1 array.  I noticed
that while this would ultimately fail the array, it would always seem
to generate

ata1.00: WARNING: zero len r/w req 
ata1.00: WARNING: zero len r/w req 
ata1.00: WARNING: zero len r/w req 
ata1.00: WARNING: zero len r/w req 
ata1.00: WARNING: zero len r/w req 
ata1.00: WARNING: zero len r/w req 

diagnostics at the same time (no clue why there are six of them).
Delving into this further I eventually settled on sync_request_write()
in raid1.c as a likely culprit and added the WARN_ON (below)

@@ -1386,6 +1393,7 @@
atomic_inc(&r1_bio->remaining);
md_sync_acct(conf->mirrors[i].rdev->bdev, wbio->bi_size >> 9);
 
+   WARN_ON(wbio->bi_size == 0);
generic_make_request(wbio);
}


to confirm that this code was indeed sending a zero size bio down to
the device layer in this circumstance.

Looking at the preceding code in sync_request_write() it appears that
the loop comparing the results of all reads just skips a mirror where
the read failed (BIO_UPTODATE is clear) without doing any of the sbio
prep or the memcpy() from the pbio.  There is other read/re-write logic
in the following if-clause but this seems to apply only if none of the
mirrors were readable.  Regardless, the fact that a zero length bio is
being issued in the "schedule writes" section is compelling evidence
that something is wrong somewhere.

I tried the following patch to raid1.c which short-ciruits the data
comparison in the read error case but otherwise does the rest of the
sbio prep for the mirror with the error.  It seems to have eliminated
the ATA warning at least.  Is it a correct thing to do?

@@ -1235,17 +1242,20 @@
}
r1_bio->read_disk = primary;
for (i=0; iraid_disks; i++)
-   if (r1_bio->bios[i]->bi_end_io == end_sync_read &&
-   test_bit(BIO_UPTODATE, &r1_bio->bios[i]->bi_flags)) 
{
+   if (r1_bio->bios[i]->bi_end_io == end_sync_read) {
int j;
int vcnt = r1_bio->sectors >> (PAGE_SHIFT- 9);
struct bio *pbio = r1_bio->bios[primary];
struct bio *sbio = r1_bio->bios[i];
-   for (j = vcnt; j-- ; )
-   if 
(memcmp(page_address(pbio->bi_io_vec[j].bv_page),
-  
page_address(sbio->bi_io_vec[j].bv_page),
-  PAGE_SIZE))
-   break;
+   if (test_bit(BIO_UPTODATE, &sbio->bi_flags)) {
+   for (j = vcnt; j-- ; )
+   if 
(memcmp(page_address(pbio->bi_io_vec[j].bv_page),
+  
page_address(sbio->bi_io_vec[j].bv_page),
+  PAGE_SIZE))
+   break;
+   } else {
+   j = 0;
+   }
if (j >= 0)
mddev->resync_mismatches += 
r1_bio->sectors;
if (j < 0 || test_bit(MD_RECOVERY_CHECK, 
&mddev->recovery)) {
--
Mike Accetta

ECI Telecom Ltd.
Data Networking Division (previously Laurel Networks)
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


When does a disk get flagged as bad?

2007-05-24 Thread Alberto Alonso
OK, lets see if I can understand how a disk gets flagged
as bad and removed from an array. I was under the impression
that any read or write operation failure flags the drive as
bad and it gets removed automatically from the array.

However, as I indicated in a prior post I am having problems
where the array is never degraded. Does an error of type:
end_request: I/O error, dev sdb, sector 
not count as a read/write error?

Thanks,

Alberto

-- 
Alberto AlonsoGlobal Gate Systems LLC.
(512) 351-7233http://www.ggsys.net
Hardware, consulting, sysadmin, monitoring and remote backups

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: I lost a XFS file system due to a minor IDE cable problem

2007-05-24 Thread David Chinner
On Fri, May 25, 2007 at 03:35:48AM +0200, Pallai Roland wrote:
> On Fri, 2007-05-25 at 10:05 +1000, David Chinner wrote:
> > > >It's a good question too, but I think the md layer could
> > > >save dumb filesystems like XFS if denies writes after 2 disks are 
> > > >failed, 
> > > >and
> > > >I cannot see a good reason why it's not behave this way.
> > 
> > How is *any* filesystem supposed to know that the underlying block
> > device has gone bad if it is not returning errors?
>  It is returning errors, I think so. If I try to write raid5 with 2
> failed disks with dd, I've got errors on the missing chunks.

Oh, did you look at your logs and find that XFS had spammed them
about writes that were failing?

>  The difference between ext3 and XFS is that ext3 will remount to
> read-only on the first write error but the XFS won't, XFS only fails
> only the current operation, IMHO. The method of ext3 isn't perfect, but
> in practice, it's working well.

XFS will shutdown the filesystem if metadata corruption will occur
due to a failed write. We don't immediately fail the filesystem on
data write errors because on large systems you can get *transient*
I/O errors (e.g. FC path failover) and so retrying failed data
writes is useful for preventing unnecessary shutdowns of the
filesystem.

Different design criteria, different solutions...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: I lost a XFS file system due to a minor IDE cable problem

2007-05-24 Thread Alberto Alonso

> >  The difference between ext3 and XFS is that ext3 will remount to
> > read-only on the first write error but the XFS won't, XFS only fails
> > only the current operation, IMHO. The method of ext3 isn't perfect, but
> > in practice, it's working well.
> 
> XFS will shutdown the filesystem if metadata corruption will occur
> due to a failed write. We don't immediately fail the filesystem on
> data write errors because on large systems you can get *transient*
> I/O errors (e.g. FC path failover) and so retrying failed data
> writes is useful for preventing unnecessary shutdowns of the
> filesystem.
> 
> Different design criteria, different solutions...

I think his point was that going into a read only mode causes a
less catastrophic situation (ie. a web server can still serve
pages). I think that is a valid point, rather than shutting down
the file system completely, an automatic switch to where the least
disruption of service can occur is always desired.

Maybe the automatic failure mode could be something that is 
configurable via the mount options.

I personally have found the XFS file system to be great for
my needs (except issues with NFS interaction, where the bug report
never got answered), but that doesn't mean it can not be improved.

Just my 2 cents,

Alberto

> Cheers,
> 
> Dave.
-- 
Alberto AlonsoGlobal Gate Systems LLC.
(512) 351-7233http://www.ggsys.net
Hardware, consulting, sysadmin, monitoring and remote backups

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html