Checksums wrong on one disk of mirror

2006-11-07 Thread David
I recently installed a server with mirrored disks using software RAID.  
Everything was working fine for a few days until a normal reboot (not  
the first).  Now the machine will not boot because it appears the  
superblock is wrong on some of the RAID devices on the first disk.


The rough layout of the disks (sda and sdb):

 sdx1 (md0) - /
 sdx2 (md1) - /var
 sdx3 (md2) - /usr
 extended partition with swap
 sdx6 (md3) - /opt

The exact error is:

 "invalid superblock checksum on sda3
 sda3 has invalid sb, not importing!"

Booting into a live CD, mdadm -E /dev/sdaX shows that the checksum is  
not what would be expected for sda1,2,3 but is fine for sda6. All of  
the checksums on drive sdb are correct.


The state is "clean" for all partitions, working 2, active 2 and  
failed 0. The table for sdb1,2,3 shows that the first device has been  
removed and is no longer an active mirror.


What is the best way to proceed here? Can I somehow sync from the  
second disk, which appears to have the correct checksums? Is there an  
easy way to fix this that wont involve loosing the data?


Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Checksums wrong on one disk of mirror

2006-11-07 Thread David

Quoting Neil Brown <[EMAIL PROTECTED]>:

On Tuesday November 7, [EMAIL PROTECTED] wrote:

Booting into a live CD, mdadm -E /dev/sdaX shows that the checksum is
not what would be expected for sda1,2,3 but is fine for sda6. All of
the checksums on drive sdb are correct.


I'm surprised it doesn't boot then.  How are the arrays being
assembled? A more complete kernel log would help.


Neil,

Thanks for such a quick reply.  I will post the kernel logs if the  
below is not enough information.  The old dmesg should also still be  
on the partition.



The state is "clean" for all partitions, working 2, active 2 and
failed 0. The table for sdb1,2,3 shows that the first device has been
removed and is no longer an active mirror.

What is the best way to proceed here? Can I somehow sync from the
second disk, which appears to have the correct checksums? Is there an
easy way to fix this that wont involve loosing the data?


While booted from the live CD you should be able to
  mdadm -AR /dev/md0 /dev/sdb1
  mdadm /dev/md0 --add /dev/sda1


Fantastic, this works well for two of the partitions.  However the  
third has a bad sector (as reported by smartmontools) on the disk with  
the "good" superblock.  The disk cannot read the sector, so the  
syncing fails and starts over at 15.7% each time.


Is it safe to mount that partition outside of the md, find the file,  
remove it so that the disk can remap that sector (it is shown as  
Currently_Pending in SMART right now) then resync the array?  I guess  
this will cause problems and break the mirror.  Or is the correct way  
to remove the "bad" superblock drive from the array, mount the md,  
remove the file then resync the array?


If it is possible to do either of the above, how do I stop the  
recovery?  It now starts automatically at live CD boot, repeating from  
15.7% over and over.  My knowledge of the tools is bad but I tried the  
following:


# mdadm /dev/md0 --remove /dev/sda1
and
# mdadm -f /dev/md0 --remove /dev/sda1 (no idea if the -f even makes  
sense there)



It is very odd that the checksums are all wrong though.  Kernel
version? mdadm version? hardware architecture?


Kernel installed from Ubuntu 6.06 sources, 2.6.15.  Machine is a x86  
Dell with two identical Maxtor DiamondMax drives on an Intel 82801  
SATA controller.


mdadm is version 1.12.  Looking at the most recently available version  
this seems incredibly out of date, but seems to be the default  
installed in Ubuntu.  Even Debian stable seems to have 1.9.  I can bug  
this with them for an update if necessary.


Is it possible that a broken init script has tried to fsck an  
individual drive instead of the md?  /etc/fstab only uses /dev/md*  
references but I'll check other scripts when (if? :) I get the system  
back up and running.


Whilst the machine is not critical and is only a new install, I'd like  
to keep fighting rather than give in if possible.


Thanks,

David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Checksums wrong on one disk of mirror

2006-11-08 Thread David

Quoting David <[EMAIL PROTECTED]>:

Or is the correct way
to remove the "bad" superblock drive from the array, mount the md,
remove the file then resync the array?


Common sense says this is correct.


If it is possible to do either of the above, how do I stop the
recovery?  It now starts automatically at live CD boot, repeating from
15.7% over and over.  My knowledge of the tools is bad but I tried the
following:

# mdadm /dev/md0 --remove /dev/sda1
and
# mdadm -f /dev/md0 --remove /dev/sda1 (no idea if the -f even makes
sense there)


Looking at http://smartmontools.sourceforge.net/BadBlockHowTo.txt I  
tried to figure out what file was in the bad blocks but it turned out  
there wasn't one, it was just unused space.


My fix, for completeness, was this:

Force failure of the corrupt half of the mirror, using
# mdadm --manage /dev/md0 --fail /dev/sda

Mount the other one and fill free space with zeros
# mount /dev/md0 /mnt/test
# dd if=/dev/zero of=/mnt/test/bigfile
# sync

smartctl now showed that the pending sector had been reallocated, so I  
removed the bigfile and hot added the other drive

# mdadm --manage /dev/md0 --add /dev/sda

The recovery went fine this time and both partitions were shown as  
correct and active.  I had to fsck another md before it would boot  
correctly but the machine is now back up and working correctly.


Thanks for your help previously, it helped me along the right lines to  
start fixing this one.


David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Swap initialised as an md?

2006-11-10 Thread David

I have two devices mirrored which are partitioned like this:

   Device Boot  Start End  Blocks   Id  System
/dev/sda1   *  633071627915358108+  fd  Linux raid autodetect
/dev/sda2307162807168202920482875   fd  Linux raid autodetect
/dev/sda371682030   11264777920482875   fd  Linux raid autodetect
/dev/sda4   112647780   156248189218002055  Extended
/dev/sda5   112647843   122881184 5116671   82  Linux swap / Solaris
/dev/sda6   122881248   15624818916683471   fd  Linux raid autodetect

My aim was to have the two swap partitions both mounted, no RAID (as I  
didn't see any benefit to that, but if I'm wrong then I'd appreciate  
being told!).  However it seems that sda5 seems to be recognised as an  
md anyway at boot, so swapon does not work correctly.  When  
initialising the partitions with mkswap, the RAID array is confused  
and refuses to boot until the superblocks are fixed.


At boot, the kernel says:

[17179589.184000] md: md3 stopped.
[17179589.184000] md: bind
[17179589.188000] md: bind
[17179589.188000] raid1: raid set md3 active with 2 out of 2 mirrors

Then /proc/mdstat says:

md3 : active raid1 sda5[0] sdb5[1]
  5116544 blocks [2/2] [UU]

In /etc/mdadm/mdadm.conf, the following is present which was created  
by the installer and only lists 4 arrays.  In actual fact sdx6 is  
recognised as the fifth array md4.


DEVICE partitions
ARRAY /dev/md3 level=raid1 num-devices=2  
UUID=75575384:5fbe10ed:a5a46544:209740b3
ARRAY /dev/md2 level=raid1 num-devices=2  
UUID=5d133655:1d034197:c1c19528:56cc420a
ARRAY /dev/md1 level=raid1 num-devices=2  
UUID=2cda8230:b2fde7b4:97082351:880c918a
ARRAY /dev/md0 level=raid1 num-devices=2  
UUID=7f9abf32:c86071fd:3df4db9d:26ddd001


As /etc is on md0 I doubt this configuration file has anything to do  
with the kernel recognising and setting the arrays active.  However,  
is there any reason that the swap partitions (which have the correct  
partition type) are initialised as an md?  Can I stop it anyhow, or is  
the correct method to have them as an md with the md initialised as  
swap?


Brief details are the same as my previous mails last week: 2.6.15,  
mdadm 1.12.0 (on md0, so I can't see that it is at fault).


Thanks,

David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread david

On Wed, 16 May 2007, Bill Davidsen wrote:


Jeff Zheng wrote:

 Here is the information of the created raid0. Hope it is enough.



If I read this correctly, the problem is with JFS rather than RAID?


he had the same problem with xfs.

David Lang

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread david

On Thu, 17 May 2007, Neil Brown wrote:


On Thursday May 17, [EMAIL PROTECTED] wrote:



The only difference of any significance between the working
and non-working configurations is that in the non-working,
the component devices are larger than 2Gig, and hence have
sector offsets greater than 32 bits.


Do u mean 2T here?, but in both configuartion, the component devices are
larger than 2T (2.25T&5.5T).


Yes, I meant 2T, and yes, the components are always over 2T.


2T decimal or 2T binary?


So I'm
at a complete loss.  The raid0 code follows the same paths and does
the same things and uses 64bit arithmetic where needed.

So I have no idea how there could be a difference between these two
cases.

I'm at a loss...

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-29 Thread david

On Wed, 30 May 2007, David Chinner wrote:


On Tue, May 29, 2007 at 04:03:43PM -0400, Phillip Susi wrote:

David Chinner wrote:

The use of barriers in XFS assumes the commit write to be on stable
storage before it returns.  One of the ordering guarantees that we
need is that the transaction (commit write) is on disk before the
metadata block containing the change in the transaction is written
to disk and the current barrier behaviour gives us that.


Barrier != synchronous write,


Of course. FYI, XFS only issues barriers on *async* writes.

But barrier semantics - as far as they've been described by everyone
but you indicate that the barrier write is guaranteed to be on stable
storage when it returns.


this doesn't match what I have seen

wtih barriers it's perfectly legal to have the following sequence of 
events


1. app writes block 10 to OS
2. app writes block 4 to OS
3. app writes barrier to OS
4. app writes block 5 to OS
5. app writes block 20 to OS
6. OS writes block 4 to disk drive
7. OS writes block 10 to disk drive
8. OS writes barrier to disk drive
9. OS writes block 5 to disk drive
10. OS writes block 20 to disk drive
11. disk drive writes block 10 to platter
12. disk drive writes block 4 to platter
13. disk drive writes block 20 to platter
14. disk drive writes block 5 to platter

there is nothing that says that when the app finishes step #3 that the OS 
has even sent the data to the drive, let alone that the drive has flushed 
it to a platter


if the disk drive doesn't support barriers then step #8 becomes 'issue 
flush' and steps 11 and 12 take place before step #9, 13, 14


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread david

On Wed, 30 May 2007, David Chinner wrote:


On Tue, May 29, 2007 at 05:01:24PM -0700, [EMAIL PROTECTED] wrote:

On Wed, 30 May 2007, David Chinner wrote:


On Tue, May 29, 2007 at 04:03:43PM -0400, Phillip Susi wrote:

David Chinner wrote:

The use of barriers in XFS assumes the commit write to be on stable
storage before it returns.  One of the ordering guarantees that we
need is that the transaction (commit write) is on disk before the
metadata block containing the change in the transaction is written
to disk and the current barrier behaviour gives us that.


Barrier != synchronous write,


Of course. FYI, XFS only issues barriers on *async* writes.

But barrier semantics - as far as they've been described by everyone
but you indicate that the barrier write is guaranteed to be on stable
storage when it returns.


this doesn't match what I have seen

wtih barriers it's perfectly legal to have the following sequence of
events

1. app writes block 10 to OS
2. app writes block 4 to OS
3. app writes barrier to OS
4. app writes block 5 to OS
5. app writes block 20 to OS


hm - applications can't issue barriers to the filesystem.
However, if you consider the barrier to be an "fsync()" for example,
then it's still the filesystem that is issuing the barrier and
there's a block that needs to be written that is associated with
that barrier (either an inode or a transaction commit) that needs to
be on stable storage before the filesystem returns to userspace.


6. OS writes block 4 to disk drive
7. OS writes block 10 to disk drive
8. OS writes barrier to disk drive
9. OS writes block 5 to disk drive
10. OS writes block 20 to disk drive


Replace OS with filesystem, and combine 7+8 together - we don't have
zero-length barriers and hence they are *always* associated with a
write to a certain block on disk. i.e.:

1. FS writes block 4 to disk drive
2. FS writes block 10 to disk drive
3. FS writes *barrier* block X to disk drive
4. FS writes block 5 to disk drive
5. FS writes block 20 to disk drive

The order that these are expected by the filesystem to hit stable
storage are:

1. block 4 and 10 on stable storage in any order
2. barrier block X on stable storage
3. block 5 and 20 on stable storage in any order

The point I'm trying to make is that in XFS,  block 5 and 20 cannot
be allowed to hit the disk before the barrier block because they
have strict order dependency on block X being stable before them,
just like block X has strict order dependency that block 4 and 10
must be stable before we start the barrier block write.


11. disk drive writes block 10 to platter
12. disk drive writes block 4 to platter
13. disk drive writes block 20 to platter
14. disk drive writes block 5 to platter



if the disk drive doesn't support barriers then step #8 becomes 'issue
flush' and steps 11 and 12 take place before step #9, 13, 14


No, you need a flush on either side of the block X write to maintain
the same semantics as barrier writes currently have.

We have filesystems that require barriers to prevent reordering of
writes in both directions and to ensure that the block associated
with the barrier is on stable storage when I/o completion is
signalled.  The existing barrier implementation (where it works)
provide these requirements. We need barriers to retain these
semantics, otherwise we'll still have to do special stuff in
the filesystems to get the semantics that we need.


one of us is misunderstanding barriers here.

you are understanding barriers to be the same as syncronous writes. (and 
therefor the data is on persistant media before the call returns)


I am understanding barriers to only indicate ordering requirements. things 
before the barrier can be reordered freely, things after the barrier can 
be reordered freely, but things cannot be reordered across the barrier.


if I am understanding it correctly, the big win for barriers is that you 
do NOT have to stop and wait until the data is on persistant media before 
you can continue.


in the past barriers have not been fully implmented in most cases, and as 
a result they have been simulated by forcing a full flush of the buffers 
to persistant media before any other writes are allowed. This has made 
them _in practice_ operate the same way as syncronous writes (matching 
your understanding), but the current thread is talking about fixing the 
implementation to the official symantics for all hardware that can 
actually support barriers (and fix it at the OS level)


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread david

On Thu, 31 May 2007, Jens Axboe wrote:


On Thu, May 31 2007, Phillip Susi wrote:

David Chinner wrote:

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate


So what if you want a synchronous write, but DON'T care about the order?
  They need to be two completely different flags which you can choose
to combine, or use individually.


If you have a use case for that, we can easily support it as well...
Depending on the drive capabilities (FUA support or not), it may be
nearly as slow as a "real" barrier write.


true, but a "real" barrier write could have significant side effects on 
other writes that wouldn't happen with a synchronous wrote (a sync wrote 
can have other, unrelated writes re-ordered around it, a barrier write 
can't)


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread david

On Fri, 1 Jun 2007, Tejun Heo wrote:


but one
thing we should bear in mind is that harddisks don't have humongous
caches or very smart controller / instruction set.  No matter how
relaxed interface the block layer provides, in the end, it just has to
issue whole-sale FLUSH CACHE on the device to guarantee data ordering on
the media.


if you are talking about individual drives you may be right for the moment 
(but 16M cache on drives is a _lot_ larger then people imagined would be 
there a few years ago)


but when you consider the self-contained disk arrays it's an entirely 
different story. you can easily have a few gig of cache and a complete OS 
pretending to be a single drive as far as you are concerned.


and the price of such devices is plummeting (in large part thanks to Linux 
moving into this space), you can now readily buy a 10TB array for $10k 
that looks like a single drive.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


limits on raid

2007-06-14 Thread david

what is the limit for the number of devices that can be in a single array?

I'm trying to build a 45x750G array and want to experiment with the 
different configurations. I'm trying to start with raid6, but mdadm is 
complaining about an invalid number of drives


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-14 Thread david

On Fri, 15 Jun 2007, Neil Brown wrote:


On Thursday June 14, [EMAIL PROTECTED] wrote:

what is the limit for the number of devices that can be in a single array?

I'm trying to build a 45x750G array and want to experiment with the
different configurations. I'm trying to start with raid6, but mdadm is
complaining about an invalid number of drives

David Lang


"man mdadm"  search for "limits".  (forgive typos).


thanks.

why does it still default to the old format after so many new versions? 
(by the way, the documetnation said 28 devices, but I couldn't get it to 
accept more then 27)


it's now churning away 'rebuilding' the brand new array.

a few questions/thoughts.

why does it need to do a rebuild when makeing a new array? couldn't it 
just zero all the drives instead? (or better still just record most of the 
space as 'unused' and initialize it as it starts useing it?)


while I consider zfs to be ~80% hype, one advantage it could have (but I 
don't know if it has) is that since the filesystem an raid are integrated 
into one layer they can optimize the case where files are being written 
onto unallocated space and instead of reading blocks from disk to 
calculate the parity they could just put zeros in the unallocated space, 
potentially speeding up the system by reducing the amount of disk I/O.


.this wouldn't work if the filesystem is crowded, but a lot of large 
arrays are used for storing large files (i.e. sequential writes of large 
amounts of data) and it would seem that this could be a substantial win in 
these cases.


is there any way that linux would be able to do this sort of thing? or is 
it impossible due to the layering preventing the nessasary knowledge from 
being in the right place?


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-16 Thread david

On Sat, 16 Jun 2007, Neil Brown wrote:


It would be possible to have a 'this is not initialised' flag on the
array, and if that is not set, always do a reconstruct-write rather
than a read-modify-write.  But the first time you have an unclean
shutdown you are going to resync all the parity anyway (unless you
have a bitmap) so you may as well resync at the start.

And why is it such a big deal anyway?  The initial resync doesn't stop
you from using the array.  I guess if you wanted to put an array into
production instantly and couldn't afford any slowdown due to resync,
then you might want to skip the initial resync but is that really
likely?


in my case it takes 2+ days to resync the array before I can do any 
performance testing with it. for some reason it's only doing the rebuild 
at ~5M/sec (even though I've increased the min and max rebuild speeds and 
a dd to the array seems to be ~44M/sec, even during the rebuild)


I want to test several configurations, from a 45 disk raid6 to a 45 disk 
raid0. at 2-3 days per test (or longer, depending on the tests) this 
becomes a very slow process.


also, when a rebuild is slow enough (and has enough of a performance 
impact) it's not uncommon to want to operate in degraded mode just long 
enought oget to a maintinance window and then recreate the array and 
reload from backup.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-16 Thread david

On Sat, 16 Jun 2007, David Greaves wrote:


[EMAIL PROTECTED] wrote:

 On Sat, 16 Jun 2007, Neil Brown wrote:

 I want to test several configurations, from a 45 disk raid6 to a 45 disk
 raid0. at 2-3 days per test (or longer, depending on the tests) this
 becomes a very slow process.
Are you suggesting the code that is written to enhance data integrity is 
optimised (or even touched) to support this kind of test scenario?

Seriously? :)


actually, if it can be done without a huge impact to the maintainability 
of the code I think it would be a good idea for the simple reason that I 
think the increased experimentation would result in people finding out 
what raid level is really appropriate for their needs.


there is a _lot_ of confusion around about what the performance 
implications of different raid levels are (especially when you consider 
things like raid 10/50/60 where you have two layers combined) and anything 
that encourages experimentation would be a good thing.



 also, when a rebuild is slow enough (and has enough of a performance
 impact) it's not uncommon to want to operate in degraded mode just long
 enought oget to a maintinance window and then recreate the array and
 reload from backup.


so would mdadm --remove the rebuilding disk help?


no. let me try again

drive fails monday morning

scenerio 1

replace the failed drive, start the rebuild. system will be slow (degraded 
mode + rebuild) for the next three days.


scenerio 2

leave it in degraded mode until monday night (accepting the speed penalty 
for degraded mode, but not the rebuild penalty)


monday night shutdown the system, put in the new drive, reinitialize the 
array, reload the system from backup.


system is back to full speed tuesday morning.

scenerio 2 isn't supported with md today, although it sounds as if the 
skip rebuild could do this except for raid 5


on my test system, the rebuild says it's running at 5M/s a DD to a file on 
the array says it's doing 45M/s (even while the rebuild is running), so it 
seems to me that there may be value in this approach.


David Lang

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-17 Thread david

On Sun, 17 Jun 2007, Wakko Warner wrote:


you can also easily move an ext3 journal to an external journal with
tune2fs (see man page).


I only have 2 ext3 file systems (One of which is mounted R/O since it's
full), all my others are reiserfs (v3).

What benefit would I gain by using an external journel and how big would it
need to be?


if you have the journal on a drive by itself you end up doing (almost) 
sequential reads and writes to the journal and the disk head doesn't need 
to move much.


this can greatly increase your write speeds since

1. the journal gets written faster (completeing the write as far as your 
software is concerned)


2. the heads don't need to seek back and forth from the journal to the 
final location that the data gets written.


as for how large it should be, it all depends on the volume of your 
writes, once the journal fills up all writes stall until space is freed in 
the journal, IIRC Ext3 is limited to 128M, with todays drive sizes I don't 
see any reason to make it any smaller.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-17 Thread david

On Sun, 17 Jun 2007, dean gaudet wrote:


On Sun, 17 Jun 2007, Wakko Warner wrote:


What benefit would I gain by using an external journel and how big would it
need to be?


i don't know how big the journal needs to be... i'm limited by xfs'
maximum journal size of 128MiB.

i don't have much benchmark data -- but here are some rough notes i took
when i was evaluating a umem NVRAM card.  since the pata disks in the
raid1 have write caching enabled it's somewhat of an unfair comparison,
but the important info is the 88 seconds for internal journal vs. 81
seconds for external journal.


if you turn on disk write caching the difference will be much larger.


-dean

time sh -c 'tar xf /var/tmp/linux-2.6.20.tar; sync'


I know that sync will force everything to get as far as the journal, will 
it force the journal to be flushed?


David Lang



xfs journal raid5 bitmaptimes
internalnone0.18s user 2.14s system 2% cpu 1:27.95 total
internalinternal0.16s user 2.16s system 1% cpu 2:01.12 total
raid1   none0.07s user 2.02s system 2% cpu 1:20.62 total
raid1   internal0.14s user 2.01s system 1% cpu 1:55.18 total
raid1   raid1   0.14s user 2.03s system 2% cpu 1:20.61 total
umemnone0.13s user 2.07s system 2% cpu 1:20.77 total
umeminternal0.15s user 2.16s system 2% cpu 1:51.28 total
umemumem0.12s user 2.13s system 2% cpu 1:20.50 total


raid5:
- 4x seagate 7200.10 400GB on marvell MV88SX6081
- mdadm --create --level=5 --raid-devices=4 /dev/md4 /dev/sd[abcd]1

raid1:
- 2x maxtor 6Y200P0 on 3ware 7504
- two 128MiB partitions starting at cyl 1
- mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md1 
/dev/sd[fg]1
- mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md2 
/dev/sd[fg]2
- md1 is used for external xfs journal
- md2 has an ext3 filesystem for the external md4 bitmap

xfs:
- mkfs.xfs issued before each run using the defaults (aside from -l 
logdev=/dev/md1)
- mount -o noatime,nodiratime[,logdev=/dev/md1]

umem:
- 512MiB Micro Memory MM-5415CN
- 2 partitions similar to the raid1 setup


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-18 Thread david

On Mon, 18 Jun 2007, Brendan Conoboy wrote:


[EMAIL PROTECTED] wrote:

 in my case it takes 2+ days to resync the array before I can do any
 performance testing with it. for some reason it's only doing the rebuild
 at ~5M/sec (even though I've increased the min and max rebuild speeds and
 a dd to the array seems to be ~44M/sec, even during the rebuild)


With performance like that, it sounds like you're saturating a bus somewhere 
along the line.  If you're using scsi, for instance, it's very easy for a 
long chain of drives to overwhelm a channel.  You might also want to consider 
some other RAID layouts like 1+0 or 5+0 depending upon your space vs. 
reliability needs.


I plan to test the different configurations.

however, if I was saturating the bus with the reconstruct how can I fire 
off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the 
reconstruct to ~4M/sec?


I'm putting 10x as much data through the bus at that point, it would seem 
to proove that it's not the bus that's saturated.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-18 Thread david

On Mon, 18 Jun 2007, Lennart Sorensen wrote:


On Mon, Jun 18, 2007 at 10:28:38AM -0700, [EMAIL PROTECTED] wrote:

I plan to test the different configurations.

however, if I was saturating the bus with the reconstruct how can I fire
off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the
reconstruct to ~4M/sec?

I'm putting 10x as much data through the bus at that point, it would seem
to proove that it's not the bus that's saturated.


dd 45MB/s from the raid sounds reasonable.

If you have 45 drives, doing a resync of raid5 or radi6 should probably
involve reading all the disks, and writing new parity data to one drive.
So if you are writing 5MB/s, then you are reading 44*5MB/s from the
other drives, which is 220MB/s.  If your resync drops to 4MB/s when
doing dd, then you have 44*4MB/s which is 176MB/s or 44MB/s less read
capacity, which surprisingly seems to match the dd speed you are
getting.  Seems like you are indeed very much saturating a bus
somewhere.  The numbers certainly agree with that theory.

What kind of setup is the drives connected to?


simple ultra-wide SCSI to a single controller.

I didn't realize that the rate reported by /proc/mdstat was the write 
speed that was takeing place, I thought it was the total data rate (reads 
+ writes). the next time this message gets changed it would be a good 
thing to clarify this.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-18 Thread david

On Mon, 18 Jun 2007, Brendan Conoboy wrote:


[EMAIL PROTECTED] wrote:

 I plan to test the different configurations.

 however, if I was saturating the bus with the reconstruct how can I fire
 off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the
 reconstruct to ~4M/sec?

 I'm putting 10x as much data through the bus at that point, it would seem
 to proove that it's not the bus that's saturated.


I am unconvinced.  If you take ~1MB/s for each active drive, add in SCSI 
overhead, 45M/sec seems reasonable.  Have you look at a running iostat while 
all this is going on?  Try it out- add up the kb/s from each drive and see 
how close you are to your maximum theoretical IO.


I didn't try iostat, I did look at vmstat, and there the numbers look even 
worse, the bo column is ~500 for the resync by itself, but with the DD 
it's ~50,000. when I get access to the box again I'll try iostat to get 
more details



Also, how's your CPU utilization?


~30% of one cpu for the raid 6 thread, ~5% of one cpu for the resync 
thread


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-18 Thread david

On Mon, 18 Jun 2007, Lennart Sorensen wrote:


On Mon, Jun 18, 2007 at 11:12:45AM -0700, [EMAIL PROTECTED] wrote:

simple ultra-wide SCSI to a single controller.


Hmm, isn't ultra-wide limited to 40MB/s?  Is it Ultra320 wide?  That
could do a lot more, and 220MB/s sounds plausable for 320 scsi.


yes, sorry, ultra 320 wide.


I didn't realize that the rate reported by /proc/mdstat was the write
speed that was takeing place, I thought it was the total data rate (reads
+ writes). the next time this message gets changed it would be a good
thing to clarify this.


Well I suppose itcould make sense to show rate of rebuild which you can
then compare against the total size of tha raid, or you can have rate of
write, which you then compare against the size of the drive being
synced.  Certainly I would expect much higer speeds if it was the
overall raid size, while the numbers seem pretty reasonable as a write
speed.  4MB/s would take for ever if it was the overall raid resync
speed.  I usually see SATA raid1 resync at 50 to 60MB/s or so, which
matches the read and write speeds of the drives in the raid.


as I read it right now what happens is the worst of the options, you show 
the total size of the array for the amount of work that needs to be done, 
but then show only the write speed for the rate pf progress being made 
through the job.


total rebuild time was estimated at ~3200 min

David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-18 Thread david

On Mon, 18 Jun 2007, Brendan Conoboy wrote:


[EMAIL PROTECTED] wrote:

 yes, sorry, ultra 320 wide.


Exactly how many channels and drives?


one channel, 2 OS drives plus the 45 drives in the array.

yes I realize that there will be bottlenecks with this, the large capacity 
is to handle longer history (it's going to be a 30TB circular buffer being 
fed by a pair of OC-12 links)


it appears that my big mistake was not understanding what /proc/mdstat is 
telling me.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-18 Thread david

On Mon, 18 Jun 2007, Wakko Warner wrote:


Subject: Re: limits on raid

[EMAIL PROTECTED] wrote:

On Mon, 18 Jun 2007, Brendan Conoboy wrote:


[EMAIL PROTECTED] wrote:

yes, sorry, ultra 320 wide.


Exactly how many channels and drives?


one channel, 2 OS drives plus the 45 drives in the array.


Given that the drives only have 4 ID bits, how can you have 47 drives on 1
cable?  You'd need a minimum of 3 channels for 47 drives.  Do you have some
sort of external box that holds X number of drives and only uses a single
ID?


yes, I'm useing promise drive shelves, I have them configured to export 
the 15 drives as 15 LUNs on a single ID.


I'm going to be useing this as a huge circular buffer that will just be 
overwritten eventually 99% of the time, but once in a while I will need to 
go back into the buffer and extract and process the data.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-19 Thread david

On Tue, 19 Jun 2007, Phillip Susi wrote:


[EMAIL PROTECTED] wrote:

 one channel, 2 OS drives plus the 45 drives in the array.


Huh?  You can only have 16 devices on a scsi bus, counting the host adapter. 
And I don't think you can even manage that much reliably with the newer 
higher speed versions, at least not without some very special cables.


6 devices on the bus (2 OS drives, 3 promise drive shelves, controller 
card)



 yes I realize that there will be bottlenecks with this, the large capacity
 is to handle longer history (it's going to be a 30TB circular buffer being
 fed by a pair of OC-12 links)


Building one of those nice packet sniffers for the NSA to install on AT&Ts 
network eh? ;)


just for going back in time to track hacker actions at a bank.

I'm hopeing that once I figure out the drives the rest of the software 
will basicly boil down to tcpdump with the right options to write to a 
circular buffer of files.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-19 Thread david

On Tue, 19 Jun 2007, Lennart Sorensen wrote:


On Mon, Jun 18, 2007 at 02:56:10PM -0700, [EMAIL PROTECTED] wrote:

yes, I'm useing promise drive shelves, I have them configured to export
the 15 drives as 15 LUNs on a single ID.

I'm going to be useing this as a huge circular buffer that will just be
overwritten eventually 99% of the time, but once in a while I will need to
go back into the buffer and extract and process the data.


I would guess that if you ran 15 drives per channel on 3 different
channels, you would resync in 1/3 the time.  Well unless you end up
saturating the PCI bus instead.

hardware raid of course has an advantage there in that it doesn't have
to go across the bus to do the work (although if you put 45 drives on
one scsi channel on hardware raid, it will still be limited).


I fully realize that the channel will be the bottleneck, I just didn't 
understand what /proc/mdstat was telling me. I thought that it was telling 
me that the resync was processing 5M/sec, not that it was writing 5M/sec 
on each of the two parity locations.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-20 Thread david

On Thu, 21 Jun 2007, David Chinner wrote:


On Thu, Jun 21, 2007 at 12:56:44PM +1000, Neil Brown wrote:


I have that - apparently naive - idea that drives use strong checksum,
and will never return bad data, only good data or an error.  If this
isn't right, then it would really help to understand what the cause of
other failures are before working out how to handle them


The drive is not the only source of errors, though.  You could
have a path problem that is corrupting random bits between the drive
and the filesystem. So the data on the disk might be fine, and
reading it via a redundant path might be all that is needed.


one of the 'killer features' of zfs is that it does checksums of every 
file on disk. so many people don't consider the disk infallable.


several other filesystems also do checksums

both bitkeeper and git do checksums of files to detect disk corruption

as david C points out there are many points in the path where the data 
could get corrupted besides on the platter.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-21 Thread david

On Thu, 21 Jun 2007, Mattias Wadenstein wrote:


On Thu, 21 Jun 2007, Neil Brown wrote:


 I have that - apparently naive - idea that drives use strong checksum,
 and will never return bad data, only good data or an error.  If this
 isn't right, then it would really help to understand what the cause of
 other failures are before working out how to handle them


In theory, that's how storage should work. In practice, silent data 
corruption does happen. If not from the disks themselves, somewhere along the 
path of cables, controllers, drivers, buses, etc. If you add in fcal, you'll 
get even more sources of failure, but usually you can avoid SANs (if you care 
about your data).


heh, the pitch I get from the self proclaimed experts is that if you care 
about your data you put it on the san (so you can take advantage of the 
more expensive disk arrays, various backup advantages, and replication 
features that tend to be focused on the san becouse it's a big target)


David Lang


Well, here is a couple of the issues that I've seen myself:

A hw-raid controller returning every 64th bit as 0, no matter what's on disk. 
With no error condition at all. (I've also heard from a collegue about this 
on every 64k, but not seen that myself.)


An fcal switch occasionally resetting, garbling the blocks in transit with 
random data. Lost a few TB of user data that way.


Add to this the random driver breakage that happens now and then. I've also 
had a few broken filesystems due to in-memory corruption due to bad ram, not 
sure there is much hope of fixing that though.


Also, this presentation is pretty worrying on the frequency of silent data 
corruption:


https://indico.desy.de/contributionDisplay.py?contribId=65&sessionId=42&confId=257

/Mattias Wadenstein



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-22 Thread david

On Fri, 22 Jun 2007, David Greaves wrote:

That's not a bad thing - until you look at the complexity it brings - and 
then consider the impact and exceptions when you do, eg hardware 
acceleration? md information fed up to the fs layer for xfs? simple long term 
maintenance?


Often these problems are well worth the benefits of the feature.

I _wonder_ if this is one where the right thing is to "just say no" :)


In this case I think the advantages of a higher level system knowing what 
efficiant blocks to do writes/reads in can potentially be a HUGE 
advantage.


if the uppper levels know that you ahve a 6 disk raid 6 array with a 64K 
chunk size then reads and writes in 256k chunks (aligned) should be able 
to be done at basicly the speed of a 4 disk raid 0 array.


what's even more impressive is that this could be done even if the array 
is degraded (if you know the drives have failed you don't even try to read 
from them and you only have to reconstruct the missing info once per 
stripe)


the current approach doesn't give the upper levels any chance to operate 
in this mode, they just don't have enough information to do so.


the part about wanting to know raid 0 chunk size so that the upper layers 
can be sure that data that's supposed to be redundant is on seperate 
drives is also possible


storage technology is headed in the direction of having the system do more 
and more of the layout decisions, and re-stripe the array as conditions 
change (similar to what md can already do with enlarging raid5/6 arrays) 
but unless you want to eventually put all that decision logic into the md 
layer you should make it possible for other layers to make queries to find 
out what's what and then they can give directions for what they want to 
have happen.


so for several reasons I don't see this as something that's deserving of 
an atomatic 'no'


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-22 Thread david

On Fri, 22 Jun 2007, Bill Davidsen wrote:

By delaying parity computation until the first write to a stripe only the 
growth of a filesystem is slowed, and all data are protected without waiting 
for the lengthly check. The rebuild speed can be set very low, because 
on-demand rebuild will do most of the work.


 I'm very much for the fs layer reading the lower block structure so I
 don't have to fiddle with arcane tuning parameters - yes, *please* help
 make xfs self-tuning!

 Keeping life as straightforward as possible low down makes the upwards
 interface more manageable and that goal more realistic... 


Those two paragraphs are mutually exclusive. The fs can be simple because it 
rests on a simple device, even if the "simple device" is provided by LVM or 
md. And LVM and md can stay simple because they rest on simple devices, even 
if they are provided by PATA, SATA, nbd, etc. Independent layers make each 
layer more robust. If you want to compromise the layer separation, some 
approach like ZFS with full integration would seem to be promising. Note that 
layers allow specialized features at each point, trading integration for 
flexibility.


My feeling is that full integration and independent layers each have 
benefits, as you connect the layers to expose operational details you need to 
handle changes in those details, which would seem to make layers more 
complex. What I'm looking for here is better performance in one particular 
layer, the md RAID5 layer. I like to avoid unnecessary complexity, but I feel 
that the current performance suggests room for improvement.


they both have have benifits, but it shouldn't have to be either-or

if you build the seperate layers and provide for ways that the upper 
layers can query the lower layers to find what's efficiant then you can 
have some uppoer layers that don't care about this and trat the lower 
layer as a simple block device, while other upper layers find out what 
sort of things are more efficiant to do and use the same lower layer in a 
more complex manner


the alturnative is to duplicate effort (and code) to have two codebases 
that try to do the same thing, one stand-alone, and one as a part of an 
integrated solution (and it gets even worse if there end up being multiple 
integrated solutions)


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread david

On Sun, 12 Aug 2007, Jan Engelhardt wrote:


On Aug 12 2007 13:35, Al Boldi wrote:

Lars Ellenberg wrote:

meanwhile, please, anyone interessted,
the drbd paper for LinuxConf Eu 2007 is finalized.
http://www.drbd.org/fileadmin/drbd/publications/
drbd8.linux-conf.eu.2007.pdf

but it does give a good overview about what DRBD actually is,
what exact problems it tries to solve,
and what developments to expect in the near future.

so you can make up your mind about
 "Do we need it?", and
 "Why DRBD? Why not NBD + MD-RAID?"


I may have made a mistake when asking for how it compares to NBD+MD.
Let me retry: what's the functional difference between
GFS2 on a DRBD .vs. GFS2 on a DAS SAN?


GFS is a distributed filesystem, DRDB is a replicated block device. you 
wouldn't do GFS on top of DRDB, you would do ext2/3, XFS, etc


DRDB is much closer to the NBD+MD option.

now, I am not an expert on either option, but three are a couple things 
that I would question about the DRDB+MD option


1. when the remote machine is down, how does MD deal with it for reads and 
writes?


2. MD over local drive will alternate reads between mirrors (or so I've 
been told), doing so over the network is wrong.


3. when writing, will MD wait for the network I/O to get the data saved on 
the backup before returning from the syscall? or can it sync the data out 
lazily



Now, shared remote block access should theoretically be handled, as does
DRBD, by a block layer driver, but realistically it may be more appropriate
to let it be handled by the combining end user, like OCFS or GFS.


there are times when you want to replicate at the block layer, and there 
are times when you want to have a filesystem do the work. don't force a 
filesystem on use-cases where a block device is the right answer.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-12 Thread david
per the message below MD (or DM) would need to be modified to work 
reasonably well with one of the disk components being over an unreliable 
link (like a network link)


are the MD/DM maintainers interested in extending their code in this 
direction? or would they prefer to keep it simpler by being able to 
continue to assume that the raid components are connected over a highly 
reliable connection?


if they are interested in adding (and maintaining) this functionality then 
there is a real possibility that NBD+MD/DM could eliminate the need for 
DRDB. however if they are not interested in adding all the code to deal 
with the network type issues, then the argument that DRDB should not be 
merged becouse you can do the same thing with MD/DM + NBD is invalid and 
can be dropped/ignored


David Lang

On Sun, 12 Aug 2007, Paul Clements wrote:


Iustin Pop wrote:

 On Sun, Aug 12, 2007 at 07:03:44PM +0200, Jan Engelhardt wrote:
>  On Aug 12 2007 09:39, [EMAIL PROTECTED] wrote:
> >  now, I am not an expert on either option, but three are a couple 
> >  things that I

> >  would question about the DRDB+MD option
> > 
> >  1. when the remote machine is down, how does MD deal with it for reads 
> >  and

> >  writes?
>  I suppose it kicks the drive and you'd have to re-add it by hand unless 
>  done by

>  a cronjob.


Yes, and with a bitmap configured on the raid1, you just resync the blocks 
that have been written while the connection was down.




>From my tests, since NBD doesn't have a timeout option, MD hangs in the
 write to that mirror indefinitely, somewhat like when dealing with a
 broken IDE driver/chipset/disk.


Well, if people would like to see a timeout option, I actually coded up a 
patch a couple of years ago to do just that, but I never got it into mainline 
because you can do almost as well by doing a check at user-level (I basically 
ping the nbd connection periodically and if it fails, I kill -9 the 
nbd-client).



> >  2. MD over local drive will alternate reads between mirrors (or so 
> >  I've been

> >  told), doing so over the network is wrong.
>  Certainly. In which case you set "write_mostly" (or even write_only, not 
>  sure

>  of its name) on the raid component that is nbd.
> 
> >  3. when writing, will MD wait for the network I/O to get the data 
> >  saved on the
> >  backup before returning from the syscall? or can it sync the data out 
> >  lazily

>  Can't answer this one - ask Neil :)

 MD has the write-mostly/write-behind options - which help in this case
 but only up to a certain amount.


You can configure write_behind (aka, asynchronous writes) to buffer as much 
data as you have RAM to hold. At a certain point, presumably, you'd want to 
just break the mirror and take the hit of doing a resync once your network 
leg falls too far behind.


--
Paul


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Layering: Use-Case Composers (was: DRBD - what is it, anyways? [compare with e.g. NBD + MD raid])

2007-08-13 Thread david

On Mon, 13 Aug 2007, David Greaves wrote:


[EMAIL PROTECTED] wrote:

 per the message below MD (or DM) would need to be modified to work
 reasonably well with one of the disk components being over an unreliable
 link (like a network link)

 are the MD/DM maintainers interested in extending their code in this
 direction? or would they prefer to keep it simpler by being able to
 continue to assume that the raid components are connected over a highly
 reliable connection?

 if they are interested in adding (and maintaining) this functionality then
 there is a real possibility that NBD+MD/DM could eliminate the need for
 DRDB. however if they are not interested in adding all the code to deal
 with the network type issues, then the argument that DRDB should not be
 merged becouse you can do the same thing with MD/DM + NBD is invalid and
 can be dropped/ignored

 David Lang


As a user I'd like to see md/nbd be extended to cope with unreliable links.
I think md could be better in handling link exceptions. My unreliable memory 
recalls sporadic issues with hot-plug leaving md hanging and certain lower 
level errors (or even very high latency) causing unsatisfactory behaviour in 
what is supposed to be a fault 'tolerant' subsystem.



Would this just be relevant to network devices or would it improve support 
for jostled usb and sata hot-plugging I wonder?


good question, I suspect that some of the error handling would be similar 
(for devices that are unreachable not haning the system for example), but 
a lot of the rest would be different (do you really want to try to 
auto-resync to a drive that you _think_ just reappeared, what if it's a 
different drive? how can you be sure?) the error rate of a network is gong 
to be significantly higher then for USB or SATA drives (although I suppose 
iscsi would be limilar)


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid5 software problems after loosing 4 disks for 48 hours

2006-06-17 Thread David Greaves
Wilson Wilson wrote:
> Neil great stuff, its online now!!!
Congratulations :)
>
> I am still unsure how this raid5 volume was partially readable with 4
> disks missing.  My understanding each file is written across all disks
> apart from one, which is used for CRC.  So if 2 disks are offline the
> whole thing should be unreadable.
I'll try :)

md doesn't operate at a file level, it operates on chunks. The chunk
could be 64Kb in size.

For raid5 each stripe is made of n-1 chunks. (raid6 would be n-2).
When a stripe is read, if your file is in one of the chunks that's still
there then you're in luck.

I guess md knows it's degraded and gives as much data back as possible.

This means that you have a certain probability of accessing a given file
depending on it's size, the filesystem and the degree to which the array
is degraded.

FWIW I'd *never* try a r/w operation on such a degraded array.

Speculation:
I'm surprised you could mount such a 'sparse' array though. I wonder if
some filesystems (like xfs) would just barf as they mounted because they
have more distributed mount-time data structures and would spot the
missing chunks.  Others (ext3?) may just mount and try to read blocks on
demand.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New FAQ entry? (was IBM xSeries stop responding during RAID1 reconstruction)

2006-06-21 Thread David Greaves
OK :)

David

Niccolo Rigacci wrote:
> Thanks to the several guys in this list, I have solved my problem 
> and elaborated this, can be a new FAQ entry?
>
>
>
> Q: Sometimes when a RAID volume is resyncing, the system seems to 
> locks-up: every disk activity is blocked until resync is done.
>
> A: This is not strictly related to Linux RAID, this is a problem 
> related to the Linux kernel and the disk subsytem: in no 
> circumstances a process should get all the disk resources 
> preventing others to access them.
>
> You can control the max speed at which RAID reconstruction is 
> done by setting it, say at 5 Mb/s:
>
>   echo 5000 > /proc/sys/dev/raid/speed_limit_max
>
> This is just a workaround, you have to determine the max speed 
> that does not lock your system by trial and error and you cannot 
> predict what will be the disk load in the future when the RAID 
> will be resyncing for some reason.
>
> Starting from version 2.6, Linux kernel has several choices about 
> the I/O scheduler to be used. The default is the anticipatory 
> scheduler, which seems to be sub-optimal on resync high load. If 
> your kernel has the CFQ scheduler compiled in, use it during 
> resync.
>
> >From the command line you can see which schedulers are supported 
> and change it on the fly (remember to do it for each RAID disk):
>
>   # cat /sys/block/hda/queue/scheduler
>   noop [anticipatory] deadline cfq
>   # echo cfq > /sys/block/hda/queue/scheduler
>
> Otherwise you can recompile your kernel and set CFQ as the 
> default I/O scheduler (CONFIG_DEFAULT_CFQ=y in Block layer, IO 
> Schedulers, Default I/O scheduler).
>
>
>   


-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Large single raid and XFS or two small ones and EXT3?

2006-06-24 Thread David Greaves
Adam Talbot wrote:
> OK, this topic I relay need to get in on.
> I have spent the last few week bench marking my new 1.2TB, 6 disk, RAID6
> array.
Very interesting. Thanks.

Did you get around to any 'tuning'.
Things like raid chunk size, external logs for xfs, blockdev readahead
on the underlying devices and the raid device?

David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Large single raid and XFS or two small ones and EXT3?

2006-06-25 Thread David Rees

On 6/23/06, Nix <[EMAIL PROTECTED]> wrote:

On 23 Jun 2006, PFC suggested tentatively:
>   - ext3 is slow if you have many files in one directory, but has
>   more mature tools (resize, recovery etc)

This is much less true if you turn on the dir_index feature.


However, even with dir_index, deleting large files is still much
slower with ext2/3 than xfs or jfs.

-Dave
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid issues after power failure

2006-07-02 Thread David Greaves
Francois Barre wrote:
> 2006/7/1, Ákos Maróy <[EMAIL PROTECTED]>:
>> Neil Brown wrote:
>> > Try adding '--force' to the -A line.
>> > That tells mdadm to try really hard to assemble the array.
>>
>> thanks, this seems to have solved the issue...
>>
>>
>> Akos
>>
>>
> 
> Well, Neil, I'm wondering,
> It seemed to me that Akos' description of the problem was that
> re-adding the drive (with mdadm not complaining about anything) would
> trigger a resync that would not even start.
> But as your '--force' does the trick, it implies that the resync was
> not really triggered after all without it... Or did I miss a bit of
> log Akos provided that did say so ?
> Could there be a place here for an error message ?
> 
> More generally, could it be usefull to build up a recovery howto,
> based on the experiences on this list (I guess 90% of the posts a
> related to recoveries) ?
> Not in terms of a standard disk loss, but in terms of a power failure
> or a major disk problem. You know, re-creating the array, rolling the
> dices, and *tada !* your data is back again... I could not find a bit
> of doc about this.
> 

Francois,
I have started to put a wiki in place here:
  http://linux-raid.osdl.org/

My reasoning was *exactly* that - there is reference information for md
but sometimes the incantations need a little explanation and often the
diagnostics are not obvious...

I've been subscribed to linux-raid since the middle of last year and
I've been going through old messages looking for nuggets to base some
docs around.

I haven't had a huge amount of time recently so I've just scribbled on
it for now - I wanted to present something a little more polished to the
community - but since you're asking...

So don't consider this an official announcement of a useable work yet -
more a 'Please contact me if you would like to contribute' (just so I
can keep track of interested parties) and we can build something up...

David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] enable auto=yes by default when using udev

2006-07-03 Thread David Greaves
Neil Brown wrote:
> I guess I could test for both, but then udev might change
> again I'd really like a more robust check.
> 
> Maybe I could test if /dev was a mount point?

IIRC you can have diskless machines with a shared root and nfs mounted
static /dev/

David

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SWRaid Wiki

2006-07-11 Thread David Greaves
Francois Barre wrote:
> Hello David, all,
> 
> You pointed the http://linux-raid.osdl.org as a future ressource for
> SwRAID and MD knowledge base.
Yes. it's not ready for public use yet so I've not announced it formally
- I just mention it to people when things pop up.

> 
> In fact, the TODO page on the wiki is empty...
Hmm, yes... maybe it should say "build todo list"

One action I am pursuing is "take over the official RAID FAQ". I've made
contact with the authors and we're discussing licenses etc...
Horrid stuff but important to many. Speaking of which, Neil, if you read
this - are the man pages under the GFDL or the GPL?


> But I would like to help on feeding this wiki with all the clues and
> experiences posted on the ML,
That would be worthwhile.

 and it would first be interresting to
> build up the TODO list, which could start by :
> - reference various situations where help can be provided : recovery,
> diagnostics, statistics,
> - create a comprehensive list of success stories & good
> design/techniques, in order to help people design their own RAID
> systems. In my opinion, this both deals with software params (raid
> level, chunk size, fs, ...), and with hardware decisions (sata vs.
> scsi, the right controller, ...)
Well, I wanted to focus more on refining key information from such
stories. After all, a success story is only relevant to a particular
situation.

I'd rather develop a diagnostic approach which leads people through a
diagnostic process and explains when to use certain tools/options. That
would also be something we could keep up to date whereas an actual story
loses relevance over time.


> PS : I really like your "RAID Recovery" page :
> "If this happens then first of all: don't panic. Seriously. Don't rush
> into anything..."

yes, but so true...

David


-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mounting array was read write for about 3 minutes, then Read-only file system error

2006-07-17 Thread David Rientjes

On 7/17/06, Neil Brown <[EMAIL PROTECTED]> wrote:

On Thursday July 6, [EMAIL PROTECTED] wrote:
> I created a raid1 array using /dev/disk/by-id with (2) 250GB USB 2.0
> Drives.  It was working for about 2 minutes until I tried to copy a
> directory tree from one drive to the array and then cancelled it
> midstream.  After cancelling the copy, when I list the contents of the
> directory it doesn't show anything there.
>
> When I try to create a file, I get the following error msg:
>
> [EMAIL PROTECTED] ~]# cd /mnt/usb250
> [EMAIL PROTECTED] usb250]# ls
> lost+found
> [EMAIL PROTECTED] usb250]# touch test.txt
> touch: cannot touch `test.txt': Read-only file system

Sounds like you got some disk errors so the filesystem when readonly.

Is there anything in /var/log/messages about errors at that time?


What's /proc/mount for /dev/disk/by-id when it's rw?  Is it
rw,errors=remount-ro?

David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


md reports: unknown partition table

2006-07-18 Thread David Greaves
Hi

After a powercut I'm trying to mount an array and failing :(

teak:~# mdadm --assemble /dev/media --auto=p /dev/sd[bcdef]1
mdadm: /dev/media has been started with 5 drives.

Good

However:
teak:~# mount /media
mount: /dev/media1 is not a valid block device

teak:~# dd if=/dev/media1 of=/dev/null
dd: opening `/dev/media1': No such device or address

teak:~# dd if=/dev/media of=/dev/null
792442+0 records in
792441+0 records out
405729792 bytes transferred in 4.363571 seconds (92981135 bytes/sec)
(after ^C)

dmesg shows:
raid5: device sdb1 operational as raid disk 0
raid5: device sdf1 operational as raid disk 4
raid5: device sde1 operational as raid disk 3
raid5: device sdd1 operational as raid disk 2
raid5: device sdc1 operational as raid disk 1
raid5: allocated 5235kB for md_d127
raid5: raid level 5 set md_d127 active with 5 out of 5 devices, algorithm 2
RAID5 conf printout:
 --- rd:5 wd:5 fd:0
 disk 0, o:1, dev:sdb1
 disk 1, o:1, dev:sdc1
 disk 2, o:1, dev:sdd1
 disk 3, o:1, dev:sde1
 disk 4, o:1, dev:sdf1
md_d127: bitmap initialized from disk: read 1/1 pages, set 0 bits, status: 0
created bitmap (5 pages) for device md_d127
 md_d127: unknown partition table

That last line looks odd...

It was created like so:

mdadm --create /dev/media --level=5 -n 5 -e1.2 --bitmap=internal
--name=media --auto=p /dev/sd[bcdef]1

and the xfs fstab entry is:
  /dev/media1 /media xfs rw,noatime,logdev=/dev/media2 0 0

fdisk /dev/media
shows:
 Device Boot  Start End  Blocks   Id  System
/dev/media1   1   312536035  1250144138   83  Linux
/dev/media2   312536036   312560448   97652   da  Non-FS data

cfdisk even gets the filesystem right...

Which is expected.

teak:~# ll /dev/media*
brw-rw  1 root disk 254, 192 2006-07-18 17:18 /dev/media
brw-rw  1 root disk 254, 193 2006-07-18 17:18 /dev/media1
brw-rw  1 root disk 254, 194 2006-07-18 17:18 /dev/media2
brw-rw  1 root disk 254, 195 2006-07-18 17:18 /dev/media3
brw-rw  1 root disk 254, 196 2006-07-18 17:18 /dev/media4

teak:~# uname -a
Linux teak 2.6.16.19-teak-060602-01 #3 PREEMPT Sat Jun 3 09:20:24 BST
2006 i686 GNU/Linux
teak:~# mdadm -V
mdadm - v2.5.2 -  27 June 2006


David

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS and write barrier

2006-07-18 Thread David Chinner
On Tue, Jul 18, 2006 at 06:58:56PM +1000, Neil Brown wrote:
> On Tuesday July 18, [EMAIL PROTECTED] wrote:
> > On Mon, Jul 17, 2006 at 01:32:38AM +0800, Federico Sevilla III wrote:
> > > On Sat, Jul 15, 2006 at 12:48:56PM +0200, Martin Steigerwald wrote:
> > > > I am currently gathering information to write an article about journal
> > > > filesystems with emphasis on write barrier functionality, how it
> > > > works, why journalling filesystems need write barrier and the current
> > > > implementation of write barrier support for different filesystems.
> 
> "Journalling filesystems need write barrier" isn't really accurate.
> They can make good use of write barrier if it is supported, and where
> it isn't supported, they should use blkdev_issue_flush in combination
> with regular submit/wait.

blkdev_issue_flush() causes a write cache flush - just like a
barrier typically causes a write cache flush up to the I/O with the
barrier in it.  Both of these mechanisms provide the same thing - an
I/O barrier that enforces ordering of I/Os to disk.

Given that filesystems already indicate to the block layer when they
want a barrier, wouldn't it be better to get the block layer to issue
this cache flush if the underlying device doesn't support barriers
and it receives a barrier request?

FWIW, Only XFS and Reiser3 use this function, and only then when
issuing a fsync when barriers are disabled to make sure a common
test (fsync then power cycle) doesn't result in data loss...

> > Noone here seems to know, maybe Neil &| the other folks on linux-raid
> > can help us out with details on status of MD and write barriers?
> 
> In 2.6.17, md/raid1 will detect if the underlying devices support
> barriers and if they all do, it will accept barrier requests from the
> filesystem and pass those requests down to all devices.
> 
> Other raid levels will reject all barrier requests.

Any particular reason for not supporting barriers on the other types
of RAID?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: md reports: unknown partition table - fixed.

2006-07-18 Thread David Greaves
David Greaves wrote:
> Hi
> 
> After a powercut I'm trying to mount an array and failing :(

A reboot after tidying up /dev/ fixed it.

The first time through I'd forgotten to update the boot scripts and they
were assembling the wrong UUID. That was fine; I realised this and ran
the manual assemble:

  mdadm --assemble /dev/media /dev/sd[bcdef]1
  dmesg
  cat /proc/mdstat

All OK (but I'd forgotten that this was a partitioned array). I suspect
the device entries for /dev/media[1234] from last time were hanging about.

  mount /media
  fdisk /dev/media
So I guess this fails because the major-minor are for a non-p md device?

  mdadm --assemble /dev/media --auto=p /dev/sd[bcdef]1
  mdadm --stop /dev/media
This fails because I'm on mdadm 2.4.1

  mdadm --assemble /dev/media --auto=p /dev/sd[bcdef]1
  cat /proc/mdstat
  mdadm --stop /dev/md_d0
  mdadm --stop /dev/md0
  cat /proc/mdstat
So by now I upgrade to mdadm 2.5.1 in another session.

  mdadm --stop /dev/media
  dmesg
  cat /proc/mdstat
and it stops.

  mdadm --assemble /dev/media --auto=p /dev/sd[bcdef]1
But now it won't create working devices...

Much messing about with assemble and I try a kernel upgrade - can't
because the driver for my video card won't compile under 2.6.17 yet so
WTF, I suspect major/minor numbers so just reboot it under the same kernel.

All seems well.

I think there's a bug here somewhere. I wonder/suspect that the
superblock should contain the fact that it's a partitioned/able md device?

David

-- 

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Serious XFS bug in 2.6.17 kernels - FYI

2006-07-20 Thread David Greaves
Just an FYI for my friends here who may be running 2.6.17.x kernels and
using XFS and who may not be monitoring lkml :)

There is a fairly serious corruption problem that has recently been
discussed on lkml and affects all 2.6.17 before -stable .7 (not yet
released)

Essentially the fs can be corrupted and it's serious because the current
xfs_repair tools may make the problem worse, not better.

There is a 1-line  patch that can be applied :
  http://marc.theaimsgroup.com/?l=linux-kernel&m=115315508506996&w=2

FAQ message here
  http://marc.theaimsgroup.com/?l=linux-xfs&m=115338022506482&w=2

FAQ:
  http://oss.sgi.com/projects/xfs/faq.html#dir2

It appears that efforts are being focused on the repair tools now.

It appears to me that the best response is to patch the kernel, reboot,
backup the fs, recreate the fs and restore - but please read up before
taking  any action.

David

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: host based mirror distance in a fc-based SAN environment

2006-07-26 Thread David Greaves
Stefan Majer wrote:
> Hi,
> 
> im curious if there are some numbers out up to which distance its possible
> to mirror (raid1) 2 FC-LUNs. We have 2 datacenters with a effective
> distance of 11km. The fabrics in one datacenter are connected to the
> fabrics in the other datacenter with 5 dark fibre both about 11km in
> distance.
> 
> I want to set up servers wich mirrors their LUNs across the SAN-boxen in
> both datacenters. On top of this mirrored LUN i put lvm2.
> 
> So the question is does anybody have some numbers up to which distance
> this method works ?

No. But have a look at man mdadm in later mdadm:

   -W, --write-mostly
 subsequent devices lists in a --build, --create,  or  --add
 command  will  be flagged as 'write-mostly'.  This is valid
 for RAID1 only and means that the 'md'  driver  will  avoid
 reading from these devices if at all possible.  This can be
 useful if mirroring over a slow link.

   --write-behind=
 Specify that write-behind mode should be enabled (valid for
 RAID1  only).  If an argument is specified, it will set the
 maximum number of outstanding writes allowed.  The  default
 value  is  256.  A write-intent bitmap is required in order
 to  use  write-behind  mode,  and  write-behind   is   only
 attempted on drives marked as write-mostly.

Which suggests that the WAN/LAN latency shouldn't impact you except on
failure.

HTH

David

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: let md auto-detect 128+ raid members, fix potential race condition

2006-07-31 Thread David Greaves
Alexandre Oliva wrote:
> On Jul 30, 2006, Neil Brown <[EMAIL PROTECTED]> wrote:
> 
>>  1/
>> It just isn't "right".  We don't mount filesystems from partitions
>> just because they have type 'Linux'.  We don't enable swap on
>> partitions just because they have type 'Linux swap'.  So why do we
>> assemble md/raid from partitions that have type 'Linux raid
>> autodetect'? 
> 
> Similar reason to why vgscan finds and attempts to use any partitions
> that have the appropriate type/signature (difference being that raid
> auto-detect looks at the actual partition type, whereas vgscan looks
> at the actual data, just like mdadm, IIRC): when you have to bootstrap
> from an initrd, you don't want to be forced to have the correct data
> in the initrd image, since then any reconfiguration requires the info
> to be introduced in the initrd image before the machine goes down.
> Sometimes, especially in case of disk failures, you just can't do
> that.
> 
This debate is not about generic autodetection - a good thing (tm) - but
 in-kernel vs userspace autodetection.

Your example supports Neil's case - the proposal is to use initrd to run
mdadm which thne (kinda) does what vgscan does.


> 
>> So my preferred solution to the problem is to tell people not to use
(in kernel)
>> autodetect.  Quite possibly this should be documented in the code, and
>> maybe even have a KERN_INFO message if more than 64 devices are
>> autodetected. 
> 
> I wouldn't have a problem with that, since then distros would probably
> switch to a more recommended mechanism that works just as well, i.e.,
> ideally without requiring initrd-regeneration after reconfigurations
> such as adding one more raid device to the logical volume group
> containing the root filesystem.
That's supported in today's mdadm.

look at --uuid and --name

>> So:  Do you *really* need to *fix* this, or can you just use 'mdadm'
>> to assemble you arrays instead?
> 
> I'm not sure.  I'd expect not to need it, but the limited feature
> currently in place, that initrd uses to bring up the raid1 devices
> containing the physical volumes that form the volume group where the
> logical volume with my root filesystem is also brings up various raid6
> physical volumes that form an unrelated volume group, and it does so
> in such a way that the last of them, containing the 128th fd-type
> partition in the box, ends up being left out, so the raid device it's
> a member of is brought up either degraded or missing the spare member,
> none of which are good.
> 
> I don't know that I can easily get initrd to replace nash's
> raidautorun for mdadm unless mdadm has a mode to bring up any arrays
> it can find, as opposed to bringing up a specific array out of a given
> list of members or scanning for members.  Either way, this won't fix
> the problem 2) that you mentioned, but requiring initrd-regeneration
> after extending the volume group containing the root device is another
> problem that the current modes of operation of mdadm AFAIK won't
> contemplate, so switching to it will trade one problem for another,
> and the latter is IMHO more common than the former.
> 

I think you should name your raid1 (maybe "hostname-root") and use
initrd to bring it up by --name using:
 mdadm --assemble --scan --config partitions --name hostname-root


It could also, later in the boot process, bring up "hostname-raid6" by
--name too.
 mdadm --assemble --scan --config partitions --name hostname-raid6

David


-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] md: new bitmap sysfs interface

2006-08-03 Thread David Greaves
Neil Brown wrote:
>   write-bits-here-to-dirty-them-in-the-bitmap
> 
> is probably (no, definitely) too verbose.
> Any better suggestions?

It's not actually a bitmap is it?
It takes a number or range and *operates* on a bitmap.

so:
 dirty-chunk-in-bitmap

or maybe:
 dirty-bitmap-chunk

David


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5/lvm setup questions

2006-08-05 Thread David Greaves
Shane wrote:
> Hello all,
> 
> I'm building a new server which will use a number of disks
> and am not sure of the best way to go about the setup. 
> There will be 4 320gb SATA drives installed at first.  I'm
> just wondering how to set the system up for upgradability. 
> I'll be using raid5 but not sure whether to use lvm over
> the raid array.
> 
> By upgradability, I'd like to do several things.  Adding
> another drive of the same size to the array.  I understand
> reshape can be used here to expand the underlying block
> device.
Yes, it can.

  If the block device is the pv of an lvm array,
> would that also automatically expand in which I would
> create additional lvs in the new space.  If this isn't
> automatic, are there ways to do it manually?
Not automatic AFAIK - but doable.

> What about replacing all four drives with larger units. 
> Say going from 300gbx4 to 500gbx4.  Can one replace them
> one at a time, going through fail/rebuild as appropriate
> and then expand the array into the unused space
Yes.

 or would
> one have to reinstall at that point.
No


None of the requirements above drive you to layering lvm over the top.

That's not to say don't do it - but you certainly don't *need* to do it.

Pros:
* allows snapshots (for consistent backups)
* allows various lvm block movements etc...
* Can later grow vg to use discrete additional block devices without raid5 grow
limitations (eg same-ish size disks etc)

Cons:
* extra complexity -> risk of bugs/admin errors...
* performance impact

As an example of the cons: I've just set up lvm2 over my raid5 and whilst
testing snapshots, the first thing that happened was a kernel BUG and an oops...

David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5/lvm setup questions

2006-08-07 Thread David Greaves
Shane wrote:
> On Mon, Aug 07, 2006 at 08:57:13PM +0100, Nix wrote:
>> On 5 Aug 2006, David Greaves prattled cheerily:
>>> As an example of the cons: I've just set up lvm2 over my raid5 and whilst
>>> testing snapshots, the first thing that happened was a kernel BUG and an 
>>> oops...
>> I've been backing up using writable snapshots on LVM2 over RAID-5 for
>> some time. No BUGs.
> 
> Just performed some basic throughput tests using 4 SATA
> disks in a raid5 array.  The read performance on the
> /dev/mdx device runs around 180mbps but if lvm is layered
> over that, reads on the lv are around 130mbps.  Not an
> unsubstantial reduction.
Check the readahead at various block levels
blockdev --setra xxx

I think I found the best throughput (for me) was with 0 readahead for /dev/hdX,
0 for /dev/mdX and lots for /dev/vg/lv

> 
> I seem to recall patches to md floating around a couple
> years back for partitioning of md devices.  Are those still
> available somewhere?
man mdadm and see --auto...

David


-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5/lvm setup questions

2006-08-07 Thread David Greaves
Nix wrote:
> On 5 Aug 2006, David Greaves prattled cheerily:
that's me :)
>> As an example of the cons: I've just set up lvm2 over my raid5 and whilst
>> testing snapshots, the first thing that happened was a kernel BUG and an 
>> oops...
> 
> I've been backing up using writable snapshots on LVM2 over RAID-5 for
> some time. No BUGs.
I tried but it didn't recurr.
I sent a report to lkml.

> I think the blame here is likely to be layable at the snapshots' door,
> anyway: they're still a little wobbly and the implementation is pretty
> complex: bugs surface on a regular basis.
Hmmm. Bugs in a backup strategy. Hmmm.

I think I can live with a nightly shutdown of the daemons whilst rsync does it's
stuff across the LAN.

David


-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resize on dirty array?

2006-08-09 Thread David Greaves
No, it wasn't *less* reliable than a single drive; you benefited as soon as a
James Peverill wrote:
> 
> In this case the raid WAS the backup... however it seems it turned out
> to be less reliable than the single disks it was supporting.  In the
> future I think I'll make sure my disks have varying ages so they don't
> fail all at once.
> 
be at the moment. With RAID you then stressed the remaining drives to the point
of a second failure (not that you had much choice - you *could* have spent money

> James
> 
>>> RAID is no excuse for backups.
on enough media to mirror your data whilst you played with your only remaining

I can't see where you mention the kernel version you're running? md can perform
validation sync's on a periodic basis in later kernels - Debian's mdadm enables
this in cron.

copy - that's a cost/risk tradeoff you chose not to make. I've made the same
choice in the past - I've been lucky - you were not - sorry.)

> PS: 
> -
David

> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
drive failed. At that point you would have been just as toasted as you may well


PS
Reorganise lines from distributed reply as you like :)

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resize on dirty array?

2006-08-11 Thread David Rees

On 8/10/06, dean gaudet <[EMAIL PROTECTED]> wrote:

- set up smartd to run long self tests once a month.   (stagger it every
  few days so that your disks aren't doing self-tests at the same time)


I personally prefer to do a long self-test once a week, a month seems
like a lot of time for something to go wrong.


- run nightly diffs of smartctl -a output on all your drives so you see
  when one of them reports problems in the smart self test or otherwise
  has a Current_Pending_Sectors or Realloc event... then launch a
  repair sync_action.


You can (and probably should) setup smartd to automatically send out
email alerts as well.

-Dave
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Resize on dirty array?

2006-08-11 Thread David Rees

On 8/11/06, dean gaudet <[EMAIL PROTECTED]> wrote:

On Fri, 11 Aug 2006, David Rees wrote:

> On 8/10/06, dean gaudet <[EMAIL PROTECTED]> wrote:
> > - set up smartd to run long self tests once a month.   (stagger it every
> >   few days so that your disks aren't doing self-tests at the same time)
>
> I personally prefer to do a long self-test once a week, a month seems
> like a lot of time for something to go wrong.

unfortunately i found some drives (seagate 400 pata) had a rather negative
effect on performance while doing self-test.


Interesting that you noted negative performance, but I typically
schedule the tests for off-hours anyway where performance isn't
critical.

How much of a performance hit did you notice?

-Dave
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel RAID support

2006-09-03 Thread David Greaves
Richard Scobie wrote:
> Josh Litherland wrote:
>> On Sun, 2006-09-03 at 15:56 +1200, Richard Scobie wrote:
>>
>>> I am building  2.6.18rc5-mm1 and I cannot find the entry under "make
>>> config", to enable the various RAID options.
>>
>>
>> Under "Device Drivers", switch on "Multi-device support".
>>
> 
> Thanks. I must be going nuts, as it does not appear as an option. Below
> is the list under "Device Drivers" if I do a "make menuconfig":

Recently reported on lkml

Andrew Morton said:
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc5/2.6.18-rc5-mm1/hot-fixes/
contains a fix for this.

HTH
David

PS on kernel mailing lists do a reply-all and don't trim cc lists :)

-- 
VGER BF report: U 0.500279
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Messed up creating new array...

2006-09-08 Thread David Rees

On 9/8/06, Ruth Ivimey-Cook <[EMAIL PROTECTED]> wrote:

I messed up slightly when creating a new 6-disk raid6 array, and am wondering
if there is a simple answer. The problem is that I didn't partition the drives,
but simply used the whole drive. All drives are of the same type and using the
Supermicro SAT2-MV8 controller.


This should work:

1. Unmount filesystem.
2. Shrink file system to something a bit smaller. Since it's a big
array, 1GB should give you plenty of room.
3. Shrink raid array to something in between the new fs size and old
fs size. Make sure you don't shrink it smaller than the filesystem!
4. Remove a disk from the array (fail/remove)
5. Partition disk
6. Add partition back to array
7. Repeat steps 4-6 for all disks in the array.
8. Now the whole array should be on partitions. Grow the raid array
back to "max".
9. Grow filesystem to the partition size.

-Dave
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Simulating Drive Failure on Mirrored OS drive

2006-10-02 Thread David Greaves
andy liebman wrote:
> I tried simply unplugging one drive from its power and from its SATA
> connector. The OS didn't like that at all. My KDE session kept running,
> but I could no longer open any new terminals. I couldn't become root in
> an existing terminal that was already running. And I couldn't SSH into
> the machine.
That's likely to be because sata hotswap isn't supported (yet).
dmesg should give you more info.

 > I know that simply unplugging a drive is not the same as a drive failing
> or timing out. But is there a more realistic way to simulate a failure
> so that I can know that the mirror will work when it's needed?

Read up on the md-faulty device.

Also, FWIW, md works just fine :)
(Lots of other things can go wrong so testing your setup is a food idea though)

David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recipe for Mirrored OS Drives

2006-10-02 Thread David Greaves
andy liebman wrote:
> A few weeks ago, I promised that I would put my "recipe" here  for
> creating "mirrored OS drives from an existing OS Drive". This "recipe"
> combines what I learned from MANY OTHER sometimes conflicting documents
> on the same subject -- documents that were probably developed for
> earlier kernels and distributions.

Feel free to add it here:
http://linux-raid.osdl.org/index.php/Main_Page

I haven't been able to do much for a few weeks (typical - I find some time and
use it all up just getting the basic setup done - still it's started!)

David

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 003 of 6] md: Remove 'experimental' classification from raid5 reshape.

2006-10-02 Thread David Greaves
Typo in first line of this patch :)

> I have had enough success reports not^H^H^H to believe that this 
> is safe for 2.6.19.

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm and raidtools - noob

2006-10-02 Thread David Greaves
Mark Ryden wrote:
> Hello linux-raid list,
> 
>  I want to create a Linux Software RAID1 on linux FC5 (x86_64),
> from SATA II disks. I am a noob in this.
No problems.

> I looked for it and saw that as far as I understand,
> raidtools is quite old - from 2003.
> for exanple, http://people.redhat.com/mingo/raidtools/
correct

> So my question is this:
> is raidtools deprecated ?
Yes

> Is it possible at all to use raidtool to create linux software RAID1,
> running 2.6.17-1.2187_FC5 kernel on x86_64 ?
Maybe - don't

> Is using mdadm is the way to create linux software RAID1 ?
Yes

Is it the only way ?
No (eg EVMS)


David

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recipe for Mirrored OS Drives

2006-10-02 Thread David Greaves
andy liebman wrote:
> 
>>
>> Feel free to add it here:
>> http://linux-raid.osdl.org/index.php/Main_Page
>>
>> I haven't been able to do much for a few weeks (typical - I find some
>> time and
>> use it all up just getting the basic setup done - still it's started!)
>>
>> David
>>
> 
> Any hints on how to add a page?
> 
> Andy
> 

Yep :)

First off it would help to read up on Wikis :
http://meta.wikimedia.org/wiki/Help:Contents

Basically you:
* go to the page where you want to link from
* edit that page to link to your new (not yet created) page
* save your edit
* click on the (red) link and you'll be given a page to edit
* type...

I suggest you link from http://linux-raid.osdl.org/index.php/RAID_Boot

David


-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recipe for Mirrored OS Drives

2006-10-03 Thread David Greaves
Nix wrote:
> On 2 Oct 2006, David Greaves spake:
>> I suggest you link from http://linux-raid.osdl.org/index.php/RAID_Boot
> 
> The pages don't really have the same purpose. RAID_Boot is `how to boot
> your RAID system using initramfs'; this is `how to set up a RAID system
> in the first place', i.e., setup.
> 
> I'll give it a bit of a tweak-and-rename in a bit.
> 
Fair :)

FYI I've done quite a bit on the Howto section:
http://linux-raid.osdl.org/index.php/Overview

It still needs a lot of work I think but it's getting there...

David

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiple Disk Failure Recovery

2006-10-14 Thread David Rees

On 10/14/06, Lane Brooks <[EMAIL PROTECTED]> wrote:

I am wondering if there is a way to cut my losses with these bad sectors
and have it recover what it can so that I can get my raid array back to
functioning.  Right now I cannot get a spare disk recovery to finish
because these bad sectors.  Is there a way to force as much recovery as
possible so that I can replace this newly faulty drive?


One technique is to use ddrescue to create an image of the failing
drive(s) (I would image all drives if possible) and use those images
to try to retrieve your data.

-Dave
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need help recovering a raid5 array

2006-10-24 Thread David Greaves
[EMAIL PROTECTED] wrote:
> Hello all,
Hi

First off, don't do anything else without reading up or talking on here :)

The list archive has got a lot of good material - 'help' is usually a good
search term!!!


> 
> I had a disk fail in a raid 5 array (4 disk array, no spares), and am
> having trouble recovering it.  I believe my data is still safe, but I
> cannot tell what is going wrong here.

There's some useful stuff but always include:
* kernel version
* mdadm version
* relevant dmesg or similar output


What went wrong?
Did /dev/sdd fail? If so then why are you adding it back to the array? Or is
this now a replacement?

You should be OK - I'll reply quickly now and see if I can make some suggestions
later (or sooner).

David


> 
> When I try to rebuild the array "mdadm --assemble /dev/md0 /dev/sda2
> /dev/sdb2 /dev/sdc2 /dev/sdd2" I see "failed to RUN_ARRAY /dev/md0:
> Input/output error".
> 
> dmesg shows the following:
> md: bind
> md: bind
> md: bind
> md: bind
> md: md0: raid array is not clean -- starting background reconstruction
> raid5: device sda2 operational as raid disk 0
> raid5: device sdc2 operational as raid disk 2
> raid5: device sdb2 operational as raid disk 1
> raid5: cannot start dirty degraded array for md0
> RAID5 conf printout:
>  --- rd:4 wd:3 fd:1
>  disk 0, o:1, dev:sda2
>  disk 1, o:1, dev:sdb2
>  disk 2, o:1, dev:sdc2
> raid5: failed to run raid set md0
> md: pers->run() failed ...
> 
> 
> 
> /proc mdstat shows:
> md0 : inactive sda2[0] sdd2[3](S) sdc2[2] sdb2[1]
> 
> This seems wrong, as sdd2 should not be a spare - I want it to be the
> fourth disk.
> 
> 
> The output of mdadm -E for each disk is as follows:
> sda2:
> /dev/sda2:
>   Magic : a92b4efc
> Version : 00.90.00
>UUID : c50a81fc:ef4323e6:438a7cb1:25ae35e5
>   Creation Time : Thu Jun  1 21:13:58 2006
>  Raid Level : raid5
> Device Size : 390555904 (372.46 GiB 399.93 GB)
>  Array Size : 1171667712 (1117.39 GiB 1199.79 GB)
>Raid Devices : 4
>   Total Devices : 4
> Preferred Minor : 0
> 
> Update Time : Sun Oct 22 23:39:06 2006
>   State : active
>  Active Devices : 3
> Working Devices : 4
>  Failed Devices : 0
>   Spare Devices : 1
>Checksum : 683f2f5c - correct
>  Events : 0.8831997
> 
>  Layout : left-symmetric
>  Chunk Size : 256K
> 
>   Number   Major   Minor   RaidDevice State
> this 0   820  active sync   /dev/sda2
> 
>0 0   820  active sync   /dev/sda2
>1 1   8   181  active sync   /dev/sdb2
>2 2   8   342  active sync   /dev/sdc2
>3 3   003  faulty removed
>4 4   8   504  spare   /dev/sdd2
> 
> 
> sdb2:
> /dev/sdb2:
>   Magic : a92b4efc
> Version : 00.90.00
>UUID : c50a81fc:ef4323e6:438a7cb1:25ae35e5
>   Creation Time : Thu Jun  1 21:13:58 2006
>  Raid Level : raid5
> Device Size : 390555904 (372.46 GiB 399.93 GB)
>  Array Size : 1171667712 (1117.39 GiB 1199.79 GB)
>Raid Devices : 4
>   Total Devices : 4
> Preferred Minor : 0
> 
> Update Time : Sun Oct 22 23:39:06 2006
>   State : active
>  Active Devices : 3
> Working Devices : 4
>  Failed Devices : 0
>   Spare Devices : 1
>Checksum : 683f2f6e - correct
>  Events : 0.8831997
> 
>  Layout : left-symmetric
>  Chunk Size : 256K
> 
>   Number   Major   Minor   RaidDevice State
> this 1   8   181  active sync   /dev/sdb2
> 
>0 0   820  active sync   /dev/sda2
>1 1   8   181  active sync   /dev/sdb2
>2 2   8   342  active sync   /dev/sdc2
>3 3   003  faulty removed
>4 4   8   504  spare   /dev/sdd2
> 
> 
> sdc2:
> /dev/sdc2:
>   Magic : a92b4efc
> Version : 00.90.00
>UUID : c50a81fc:ef4323e6:438a7cb1:25ae35e5
>   Creation Time : Thu Jun  1 21:13:58 2006
>  Raid Level : raid5
> Device Size : 390555904 (372.46 GiB 399.93 GB)
>  Array Size : 1171667712 (1117.39 GiB 1199.79 GB)
>Raid Devices : 4
>   Total Devices : 4
> Preferred Minor : 0
> 
> Update Time : Sun Oct 22 23:39:06 2006
>   State : active
>  Active Devices : 3
> Working Devices : 4
>  Failed Devices : 0
>   Spare Devices : 1
>Checksum : 683f2f80 - correct
>  Events : 0.8831997
> 
>

Re: Raid5 or 6 here... ?

2006-10-24 Thread David Greaves
Gordon Henderson wrote:
>1747 ?S<   724:25 [md9_raid5]
> 
> It's kernel 2.6.18 and

Wasn't the module merged to raid456 in 2.6.18?

Are your mdx_raid6's earlier kernels. My raid 6 is on 2.7.17 and says _raid6

Could it be that the combined kernel thread is called mdX_raid5

David

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid5 or 6 here... ?

2006-10-24 Thread David Greaves
David Greaves wrote:
> Gordon Henderson wrote:
>>1747 ?S<   724:25 [md9_raid5]
>>
>> It's kernel 2.6.18 and
> 
> Wasn't the module merged to raid456 in 2.6.18?
> 
> Are your mdx_raid6's earlier kernels. My raid 6 is on 2.7.17 and says _raid6
> 
> Could it be that the combined kernel thread is called mdX_raid5
> 

Yup
raid5.c now handles 45 and 6 and says:
mddev->thread = md_register_thread(raid5d, mddev, "%s_raid5");

I think I may actually be able to patch that...

David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Relabeling UUID

2006-12-13 Thread David Greaves
Neil Brown wrote:
> Patches to the man page to add useful examples are always welcome.

And if people would like to be more verbose, the wiki is available at
http://linux-raid.osdl.org/

It's now kinda useful but definitely not fully migrated from the old RAID FAQ.

David

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Frequent SATA errors / port timeouts in 2.6.18.3?

2006-12-14 Thread David Greaves
Patrik Jonsson wrote:
> Hi all,
> this may not be the best list for this question, but I figure that the
> number of disks connected to users here should be pretty big...
> 
> I upgraded from 2.6.17-rc4 to 2.6.18.3 about a week ago, and I've since
> had 3 drives kicked out of my 10-drive RAID5 array. Previously, I had no
> kicks over almost a year. The kernel message is:
> 
> ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> ata7.00: (BMDMA stat 0x20)
> ata7.00: tag 0 cmd 0xc8 Emask 0x1 stat 0x41 err 0x4 (device error)
> ata7: EH complete

> Any ideas or thought would be appreciated,
SMART?

Read the manpage and then try running:
smartctl -data -S on /dev/...
and
smartctl -data -s on /dev/...

Then look at your smartd timing and see if it's related; possibly just do a
manual smartd poll.

I've had smart/libata problems (well, no, glitches) for about 2 years now but as
the irq handler occasionally says "no one cared" ;)

It may well not be your problem but...

David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm: what if - crashed OS

2007-01-05 Thread David Greaves
Assuming you can allow some downtime, get yourself a rescue CD such as 'RIP'

This will let you boot into the machine and run mdadm commands.

You don't mention kernel/mdadm versions so you may want to check they're close
on the rescue CD.

Then try looking at the manpage around --assemble.
In particular you may want to try --scan and --uuid (if your RIP/live
kernel/mdadm support it)

Also check out the examples...

Assuming this is a sane machine and you're not in real disaster recovery mode
with drives pulled in from random boxes then look at using the literal string
"--config=partitions" (see the manpage) to avoid creating an mdadm.conf with the
"DEVICE partitions" line - PITA on live CDs where you just want a command line 
;)

If you can manage it, this will give you a nice warm feeling about recovering
from a problem and it's pretty safe - just common sense like making sure the
live CD kernel/mdadm are either up-to-date or match your production system.

HTH

Also:
> I have thought about this, and I can't understand how 'mdadm' decides the
> health of an array.

Each disk/partition used by md has a superblock which contains a unique UUID and
other info, like the number of devices and the raid level. mdadm --scan looks
into each partition for a superblock and notes this data. It can then group all
the superblocks with the same UUID together and, for each group, knowing how
many devices it should have, how many it has and how many it needs it can decide
if the device can safely be assembled.

David
PS Yes, I've done this (too many times!)
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mdadm --grow failed

2007-02-18 Thread David Greaves
Marc Marais wrote:
[snip]
> Unfortunately one of the drives timed out during the operation (not a read 
> error - just a timeout - which I would've thought would be retried but 
> anyway...):
> Help appreciated. (I do have a full backup of course but that's a last 
> resort with my luck I'd get a read error from the tape drive)

Hi Marc
It looks like you've since recreated the array and restored your data - good :)

It doesn't appear that you mentioned the kernel and distro you are using and the
software versions.

I'm sure this is something people will need.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID a bit of a weakness?

2007-02-26 Thread David Rees

On 2/25/07, Richard Scobie <[EMAIL PROTECTED]> wrote:

Colin Simpson wrote:
> They therefore do not have the "check" option in the kernel. Is there
> anything else I can do? Would forcing a resync achieve the same result
> (or is that down right dangerous as the array is not considered
> consistent for a while). Any thoughts apart from my one being to upgrade
> them to RH5 when that appears with a probably 2.6.18 kernel (which will
> presumably have "check")? Any thoughts?

You could configure smartd to do regular long selftests, which would
notify you on failures and allow you to take the drive offline and dd,
replace etc.


So what do you do when your drives in your array don't support SMART
self tests for some reason?

The best solution I have thought of so far is to do a `dd if=/dev/mdX
of=/dev/null` periodically, but this isn't as nice as running a check
in the later kernels as it's not guaranteed to read blocks from all
disks. I guess you could instead do the same thing but with the
underlying disks instead of the raid device, then make sure you watch
the logs for disk read errors.

-Dave
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID a bit of a weakness?

2007-02-26 Thread David Rees

On 2/26/07, Colin Simpson <[EMAIL PROTECTED]> wrote:

If I say,

dd if=/dev/sda2 of=/dev/null

where /dev/sda2 is a component of an active md device.

Will the RAID subsystem get upset that someone else is fiddling with the
disk (even in just a read only way)? And will a read error on this dd
(caused by a bad block) cause md to knock out that device?


The MD subsystem doesn't care if someone else is reading the disk, and
I'm pretty sure that rear errors will be noticed by the MD system,
either.

-Dave
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID a bit of a weakness?

2007-02-27 Thread David Rees

On 2/26/07, Neil Brown <[EMAIL PROTECTED]> wrote:

On Monday February 26, [EMAIL PROTECTED] wrote:
> I'm pretty sure that rear errors will be noticed by the MD system,
> either.

:-)  Your typing is nearly as bad as mine often is, but your intent is
correct.  If you independently read from a device in an MD array and get an
error, MD won't notice.  MD only notices errors for requests that it
makes of the devices itself.


Doh, 2 errors in one line! Should have read:

I'm pretty sure that read errors will _not_ be noticed by the MD system, either.

Good thing at least Neil understood me. :-)

-Dave
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BLK_DEV_MD with CONFIG_NET

2007-03-20 Thread David Miller
From: Randy Dunlap <[EMAIL PROTECTED]>
Date: Tue, 20 Mar 2007 20:05:38 -0700

> Build a kernel with CONFIG_NET-n and CONFIG_BLK_DEV_MD=m.
> Unless csum_partial() is built and kept by some arch Makefile,
> the result is:
> ERROR: "csum_partial" [drivers/md/md-mod.ko] undefined!
> make[1]: *** [__modpost] Error 1
> make: *** [modules] Error 2
> 
> 
> Any suggested solutions?

Anything which is every exported to modules, which ought to
be the situation in this case, should be obj-y not lib-y
right?
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10 kernel panic on sparc64

2007-04-01 Thread David Miller
From: Jan Engelhardt <[EMAIL PROTECTED]>
Date: Mon, 2 Apr 2007 02:15:57 +0200 (MEST)

> just when I did
> # mdadm -C /dev/md2 -b internal -e 1.0 -l 10 -n 4 /dev/sd[cdef]4
> (created)
> # mdadm -D /dev/md2
> Killed
> 
> dmesg filled up with a kernel oops. A few seconds later, the box
> locked solid. Since I was only in by ssh and there is not (yet) any
> possibility to reset it remotely, this is all I can give right now,
> the last 80x25 screen:

Unfortunately the beginning of the OOPS is the most important part,
that says where exactly the kernel died, the rest of the log you
showed only gives half the registers and the rest of the call trace.

Please try to capture the whole thing.

Please also provide hardware type information as well, which you
should give in any bug report like this.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10 kernel panic on sparc64

2007-04-12 Thread David Miller
From: Jan Engelhardt <[EMAIL PROTECTED]>
Date: Mon, 2 Apr 2007 02:15:57 +0200 (MEST)

> Kernel is kernel-smp-2.6.16-1.2128sp4.sparc64.rpm from Aurora Corona.
> Perhaps it helps, otherwise hold your breath until I reproduce it.

Jan, if you can reproduce this with the current 2.6.20 vanilla
kernel I'd be very interested in a full trace so that I can
try to fix this.

With the combination of an old kernel and only part of the
crash trace, there isn't much I can do with this report.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Manually hacking superblocks

2007-04-13 Thread David Greaves
Lasse Kärkkäinen wrote:
> I managed to mess up a RAID-5 array by mdadm -adding a few failed disks
> back, trying to get the array running again. Unfortunately, -add didn't
> do what I expected, but instead made spares out of the failed disks. The
> disks failed due to loose SATA cabling and the data inside should be
> fairly consistent. sdh failed a bit earlier than sdd and sde, so I
> expect to be able to revocer by building a degraded array without sdh
> and then syncing.
> 
> The current situation looks like this:
>   Number   Major   Minor   RaidDevice State
>0 0   8   330  active sync   /dev/sdc1
>1 1   001  faulty removed
>2 2   8   972  active sync   /dev/sdg1
>3 3   8  1293  active sync   /dev/sdi1
>4 4   004  faulty removed
>5 5   8   815  active sync   /dev/sdf1
>6 6   006  faulty removed
>7 7   8  1777  spare
>8 8   8  1618  spare
>9 9   8  1459  spare
> 
> ... and before any of this happened, the configuration was:
> 
> disk 0, o:1, dev:sdc1
> disk 1, o:1, dev:sde1
> disk 2, o:1, dev:sdg1
> disk 3, o:1, dev:sdi1
> disk 4, o:1, dev:sdh1
> disk 5, o:1, dev:sdf1
> disk 6, o:1, dev:sdd1
> 
> I gather that I need a way to alter the superblocks of sde and sdd so
> that they seem to be clean up-to-date disks, with their original disk
> numbers 1 and 6. A hex editor comes to mind, but are there any better
> tools for that?

You don't need a tool.
mdadm --force will do what you want.

Read the archives and the man page.

You are correct to assemble the array with a missing disk (or 2 missing disks
for RAID6) - this prevents the kernel from trying to sync. Not syncing is good
because if you do make a slight error in the order, you can end up syncing bad
data over good.

I *THINK* you should try something like (untested):
mdadm --assemble /dev/md0 --force /dev/sdc1 /dev/sde1 /dev/sdg1 /dev/sdi1
missing /dev/sdf1 /dev/sdf1

The order is important and should match the original order.
There's more you could do by looking at device event counts (--examine)

Also you must do a READ-ONLY mount the first time you mount the array - this
will check the consistency and avoid corruption if you get the order wrong.

I really must get around to setting up a test environment so I can check this
out and update the wiki...

I have to go out or a couple of hours. Let me know how it goes if you can't wait
for me to get back.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Partitioned arrays initially missing from /proc/partitions

2007-04-23 Thread David Greaves
Hi Neil

I think this is a bug.

Essentially if I create an auto=part md device then I get md_d0p? partitions.
If I stop the array and just re-assemble, I don't.

It looks like the same (?) problem as Mike (see below - Mike do you have a
patch?) but I'm on 2.6.20.7 with mdadm v2.5.6

FWIW I upgraded from 2.6.16 where it worked (but used in-kernel detection which
isn't working in 2.6.20 for some reason but I don't mind).


Here's a simple sequence of commands:

teak:~# mdadm --stop /dev/md_d0
mdadm: stopped /dev/md_d0

teak:~# mdadm --create /dev/md_d0 -l5 -n5 --bitmap=internal -e1.2 --auto=part
--name media --force /dev/sde1 /dev/sdc1 /dev/sdd1 missing /dev/sdf1
mdadm: /dev/sde1 appears to be part of a raid array:
level=raid5 devices=5 ctime=Mon Apr 23 15:02:13 2007
mdadm: /dev/sdc1 appears to be part of a raid array:
level=raid5 devices=5 ctime=Mon Apr 23 15:02:13 2007
mdadm: /dev/sdd1 appears to be part of a raid array:
level=raid5 devices=5 ctime=Mon Apr 23 15:02:13 2007
mdadm: /dev/sdf1 appears to be part of a raid array:
level=raid5 devices=5 ctime=Mon Apr 23 15:02:13 2007
Continue creating array? y
mdadm: array /dev/md_d0 started.

teak:~# grep md /proc/partitions
 254 0 1250241792 md_d0
 254 1 1250144138 md_d0p1
 254 2  97652 md_d0p2

teak:~# mdadm --stop /dev/md_d0
mdadm: stopped /dev/md_d0

teak:~# mdadm --assemble /dev/md_d0 --auto=part  /dev/sde1 /dev/sdc1 /dev/sdd1
/dev/sdf1
mdadm: /dev/md_d0 has been started with 4 drives (out of 5).

teak:~# grep md /proc/partitions
 254 0 1250241792 md_d0


If I then run cfdisk it finds the partition table. I write this and get:
teak:~# cfdisk /dev/md_d0

Disk has been changed.

WARNING: If you have created or modified any
DOS 6.x partitions, please see the cfdisk manual
page for additional information.
teak:~# grep md /proc/partitions
 254 0 1250241792 md_d0
 254 1 1250144138 md_d0p1
 254 2  97652 md_d0p2


and the syslog:
Apr 23 15:13:13 localhost kernel: md: md_d0 stopped.
Apr 23 15:13:13 localhost kernel: md: unbind
Apr 23 15:13:13 localhost kernel: md: export_rdev(sde1)
Apr 23 15:13:13 localhost kernel: md: unbind
Apr 23 15:13:13 localhost kernel: md: export_rdev(sdf1)
Apr 23 15:13:13 localhost kernel: md: unbind
Apr 23 15:13:13 localhost kernel: md: export_rdev(sdd1)
Apr 23 15:13:13 localhost kernel: md: unbind
Apr 23 15:13:13 localhost kernel: md: export_rdev(sdc1)
Apr 23 15:13:13 localhost mdadm: DeviceDisappeared event detected on md device
/dev/md_d0
Apr 23 15:13:36 localhost kernel: md: bind
Apr 23 15:13:36 localhost kernel: md: bind
Apr 23 15:13:36 localhost kernel: md: bind
Apr 23 15:13:36 localhost kernel: md: bind
Apr 23 15:13:36 localhost kernel: raid5: device sdf1 operational as raid disk 4
Apr 23 15:13:36 localhost kernel: raid5: device sdd1 operational as raid disk 2
Apr 23 15:13:36 localhost kernel: raid5: device sdc1 operational as raid disk 1
Apr 23 15:13:36 localhost kernel: raid5: device sde1 operational as raid disk 0
Apr 23 15:13:36 localhost kernel: raid5: allocated 5236kB for md_d0
Apr 23 15:13:36 localhost kernel: raid5: raid level 5 set md_d0 active with 4
out of 5 devices, algorithm 2
Apr 23 15:13:36 localhost kernel: RAID5 conf printout:
Apr 23 15:13:36 localhost kernel:  --- rd:5 wd:4
Apr 23 15:13:36 localhost kernel:  disk 0, o:1, dev:sde1
Apr 23 15:13:36 localhost kernel:  disk 1, o:1, dev:sdc1
Apr 23 15:13:36 localhost kernel:  disk 2, o:1, dev:sdd1
Apr 23 15:13:36 localhost kernel:  disk 4, o:1, dev:sdf1
Apr 23 15:13:36 localhost kernel: md_d0: bitmap initialized from disk: read 1/1
pages, set 19078 bits, status: 0
Apr 23 15:13:36 localhost kernel: created bitmap (10 pages) for device md_d0
Apr 23 15:13:36 localhost kernel:  md_d0: p1 p2
Apr 23 15:13:54 localhost kernel: md: md_d0 stopped.
Apr 23 15:13:54 localhost kernel: md: unbind
Apr 23 15:13:54 localhost kernel: md: export_rdev(sdf1)
Apr 23 15:13:54 localhost kernel: md: unbind
Apr 23 15:13:54 localhost kernel: md: export_rdev(sdd1)
Apr 23 15:13:54 localhost kernel: md: unbind
Apr 23 15:13:54 localhost kernel: md: export_rdev(sdc1)
Apr 23 15:13:54 localhost kernel: md: unbind
Apr 23 15:13:54 localhost kernel: md: export_rdev(sde1)
Apr 23 15:13:54 localhost mdadm: DeviceDisappeared event detected on md device
/dev/md_d0
Apr 23 15:14:04 localhost kernel: md: md_d0 stopped.
Apr 23 15:14:04 localhost kernel: md: bind
Apr 23 15:14:04 localhost kernel: md: bind
Apr 23 15:14:04 localhost kernel: md: bind
Apr 23 15:14:04 localhost kernel: md: bind
Apr 23 15:14:04 localhost kernel: raid5: device sde1 operational as raid disk 0
Apr 23 15:14:04 localhost kernel: raid5: device sdf1 operational as raid disk 4
Apr 23 15:14:04 localhost kernel: raid5: device sdd1 operational as raid disk 2
Apr 23 15:14:04 localhost kernel: raid5: device sdc1 operational as raid disk 1
Apr 23 15:14:04 localhost kernel: raid5: allocated 5236kB for md_d0
Apr 23 15:14:04 localhost kernel: raid5: raid level 5 set md_d0 active with 4
out of 5 devices, a

Re: Multiple disk failure, but slot numbers are corrupt and preventing assembly.

2007-04-23 Thread David Greaves
There is some odd stuff in there:

/dev/sda1:
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Events : 0.115909229

/dev/sdb1:
Active Devices : 5
Working Devices : 4
Failed Devices : 1
Events : 0.115909230

/dev/sdc1:
Active Devices : 8
Working Devices : 8
Failed Devices : 1
Events : 0.115909230

/dev/sdd1:
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Events : 0.115909230

but your event counts are consistent. It looks like corruption on 2 disks :(
Or did you try some things?

I think you'll need to recreate the array since assemble can't figure things 
out.

Since you mention SMART errors on /dev/sdb you are taking a big chance by trying
to start up the array with a known faulty disk - especially if you resync as
it's a very IO intensive operation that will read every sector of the bad disk
and is likely to trigger errors that will kick it again leaving you back where
you started (or worse).

If you are desperate for data recovery and you have the space then you should
take disk images using ddrescue *before* trying anything.

Next best is if you are buying new disks and can wait for them to arrive, do so.
You can then use ddrescue to copy the old disk to the new ones and work with
non-broken hardware.

If you have no choice

>From this point forward it will be very easy to mess up.


Once you have disks to work on you can try to recreate the array.

You were using 0.9 superblocks, 64k, left symmetric which are defaults.

You should re-create in degraded mode to prevent the sync from starting (if you
got the order wrong then it would get the parity calc wrong).

So:
mdadm --create /dev/md0 --force -l5 -n4 /dev/sda1 /dev/sdb1 missing /dev/sdc1

Then do a *readonly* fsck on the /dev/md0.

If it works you can try a backup or an fsck.

Ask if anything isn't clear.

David
PS I recovered from a 2-disk failure last night. Seems to be back up and
re-syncing :) Glad I had a spare disk around!

Leon Woestenberg wrote:
> Hello,
> 
> it's recovery time again. Problem at hand: raid5 consisting of four
> partitions, each on a drive. Two disks have failed. Assembly fails
> because the slot numbers of the array components seem to be corrupt.
> 
> /dev/md0 consisting of /dev/sd[abcd]1, of which b,c failed and of
> which c seems really bad in SMART, b looks reasonably OK judging from
> SMART.
> 
> Checksum of the failed component superblocks was bad.
> 
> Using mdadm.conf we have already tried updating the superblocks. This
> partly succeeded in the sense that checksums came up ok, the slot
> numbers did not.
> 
> mdadm refuses to assemble, even with --force.
> 
> Could you guys peek over the array configuration (mdadm --examine) and
> see if there is a non-destructive way to try and mount the array. If
> not, what is the least intrusive way to do a non-syncing (re)create?
> 
> Data recovery is our prime concern here.
> 
> Below the uname -a, --examine output of all four drives, mdadm.conf of
> what we think the array should look like and finally, the mdadm
> --assemble command and output.
> 
> Note the slot numbers on /dev/sd[bc].
> 
> Thanks for any help,
> 
> with kind regards,
> 
> Leon Woestenberg
> 
> 
> 
> 
> Linux localhost 2.6.16.14-axon1 #1 SMP PREEMPT Mon May 8 17:01:33 CEST
> 2006 i486 pentium4 i386 GNU/Linux
> 
> [EMAIL PROTECTED] ~]# mdadm --examine /dev/sda1
> /dev/sda1:
>  Magic : a92b4efc
>Version : 00.90.00
>   UUID : 51a95144:00af4c77:c1cd173b:94cb1446
>  Creation Time : Mon Sep  5 13:16:42 2005
> Raid Level : raid5
>Device Size : 390620352 (372.52 GiB 400.00 GB)
>   Raid Devices : 4
>  Total Devices : 4
> Preferred Minor : 0
> 
>Update Time : Tue Apr 17 07:03:46 2007
>  State : active
> Active Devices : 4
> Working Devices : 4
> Failed Devices : 0
>  Spare Devices : 0
>   Checksum : f98ed71b - correct
> Events : 0.115909229
> 
> Layout : left-symmetric
> Chunk Size : 64K
> 
>  Number   Major   Minor   RaidDevice State
> this 0   810  active sync   /dev/sda1
> 
>   0 0   810  active sync   /dev/sda1
>   1 1   8   171  active sync   /dev/sdb1
>   2 2   8   332  active sync   /dev/sdc1
>   3 3   8   493  active sync   /dev/sdd1
> [EMAIL PROTECTED] ~]# mdadm --examine /dev/sdb1
> /dev/sdb1:
>  Magic : a92b4efc
>Version : 00.90.00
>   UUID : 51a95144:00af4c77:c1cd173b:94cb1446
>  Creation Time : Mon Sep  5 13:16:42 2005
> Raid Level : raid5
>Device Size : 390620352 (372.52 GiB 400.00 GB)
>   Raid Devices : 4
>  Total Devices : 5
> Preferred Minor : 0
> 
>Update T

Re: Multiple disk failure, but slot numbers are corrupt and preventing assembly.

2007-04-24 Thread David Greaves
Leon Woestenberg wrote:
> On 4/24/07, Leon Woestenberg <[EMAIL PROTECTED]> wrote:
>> Hello,
>>
>> On 4/23/07, David Greaves <[EMAIL PROTECTED]> wrote:
>> > There is some odd stuff in there:
>> >
>> [EMAIL PROTECTED] ~]# mdadm -v --assemble --scan
>> --config=/tmp/mdadm.conf --force
>> [...]
>> mdadm: no uptodate device for slot 1 of /dev/md0
>> mdadm: no uptodate device for slot 2 of /dev/md0
>> [...]
>>
> So, the problem I am facing is that the slot number (as seen with
> --examine) is invalid on two and therefore they won't be recognized as
> valid drives for the array.
> 
> Is there any way to override the slot number? I could not find
> anything in mdadm or mdadm.conf to override them.
Yes --create, see my original reply.

Essentially all --create does is create superblocks with the data you want (eg
slot numbers). It does not touch other 'on disk data'.
It is safe to run the *exact same* create command on a dormant array at any time
after initial creation - the main side effect is a new UUID.
(Neil - yell if I'm wrong).

The most 'dangerous' part is to create a superblock with a different version.

If you wanted to experiment (maybe with loopback devices) you could try
--create'ing an array with 4 devices to simulate where you were.
Then do the --create again with 2 devices missing. This should end up with 2
devices with one UUID, 2 with another.

Then do an --assemble using --force and --update=uuid.

Report back if you do this...

Cheers
David


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Partitioned arrays initially missing from /proc/partitions

2007-04-24 Thread David Greaves
Neil Brown wrote:
> This problem is very hard to solve inside the kernel.
> The partitions will not be visible until the array is opened *after*
> it has been created.  Making the partitions visible before that would
> be possible, but would be very easy.
> 
> I think the best solution is Mike's solution which is to simply
> open/close the array after it has been assembled.  I will make sure
> this is in the next release of mdadm.
> 
> Note that you can still access the partitions even though they do not
> appear in /proc/partitions. Any attempt to access and of them will
> make them all appear in /proc/partitions.  But I understand there is
> sometimes value in seeing them before accessing them.
> 
> NeilBrown

Um. Are you sure?
The reason I noticed is that I couldn't mount them until they appeared; see
these cut'n'pastes from my terminal history:

teak:~# mount /media/
mount: /dev/md_d0p1 is not a valid block device

teak:~# mount /dev/md_d0p1 /media
mount: you must specify the filesystem type

teak:~# xfs_repair -ln /dev/md_d0p2 /dev/md_d0p1
Usage: xfs_repair [-nLvV] [-o subopt[=value]] [-l logdev] [-r rtdev] devname
teak:~# ll /dev/md*
brw-rw 1 root disk 254, 0 2007-04-23 15:44 /dev/md_d0
brw-rw 1 root disk 254, 1 2007-04-23 14:46 /dev/md_d0p1
brw-rw 1 root disk 254, 2 2007-04-23 14:46 /dev/md_d0p2
brw-rw 1 root disk 254, 3 2007-04-23 15:44 /dev/md_d0p3
brw-rw 1 root disk 254, 4 2007-04-23 15:44 /dev/md_d0p4

/dev/md:
total 0
teak:~# /etc/init.d/mdadm-raid stop
Stopping MD array md_d0...done (stopped).
teak:~# /etc/init.d/mdadm-raid start
Assembling MD array md_d0...done (degraded [4/5]).
Generating udev events for MD arrays...done.
teak:~# cfdisk /dev/md_d0

teak:~# mount /dev/md_d0p1
mount: /dev/md_d0p1 is not a valid block device

and so on...

Notice the cfdisk command above. I did this to check the on-array table (it was
good). I assume cfdisk opens the array - but the partitions were still not there
afterwards. I did not do a 'Write' from in cfdisk this time.

I wouldn't be so concerned at a cosmetic thing in /proc/partitions - the problem
is that I can't mount my array after doing an assemble and I have to --create
each time - not the nicest solution.

Oh, I'm using udev FWIW.

David

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Partitioned arrays initially missing from /proc/partitions

2007-04-24 Thread David Greaves
Mike Accetta wrote:
> David Greaves writes:
> 
> ...
>> It looks like the same (?) problem as Mike (see below - Mike do you have a
>> patch?) but I'm on 2.6.20.7 with mdadm v2.5.6
> ...
> 
> We have since started assembling the array from the initrd using
> --homehost and --auto-update-homehost which takes a different path through
> the code, and in this path the kernel figures out there are partitions
> on the array before mdadm exists.
Just tried that - doesn't work :)


> For the previous code path, we had been ruuning with the patch I described
> in my original post which I've included below.  I'd guess that the bug
> is actually in the kernel code and I looked at it briefly but couldn't
> figure out how things all fit together well enough to come up with a
> patch there.  The user level patch is a bit of a hack and there may be
> other code paths that also need a similar patch.  I only made this patch
> in the assembly code path we were executing at the time.
> 
>  BUILD/mdadm/mdadm.c#2 (text) - BUILD/mdadm/mdadm.c#3 (text)  content
> @@ -983,6 +983,10 @@
>NULL,
>readonly, 
> runstop, NULL, verbose-quiet, force);
> close(mdfd);
> +   mdfd = open(array_list->devname, 
> O_RDONLY); 
> +   if (mdfd >= 0) {
> +   close(mdfd);
> +   }
> }
> }
> break;

Thanks Mike

But this doesn't work for me either :(

I changed array_list to devlist inline with 2.6.9 and it compiles and runs OK.

teak:~# mdadm --stop /dev/md_d0
mdadm: stopped /dev/md_d0
teak:~# /everything/devel/mdadm/mdadm-2.5.6/mdadm --assemble /dev/md_d0
/dev/sd[bcdef]1
mdadm: With Fudge.
mdadm: /dev/md_d0 has been started with 5 drives.
mdadm: Fudging partition creation.
teak:~# mount /media
mount: /dev/md_d0p1 is not a valid block device
teak:~#

I also wrote a small c program to call the RAID_AUTORUN ioctl - that didn't work
either because I'd compiled RAID as a module so the ioctl isn't defined.

currently recompiling the kernel to allow autorun...

David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Partitioned arrays initially missing from /proc/partitions

2007-04-24 Thread David Greaves
David Greaves wrote:

> currently recompiling the kernel to allow autorun...

Which of course won't work because I'm on 1.2 superblocks:
md: Autodetecting RAID arrays.
md: invalid raid superblock magic on sdb1
md: sdb1 has invalid sb, not importing!
md: invalid raid superblock magic on sdc1
md: sdc1 has invalid sb, not importing!
md: invalid raid superblock magic on sdd1
md: sdd1 has invalid sb, not importing!
md: invalid raid superblock magic on sde1
md: sde1 has invalid sb, not importing!
md: invalid raid superblock magic on sdf1
md: sdf1 has invalid sb, not importing!
md: autorun ...
md: ... autorun DONE.


David

PS Dropped Mike from cc since I doubt he's too interested :)
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Partitioned arrays initially missing from /proc/partitions

2007-04-24 Thread David Greaves
Neil Brown wrote:
> This problem is very hard to solve inside the kernel.
> The partitions will not be visible until the array is opened *after*
> it has been created.  Making the partitions visible before that would
> be possible, but would be very easy.
> 
> I think the best solution is Mike's solution which is to simply
> open/close the array after it has been assembled.  I will make sure
> this is in the next release of mdadm.
> 
> Note that you can still access the partitions even though they do not
> appear in /proc/partitions.  Any attempt to access and of them will
> make them all appear in /proc/partitions.  But I understand there is
> sometimes value in seeing them before accessing them.
> 
> NeilBrown

For anyone else who is in this boat and doesn't fancy finding somewhere in mdadm
 to hack, here's a simple program that issues the BLKRRPART ioctl.
This re-reads the block device partition table and 'works for me'.

I think partx -a would do the same job but for some reason partx isn't in
utils-linux for Debian...

Neil, isn't it easy to just do this after an assemble?

David

#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 

int main(int argc, char *argv[])
{
int fd;

	if (argc != 2)
   fprintf(stderr, "Usage: %s \n", argv[0]);

	
if ((fd = open(argv[1], O_RDONLY)) == -1) {
   fprintf(stderr, "Can't open md device %s\n", argv[1]);
   return -1;
}

	if (ioctl(fd,  BLKRRPART, NULL) != 0) {
   fprintf(stderr, "ioctl failed\n");
close (fd);
return -1;
}

close (fd);

return 0;
}



Re: Partitioned arrays initially missing from /proc/partitions

2007-04-24 Thread David Greaves
Neil Brown wrote:
> On Tuesday April 24, [EMAIL PROTECTED] wrote:
>> Neil Brown wrote:
>>> This problem is very hard to solve inside the kernel.
>>> The partitions will not be visible until the array is opened *after*
>>> it has been created.  Making the partitions visible before that would
>>> be possible, but would be very easy.
>>>
>>> I think the best solution is Mike's solution which is to simply
>>> open/close the array after it has been assembled.  I will make sure
>>> this is in the next release of mdadm.
>>>
>>> Note that you can still access the partitions even though they do not
>>> appear in /proc/partitions. Any attempt to access and of them will
>>> make them all appear in /proc/partitions.  But I understand there is
>>> sometimes value in seeing them before accessing them.
>>>
>>> NeilBrown
>> Um. Are you sure?
> 
> "Works for me".
Lucky you ;)

> What happens if you
>   blockdev --rereadpt /dev/md_d0
> ?? It probably works then.
Well, that's probably the same as my BLKRRPART ioctl so I guess yes.
[confirmed - yes, but blockdev seems to do it twice - I get 2 kernel messages]

> It sounds like someone is deliberately removing all the partition
> info.
Gremlins?

> Can you try this patch and see if it reports anyone calling
> '2' on md_d0 ??

Nope, not being called at all.

teak:~# mdadm --assemble /dev/md_d0 --auto=parts /dev/sd[bcdef]1
mdadm: /dev/md_d0 has been started with 5 drives.

dmesg:
md: bind
md: bind
md: bind
md: bind
md: bind
raid5: device sde1 operational as raid disk 0
raid5: device sdf1 operational as raid disk 4
raid5: device sdb1 operational as raid disk 3
raid5: device sdd1 operational as raid disk 2
raid5: device sdc1 operational as raid disk 1
raid5: allocated 5236kB for md_d0
raid5: raid level 5 set md_d0 active with 5 out of 5 devices, algorithm 2
RAID5 conf printout:
 --- rd:5 wd:5
 disk 0, o:1, dev:sde1
 disk 1, o:1, dev:sdc1
 disk 2, o:1, dev:sdd1
 disk 3, o:1, dev:sdb1
 disk 4, o:1, dev:sdf1
md_d0: bitmap initialized from disk: read 1/1 pages, set 0 bits, status: 0
created bitmap (10 pages) for device md_d0


teak:~# mount /media
mount: special device /dev/md_d0p1 does not exist

no dmesg


teak:~# blockdev --rereadpt /dev/md_d0
dmesg:
 md_d0: p1 p2
 md_d0: p1 p2


did I mention 2.6.20.7 and mdadm v2.5.6 and udev

I'd be happy if I've done something wrong...

anyway, more config data...

teak:~# mdadm --detail /dev/md_d0
/dev/md_d0:
Version : 01.02.03
  Creation Time : Mon Apr 23 15:13:35 2007
 Raid Level : raid5
 Array Size : 1250241792 (1192.32 GiB 1280.25 GB)
Device Size : 625120896 (298.08 GiB 320.06 GB)
   Raid Devices : 5
  Total Devices : 5
Preferred Minor : 0
Persistence : Superblock is persistent

  Intent Bitmap : Internal

Update Time : Tue Apr 24 12:49:26 2007
  State : active
 Active Devices : 5
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 0

 Layout : left-symmetric
 Chunk Size : 64K

   Name : media
   UUID : f7835ba6:e38b6feb:c0cd2e2d:3079db59
 Events : 25292

Number   Major   Minor   RaidDevice State
   0   8   650  active sync   /dev/sde1
   1   8   331  active sync   /dev/sdc1
   2   8   492  active sync   /dev/sdd1
   5   8   173  active sync   /dev/sdb1
   4   8   81    4  active sync   /dev/sdf1
teak:~# cat /etc/mdadm/mdadm.conf
DEVICE partitions
ARRAY /dev/md_d0 auto=part level=raid5 num-devices=5
UUID=f7835ba6:e38b6feb:c0cd2e2d:3079db59
MAILADDR [EMAIL PROTECTED]



David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Partitioned arrays initially missing from /proc/partitions

2007-04-24 Thread David Greaves
Neil Brown wrote:
> On Tuesday April 24, [EMAIL PROTECTED] wrote:
>> Neil, isn't it easy to just do this after an assemble?
> 
> Yes, but it should not be needed, and I'd like to understand why it
> is.
> One of the last things do_md_run does is
>mddev->changed = 1;
> 
> When you next open /dev/md_d0, md_open is called which calls
> check_disk_change().
> This will call into md_fops->md_media_changed which will return the
> value of mddev->changed, which will be '1'.
> So check_disk_change will then call md_fops->revalidate_disk which
> will set mddev->changed to 0, and will then set bd_invalidated to 1
> (as bd_disk->minors > 1 (being 64)).
> 
> md_open will then return into do_open (in fs/block_dev.c) and because
> bd_invalidated is true, it will call rescan_partitions and the
> partitions will appear.
> 
> Hmmm... there is room for a race there.  If some other process opens
> /dev/md_d0 before mdadm gets to close it, it will call
> rescan_partitions before first calling  bd_set_size to update the size
> of the bdev.  So when we try to read the partition table, it will
> appear to be reading past the EOF, and will not actually read
> anything..
> 
> I guess udev must be opening the block device at exactly the wrong
> time. 
> 
> I can simulate this by holding /dev/md_d0 open while assembling the
> array.  If I do that, the partitions don't get created.
> Yuck.
> 
> Maybe I could call bd_set_size in md_open before calling
> check_disk_change..
> 
> Yep, this patch seems to fix it.  Could you confirm?
almost...

teak:~# mdadm --assemble /dev/md_d0 --auto=parts /dev/sd[bcdef]1
mdadm: /dev/md_d0 has been started with 5 drives.
teak:~# mount /media
teak:~# umount /media
teak:~# mdadm --stop /dev/md_d0
mdadm: stopped /dev/md_d0
teak:~# mdadm --assemble /dev/md_d0 --auto=parts /dev/sd[bcdef]1
mdadm: /dev/md_d0 has been started with 5 drives.
teak:~# mount /media
mount: No such file or directory
teak:~# mount /media
teak:~#
(second mount succeeds second time around)



md: md_d0 stopped.
md: bind
md: bind
md: bind
md: bind
md: bind
raid5: device sde1 operational as raid disk 0
raid5: device sdf1 operational as raid disk 4
raid5: device sdb1 operational as raid disk 3
raid5: device sdd1 operational as raid disk 2
raid5: device sdc1 operational as raid disk 1
raid5: allocated 5236kB for md_d0
raid5: raid level 5 set md_d0 active with 5 out of 5 devices, algorithm 2
RAID5 conf printout:
 --- rd:5 wd:5
 disk 0, o:1, dev:sde1
 disk 1, o:1, dev:sdc1
 disk 2, o:1, dev:sdd1
 disk 3, o:1, dev:sdb1
 disk 4, o:1, dev:sdf1
md_d0: bitmap initialized from disk: read 1/1 pages, set 0 bits, status: 0
created bitmap (10 pages) for device md_d0
 md_d0: p1 p2
Filesystem "md_d0p1": Disabling barriers, not supported with external log device
XFS mounting filesystem md_d0p1
Ending clean XFS mount for filesystem: md_d0p1
md: md_d0 stopped.
md: unbind
md: export_rdev(sde1)
md: unbind
md: export_rdev(sdf1)
md: unbind
md: export_rdev(sdb1)
md: unbind
md: export_rdev(sdd1)
md: unbind
md: export_rdev(sdc1)
md: md_d0 stopped.
md: bind
md: bind
md: bind
md: bind
md: bind
raid5: device sde1 operational as raid disk 0
raid5: device sdf1 operational as raid disk 4
raid5: device sdb1 operational as raid disk 3
raid5: device sdd1 operational as raid disk 2
raid5: device sdc1 operational as raid disk 1
raid5: allocated 5236kB for md_d0
raid5: raid level 5 set md_d0 active with 5 out of 5 devices, algorithm 2
RAID5 conf printout:
 --- rd:5 wd:5
 disk 0, o:1, dev:sde1
 disk 1, o:1, dev:sdc1
 disk 2, o:1, dev:sdd1
 disk 3, o:1, dev:sdb1
 disk 4, o:1, dev:sdf1
md_d0: bitmap initialized from disk: read 1/1 pages, set 0 bits, status: 0
created bitmap (10 pages) for device md_d0
 md_d0: p1 p2
XFS: Invalid device [/dev/md_d0p2], error=-2
Filesystem "md_d0p1": Disabling barriers, not supported with external log device
XFS mounting filesystem md_d0p1
Ending clean XFS mount for filesystem: md_d0p1


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiple disk failure, but slot numbers are corrupt and preventing assembly.

2007-04-24 Thread David Greaves
Leon Woestenberg wrote:
> David,
> 
> thanks for all the advice so far.

No problem :)

> In first instance we were searching for ways to tell mdadm what we
> know about the array (through mdadm.conf) but from all advice we got
> we have to take the 'usual' non-syncing-recreate approach.
> 
> We will try to make disk clones first. Will dd suffice or do I need
> something more fancy that maybe copes with source drive read errors in
> a better fashion?

ddrescue and dd_rescue are *much* better.

I favour the gnu ddrescue - it's much easier. But sometimes, on some kernels
with some hardware I've had kernel locks that dd_rescue (eventually, after many
minutes) times out from.

The RIP iso is a good place to start.
http://www.tux.org/pub/people/kent-robotti/looplinux/rip/

David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiple disk failure, but slot numbers are corrupt and preventing assembly.

2007-04-26 Thread David Greaves
Bill Davidsen wrote:
> Leon Woestenberg wrote:
>> We will try to make disk clones first. Will dd suffice or do I need
>> something more fancy that maybe copes with source drive read errors in
>> a better fashion? 
> 
> Yes to both. dd will be fine in most cases, and I suggest using noerror
> to continue after errors, and oflag=direct just for performance. You
> could use ddrescue, it supposedly copes better with errors, although I
> don't know details.
> 

Hi Bill

IIRC dd will continue to operate on error but just retries a set number of times
and continues to the next sector.

ddrescue does clever things like jumping a few sectors when it hits an error,
then, after it has retrieved as much data off the disk as possible it starts to
bisect the error area.

This type of algorithm is good because disks often die after a few minutes of
being powered and deteriorate rapidly as data is recovered - hammering them on
retries on an error on sector 312, 313, 314 when you have millions of
(currently) readable sectors elsewhere is a bad idea because it can minimise the
time you have left to read your data.

It's also very fast, provides continuous updates to the screen as to where it is
and the io rates it is currently seeing and those it's averaging, it also writes
a log of what it's done to local file to allow restarts.
etc etc etcc..
try it - IMHO it's the right tool for the (data-recovery) job :)

David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID rebuild on Create

2007-04-30 Thread David Greaves
Jan Engelhardt wrote:
> Hi list,
> 
> 
> when a user does `mdadm -C /dev/md0 -l  -n  
> `, the array gets rebuilt for at least RAID1 and RAID5, even if 
> the disk contents are most likely not of importance (otherwise we would 
> not be creating a raid array right now). Could not this needless resync 
> be skipped - what do you think?
> 
> 
> Jan

This is an FAQ - and I'll put it there RSN :)

Here's one answer from Neil from the archives (google "avoiding the initial
resync on --create"):

Otherwise I agree.  There is no real need to perform the sync of a
raid1 at creation.
However it seems to be a good idea to regularly 'check' an array to
make sure that all blocks on all disks get read to find sleeping bad
blocks early.  If you didn't sync first, then every check will find
lots of errors.  Ofcourse you could 'repair' instead of 'check'.  Or
do that once.  Or something.

For raid6 it is also safe to not sync first, though with the same
caveat as raid1.  Raid6 always updates parity by reading all blocks in
the stripe that aren't known and calculating P and Q.  So the first
write to a stripe will make P and Q correct for that stripe.
This is current behaviour.  I don't think I can guarantee it will
never changed.

For raid5 it is NOT safe to skip the initial sync.  It is possible for
all updates to be "read-modify-write" updates which assume the parity
is correct.  If it is wrong, it stays wrong.  Then when you lose a
drive, the parity blocks are wrong so the data you recover using them
is wrong.

In summary, it is safe to use --assume-clean on a raid1 or raid1o,
though I would recommend a "repair" before too long.  For other raid
levels it is best avoided.

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid10 on centos 5

2007-05-04 Thread David Greaves
Ruslan Sivak wrote:
> So a custom kernel is needed?  Is there a way to do a kickstart install
> with the new kernel?  Or better yet, put it on the install cd?

have you tried:
 modprobe raid10

?

David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Partitioned arrays initially missing from /proc/partitions

2007-05-07 Thread David Greaves
Hi Neil

Just wondering what the status is here - do you need any more from me or is it
on your stack?

The patch helped but didn't cure.
After a clean boot it mounted correctly first try.
Then I unmounted, stopped and re-assembled the array.
The next mount failed.
The subsequent mount succeeded.

How do other block devices initialise their partitions on 'discovery'?

David

David Greaves wrote:
> Neil Brown wrote:
>> On Tuesday April 24, [EMAIL PROTECTED] wrote:
>>> Neil, isn't it easy to just do this after an assemble?
>> Yes, but it should not be needed, and I'd like to understand why it
>> is.
>> One of the last things do_md_run does is
>>mddev->changed = 1;
>>
>> When you next open /dev/md_d0, md_open is called which calls
>> check_disk_change().
>> This will call into md_fops->md_media_changed which will return the
>> value of mddev->changed, which will be '1'.
>> So check_disk_change will then call md_fops->revalidate_disk which
>> will set mddev->changed to 0, and will then set bd_invalidated to 1
>> (as bd_disk->minors > 1 (being 64)).
>>
>> md_open will then return into do_open (in fs/block_dev.c) and because
>> bd_invalidated is true, it will call rescan_partitions and the
>> partitions will appear.
>>
>> Hmmm... there is room for a race there.  If some other process opens
>> /dev/md_d0 before mdadm gets to close it, it will call
>> rescan_partitions before first calling  bd_set_size to update the size
>> of the bdev.  So when we try to read the partition table, it will
>> appear to be reading past the EOF, and will not actually read
>> anything..
>>
>> I guess udev must be opening the block device at exactly the wrong
>> time. 
>>
>> I can simulate this by holding /dev/md_d0 open while assembling the
>> array.  If I do that, the partitions don't get created.
>> Yuck.
>>
>> Maybe I could call bd_set_size in md_open before calling
>> check_disk_change..
>>
>> Yep, this patch seems to fix it.  Could you confirm?
> almost...
> 
> teak:~# mdadm --assemble /dev/md_d0 --auto=parts /dev/sd[bcdef]1
> mdadm: /dev/md_d0 has been started with 5 drives.
> teak:~# mount /media
> teak:~# umount /media
> teak:~# mdadm --stop /dev/md_d0
> mdadm: stopped /dev/md_d0
> teak:~# mdadm --assemble /dev/md_d0 --auto=parts /dev/sd[bcdef]1
> mdadm: /dev/md_d0 has been started with 5 drives.
> teak:~# mount /media
> mount: No such file or directory
> teak:~# mount /media
> teak:~#
> (second mount succeeds second time around)
> 
> 
> 
> md: md_d0 stopped.
> md: bind
> md: bind
> md: bind
> md: bind
> md: bind
> raid5: device sde1 operational as raid disk 0
> raid5: device sdf1 operational as raid disk 4
> raid5: device sdb1 operational as raid disk 3
> raid5: device sdd1 operational as raid disk 2
> raid5: device sdc1 operational as raid disk 1
> raid5: allocated 5236kB for md_d0
> raid5: raid level 5 set md_d0 active with 5 out of 5 devices, algorithm 2
> RAID5 conf printout:
>  --- rd:5 wd:5
>  disk 0, o:1, dev:sde1
>  disk 1, o:1, dev:sdc1
>  disk 2, o:1, dev:sdd1
>  disk 3, o:1, dev:sdb1
>  disk 4, o:1, dev:sdf1
> md_d0: bitmap initialized from disk: read 1/1 pages, set 0 bits, status: 0
> created bitmap (10 pages) for device md_d0
>  md_d0: p1 p2
> Filesystem "md_d0p1": Disabling barriers, not supported with external log 
> device
> XFS mounting filesystem md_d0p1
> Ending clean XFS mount for filesystem: md_d0p1
> md: md_d0 stopped.
> md: unbind
> md: export_rdev(sde1)
> md: unbind
> md: export_rdev(sdf1)
> md: unbind
> md: export_rdev(sdb1)
> md: unbind
> md: export_rdev(sdd1)
> md: unbind
> md: export_rdev(sdc1)
> md: md_d0 stopped.
> md: bind
> md: bind
> md: bind
> md: bind
> md: bind
> raid5: device sde1 operational as raid disk 0
> raid5: device sdf1 operational as raid disk 4
> raid5: device sdb1 operational as raid disk 3
> raid5: device sdd1 operational as raid disk 2
> raid5: device sdc1 operational as raid disk 1
> raid5: allocated 5236kB for md_d0
> raid5: raid level 5 set md_d0 active with 5 out of 5 devices, algorithm 2
> RAID5 conf printout:
>  --- rd:5 wd:5
>  disk 0, o:1, dev:sde1
>  disk 1, o:1, dev:sdc1
>  disk 2, o:1, dev:sdd1
>  disk 3, o:1, dev:sdb1
>  disk 4, o:1, dev:sdf1
> md_d0: bitmap initialized from disk: read 1/1 pages, set 0 bits, status: 0
> created bitmap (10 pages) for device md_d0
>  md_d0: p1 p2
> XFS: Invalid device [/dev/md_d0p2], error=-2
> Filesystem "md_d0p1": Disabling barriers, not supported with external log 
> device
> XFS mounting filesystem md_d0p1
> Ending clean XFS mount for filesystem: md_d0p1
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Swapping out for larger disks

2007-05-08 Thread David Greaves
Brad Campbell wrote:
> G'day all,
> 
> I've got 3 arrays here. A 3 drive raid-5, a 10 drive raid-5 and a 15
> drive raid-6. They are all currently 250GB SATA drives.
> 
> I'm contemplating an upgrade to 500GB drives on one or more of the
> arrays and wondering the best way to do the physical swap.
> 
> The slow and steady way would be to degrade the array, remove a disk,
> add the new disk, lather, rinse, repeat. After which I could use mdadm
> --grow. There is the concern of a degraded array here though (and one of
> the reasons I'm looking to swap is some of the disks have about 30,000
> hours on the clock and are growing the odd defect).


Assuming hotswap and for maximum uptime/minimal exposure to risk... a while back
there was a discussion of a fiddly way that involved adding a disk, making a
mirror, removing the old disk, breaking the mirror. ( See archive for details)

> 
> I was more wondering about the feasibility of using dd to copy the drive
> contents to the larger drives (then I could do 5 at a time) and working
> it from there.
Err, if you can dd the drives, why can't you create a new array and use xfsdump
or equivalent? Is downtime due to copying that bad?

David

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: removed disk && md-device

2007-05-10 Thread David Greaves
Neil Brown wrote:
> On Wednesday May 9, [EMAIL PROTECTED] wrote:
>> Neil Brown <[EMAIL PROTECTED]> [2007.04.02.0953 +0200]:
>>> Hmmm... this is somewhat awkward.  You could argue that udev should be
>>> taught to remove the device from the array before removing the device
>> >from /dev.  But I'm not convinced that you always want to 'fail' the
>>> device.   It is possible in this case that the array is quiescent and
>>> you might like to shut it down without registering a device failure...
>> Hmm, the the kernel advised hotplug to remove the device from /dev, but you 
>> don't want to remove it from md? Do you have an example for that case?
> 
> Until there is known to be an inconsistency among the devices in an
> array, you don't want to record that there is.
> 
> Suppose I have two USB drives with a mounted but quiescent filesystem
> on a raid1 across them.
> I pull them both out, one after the other, to take them to my friends
> place.
> 
> I plug them both in and find that the array is degraded, because as
> soon as I unplugged on, the other was told that it was now the only
> one.
And, in truth, so it was.

Who updated the event count though?

> Not good.  Best to wait for an IO request that actually returns an
> errors. 
Ah, now would that be a good time to update the event count?


Maybe you should allow drives to be removed even if they aren't faulty or spare?
A write to a removed device would mark it faulty in the other devices without
waiting for a timeout.

But joggling a usb stick (similar to your use case) would probably be OK since
it would be hot-removed and then hot-added.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: removed disk && md-device

2007-05-11 Thread David Greaves
x27;m not completely familiar
> with how USB storage works.
Yes, so, assuming my proposal, in the case where you hot remove sdb (not fail)
then hot add sdp (same drive slot different drive identifier, maybe different
usb controller) the on-disk superblock can reliably ensure that the array just
continues (also assuming quiescence)?

> In any case, it should really be a user-space decision what happens
> then.  A hot re-add may well be appropriate, but I wouldn't want to
> have the kernel make that decision.

udev is userspace though - you could have a conservative no-add policy ruleset.


My proposal is simply to allow a hot-remove of a drive without marking it
faulty. This remove event would not update the event counts in other drives.
This allows transient (stupid human in the OP report) drive removal to be
properly communicated via udev to md. You don't end up in the situation of "the
drive formerly known as..."

Just out of interest.
Currently, if I unplug /dev/sdp (which is md0 slot3), wait, plug in a random
non-md usb drive which appears as /dev/sdp, what does md do? Just write to the
new /dev/sdp assuming it's the old one?

David


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: removed disk && md-device

2007-05-11 Thread David Greaves
assuming my proposal, in the case where you hot remove sdb (not fail)
then hot add sdp (same drive slot different drive identifier, maybe different
usb controller) the on-disk superblock can reliably ensure that the array just
continues (also assuming quiescence)?

> In any case, it should really be a user-space decision what happens
> then.  A hot re-add may well be appropriate, but I wouldn't want to
> have the kernel make that decision.

udev is userspace though - you could have a conservative no-add policy ruleset.


My proposal is simply to allow a hot-remove of a drive without marking it
faulty. This remove event would not update the event counts in other drives.
This allows transient (stupid human in the OP report) drive removal to be
properly communicated via udev to md. You don't end up in the situation of "the
drive formerly known as..."

Just out of interest.
Currently, if I unplug /dev/sdp (which is md0 slot3), wait, plug in a random
non-md usb drive which appears as /dev/sdp, what does md do? Just write to the
new /dev/sdp assuming it's the old one?

David

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: removed disk && md-device

2007-05-11 Thread David Greaves
gt; back a something different (sdp?) - though I'm not completely familiar
> with how USB storage works.
Yes, so, assuming my proposal, in the case where you hot remove sdb (not fail)
then hot add sdp (same drive slot different drive identifier, maybe different
usb controller) the on-disk superblock can reliably ensure that the array just
continues (also assuming quiescence)?

> In any case, it should really be a user-space decision what happens
> then.  A hot re-add may well be appropriate, but I wouldn't want to
> have the kernel make that decision.

udev is userspace though - you could have a conservative no-add policy ruleset.


My proposal is simply to allow a hot-remove of a drive without marking it
faulty. This remove event would not update the event counts in other drives.
This allows transient (stupid human in the OP report) drive removal to be
properly communicated via udev to md. You don't end up in the situation of "the
drive formerly known as..."

Just out of interest.
Currently, if I unplug /dev/sdp (which is md0 slot3), wait, plug in a random
non-md usb drive which appears as /dev/sdp, what does md do? Just write to the
new /dev/sdp assuming it's the old one?

David



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to synchronize two devices (RAID-1, but not really?)

2007-05-15 Thread David Greaves
Tomasz Chmielewski wrote:
> Peter Rabbitson schrieb:
>> Tomasz Chmielewski wrote:
>>> I have a RAID-10 setup of four 400 GB HDDs. As the data grows by several
>>> GBs a day, I want to migrate it somehow to RAID-5 on separate disks in a
>>> separate machine.
>>>
>>> Which would be easy, if I didn't have to do it online, without stopping
>>> any services.
>>>
>>>
>>
>> Your /dev/md10 - what is directly on top of it? LVM? XFS? EXT3?
> 
> Good point. I don't want to copy the whole RAID-10.
> I want to copy only one LVM-2 volume (which is like 90% of that RAID-10,
> anyway).
> 
> 
> So I want to synchronize /dev/LVM2/my-volume (ext3) with /dev/sdr (now
> empty; bigger than /dev/LVM2/my-volume).
> 
> 
> (sda2, sdb2, sdc2, sdd2) -> RAID-10 -> LVM-2 -> my volume -> ext3
> 
> 


I've not used iSCSI but I wonder about using nbd : network block device

Use nbd to export /dev/md5 from machine 2.
Import /dev/nbd0 on machine 1.
Add nbd0 to the VG on machine 1
pvmove the data from /dev/md10 to /dev/nbd0 (ie the md5 on machine2 via nbd)
remove /dev/md10 from the VG.
The VG should now exist only on /dev/nbd0 on machine 2
stop the services and lvm on machine 1
start the lvm and services on machine 2.

I'd suggest testing this first .

David
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: I lost a XFS file system due to a minor IDE cable problem

2007-05-24 Thread David Chinner
On Thu, May 24, 2007 at 07:20:35AM -0400, Justin Piszcz wrote:
> Including XFS mailing list on this one.

Thanks Justin.

> On Thu, 24 May 2007, Pallai Roland wrote:
> 
> >
> >Hi,
> >
> >I wondering why the md raid5 does accept writes after 2 disks failed. I've 
> >an
> >array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable 
> >failed
> >(my friend kicked it off from the box on the floor:) and 2 disks have been
> >kicked but my download (yafc) not stopped, it tried and could write the 
> >file
> >system for whole night!
> >Now I changed the cable, tried to reassembly the array (mdadm -f --run),
> >event counter increased from 4908158 up to 4929612 on the failed disks, 
> >but I
> >cannot mount the file system and the 'xfs_repair -n' shows lot of errors
> >there. This is expainable by the partially successed writes. Ext3 and JFS
> >has "error=" mount option to switch filesystem read-only on any error, but
> >XFS hasn't: why?

"-o ro,norecovery" will allow you to mount the filesystem and get any
uncorrupted data off it.

You still may get shutdowns if you trip across corrupted metadata in
the filesystem, though.

> >It's a good question too, but I think the md layer could
> >save dumb filesystems like XFS if denies writes after 2 disks are failed, 
> >and
> >I cannot see a good reason why it's not behave this way.

How is *any* filesystem supposed to know that the underlying block
device has gone bad if it is not returning errors?

I did mention this exact scenario in the filesystems workshop back
in february - we'd *really* like to know if a RAID block device has gone
into degraded mode (i.e. lost a disk) so we can throttle new writes
until the rebuil dhas been completed. Stopping writes completely on a
fatal error (like 2 lost disks in RAID5, and 3 lost disks in RAID6)
would also be possible if only we could get the information out
of the block layer.

> >Do you have better idea how can I avoid such filesystem corruptions in the
> >future? No, I don't want to use ext3 on this box. :)

Well, the problem is a bug in MD - it should have detected
drives going away and stopped access to the device until it was
repaired. You would have had the same problem with ext3, or JFS,
or reiser or any other filesystem, too.

> >my mount error:
> >XFS: Log inconsistent (didn't find previous header)
> >XFS: failed to find log head
> >XFS: log mount/recovery failed: error 5
> >XFS: log mount failed

You MD device is still hosed - error 5 = EIO; the md device is
reporting errors back the filesystem now. You need to fix that
before trying to recover any data...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: I lost a XFS file system due to a minor IDE cable problem

2007-05-24 Thread David Chinner
On Fri, May 25, 2007 at 03:35:48AM +0200, Pallai Roland wrote:
> On Fri, 2007-05-25 at 10:05 +1000, David Chinner wrote:
> > > >It's a good question too, but I think the md layer could
> > > >save dumb filesystems like XFS if denies writes after 2 disks are 
> > > >failed, 
> > > >and
> > > >I cannot see a good reason why it's not behave this way.
> > 
> > How is *any* filesystem supposed to know that the underlying block
> > device has gone bad if it is not returning errors?
>  It is returning errors, I think so. If I try to write raid5 with 2
> failed disks with dd, I've got errors on the missing chunks.

Oh, did you look at your logs and find that XFS had spammed them
about writes that were failing?

>  The difference between ext3 and XFS is that ext3 will remount to
> read-only on the first write error but the XFS won't, XFS only fails
> only the current operation, IMHO. The method of ext3 isn't perfect, but
> in practice, it's working well.

XFS will shutdown the filesystem if metadata corruption will occur
due to a failed write. We don't immediately fail the filesystem on
data write errors because on large systems you can get *transient*
I/O errors (e.g. FC path failover) and so retrying failed data
writes is useful for preventing unnecessary shutdowns of the
filesystem.

Different design criteria, different solutions...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: I lost a XFS file system due to a minor IDE cable problem

2007-05-25 Thread David Chinner
On Fri, May 25, 2007 at 12:43:51AM -0500, Alberto Alonso wrote:
> > >  The difference between ext3 and XFS is that ext3 will remount to
> > > read-only on the first write error but the XFS won't, XFS only fails
> > > only the current operation, IMHO. The method of ext3 isn't perfect, but
> > > in practice, it's working well.
> > 
> > XFS will shutdown the filesystem if metadata corruption will occur
> > due to a failed write. We don't immediately fail the filesystem on
> > data write errors because on large systems you can get *transient*
> > I/O errors (e.g. FC path failover) and so retrying failed data
> > writes is useful for preventing unnecessary shutdowns of the
> > filesystem.
> > 
> > Different design criteria, different solutions...
> 
> I think his point was that going into a read only mode causes a
> less catastrophic situation (ie. a web server can still serve
> pages).

Sure - but once you've detected one corruption or had metadata
I/O errors, can you trust the rest of the filesystem?

> I think that is a valid point, rather than shutting down
> the file system completely, an automatic switch to where the least
> disruption of service can occur is always desired.

I consider the possibility of serving out bad data (i.e after
a remount to readonly) to be the worst possible disruption of
service that can happen ;)

> Maybe the automatic failure mode could be something that is 
> configurable via the mount options.

If only it were that simple. Have you looked to see how many
hooks there are in XFS to shutdown without causing further
damage?

% grep FORCED_SHUTDOWN fs/xfs/*.[ch] fs/xfs/*/*.[ch] | wc -l
116

Changing the way we handle shutdowns would take a lot of time,
effort and testing. When can I expect a patch? ;)

> I personally have found the XFS file system to be great for
> my needs (except issues with NFS interaction, where the bug report
> never got answered), but that doesn't mean it can not be improved.

Got a pointer?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-25 Thread David Chinner
On Fri, May 25, 2007 at 05:58:25PM +1000, Neil Brown wrote:
> We can think of there being three types of devices:
>  
> 1/ SAFE.  With a SAFE device, there is no write-behind cache, or if
>   there is it is non-volatile.  Once a write completes it is 
>   completely safe.  Such a device does not require barriers
>   or ->issue_flush_fn, and can respond to them either by a
> no-op or with -EOPNOTSUPP (the former is preferred).
> 
> 2/ FLUSHABLE.
>   A FLUSHABLE device may have a volatile write-behind cache.
>   This cache can be flushed with a call to blkdev_issue_flush.
> It may not support barrier requests.

So returns -EOPNOTSUPP to any barrier request?

> 3/ BARRIER.
> A BARRIER device supports both blkdev_issue_flush and
>   BIO_RW_BARRIER.  Either may be used to synchronise any
> write-behind cache to non-volatile storage (media).
> 
> Handling of SAFE and FLUSHABLE devices is essentially the same and can
> work on a BARRIER device.  The BARRIER device has the option of more
> efficient handling.
> 
> How does a filesystem use this?
> ===

> 
> The filesystem will want to ensure that all preceding writes are safe
> before writing the barrier block.  There are two ways to achieve this.

Three, actually.

> 1/  Issue all 'preceding writes', wait for them to complete (bi_endio
>called), call blkdev_issue_flush, issue the commit write, wait
>for it to complete, call blkdev_issue_flush a second time.
>(This is needed for FLUSHABLE)

*nod*

> 2/ Set the BIO_RW_BARRIER bit in the write request for the commit
> block.
>(This is more efficient on BARRIER).

*nod*

3/ Use a SAFE device.

> The second, while much easier, can fail.

So we do a test I/O to see if the device supports them before
enabling that mode.  But, as we've recently discovered, this is not
sufficient to detect *correctly functioning* barrier support.

> So a filesystem should be
> prepared to deal with that failure by falling back to the first
> option.

I don't buy that argument.

> Thus the general sequence might be:
> 
>   a/ issue all "preceding writes".
>   b/ issue the commit write with BIO_RW_BARRIER

At this point, the filesystem has done everything it needs to ensure
that the block layer has been informed of the I/O ordering
requirements. Why should the filesystem now have to detect block
layer breakage, and then use a different block layer API to issue
the same I/O under the same constraints?

>   c/ wait for the commit to complete.
>  If it was successful - done.
>  If it failed other than with EOPNOTSUPP, abort
>  else continue
>   d/ wait for all 'preceding writes' to complete
>   e/ call blkdev_issue_flush
>   f/ issue commit write without BIO_RW_BARRIER
>   g/ wait for commit write to complete
>if it failed, abort
>   h/ call blkdev_issue
_flush?

>   DONE
> 
> steps b and c can be left out if it is known that the device does not
> support barriers.  The only way to discover this to try and see if it
> fails.

That's a very linear, single-threaded way of looking at it... ;)

> I don't think any filesystem follows all these steps.
> 
>  ext3 has the right structure, but it doesn't include steps e and h.
>  reiserfs is similar.  It does have a call to blkdev_issue_flush, but 
>   that is only on the fsync path, so it isn't really protecting
>   general journal commits.
>  XFS - I'm less sure.  I think it does 'a' then 'd', then 'b' or 'f'
>depending on a whether it thinks the device handles barriers,
>and finally 'g'.

That's right, except for the "g" (or "c") bit - commit writes are
async and nothing waits for them - the io completion wakes anything
waiting on it's completion

(yes, all XFS barrier I/Os are issued async which is why having to
handle an -EOPNOTSUPP error is a real pain. The fix I currently
have is to reissue the I/O from the completion handler with is
ugly, ugly, ugly.)

> So for devices that support BIO_RW_BARRIER, and for devices that don't
> need any flush, they work OK, but for device that need flushing, but
> don't support BIO_RW_BARRIER, none of them work.  This should be easy
> to fix.

Right - XFS as it stands was designed to work on SAFE devices, and
we've modified it to work on BARRIER devices. We don't support
FLUSHABLE devices at all.

But if the filesystem supports BARRIER devices, I don't see any
reason why a filesystem needs to be modified to support FLUSHABLE
devices - the key point being that by the time the filesystem
has issued the "commit write" it has already waited for all it's
dependent I/O, and so all the block device needs to do is
issue flushes either side of the commit write

> HOW DO MD or DM USE THIS
> 
> 
> 1/ striping devices.
>  This includes md/raid0 md/linear dm-linear dm-stripe and probably
>  others. 
> 
>These devices can easily supp

Re: raid10 kernel panic on sparc64

2007-05-26 Thread David Miller
From: Jan Engelhardt <[EMAIL PROTECTED]>
Date: Sat, 26 May 2007 17:10:30 +0200 (MEST)

> 
> On Apr 12 2007 14:26, David Miller wrote:
> >
> >> Kernel is kernel-smp-2.6.16-1.2128sp4.sparc64.rpm from Aurora Corona.
> >> Perhaps it helps, otherwise hold your breath until I reproduce it.
> >
> >Jan, if you can reproduce this with the current 2.6.20 vanilla
> >kernel I'd be very interested in a full trace so that I can
> >try to fix this.
> >
> >With the combination of an old kernel and only part of the
> >crash trace, there isn't much I can do with this report.
> 
> Does not seem to happen under 2.6.21-1.3149.al3.2smp anymore.

Thanks for following up on this Jan.

I'd personally really appreciate reports against upstream
instead of dist kernels in the future, and I'm sure the
linux-raid maintainers feel similarly :-)
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   >