Re: BUGS: internal bitmap during array create

2006-10-18 Thread Neil Brown
On Wednesday October 18, [EMAIL PROTECTED] wrote:
> 
> 
> I've provided the requested info, attached as two files (typescript
> output):

Thanks for persisting with this.

There is one bug in mdadm that is causing all of these problems.  It
only affect the 'offset' layout with raid10.

The fix is 
http://neil.brown.name/git?p=mdadm;a=commitdiff;h=702b557b1c9

and is included below.
You might like to grab the latest source from 
  git://neil.brown.name/mdadm
and compile that, or just apply the patch.

Thanks again,
NeilBrown

-
Fix bugs related to raid10 and the new offset layout.

Need to mask of bits above the bottom 16 when calculating number of
copies.

### Diffstat output
 ./ChangeLog |1 +
 ./Create.c  |2 +-
 ./util.c|2 +-
 3 files changed, 3 insertions(+), 2 deletions(-)

diff .prev/ChangeLog ./ChangeLog
--- .prev/ChangeLog 2006-10-19 16:38:07.0 +1000
+++ ./ChangeLog 2006-10-19 16:38:24.0 +1000
@@ -13,6 +13,7 @@ Changes Prior to this release
initramfs, but device doesn't yet exist in /dev.
 -   When --assemble --scan is run, if all arrays that could be found
have already been started, don't report an error.
+-   Fix a couple of bugs related to raid10 and the new 'offset' layout.
 
 Changes Prior to 2.5.4 release
 -   When creating devices in /dev/md/ create matching symlinks

diff .prev/Create.c ./Create.c
--- .prev/Create.c  2006-10-19 16:38:07.0 +1000
+++ ./Create.c  2006-10-19 16:38:24.0 +1000
@@ -363,7 +363,7 @@ int Create(struct supertype *st, char *m
 * which is array.size * raid_disks / ncopies;
 * .. but convert to sectors.
 */
-   int ncopies = (layout>>8) * (layout & 255);
+   int ncopies = ((layout>>8) & 255) * (layout & 255);
bitmapsize = (unsigned long long)size * raiddisks / ncopies * 2;
 /* printf("bms=%llu as=%d rd=%d nc=%d\n", bitmapsize, size, 
raiddisks, ncopies);*/
} else

diff .prev/util.c ./util.c
--- .prev/util.c2006-10-19 16:38:07.0 +1000
+++ ./util.c2006-10-19 16:38:24.0 +1000
@@ -179,7 +179,7 @@ int enough(int level, int raid_disks, in
/* This is the tricky one - we need to check
 * which actual disks are present.
 */
-   copies = (layout&255)* (layout>>8);
+   copies = (layout&255)* ((layout>>8) & 255);
first=0;
do {
/* there must be one of the 'copies' form 'first' */
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Propose of enhancement of raid1 driver

2006-10-18 Thread Neil Brown
On Tuesday October 17, [EMAIL PROTECTED] wrote:
> I would like to propose an enhancement of raid 1 driver in linux kernel.
> The enhancement would be speedup of data reading on mirrored partitions.
> The idea is easy.
> If we have mirrored partition over 2 disks, and these disk are in sync, there 
> is
> possibility of simultaneous reading of the data from both disks on the same 
> way
> as in raid 0. So it would be chunk1 read from master, chunk2 read from slave 
> at
> the same time. 
> As result it would give significant speedup of read operation (comparable with
> speed of raid 0 disks).

This is not as easy as it sounds.
Skipping over blocks within a track is no faster than reading blocks
in the track, so you would need to make sure that your chunk size is
larger than one track - probably it would need to be several tracks.

Raid1 already does some read-balancing, though it is possible (even
likely) that it doesn't balance very effectively.  Working out how
best to do the balancing in general in a non-trivial task, but would
be worth spending time on.

The raid10 module in linux supports a layout described as 'far=2'.
In this layout, with two drives, the first half of the drives is used
for a raid0, and the second half is used for a mirrored raid0 with the
data on the other disk.
In this layout reads should certainly go at raid0 speeds, though
there is cost in the speed of writes.

Maybe you would like to experiment.  Write a program that reads from
two drives in parallel, reading all the 'odd' chunks from one drive
and the 'even' chunks from the other, and find out how fast it is.
Maybe you could get it to try lots of different chunk sizes and see
which is the fastest.

That might be quite helpful in understanding how to get read-balancing
working well.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: new features time-line

2006-10-18 Thread Neil Brown
On Tuesday October 17, [EMAIL PROTECTED] wrote:
> We talked about RAID5E a while ago, is there any thought that this would 
> actually happen, or is it one of the "would be nice" features? With 
> larger drives I suspect the number of drives in arrays is going down, 
> and anything which offers performance benefits for smaller arrays would 
> be useful.

So ... RAID5E is RAID5 using (N-1)/N of each drive (or close to that)
and not having a hot spare.
On a drive failure, the data is restriped across N-1 drives so that it
becomes plain RAID5.  This means that instead of having an idle spare,
you have spare space at the end of each drive.

To implement this you would need kernel code to restripe and array to
reduce the number of devices (currently we only increase the number of
devices).

Probably not too hard - just needs code and motivation.  

Don't know if/when it will happen, but it probably will
 especially if someone tries writing some code (hint hint to any
potential developers out there...)

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] md: Fix bug where new drives added to an md array sometimes don't sync properly.

2006-10-18 Thread Eli Stair


FYI, I'm testing 2.6.18.1 and noticed this mis-numbering of RAID10
members issue is still extant.  Even with this fix applied to raid10.c, 
I am still seeing repeatable issues with devices assuming a "Number" 
greater than that which they had when removed from a running array.


Issue 1)

I'm seeing inconsistencies in the way a drive is marked (and its 
behaviour) during rebuild after it is removed and added.  In this 
instance, the re-added drive is picked up and marked as "spare 
rebuilding".


 Rebuild Status : 20% complete

   Name : 0
   UUID : ab764369:7cf80f2b:cf61b6df:0b13cd3a
 Events : 1

Number   Major   Minor   RaidDevice State
   0 25300  active sync   /dev/dm-0
   1 25311  active sync   /dev/dm-1
   2 253   102  active sync   /dev/dm-10
   3 253   113  active sync   /dev/dm-11
   4 253   124  active sync   /dev/dm-12
   5 253   135  active sync   /dev/dm-13
   6 25326  active sync   /dev/dm-2
   7 25337  active sync   /dev/dm-3
   8 25348  active sync   /dev/dm-4
   9 25359  active sync   /dev/dm-5
  10 2536   10  active sync   /dev/dm-6
  11 2537   11  active sync   /dev/dm-7
  12 2538   12  active sync   /dev/dm-8
  13 2539   13  active sync   /dev/dm-9
[EMAIL PROTECTED] ~]# cat /proc/mdstat
Personalities : [raid10]
md0 : active raid10 dm-9[13] dm-8[12] dm-7[11] dm-6[10] dm-5[9] dm-4[8] 
dm-3[7] dm-2[6] dm-13[5] dm-12[4] dm-11[3] dm-10[2] dm-1[1] dm-0[0]
  1003620352 blocks super 1.2 512K chunks 2 offset-copies [14/14] 
[UU]
  [>]  resync = 21.7% (218664064/1003620352) 
finish=114.1min speed=114596K/sec





However, on the same configuration, it occasionally is pulled right back 
with a state of "active sync", without indication that it dirty:


Issue 2)

When a device is removed and subsequently added again (after setting 
failed and removing from the array), it SHOULD be set back to the 
"Number" it originally had in the array correct?  In the cases when the 
drive is NOT automatically marked as "active sync" and all members show 
up fine, it is picked up as a spare and rebuild is started, during which 
time it is marked down "_" in the /proc/mdstat date, and "spare 
rebuilding" in mdadm -D output:




When device "Number" 10


// STATE WHEN CLEAN:
   UUID : 6ccd7974:1b23f5b2:047d1560:b5922692

Number   Major   Minor   RaidDevice State
   0 25300  active sync   /dev/dm-0
   1 25311  active sync   /dev/dm-1
   2 253   102  active sync   /dev/dm-10
   3 253   113  active sync   /dev/dm-11
   4 253   124  active sync   /dev/dm-12
   5 253   135  active sync   /dev/dm-13
   6 25326  active sync   /dev/dm-2
   7 25337  active sync   /dev/dm-3
   8 25348  active sync   /dev/dm-4
   9 25359  active sync   /dev/dm-5
  10 2536   10  active sync   /dev/dm-6
  11 2537   11  active sync   /dev/dm-7
  12 2538   12  active sync   /dev/dm-8
  13 2539   13  active sync   /dev/dm-9


// STATE AFTER FAILURE:
Number   Major   Minor   RaidDevice State
   0 25300  active sync   /dev/dm-0
   1 25311  active sync   /dev/dm-1
   2   002  removed
   3 253   113  active sync   /dev/dm-11
   4 253   124  active sync   /dev/dm-12
   5 253   135  active sync   /dev/dm-13
   6 25326  active sync   /dev/dm-2
   7 25337  active sync   /dev/dm-3
   8 25348  active sync   /dev/dm-4
   9 25359  active sync   /dev/dm-5
  10 2536   10  active sync   /dev/dm-6
  11 2537   11  active sync   /dev/dm-7
  12 2538   12  active sync   /dev/dm-8
  13 2539   13  active sync   /dev/dm-9

   2 253   10-  faulty spare   /dev/dm-10

// STATE AFTER REMOVAL:
Number   Major   Minor   RaidDevice State
   0 25300  active sync   /dev/dm-0
   1 25311  active sync   /dev/dm-1
   2   002  removed
   3 253   113  active sync   /dev/dm-11
   4 253   12

Re: why partition arrays?

2006-10-18 Thread Doug Ledford
On Wed, 2006-10-18 at 15:43 +0200, martin f krafft wrote:
> also sprach Doug Ledford <[EMAIL PROTECTED]> [2006.10.18.1526 +0200]:
> > There are a couple reasons I can think.
> 
> Thanks for your elaborate response. If you don't mind, I shall link
> to it from the FAQ.

Sure.

> I have one other question: do partitionable and traditional arrays
> actually differ in format? Put differently: can I assemble
> a traditional array as a partitionable one simply by specifying:
> 
>   mdadm --create ... /dev/md0 ...
>   mdadm --stop /dev/md0
>   mdadm --assemble --auto=part ... /dev/md0 ...
> 
> ? Or do the superblocks actually differ?

Neil would be more authoritative about what would differ in the
superblocks, but yes, it is possible to do as you listed above.  In
fact, if you create a partitioned array, and your mkinitrd doesn't
restart it as a partitioned array, you'll wonder how to mount your
filesystems since the system will happily start that originally
partitioned array as non partitioned.

-- 
Doug Ledford <[EMAIL PROTECTED]>
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


Re: why partition arrays?

2006-10-18 Thread martin f krafft
also sprach Doug Ledford <[EMAIL PROTECTED]> [2006.10.18.1526 +0200]:
> There are a couple reasons I can think.

Thanks for your elaborate response. If you don't mind, I shall link
to it from the FAQ.

I have one other question: do partitionable and traditional arrays
actually differ in format? Put differently: can I assemble
a traditional array as a partitionable one simply by specifying:

  mdadm --create ... /dev/md0 ...
  mdadm --stop /dev/md0
  mdadm --assemble --auto=part ... /dev/md0 ...

? Or do the superblocks actually differ?

Thanks,

-- 
martin;  (greetings from the heart of the sun.)
  \ echo mailto: !#^."<*>"|tr "<*> mailto:"; [EMAIL PROTECTED]
 
spamtraps: [EMAIL PROTECTED]
 
the images rushed around his mind and tried
to find somewhere to settle down and make sense.
-- douglas adams, "the hitchhiker's guide to the galaxy"


signature.asc
Description: Digital signature (GPG/PGP)


Re: why partition arrays?

2006-10-18 Thread Doug Ledford
On Wed, 2006-10-18 at 14:42 +0200, martin f krafft wrote:
> Why would anyone want to create a partitionable array and put
> partitions in it, rather than creating separate arrays for each
> filesystem? Intuitively, this makes way more sense as then the
> partitions are independent of each other; one array can fail and the
> rest still works -- part of the reason why you partition in the
> first place.
> 
> Would anyone help me answer this FAQ?

There are a couple reasons I can think.

First, not all md types make sense to be split up, aka multipath.  For
those types, when a disk fails, the *entire* disk is considered to be
failed, but with different arrays you won't fail over to the next path
until each md array has attempted to access the bad path.  This can have
obvious bad consequences for certain array types that do automatic
failover from one port to another (you can end up getting the array in a
loop of switching ports repeatedly to satisfy the fact that one array
failed over during a path down, then the path came back up, and another
array stayed on the old path because it didn't send any commands during
the path down time period).

Second, convenience.  Assume you have a 6 disk raid5 array.  If a disk
fails and you are using a partitioned md array, then all the partitions
on the disk will already be handled without using that disk.  No need to
manually fail any still active array members from other arrays.

Third, safety.  Again with the raid5 array.  If you use multiple arrays
on a single disk, and that disk fails, but it only failed on one array,
then you now need to manually fail that disk from the other arrays
before shutting down or hot swapping the disk.  Generally speaking,
that's not a big deal, but people do occasionally have fat finger
syndrome and this is a good opportunity for someone to accidentally fail
the wrong disk, and when you then go to remove the disk you create a two
disk failure instead of one and now you are in real trouble.

Forth, to respond to what you wrote about independent of each other --
part of the reason why you partition.  I would argue that's not true.
If your goal is to salvage as much use from a failing disk as possible,
then OK.  But, generally speaking, people that have something of value
on their disks don't want to salvage any part of a failing disk, they
want that disk gone and replaced immediately.  There simply is little to
no value in an already malfunctioning disk.  They're too cheap and the
data stored on them too valuable to risk loosing something in an effort
to further utilize broken hardware.  This of course is written with the
understanding that the latest md raid code will do read error rewrites
to compensate for minor disk issues, so anything that will throw a disk
out of an array is more than just a minor sector glitch.

> (btw: [0] and [1] are obviously for public consumption; they are
> available under the terms of the artistic licence 2.0)
> 
> 0. 
> http://svn.debian.org/wsvn/pkg-mdadm/mdadm/trunk/debian/FAQ?op=file&rev=0&sc=0
> 1. 
> http://svn.debian.org/wsvn/pkg-mdadm/mdadm/trunk/debian/README.recipes?op=file&rev=0&sc=0
> 
-- 
Doug Ledford <[EMAIL PROTECTED]>
  GPG KeyID: CFBFF194
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband


signature.asc
Description: This is a digitally signed message part


why partition arrays?

2006-10-18 Thread martin f krafft
As the Debian mdadm maintainer, I am often subjected to questions
about partitionable arrays; people seem to want to use them in
favour of normal arrays. I don't understand why.

There's possibly an argument to be made about flexibility when it
comes to resizing partitions within the array, but even most MD
array types can be resized now.

There's possibly an argument about saving space because of fewer
sectors used/wasted with superblock information, but I am not going
to buy that.

Why would anyone want to create a partitionable array and put
partitions in it, rather than creating separate arrays for each
filesystem? Intuitively, this makes way more sense as then the
partitions are independent of each other; one array can fail and the
rest still works -- part of the reason why you partition in the
first place.

Would anyone help me answer this FAQ?

(btw: [0] and [1] are obviously for public consumption; they are
available under the terms of the artistic licence 2.0)

0. 
http://svn.debian.org/wsvn/pkg-mdadm/mdadm/trunk/debian/FAQ?op=file&rev=0&sc=0
1. 
http://svn.debian.org/wsvn/pkg-mdadm/mdadm/trunk/debian/README.recipes?op=file&rev=0&sc=0

-- 
martin;  (greetings from the heart of the sun.)
  \ echo mailto: !#^."<*>"|tr "<*> mailto:"; [EMAIL PROTECTED]
 
spamtraps: [EMAIL PROTECTED]
 
"the liar at any rate recognises that recreation, not instruction, is
 the aim of conversation, and is a far more civilised being than the
 blockhead who loudly expresses his disbelief in a story which is told
 simply for the amusement of the company."
-- oscar wilde


signature.asc
Description: Digital signature (GPG/PGP)


Problem with Software RAID5

2006-10-18 Thread Lars Schimmer
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi!

I´ve got a problem with a software raid5.
PC runs debian sarge/sid/etch mix with 2.6.16 self built and mdadm mdadm
2.5.3.git200608202239-7.
1 of 6 SATA 400GB drives failes and I rebooted the PC.
After reboot RAID was resyncing.
But again the HD died and the PC rebooted again, I pulled off the bad HD
and now the RAID5 won´t resync again.
I built a 2.6.17 kernel and replaced mdadm to  mdadm 2.5.4-1 and still
RAID5 won´t resync again:
cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid5] [raid4]
[raid6] [multipath] [faulty]
md0 : inactive sda1[0] sde1[5] sdd1[4] sdc1[3] sdb1[1]
  1953543680 blocks

unused devices: 

 mdadm --detail /dev/md0
/dev/md0:
Version : 00.90.03
  Creation Time : Fri May 12 16:10:24 2006
 Raid Level : raid5
Device Size : 390708736 (372.61 GiB 400.09 GB)
   Raid Devices : 6
  Total Devices : 5
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Tue Oct 17 14:11:56 2006
  State : active, degraded, Not Started
 Active Devices : 5
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 0

 Layout : left-symmetric
 Chunk Size : 256K

   UUID : 5ce125ae:b76d7567:a531953b:fbba92fc
 Events : 0.2818447

Number   Major   Minor   RaidDevice State
   0   810  active sync   /dev/sda1
   1   8   171  active sync   /dev/sdb1
   2   002  removed
   3   8   333  active sync   /dev/sdc1
   4   8   494  active sync   /dev/sdd1
   5   8   655  active sync   /dev/sde1

mdadm --stop /dev/md0
mdadm: stopped /dev/md0

sinope:~# mdadm --assemble /dev/md0
mdadm: failed to RUN_ARRAY /dev/md0: Input/output error

Any hints? tips?


MfG,
Lars Schimmer
- --
- -
TU Graz, Institut für ComputerGraphik & WissensVisualisierung
Tel: +43 316 873-5405   E-Mail: [EMAIL PROTECTED]
Fax: +43 316 873-5402   PGP-Key-ID: 0x4A9B1723
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFNfZgmWhuE0qbFyMRAqBBAJoCcE4gMx83NQl8pksSqgEpBHWNiACfTpKr
NVHtinnXRPIbY2Rfv3BUC0s=
=svsf
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html