Re: RAID5/10 chunk size and ext2/3 stride parameter

2006-11-03 Thread dean gaudet
On Tue, 24 Oct 2006, martin f krafft wrote:

> Hi,
> 
> I cannot find authoritative information about the relation between
> the RAID chunk size and the correct stride parameter to use when
> creating an ext2/3 filesystem.

you know, it's interesting -- mkfs.xfs somehow gets the right sunit/swidth 
automatically from the underlying md device.

for example, on a box i'm testing:

# mdadm --create --level=5 --raid-devices=4 --assume-clean --auto=yes /dev/md0 
/dev/sd[abcd]1
mdadm: array /dev/md0 started.
# mkfs.xfs /dev/md0
meta-data=/dev/md0   isize=256agcount=32, agsize=9157232 
blks
 =   sectsz=4096  attr=0
data =   bsize=4096   blocks=293031424, imaxpct=25
 =   sunit=16 swidth=48 blks, unwritten=1
naming   =version 2  bsize=4096
log  =internal log   bsize=4096   blocks=32768, version=2
 =   sectsz=4096  sunit=1 blks
realtime =none   extsz=196608 blocks=0, rtextents=0

# mdadm --stop /dev/md0
mdadm: stopped /dev/md0
# mdadm --zero-superblock /dev/sd[abcd]1
# mdadm --create --level=10 --layout=f2 --raid-devices=4 --assume-clean 
--auto=yes /dev/md0 /dev/sd[abcd]1
mdadm: array /dev/md0 started.
# mkfs.xfs -f /dev/md0
meta-data=/dev/md0   isize=256agcount=32, agsize=6104816 blks
 =   sectsz=512   attr=0
data =   bsize=4096   blocks=195354112, imaxpct=25
 =   sunit=16 swidth=64 blks, unwritten=1
naming   =version 2  bsize=4096
log  =internal log   bsize=4096   blocks=32768, version=1
 =   sectsz=512   sunit=0 blks
realtime =none   extsz=262144 blocks=0, rtextents=0


i wonder if the code could be copied into mkfs.ext3?

although hmm, i don't think it gets raid10 "n2" correct:

# mdadm --stop /dev/md0
mdadm: stopped /dev/md0
# mdadm --zero-superblock /dev/sd[abcd]1
# mdadm --create --level=10 --layout=n2 --raid-devices=4 --assume-clean 
--auto=yes /dev/md0 /dev/sd[abcd]1
mdadm: array /dev/md0 started.
# mkfs.xfs -f /dev/md0
meta-data=/dev/md0   isize=256agcount=32, agsize=6104816 blks
 =   sectsz=512   attr=0
data =   bsize=4096   blocks=195354112, imaxpct=25
 =   sunit=16 swidth=64 blks, unwritten=1
naming   =version 2  bsize=4096
log  =internal log   bsize=4096   blocks=32768, version=1
 =   sectsz=512   sunit=0 blks
realtime =none   extsz=262144 blocks=0, rtextents=0


in a "near 2" layout i would expect sunit=16, swidth=32 ...  but swidth=64
probably doesn't hurt.


> My understanding is that (block * stride) == (chunk). So if I create
> a default RAID5/10 with 64k chunks, and create a filesystem with 4k
> blocks on it, I should choose stride 64k/4k = 16.

that's how i think it works -- i don't think ext[23] have a concept of "stripe
width" like xfs does.  they just want to know how to avoid putting all the
critical data on one disk (which needs only the chunk size).  but you should
probably ask on the linux-ext4 mailing list.

> Is the chunk size of an array equal to the stripe size? Or is it
> (n-1)*chunk size for RAID5 and (n/2)*chunk size for a plain near=2
> RAID10?

> Also, I understand that it makes no sense to use stride for RAID1 as
> there are no stripes in that sense. But for RAID10 it makes sense,
> right?

yep.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 refuses to accept replacement drive.

2006-11-03 Thread greg
On Oct 26,  7:25am, Neil Brown wrote:
} Subject: Re: RAID5 refuses to accept replacement drive.

Hi Neil, hope the end of the week is going well for you.

> On Wednesday October 25, [EMAIL PROTECTED] wrote:
> > Good morning to everyone, hope everyone's day is going well.
> > 
> > Neil, I sent this to your SUSE address a week ago but it may have
> > gotten trapped in a SPAM filter or lost in the shuffle.
> 
> Yes, resending is always a good idea if I seem to be ignoring you.
> 
> (people who are really on-the-ball will probably start telling me it is a
> resend the first time they mail me. I probably wouldn't notice.. :-)

Did you get my reply on what I found when I poked at mdadm with gdb?

> NeilBrown

Have a good weekend.

}-- End of excerpt from Neil Brown

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.   Specializing in information infra-structure
Fargo, ND  58102development.
PH: 701-281-1686
FAX: 701-281-3949   EMAIL: [EMAIL PROTECTED]
--
"Fools ignore complexity.  Pragmatists suffer it.  Some can avoid it.
Geniuses remove it.
-- Perliss' Programming Proverb #58
   SIGPLAN National, Sept. 1982
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New features?

2006-11-03 Thread Gabor Gombas
On Fri, Nov 03, 2006 at 02:39:31PM +1100, Neil Brown wrote:

> mdadm could probably be changed to be able to remove the device
> anyway.  The only difficulty is: how do you tell it which device to
> remove", given that there is no name in /dev to use.
> Suggestions?

Major:minor? If /sys/block still holds an entry for the removed disk,
then the user can figure it out from the name. Or mdadm could just
accept a path under /sys/block instead of a device node.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 001 of 6] md: Send online/offline uevents when an md array starts/stops.

2006-11-03 Thread Kay Sievers
On Fri, 2006-11-03 at 17:57 +1100, Neil Brown wrote:
> On Thursday November 2, [EMAIL PROTECTED] wrote:
> > On Thu, 2006-11-02 at 23:32 +1100, Neil Brown wrote:

> > We couldn't think of any use of an "offline" event. So we removed the
> > event when the device-mapper device is suspended.
> > 
> > > Should ONLINE and OFFLINE remain and CHANGE be added, or should they
> > > go away?
> > 
> > The current idea is to send only a "change" event if something happens
> > that makes it necessary for udev to reinvestigate the device, like
> > possible filesystem content that creates /dev/disk/by-* links.
> > 
> > Finer grained device-monitoring is likely better placed by using the
> > poll() infrastructure for a sysfs file, instead of sending pretty
> > expensive uevents. 
> > 
> > Udev only hooks into "change" and revalidates all current symlinks for
> > the device. Udev can run programs on "online", but currently, it will
> > not update any /dev/disk/by-* link, if the device changes its content.
> > 
> 
> OK.  Makes sense.
> I tried it an got an interesting result
> 
> This is with md generating 'CHANGE' events when an array goes on-line
> and when it goes off line, and also with another patch which causes md
> devices to disappear when not active so that we get ADD and REMOVE
> events at reasonably appropriate times.
> 
> It all works fine until I stop an array.
> We get a CHANGE event and then a REMOVE event.
> And then a seemingly infinite series of ADD/REMOVE pairs.
> 
> I guess that udev sees the CHANGE and so opens the device to see what
> is there.  By that time the device has disappeared so the open causes
> an ADD.  udev doesn't find anything and closes the device which causes
> it to disappear and we get a REMOVE.
> Now udev sees that ADD and so opens the device again to see what it
> there, triggering an ADD.  Nothing is there so we close it and get a
> REMOVE.
> Now udev sees the second ADD and 

Hmm, why does the open() of device node of a stopped device cause an "add"?
Shouldn't it just return a failure, instead of creating a device?

> A bit unfortunate really.  This didn't happen when I had
> ONLINE/OFFLINE as udev ignored the OFFLINE.
> I guess I can removed the CHANGE at shutdown, but as there really is a
> change there, that doesn't seem right.

Yeah, it's the same problem we had with device-mapper, nobody could
think of any useful action at a dm-device suspend "change"-event, so we
didn't add it. :)   

> The real problem is that udev opens the device, and md interprets and
> 'open' as a request to create the device. And udev see the open and an
> ADD and so opens the device

Yes, current udev rules are written to to so, md needs to be excluded
from the list of block devices which are handled by the default
persistent naming rules, and moved to its own rules file. We did the
same for device-mapper to ignore some "private" dm-* volumes like
snapshot devices. 

> It's not clear to me what the 'right' thing to do here is:
>  - I could stop removing the device on last-close, but I still
>think that (the current situation) is ugly.
>  - I could delay the remove until udev will have stopped poking,
>but that is even more ugly
>  - udev could avoid opening md devices until it has poked in 
>/sys/block/mdX to see what the status is, but that is very specific
>to md
> 
> It would be nice if I could delay the add until later, but that would
> require major surgery and probably break the model badly.
> 
> On the whole, it seems that udev was designed without thought to the
> special needs of md, and md was designed (long ago) without thought
> the ugliness that "open creates a device" causes.

The persistent naming rules for /dev/disk/by-* are causing this. Md
devices will probably just get their own rules file, which will handle
this and which can be packaged and installed along with the md tools.

If it's acceptable for you, so leave the shutdown "change" event out for
now, until someone has the need for it.
We will update the rules in the meantime, and read a sysfs file or call
a md-tool to query the current state of the device on "add" and "change"
events, this will prevent the opening of the device when it's not
supposed to do so.

Thanks,
Kay

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html