Re: [PATCH] Use new sb type

2008-01-29 Thread Peter Rabbitson

Tim Southerwood wrote:

David Greaves wrote:


IIRC Doug Leford did some digging wrt lilo + grub and found that 1.1 and 1.2
wouldn't work with them. I'd have to review the thread though...

David
-


For what it's worth, that was my finding too. -e 0.9+1.0 are fine with
GRUB, but  1.1 an 1.2 won't work under the filesystem that contains
/boot, at least with GRUB 1.x (I haven't used LILO for some time nor
have I tried the development GRUB 2).

The reason IIRC boils down to the fact that GRUB 1 isn't MD aware, and
the only reason one can "get away" with using it on a RAID 1 setup at
all is that the constituent devices present the same data as the
composite MD device, from the start.

Putting an MD SB at/near the beginning of the device breaks this case
and GRUB 1 doesn't know how to deal with it.

Cheers
Tim
-


I read the entire thread, and it seems that the discussion drifted away from 
the issue at hand. I hate flogging a dead horse, but here are my 2 cents:


First the summary:

* Currently LILO and GRUB are the only booting mechanisms widely available 
(GRUB2 is nowhere to be seen, and seems to be badly designed anyway)


* Both of these boot mechanisms do not understand RAID at all, so they can 
boot only off a block device containing a hackishly-readable filesystem (lilo: 
files are mappable, grub: a stage1.5 exists)


* The only raid level providing unfettered access to the underlying filesystem 
is RAID1 with a superblock at its end, and it has been common wisdom for years 
that you need RAID1 boot partition in order to boot anything at all.


The problem is that these three points do not affect any other raid level (as 
you can not boot from any of them in a reliable fashion anyway). I saw a 
number of voices that backward compatibility must be preserved. I don't see 
any need for that because:


* The distro managers will definitely RTM and will adjust their flashy GUIs to 
do the right thing by explicitly supplying -e 1.0 for boot devices


* A clueless user might burn himself by making a single root on a single raid1 
device. But wait - he can burn himself the same way by making the root a raid5 
device and rebooting.


Why do we sacrifice "the right thing to do"? To eliminate the possibility of 
someone shooting himself in the foot by not reading the manual?


Cheers
Peter
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: write-intent bitmaps

2008-01-29 Thread Peter Rabbitson

Russell Coker wrote:
Are there plans for supporting a NVRAM write-back cache with Linux software 
RAID?




AFAIK even today you can place the bitmap in an external file residing on a 
file system which in turn can reside on the nvram...


Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: write-intent bitmaps

2008-01-29 Thread Russell Coker
On Tuesday 29 January 2008 20:13, Peter Rabbitson <[EMAIL PROTECTED]> 
wrote:
> Russell Coker wrote:
> > Are there plans for supporting a NVRAM write-back cache with Linux
> > software RAID?
>
> AFAIK even today you can place the bitmap in an external file residing on a
> file system which in turn can reside on the nvram...

True, and you can also put the journal of a filesystem on a NVRAM device.  But 
that doesn't give the stripe aggregating benefits for RAID-5 or the general 
write-back cache benefits for everything else.

-- 
[EMAIL PROTECTED]
http://etbe.coker.com.au/  My Blog

http://www.coker.com.au/sponsorship.html Sponsoring Free Software development
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Peter Rabbitson

Moshe Yudkowsky wrote:


One of the puzzling things about this is that I conceive of RAID10 as 
two RAID1 pairs, with RAID0 on top of to join them into a large drive. 
However, when I use --level=10  to create my md drive, I cannot find out 
which two pairs are the RAID1's: the --detail doesn't give that 
information. Re-reading the md(4) man page, I think I'm badly mistaken 
about RAID10.


Furthermore, since grub cannot find the /boot on the md drive, I deduce 
that RAID10 isn't what the 'net descriptions say it is.




It is exactly what the names implies - a new kind of RAID :) The setup you 
describe is not RAID10 it is RAID1+0. As far as how linux RAID10 works - here 
is an excellent article: 
http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10


Peter
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Moshe Yudkowsky
Neil, thanks for writing. A couple of follow-up questions to you and the 
group:


Neil Brown wrote:

On Monday January 28, [EMAIL PROTECTED] wrote:
Perhaps I'm mistaken but I though it was possible to do boot from 
/dev/md/all1.


It is my understanding that grub cannot boot from RAID.


Ah. Well, even though LILO seems to be less classy and in current 
disfavor, can I boot RAID10/RAID5 from LILO?



You can boot from raid1 by the expedient of booting from one of the
halves.


One of the puzzling things about this is that I conceive of RAID10 as 
two RAID1 pairs, with RAID0 on top of to join them into a large drive. 
However, when I use --level=10  to create my md drive, I cannot find out 
which two pairs are the RAID1's: the --detail doesn't give that 
information. Re-reading the md(4) man page, I think I'm badly mistaken 
about RAID10.


Furthermore, since grub cannot find the /boot on the md drive, I deduce 
that RAID10 isn't what the 'net descriptions say it is.



A common approach is to make a small raid1 which contains /boot and
boot from that.  Then use the rest of your devices for raid10 or raid5
or whatever.


Ah. Ny understanding from a previous question to this group was that 
using one partition of the drive for RAID1 and the other for RAID5 would 
(a) create inefficiencies in read/write cycles as the two different md 
drives maintained conflicting internal tables of the overall physical 
drive state and (b) would create problems if one or the other failed.


Under the alternative solution (booting from half of a raid1) since I'm 
booting from just one of the halves or the raid1, I would have to set up 
grub on both halves. If one physical drive fails, grub would fail over 
to the next device.


(My original question was prompted by my theory that multiple RAID5s, 
built out of different partitions, would be faster than a single large 
drive -- more threads to perform calculations during writes to different 
parts of the physical drives.)



Am I trying to do something that's basically impossible?


I believe so.


If the answers above don't lead to a resolution, I can create two RAID1 
pairs and join them using LVM. I would take a hit by using LVM to tie 
the pairs intead of RAID0, I suppose, but I would avoid the performance 
hit of multiple md drives on a single physical drive, and I could even 
run a hot spare through a sparing group. Any comments on the performance 
hit -- is raid1L a really bad idea for some reason?


--
Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe
 "It's a sobering thought, for example, to realize that by the time
  he was my age, Mozart had been dead for two years."
-- Tom Lehrer
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Yes, but please provide the clue (was Re: [PATCH] Use new sb type)

2008-01-29 Thread Moshe Yudkowsky


* The only raid level providing unfettered access to the underlying 
filesystem is RAID1 with a superblock at its end, and it has been common 
wisdom for years that you need RAID1 boot partition in order to boot 
anything at all.


Ah. This shines light on my problem...

The problem is that these three points do not affect any other raid 
level (as you can not boot from any of them in a reliable fashion 
anyway). I saw a number of voices that backward compatibility must be 
preserved. I don't see any need for that because:


* The distro managers will definitely RTM and will adjust their flashy 
GUIs to do the right thing by explicitly supplying -e 1.0 for boot devices


The Debian stable distro won't let you create /boot on an LVM RAID1, but 
that seems to be the extent of current RAID awareness. Using the GUI, if 
you create a large RAID5 and attempt to boot off it  -- well, you're 
toast, but you don't find out until LILO and grub portion of the 
installation fails.


* A clueless user might burn himself by making a single root on a single 
raid1 device. But wait - he can burn himself the same way by making the 
root a raid5 device and rebooting.


Okay, but:

Why do we sacrifice "the right thing to do"? To eliminate the 
possibility of someone shooting himself in the foot by not reading the 
manual?


Speaking for clueless users everywhere: I'd love to Read The Fine 
Manual, but the Fine md, mdadm, and mdadm.conf Manuals that I've read 
don't have information about grub/LILO issues. A hint such as "grub and 
LILO can only work from RAID 1 and superblocks greater than 1.0 will 
toast your system in any case" is crucial information to have. Not 
everyone will catch this particular thread -- they're going to RTFM and 
make a mistake *regardless*.


And now, please excuse me while I RTFM to find out if I change the 
superblocks to 1.0 from 1.2 on a running array...


--
Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe
 "If you're not part of the solution, you're part of the process."
-- Mark A. Johnson
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Peter Rabbitson

Michael Tokarev wrote:
 > Raid10 IS RAID1+0 ;)

It's just that linux raid10 driver can utilize more.. interesting ways
to lay out the data.


This is misleading, and adds to the confusion existing even before linux 
raid10. When you say raid10 in the hardware raid world, what do you mean? 
Stripes of mirrors? Mirrors of stripes? Some proprietary extension?


What Neil did was generalize the concept of N drives - M copies, and called it 
10 because it could exactly mimic the layout of conventional 1+0 [*]. However 
thinking about md level 10 in the terms of RAID 1+0 is wrong. Two examples 
(there are many more):


* mdadm -C -l 10 -n 3 -o f2 /dev/md10 /dev/sda1 /dev/sdb1 /dev/sdc1
Odd number of drives, no parity calculation overhead, yet the setup can still 
suffer a loss of a single drive


* mdadm -C -l 10 -n 2 -o f2 /dev/md10 /dev/sda1 /dev/sdb1
This seems useless at first, as it effectively creates a RAID1 setup, without 
preserving the FS format on disk. However md10 has read balancing code, so one 
could get a single thread sustained read at a speed twice what he could 
possibly get with md1 in the current implementation


I guess I will sit down tonight and craft some patches to the existing md* man 
pages. Some things are indeed left unsaid.


Peter

[*] The layout is the same but the functionality is different. If you have 1+0 
on 4 drives, you can survive a loss of 2 drives as long as they are part of 
different mirrors. mdadm -C -l 10 -n 4 -o n2  however will _NOT_ 
survive a loss of 2 drives.

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Michael Tokarev
Moshe Yudkowsky wrote:
> Peter Rabbitson wrote:
> 
>> It is exactly what the names implies - a new kind of RAID :) The setup
>> you describe is not RAID10 it is RAID1+0. As far as how linux RAID10
>> works - here is an excellent article:
>> http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10
> 
> Thanks. Let's just say that the md(4) man page was finally penetrating
> my brain, but the Wikipedia article helped a great deal. I had thought
> md's RAID10 was more "standard."

It is exactly "standard" - when you create it with default settings
and with even number of drives (2, 4, 6, 8, ...), it will be exactly
"standard" raid10 (or raid1+0, whatever) as described in various
places on the net.

But if you use odd number of drives, or if you pass some fancy --layout
option, it will look differently.  Still not suitable for lilo or
grub, at least their current versions.

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Moshe Yudkowsky

Peter Rabbitson wrote:

[*] The layout is the same but the functionality is different. If you 
have 1+0 on 4 drives, you can survive a loss of 2 drives as long as they 
are part of different mirrors. mdadm -C -l 10 -n 4 -o n2  
however will _NOT_ survive a loss of 2 drives.


In my 4 drive system, I'm clearly not getting 1+0's ability to use grub 
out of the RAID10.  I expect it's because I used 1.2 superblocks (why 
not use the latest, I said, foolishly...) and therefore the RAID10 -- 
with even number of drives -- can't be read by grub. If you'd patch that 
information into the man pages that'd be very useful indeed.


Thanks for your attention to this!

--
Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe
 "no user serviceable parts below this line"
-- From a Perl program by [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Michael Tokarev
Keld Jørn Simonsen wrote:
> On Tue, Jan 29, 2008 at 09:57:48AM -0600, Moshe Yudkowsky wrote:
>> In my 4 drive system, I'm clearly not getting 1+0's ability to use grub 
>> out of the RAID10.  I expect it's because I used 1.2 superblocks (why 
>> not use the latest, I said, foolishly...) and therefore the RAID10 -- 
>> with even number of drives -- can't be read by grub. If you'd patch that 
>> information into the man pages that'd be very useful indeed.
> 
> If you have 4 drives, I think the right thing is to use a raid1 with 4
> drives, for your /boot partition. Then yo can survive that 3 disks
> crash!

By the way, on all our systems I use small (256Mb for small-software systems,
sometimes 512M, but 1G should be sufficient) partition for a root filesystem
(/etc, /bin, /sbin, /lib, and /boot), and put it on a raid1 on all (usually
identical) drives - be it 4 or 6 or more of them.  Root filesystem does not
change often, or at least it's write speed isn't that important.  But doing
this way, you always have all the tools necessary to repair a damaged system
even in case your raid didn't start, or you forgot where your root disk is
etc etc.

But in this setup, /usr, /home, /var and so on should be separate partitions.
Also, placing /dev on a tmpfs helps alot to minimize number of writes
necessary for root fs.

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Michael Tokarev
Peter Rabbitson wrote:
[]
> However if you want to be so anal about names and specifications: md
> raid 10 is not a _full_ 1+0 implementation. Consider the textbook
> scenario with 4 drives:
> 
> (A mirroring B) striped with (C mirroring D)
> 
> When only drives A and C are present, md raid 10 with near offset will
> not start, whereas "standard" RAID 1+0 is expected to keep clunking away.

Ugh.  Yes. offset is linux extension.

But md raid 10 with default, n2 (without offset), configuration will behave
exactly like in "classic" docs.

Again.  Linux md raid10 module implements standard raid10 as known in
all widely used docs.  And IN ADDITION, it can do OTHER FORMS, which
differs from "classic" variant.  Pretty like a hardware raid card from
a brand vendor probably implements their own variations of standard
raid levels.

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Michael Tokarev
Keld Jørn Simonsen wrote:
> On Tue, Jan 29, 2008 at 06:13:41PM +0300, Michael Tokarev wrote:
>> Linux raid10 MODULE (which implements that standard raid10
>> LEVEL in full) adds some quite.. unusual extensions to that
>> standard raid10 LEVEL.  The resulting layout is also called
>> raid10 in linux (ie, not giving new names), but it's not that
>> raid10 (which is again the same as raid1+0) as commonly known
>> in various literature and on the internet.  Yet raid10 module
>> fully implements STANDARD raid10 LEVEL.
> 
> My understanding is that you can have a linux raid10 of only 2
> drives, while the standard RAID 1+0 requires 4 drives, so this is a huge
> difference.

Ugh.  2-drive raid10 is effectively just a raid1.  I.e, mirroring
without any striping. (Or, backwards, striping without mirroring).

So to say, raid1 is just one particular configuration of raid10 -
with only one mirror.

Pretty much like with raid5 of 2 disks - it's the same as raid1.

> I am not sure what vanilla linux raid10 (near=2, far=1)
> has of properties. I think it can run with only 1 disk, but I think it

number of copies should be <= number of disks, so no.

> does not have striping capabilities. It would be nice to have more 
> info on this, eg in the man page. 

It's all in there really.  See md(4).  Maybe it's not that
verbose, but it's not a user's guide (as in: a large book),
after all.

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Michael Tokarev
Moshe Yudkowsky wrote:
> Michael Tokarev wrote:
> 
>> There are more-or-less standard raid LEVELS, including
>> raid10 (which is the same as raid1+0, or a stripe on top
>> of mirrors - note it does not mean 4 drives, you can
>> use 6 - stripe over 3 mirrors each of 2 components; or
>> the reverse - stripe over 2 mirrors of 3 components each
>> etc).
> 
> Here's a baseline question: if I create a RAID10 array using default
> settings, what do I get? I thought I was getting RAID1+0; am I really?

..default settings AND even (4, 6, 8, 10, ...) number of drives.  It
will be "standard" raid10 or raid1+0 which is the same, as many stripes
of mirrored (2 copies) data as fits with the number of disks.  With odd
number of disks it obviously will be soemthing else, not a "standard"
raid10.

> My superblocks, by the way, are marked version 01; my metadata in
> mdadm.conf asked for 1.2. I wonder what I really got. The real question

Ugh.  Another source of confusion.  In --superblock=1.2, "1" stands
for the format, and "2" stands for the placement.  So it's really
format version 1.  From mdadm(8):

  1, 1.0, 1.1, 1.2
 Use  the  new  version-1 format superblock.  This has few
 restrictions.   The  different  sub-versions  store   the
 superblock  at  different locations on the device, either
 at the end (for 1.0), at the start (for 1.1) or  4K  from
 the start (for 1.2).


> in my mind now is why grub can't find the info, and either it's because
> of 1.2 superblocks or because of sub-partitioning of components.

As has been said numerous times in this thread, grub can't be used with
anything but raid1 to start with (the same is true for lilo).  Raid10
(or raid1+0, which is the same) - be it standard or linux extension format -
is NOT raid1.

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Peter Rabbitson

Moshe Yudkowsky wrote:
Here's a baseline question: if I create a RAID10 array using default 
settings, what do I get? I thought I was getting RAID1+0; am I really?


Maybe you are, depending on your settings, but this is beyond the point. No 
matter what 1+0 you have (linux, classic, or otherwise) you can not boot from 
it, as there is no way to see the underlying filesystem without the RAID layer.


With the current state of affairs (available mainstream bootloaders) the rule 
is:
Block devices containing the kernel/initrd image _must_ be either:
* a regular block device (/sda1, /hda, /fd0, etc.)
* or a linux RAID 1 with the superblock at the end of the device (0.9 
or 1.2)


My superblocks, by the way, are marked version 01; my metadata in 
mdadm.conf asked for 1.2. I wonder what I really got.


This is how you find the actual raid version:

mdadm -D /dev/md[X] | grep Version

This will return a string of the form XX.YY.ZZ. Your superblock version is 
XX.YY.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Moshe Yudkowsky

Michael Tokarev wrote:


There are more-or-less standard raid LEVELS, including
raid10 (which is the same as raid1+0, or a stripe on top
of mirrors - note it does not mean 4 drives, you can
use 6 - stripe over 3 mirrors each of 2 components; or
the reverse - stripe over 2 mirrors of 3 components each
etc).


Here's a baseline question: if I create a RAID10 array using default 
settings, what do I get? I thought I was getting RAID1+0; am I really?


My superblocks, by the way, are marked version 01; my metadata in 
mdadm.conf asked for 1.2. I wonder what I really got. The real question 
in my mind now is why grub can't find the info, and either it's because 
of 1.2 superblocks or because of sub-partitioning of components.



--
Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe
 "You may not be interested in war, but war is interested in you."
-- Leon Trotsky
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Keld Jørn Simonsen
On Tue, Jan 29, 2008 at 06:13:41PM +0300, Michael Tokarev wrote:
> 
> Linux raid10 MODULE (which implements that standard raid10
> LEVEL in full) adds some quite.. unusual extensions to that
> standard raid10 LEVEL.  The resulting layout is also called
> raid10 in linux (ie, not giving new names), but it's not that
> raid10 (which is again the same as raid1+0) as commonly known
> in various literature and on the internet.  Yet raid10 module
> fully implements STANDARD raid10 LEVEL.

My understanding is that you can have a linux raid10 of only 2
drives, while the standard RAID 1+0 requires 4 drives, so this is a huge
difference.

I am not sure what vanilla linux raid10 (near=2, far=1)
has of properties. I think it can run with only 1 disk, but I think it
does not have striping capabilities. It would be nice to have more 
info on this, eg in the man page. 

Is there an official web page for mdadm?
And maybe the raid faq could be updated?

best regards
keld 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Peter Rabbitson

Moshe Yudkowsky wrote:

Keld Jørn Simonsen wrote:


raid10 have a number of ways to do layout, namely the near, far and
offset ways, layout=n2, f2, o2 respectively.


The default layout, according to --detail, is "near=2, far=1." If I 
understand what's been written so far on the topic, that's automatically 
incompatible with 1+0.




Unfortunately you are interpreting this wrong as well. far=1 is just a way of 
saying 'no copies of type far'.

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Michael Tokarev
Peter Rabbitson wrote:
> Michael Tokarev wrote:
>  > Raid10 IS RAID1+0 ;)
>> It's just that linux raid10 driver can utilize more.. interesting ways
>> to lay out the data.
> 
> This is misleading, and adds to the confusion existing even before linux
> raid10. When you say raid10 in the hardware raid world, what do you
> mean? Stripes of mirrors? Mirrors of stripes? Some proprietary extension?

Mirrors of stripes makes no sense.

> What Neil did was generalize the concept of N drives - M copies, and
> called it 10 because it could exactly mimic the layout of conventional
> 1+0 [*]. However thinking about md level 10 in the terms of RAID 1+0 is
> wrong. Two examples (there are many more):
> 
> * mdadm -C -l 10 -n 3 -o f2 /dev/md10 /dev/sda1 /dev/sdb1 /dev/sdc1
    ^

Those are "interesting ways"

> Odd number of drives, no parity calculation overhead, yet the setup can
> still suffer a loss of a single drive
> 
> * mdadm -C -l 10 -n 2 -o f2 /dev/md10 /dev/sda1 /dev/sdb1
^

And this one too.

There are more-or-less standard raid LEVELS, including
raid10 (which is the same as raid1+0, or a stripe on top
of mirrors - note it does not mean 4 drives, you can
use 6 - stripe over 3 mirrors each of 2 components; or
the reverse - stripe over 2 mirrors of 3 components each
etc).

Vendors often adds their own extensions, sometimes calling
them as the original level, and sometimes giving them new
names, especially in the marketing speak.

Linux raid10 MODULE (which implements that standard raid10
LEVEL in full) adds some quite.. unusual extensions to that
standard raid10 LEVEL.  The resulting layout is also called
raid10 in linux (ie, not giving new names), but it's not that
raid10 (which is again the same as raid1+0) as commonly known
in various literature and on the internet.  Yet raid10 module
fully implements STANDARD raid10 LEVEL.

/mjt

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Keld Jørn Simonsen
On Tue, Jan 29, 2008 at 05:07:27PM +0300, Michael Tokarev wrote:
> Peter Rabbitson wrote:
> > Moshe Yudkowsky wrote:
> >>
> 
> > It is exactly what the names implies - a new kind of RAID :) The setup
> > you describe is not RAID10 it is RAID1+0.
> 
> Raid10 IS RAID1+0 ;)
> It's just that linux raid10 driver can utilize more.. interesting ways
> to lay out the data.

My understandining is that raid10 is different from RAID1+0

Traditional  RAID1+0 is composed of two RAID1's combined into one RAID0.
It takes 4 drives to make it work. Linux raid10 only takes 2 drives to
work.

Traditional RAID1+0 only have one way of laying out the blocks. 

raid10 have a number of ways to do layout, namely the near, far and
offset ways, layout=n2, f2, o2 respectively.

Traditional RAID1+0 can only do striping of half of the disks involved,
while raid10 can do striping on all disks in the far and offset layouts.



I looked around on the net for documentation of this. The first hits (on
Google) for mkadm did not have descriptions of raid10. Wikipedia
describes raid 10 as a synonym for raid1+0. I think there is too much
confusion on the raid10 term, and that also the marveleous linux raid10
layouts is a little known secret beyound maybe the circles of this
linux-raid list. We should tell others more about the wondersi of raid10.

And then I would like a good reference for describing how raid10,o2
works and why bigger chunks work. 

Best regards
keld
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-29 Thread Carlos Carvalho
Tim Southerwood ([EMAIL PROTECTED]) wrote on 28 January 2008 17:29:
 >Subtitle: Patch to mainline yet?
 >
 >Hi
 >
 >I don't see evidence of Neil's patch in 2.6.24, so I applied it by hand
 >on my server.

I applied all 4 pending patches to .24. It's been better than .22 and
.23... Unfortunately the bitmap and rai1 patch don't go in .22.16.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Michael Tokarev
Peter Rabbitson wrote:
> Moshe Yudkowsky wrote:
>>
>> One of the puzzling things about this is that I conceive of RAID10 as
>> two RAID1 pairs, with RAID0 on top of to join them into a large drive.
>> However, when I use --level=10  to create my md drive, I cannot find
>> out which two pairs are the RAID1's: the --detail doesn't give that
>> information. Re-reading the md(4) man page, I think I'm badly mistaken
>> about RAID10.
>>
>> Furthermore, since grub cannot find the /boot on the md drive, I
>> deduce that RAID10 isn't what the 'net descriptions say it is.

In fact, everything matches.  For lilo to work, it basically needs
a whole filesystem on the same physical drive.  It's exactly the case
with raid1 (and only).  With raid10, half of the filesystem is on one
mirror, and another half is on another mirror.  Like this:

 filesystem  blocks on raid0
 blocks  DiskADiskB

  0  0
  1   1
  2  2
  3   3
  4  4
  5   5
  ..

(this is  (this is the actual
what LILO  layout)
expects)

(Difference between raid10 and raid0 is that
each of diskA and diskB is in fact composed of
two identical devices).

If your kernel is located in filesytem blocks
number 2 and 3 for example, lilo has to read
BOTH halves, but it is not smart enough to
figure it out - it can only read everything
from a single drive.

> It is exactly what the names implies - a new kind of RAID :) The setup
> you describe is not RAID10 it is RAID1+0.

Raid10 IS RAID1+0 ;)
It's just that linux raid10 driver can utilize more.. interesting ways
to lay out the data.

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Keld Jørn Simonsen
On Tue, Jan 29, 2008 at 05:02:57AM -0600, Moshe Yudkowsky wrote:
> Neil, thanks for writing. A couple of follow-up questions to you and the 
> group:
> 
> If the answers above don't lead to a resolution, I can create two RAID1 
> pairs and join them using LVM. I would take a hit by using LVM to tie 
> the pairs intead of RAID0, I suppose, but I would avoid the performance 
> hit of multiple md drives on a single physical drive, and I could even 
> run a hot spare through a sparing group. Any comments on the performance 
> hit -- is raid1L a really bad idea for some reason?

You can of cause construct a traditional raid-1+0 in Linux as you describe here,
but this is different from linux raid10 (with its different layout
possibilities). And constructing two grub/lilos on two disks for a raid1
on /boot seems to be the right way for a reasonably secured system.

best regards
keld
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Moshe Yudkowsky

Peter Rabbitson wrote:

It is exactly what the names implies - a new kind of RAID :) The setup 
you describe is not RAID10 it is RAID1+0. As far as how linux RAID10 
works - here is an excellent article: 
http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10


Thanks. Let's just say that the md(4) man page was finally penetrating 
my brain, but the Wikipedia article helped a great deal. I had thought 
md's RAID10 was more "standard."


--
Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe
"Rumor is information distilled so finely that it can filter through 
anything."

 --  Terry Pratchet, _Feet of Clay_
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Keld Jørn Simonsen
On Tue, Jan 29, 2008 at 09:57:48AM -0600, Moshe Yudkowsky wrote:
> 
> In my 4 drive system, I'm clearly not getting 1+0's ability to use grub 
> out of the RAID10.  I expect it's because I used 1.2 superblocks (why 
> not use the latest, I said, foolishly...) and therefore the RAID10 -- 
> with even number of drives -- can't be read by grub. If you'd patch that 
> information into the man pages that'd be very useful indeed.

If you have 4 drives, I think the right thing is to use a raid1 with 4
drives, for your /boot partition. Then yo can survive that 3 disks
crash!


If you want the extra performance, then I think you should not bother
too much for the kernel and initrd load time - which of cause is not
striping on the disks, but some performance improvement can be expected.
Then you can have the rest of /root on a raid10,f2 with 4 disks.

best regards
keld
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Moshe Yudkowsky

Keld Jørn Simonsen wrote:


raid10 have a number of ways to do layout, namely the near, far and
offset ways, layout=n2, f2, o2 respectively.


The default layout, according to --detail, is "near=2, far=1." If I 
understand what's been written so far on the topic, that's automatically 
incompatible with 1+0.


--
Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Peter Rabbitson

Michael Tokarev wrote:

Linux raid10 MODULE (which implements that standard raid10
LEVEL in full) adds some quite.. unusual extensions to that
standard raid10 LEVEL.  The resulting layout is also called
raid10 in linux (ie, not giving new names), but it's not that
raid10 (which is again the same as raid1+0) as commonly known
in various literature and on the internet.  Yet raid10 module
fully implements STANDARD raid10 LEVEL.


I will let Neil speak about what he meant by RAID10: whether it is raid10 + 
weird extensions, or a generalization of drive/stripe layouts.


However if you want to be so anal about names and specifications: md raid 10 
is not a _full_ 1+0 implementation. Consider the textbook scenario with 4 drives:


(A mirroring B) striped with (C mirroring D)

When only drives A and C are present, md raid 10 with near offset will not 
start, whereas "standard" RAID 1+0 is expected to keep clunking away.


Peter
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Keld Jørn Simonsen
On Tue, Jan 29, 2008 at 07:51:07PM +0300, Michael Tokarev wrote:
> Peter Rabbitson wrote:
> []
> > However if you want to be so anal about names and specifications: md
> > raid 10 is not a _full_ 1+0 implementation. Consider the textbook
> > scenario with 4 drives:
> > 
> > (A mirroring B) striped with (C mirroring D)
> > 
> > When only drives A and C are present, md raid 10 with near offset will
> > not start, whereas "standard" RAID 1+0 is expected to keep clunking away.
> 
> Ugh.  Yes. offset is linux extension.
> 
> But md raid 10 with default, n2 (without offset), configuration will behave
> exactly like in "classic" docs.

I would like to understand this fully. What Peter described for mdraid10:
" md raid 10 with near offset " I believe is vanilla raid10 without any
options (or near=2, far=1). Will that not start if we are unlucky to
have 2 drives failing, but we are lucky that the data on the two
remaining drives actually have all the data?

Same question for a raid10,f2 array. I think it would be easy to
investigate, when the number of drives are even, if all data is present,
and then happily run an array with some failing disks.

Say for a 4 drive raid10,f2 disks A and D are failing, then all data
should be present on drives B and C, given that A and C have the even
chunks, and B and D have the odd chunks. Likewise for a 6 drive array,
etc for all multiples of 2, with F2.

best regards
keld
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Keld Jørn Simonsen
On Tue, Jan 29, 2008 at 07:46:58PM +0300, Michael Tokarev wrote:
> Keld Jørn Simonsen wrote:
> > On Tue, Jan 29, 2008 at 06:13:41PM +0300, Michael Tokarev wrote:
> >> Linux raid10 MODULE (which implements that standard raid10
> >> LEVEL in full) adds some quite.. unusual extensions to that
> >> standard raid10 LEVEL.  The resulting layout is also called
> >> raid10 in linux (ie, not giving new names), but it's not that
> >> raid10 (which is again the same as raid1+0) as commonly known
> >> in various literature and on the internet.  Yet raid10 module
> >> fully implements STANDARD raid10 LEVEL.
> > 
> > My understanding is that you can have a linux raid10 of only 2
> > drives, while the standard RAID 1+0 requires 4 drives, so this is a huge
> > difference.
> 
> Ugh.  2-drive raid10 is effectively just a raid1.  I.e, mirroring
> without any striping. (Or, backwards, striping without mirroring).

OK.  

uhm, well, I did not understand: "(Or, backwards, striping without
mirroring)."  I don't think a 2 drive vanilla raid10 will do striping.
Please explain.

> Pretty much like with raid5 of 2 disks - it's the same as raid1.

I think in raid5 of 2 disks, half of the chunks are parity chynks which
are evenly distributed over the two disks, and the parity chunk is the
XOR of the data chunk. But maybe I am wrong. Also the behaviour of suce
a raid5 is different from a raid1 as the parity chunk is not used as
data.
> 
> > I am not sure what vanilla linux raid10 (near=2, far=1)
> > has of properties. I think it can run with only 1 disk, but I think it
> 
> number of copies should be <= number of disks, so no.

I have a clear understanding that in a vanilla linux raid10 (near=2, far=1)
you can run with one failing disk, that is with only one working disk.
Am I wrong?

> > does not have striping capabilities. It would be nice to have more 
> > info on this, eg in the man page. 
> 
> It's all in there really.  See md(4).  Maybe it's not that
> verbose, but it's not a user's guide (as in: a large book),
> after all.

Some man pages have examples. Or info could be written in the faq or in
wikipedia.

Best regards
keld
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


linux raid faq

2008-01-29 Thread Keld Jørn Simonsen
Hmm, I read the Linux raid faq on
http://www.faqs.org/contrib/linux-raid/x37.html

It looks pretty outdated, referring to how to patch 2.2 kernels and
not mentioning new mdadm, nor raid10. It was not dated. 
It seemed to be related to the linux-raid list, telling where to find
archives of the list.

Maybe time for an update? or is this not the right place to write stuff?

If I searched on google for "raid faq", the first say 5-7 items did not
mention raid10. 

Maybe wikipedia is the way to go? I did contribute myself a little
there.

The software raid howto is dated v. 1.1 3rd of June 2004,
http://unthought.net/Software-RAID.HOWTO/Software-RAID.HOWTO.html
also pretty old.

best regards
keld
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How many disks per SATA-II bus?

2008-01-29 Thread Bruce Miller
The beginning of Section 4 of the Linux Sotfware-RAID-HOWTO
states emphatically that "you should only have one device per
IDE bus. Running disks as master/slave is horrible for
performance. IDE is really bad at accessing more that one drive
per bus".

Do the same cautions apply to building a RAID array using
SATA-II disks?

--
Bruce Miller
Ottawa ON, Canada
[EMAIL PROTECTED]
(613) 745-1151
This message is from a webmail login and not from my regular mail system. It 
does not have my customary digital signature.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How many disks per SATA-II bus?

2008-01-29 Thread Jeff Garzik

Bruce Miller wrote:

The beginning of Section 4 of the Linux Sotfware-RAID-HOWTO
states emphatically that "you should only have one device per
IDE bus. Running disks as master/slave is horrible for
performance. IDE is really bad at accessing more that one drive
per bus".

Do the same cautions apply to building a RAID array using
SATA-II disks?


SATA is a point-to-point serial connection, totally different technology.

Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Moshe Yudkowsky
I'd like to thank everyone who wrote in with comments and explanations. 
And in particular it's nice to see that I'm not the only one who's confused.


I'm going to convert back to the RAID 1 setup I had before for /boot, 2 
hot and 2 spare across four drives. No, that's wrong: 4 hot makes the 
most sense.


And given that RAID 10 doesn't seem to confer (for me, as far as I can 
tell) advantages in speed or reliability -- or the ability to mount just 
one surviving disk of a mirrored pair -- over RAID 5, I think I'll 
convert back to RAID 5, put in a hot spare, and do regular backups (as 
always). Oh, and use reiserfs with data=journal.


Comments back:

Peter Rabbitson wrote:

Maybe you are, depending on your settings, but this is beyond the point. 
No matter what 1+0 you have (linux, classic, or otherwise) you can not 
boot from it, as there is no way to see the underlying filesystem 
without the RAID layer.


Sir, thank you for this unequivocal comment. This comment clears up all 
my confusion. I had a wrong mental model of how file system maps work.


With the current state of affairs (available mainstream bootloaders) the 
rule is:

Block devices containing the kernel/initrd image _must_ be either:
* a regular block device (/sda1, /hda, /fd0, etc.)
* or a linux RAID 1 with the superblock at the end of the device 
(0.9 or 1.2)


Thaks even more: 1.2 it is.


This is how you find the actual raid version:

mdadm -D /dev/md[X] | grep Version

This will return a string of the form XX.YY.ZZ. Your superblock version 
is XX.YY.


Ah hah!

Mr. Tokarev wrote:


By the way, on all our systems I use small (256Mb for small-software systems,
sometimes 512M, but 1G should be sufficient) partition for a root filesystem
(/etc, /bin, /sbin, /lib, and /boot), and put it on a raid1 on all...
... doing [it]
this way, you always have all the tools necessary to repair a damaged system
even in case your raid didn't start, or you forgot where your root disk is
etc etc.


An excellent idea. I was going to put just /boot on the RAID 1, but 
there's no reason why I can't add a bit more room and put them all 
there. (Because I was having so much fun on the install, I'm using 4GB 
that I was going to use for swap space to mount base install and I'm 
working from their to build the RAID. Same idea.)


Hmmm... I wonder if this more expansive /bin, /sbin, and /lib causes 
hits on the RAID1 drive which ultimately degrade overall performance? 
/lib is hit only at boot time to load the kernel, I'll guess, but /bin 
includes such common tools as bash and grep.



Also, placing /dev on a tmpfs helps alot to minimize number of writes
necessary for root fs.


Another interesting idea. I'm not familiar with using tmpfs (no need, 
until now); but I wonder how you create the devices you need when you're 
doing a rescue.


Again, my thanks to everyone who responded and clarified.

--
Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe
"Practically perfect people never permit sentiment to muddle their 
thinking."

-- Mary Poppins
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux raid faq

2008-01-29 Thread Janek Kozicki
Keld Jørn Simonsen said: (by the date of Tue, 29 Jan 2008 20:17:55 +0100)

> Hmm, I read the Linux raid faq on
> http://www.faqs.org/contrib/linux-raid/x37.html

I've found some information in 

/usr/share/doc/mdadm/FAQ.gz

I'm wondering why this file is not advertised anywhere
(eg. in 'man mdadm'). Does it exist only in debian packages, or what?
With 'man 4 md' I've found a little sparse info about raid10. But
still I don't get it.

-- 
Janek Kozicki |
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Keld Jørn Simonsen
On Tue, Jan 29, 2008 at 01:34:37PM -0600, Moshe Yudkowsky wrote:
> 
> I'm going to convert back to the RAID 1 setup I had before for /boot, 2 
> hot and 2 spare across four drives. No, that's wrong: 4 hot makes the 
> most sense.
> 
> And given that RAID 10 doesn't seem to confer (for me, as far as I can 
> tell) advantages in speed or reliability -- or the ability to mount just 
> one surviving disk of a mirrored pair -- over RAID 5, I think I'll 
> convert back to RAID 5, put in a hot spare, and do regular backups (as 
> always). Oh, and use reiserfs with data=journal.

Hmm, my idea was to use a raid10,f2 4 disk raid for the /root, or a o2
layout. I think it would offer quite some speed advantage over raid5. 
At least I had on a 4 disk raid5 only a random performance of about 130
MB/s while the raid10 gave 180-200 MB/s. Also sequential read was
significantly faster on raid10. I do think I can get about 320 MB/s 
on the raid10,f2, but I need to have a bigger power supply to support my
disks before I can go on testing. The key here is bigger readahead.
I only got 150 MB/s for raid5 sequential reads. 

I think the sequential read could be significant in the boot time,
and then for the single user running on the system, namely the system
administrator (=me), even under reasonable load.

I would be interested if you would experiment with this wrt boot time,
for example the difference between /root on a raid5, raid10,f2 and raid10,o2.



> Comments back:
> 
> Mr. Tokarev wrote:
> 
> >By the way, on all our systems I use small (256Mb for small-software 
> >systems,
> >sometimes 512M, but 1G should be sufficient) partition for a root 
> >filesystem
> >(/etc, /bin, /sbin, /lib, and /boot), and put it on a raid1 on all...
> >... doing [it]
> >this way, you always have all the tools necessary to repair a damaged 
> >system
> >even in case your raid didn't start, or you forgot where your root disk is
> >etc etc.
> 
> An excellent idea. I was going to put just /boot on the RAID 1, but 
> there's no reason why I can't add a bit more room and put them all 
> there. (Because I was having so much fun on the install, I'm using 4GB 
> that I was going to use for swap space to mount base install and I'm 
> working from their to build the RAID. Same idea.)

If you put more than /boot on the raid1, then you will not get the added
performance of raid10 for all your system utilities. 

I am not sure about redundance, but a raid1 and a raid10 should be
equally vulnerable to a 1 disk faliure. If you use a 4 disk raid1 for 
/root, then of cause you can survive 3 disk crashes.

I am not sure that 4 disks in a raid1 for /root give added performance, 
as grub only sees the /root raid1 as a normal disk, but maybe some kind of
remounting makes it get its raid behaviour.


> >Also, placing /dev on a tmpfs helps alot to minimize number of writes
> >necessary for root fs.

I thought of using the noatime mount option for /root.

best regards
Keld
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Moshe Yudkowsky

Keld Jørn Simonsen wrote:

Based on your reports of better performance on RAID10 -- which are more 
significant that I'd expected -- I'll just go with RAID10. The only 
question now is if LVM is worth the performance hit or not.



I would be interested if you would experiment with this wrt boot time,
for example the difference between /root on a raid5, raid10,f2 and raid10,o2.


According to man md(4), the o2 is likely to offer the best combination 
of read and write performance. Why would you consider f2 instead?


I'm unlike to do any testing beyond running bonnie++ or something 
similar once it's installed.



--
Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-29 Thread Bill Davidsen

Carlos Carvalho wrote:

Tim Southerwood ([EMAIL PROTECTED]) wrote on 28 January 2008 17:29:
 >Subtitle: Patch to mainline yet?
 >
 >Hi
 >
 >I don't see evidence of Neil's patch in 2.6.24, so I applied it by hand
 >on my server.

I applied all 4 pending patches to .24. It's been better than .22 and
.23... Unfortunately the bitmap and rai1 patch don't go in .22.16.


Neil, have these been sent up against 24-stable and 23-stable?

--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Use new sb type

2008-01-29 Thread Bill Davidsen

David Greaves wrote:

Jan Engelhardt wrote:
  

This makes 1.0 the default sb type for new arrays.




IIRC there was a discussion a while back on renaming mdadm options (google "Time
to  deprecate old RAID formats?") and the superblocks to emphasise the location
and data structure. Would it be good to introduce the new names at the same time
as changing the default format/on-disk-location?
  


Yes, I suggested some layout names, as did a few other people, and a few 
changes to separate metadata type and position were discussed. BUT, 
changing the default layout, no matter how "better" it seems, is trumped 
by "breaks existing setups and user practice." For all of the reasons 
something else is preferable, 1.0 *works*.


--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Bill Davidsen

Moshe Yudkowsky wrote:
I'd like to thank everyone who wrote in with comments and 
explanations. And in particular it's nice to see that I'm not the only 
one who's confused.


I'm going to convert back to the RAID 1 setup I had before for /boot, 
2 hot and 2 spare across four drives. No, that's wrong: 4 hot makes 
the most sense.


And given that RAID 10 doesn't seem to confer (for me, as far as I can 
tell) advantages in speed or reliability -- or the ability to mount 
just one surviving disk of a mirrored pair -- over RAID 5, I think 
I'll convert back to RAID 5, put in a hot spare, and do regular 
backups (as always). Oh, and use reiserfs with data=journal.


Depending on near/far choices, raid10 should be faster than raid5, with 
far read should be quite a bit faster. You can't boot off raid10, and if 
you put your swap on it many recovery CDs won't use it. But for general 
use and swap on a normally booted system it is quite fast.

Comments back:

Peter Rabbitson wrote:

Maybe you are, depending on your settings, but this is beyond the 
point. No matter what 1+0 you have (linux, classic, or otherwise) you 
can not boot from it, as there is no way to see the underlying 
filesystem without the RAID layer.


Sir, thank you for this unequivocal comment. This comment clears up 
all my confusion. I had a wrong mental model of how file system maps 
work.


With the current state of affairs (available mainstream bootloaders) 
the rule is:

Block devices containing the kernel/initrd image _must_ be either:
* a regular block device (/sda1, /hda, /fd0, etc.)
* or a linux RAID 1 with the superblock at the end of the device 
(0.9 or 1.2)


Thaks even more: 1.2 it is.


This is how you find the actual raid version:

mdadm -D /dev/md[X] | grep Version

This will return a string of the form XX.YY.ZZ. Your superblock 
version is XX.YY.


Ah hah!

Mr. Tokarev wrote:

By the way, on all our systems I use small (256Mb for small-software 
systems,
sometimes 512M, but 1G should be sufficient) partition for a root 
filesystem

(/etc, /bin, /sbin, /lib, and /boot), and put it on a raid1 on all...
... doing [it]
this way, you always have all the tools necessary to repair a damaged 
system
even in case your raid didn't start, or you forgot where your root 
disk is

etc etc.


An excellent idea. I was going to put just /boot on the RAID 1, but 
there's no reason why I can't add a bit more room and put them all 
there. (Because I was having so much fun on the install, I'm using 4GB 
that I was going to use for swap space to mount base install and I'm 
working from their to build the RAID. Same idea.)


Hmmm... I wonder if this more expansive /bin, /sbin, and /lib causes 
hits on the RAID1 drive which ultimately degrade overall performance? 
/lib is hit only at boot time to load the kernel, I'll guess, but /bin 
includes such common tools as bash and grep.



Also, placing /dev on a tmpfs helps alot to minimize number of writes
necessary for root fs.


Another interesting idea. I'm not familiar with using tmpfs (no need, 
until now); but I wonder how you create the devices you need when 
you're doing a rescue.


Again, my thanks to everyone who responded and clarified.




--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Bill Davidsen

Moshe Yudkowsky wrote:

Keld Jørn Simonsen wrote:

Based on your reports of better performance on RAID10 -- which are 
more significant that I'd expected -- I'll just go with RAID10. The 
only question now is if LVM is worth the performance hit or not.



I would be interested if you would experiment with this wrt boot time,
for example the difference between /root on a raid5, raid10,f2 and 
raid10,o2.


According to man md(4), the o2 is likely to offer the best combination 
of read and write performance. Why would you consider f2 instead?



f2 is faster for read, most systems spend more time reading than writing.

I'm unlike to do any testing beyond running bonnie++ or something 
similar once it's installed.






--
Bill Davidsen <[EMAIL PROTECTED]>
 "Woe unto the statesman who makes war without a reason that will still
 be valid when the war is over..." Otto von Bismark 




-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Moshe Yudkowsky

Bill Davidsen wrote:

According to man md(4), the o2 is likely to offer the best combination 
of read and write performance. Why would you consider f2 instead?



f2 is faster for read, most systems spend more time reading than writing.


According to md(4), offset "should give similar read characteristics to 
'far' if a suitably large chunk size is used, but without as much 
seeking for writes."


Is the man page not correct, conditionally true, or simply not 
understood by me (most likely case)?


I wonder what "suitably large" is...

--
Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe
 "The seconds marched past, transversing that mysterious boundary that
  separates the future from the past."
-- Jack Vance, "The Face"
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Keld Jørn Simonsen
On Tue, Jan 29, 2008 at 04:14:24PM -0600, Moshe Yudkowsky wrote:
> Keld Jørn Simonsen wrote:
> 
> Based on your reports of better performance on RAID10 -- which are more 
> significant that I'd expected -- I'll just go with RAID10. The only 
> question now is if LVM is worth the performance hit or not.

Hmm, LVM for what purpose? For the root system, I think it is not 
an issue. Just have a large enough partition, it is not more than 10- 20
GB anyway, which is around 1 % of the disk sizes that we talk about
today with new disks in raids.

> >I would be interested if you would experiment with this wrt boot time,
> >for example the difference between /root on a raid5, raid10,f2 and 
> >raid10,o2.
> 
> According to man md(4), the o2 is likely to offer the best combination 
> of read and write performance. Why would you consider f2 instead?

I have no experiences with o2, and little experience with f2.
But I kind of designed f2. I have not fully grasped o2 yet. 

But my take is that for writes, this would be random writes, and that is
almost the same for all layouts. However, when/if a disk is faulty, then 
f2 has considerably worse performance for sequential reads,
approximating the performance of random reads, which in some cases is
about half the speed of sequential reads. For sequential reads and
random reads I think f2 would be faster than o2, due to the smaller 
average seek times, and use of the faster part of the disk.

I am still wondering how o2 gets to do striping, I don't understand it
given the layout schemes I have seen. F2 OTOH is designed for striping.

I would like to see some figures, tho. My testing environment is, as
said, not operationable right now, but will be OK possibly later this
week.

> I'm unlike to do any testing beyond running bonnie++ or something 
> similar once it's installed.

I do some crude testing like reading concurrently 1000 files of 20 MB, 
and then just cat file >/dev/null of a 4 GB file. The RAM caches needs
to be not capable of holding the files.

Looking on boot times could also be interesting. I would like as litte
downtime as possible.

But it depends on your purpose and thus pattern of use. Many systems
tend to be read oriented, and for that I think f2 is the better
alternative.

best regards
keld
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Keld Jørn Simonsen
On Tue, Jan 29, 2008 at 06:44:20PM -0500, Bill Davidsen wrote:

> Depending on near/far choices, raid10 should be faster than raid5, with 
> far read should be quite a bit faster. You can't boot off raid10, and if 
> you put your swap on it many recovery CDs won't use it. But for general 
> use and swap on a normally booted system it is quite fast.

Hmm, why would you put swap on a raid10? I would in a production
environment always put it on separate swap partitions, possibly a number,
given that a number of drives are available.

best regards
keld
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Peter Rabbitson

Keld Jørn Simonsen wrote:

On Tue, Jan 29, 2008 at 06:44:20PM -0500, Bill Davidsen wrote:

Depending on near/far choices, raid10 should be faster than raid5, with 
far read should be quite a bit faster. You can't boot off raid10, and if 
you put your swap on it many recovery CDs won't use it. But for general 
use and swap on a normally booted system it is quite fast.


Hmm, why would you put swap on a raid10? I would in a production
environment always put it on separate swap partitions, possibly a number,
given that a number of drives are available.



Because you want some redundancy for the swap as well. A swap partition/file 
becoming inaccessible is equivalent to yanking out a stick of memory out of 
your motherboard.


Peter
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Moshe Yudkowsky



Hmm, why would you put swap on a raid10? I would in a production
environment always put it on separate swap partitions, possibly a number,
given that a number of drives are available.


I put swap onto non-RAID, separate partitions on all 4 disks.

In a production server, however, I'd use swap on RAID in order to 
prevent server downtime if a disk fails -- a suddenly bad swap can 
easily (will absolutely?) cause the server to crash (even though you can 
boot the server up again afterwards on the surviving swap partitions).


--
Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe
 "She will have fun who knows when to work
  and when not to work."
-- Segami
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Keld Jørn Simonsen
On Tue, Jan 29, 2008 at 06:32:54PM -0600, Moshe Yudkowsky wrote:
> 
> >Hmm, why would you put swap on a raid10? I would in a production
> >environment always put it on separate swap partitions, possibly a number,
> >given that a number of drives are available.
> 
> In a production server, however, I'd use swap on RAID in order to 
> prevent server downtime if a disk fails -- a suddenly bad swap can 
> easily (will absolutely?) cause the server to crash (even though you can 
> boot the server up again afterwards on the surviving swap partitions).

I see. Which file system type would be good for this?
I normally use XFS but maybe other FS is better, given that swap is used
very randomly 8read/write).

Will a bad swap crash the system?

best regards
keld
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: In this partition scheme, grub does not find md information?

2008-01-29 Thread Moshe Yudkowsky

Keld Jørn Simonsen wrote:

On Tue, Jan 29, 2008 at 06:32:54PM -0600, Moshe Yudkowsky wrote:

Hmm, why would you put swap on a raid10? I would in a production
environment always put it on separate swap partitions, possibly a number,
given that a number of drives are available.
In a production server, however, I'd use swap on RAID in order to 
prevent server downtime if a disk fails -- a suddenly bad swap can 
easily (will absolutely?) cause the server to crash (even though you can 
boot the server up again afterwards on the surviving swap partitions).


I see. Which file system type would be good for this?
I normally use XFS but maybe other FS is better, given that swap is used
very randomly 8read/write).

Will a bad swap crash the system?


Well, Peter says it will, and that's good enough for me. :-)

As for which file system: I would use fdisk to partition the md disk and 
then use mkswap on the partition to make it into a swap partition. It's 
a naive approach but I suspect it's almost certainly the correct one.


--
Moshe Yudkowsky * [EMAIL PROTECTED] * www.pobox.com/~moshe
 "There are more ways to skin a cat than nuking it from orbit
-- but it's the only way to be sure."
-- Eliezer Yudkowsky
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] The kernel thread for md RAID1 could cause a md RAID1 array deadlock

2008-01-29 Thread K.Tanaka
Hi,

>Also, md raid10 seems to have the same problem.
>I will test raid10 applying this patch as well.

Sorry for the late response. I had a trouble with reproducing the problem,
but it turns out that the 2.6.24 kernel needs the latest (possibly testing)
version of systemtap-0.6.1-1 to run systemtap for the fault injection tool.

I've reproduced the stall on both raid1 and raid10 using 2.6.24.
Also I've tested the patch applied to 2.6.24 and confirmed that
it will fix the stall problem for both cases.

K.Tanaka wrote:
> Hi,
> 
> Thank you for the patch.
> I have applied the patch to 2.6.23.14 and it works well.
> 
> - In case of 2.6.23.14, the problem is reproduced.
> - In case of 2.6.23.14 with this patch, raid1 works well so far.
>   The fault injection script continues to run, and it doesn't deadlock.
>   I will keep it running for a while.
> 
> Also, md raid10 seems to have the same problem.
> I will test raid10 applying this patch as well.
> 
> 
> Neil Brown wrote:
>> On Tuesday January 15, [EMAIL PROTECTED] wrote:
>>> This message describes the details about md-RAID1 issue found by
>>> testing the md RAID1 using the SCSI fault injection framework.
>>>
>>> Abstract:
>>> Both the error handler for md RAID1 and write access request to the md RAID1
>>> use raid1d kernel thread. The nr_pending flag could cause a race condition
>>> in raid1d, results in a raid1d deadlock.
>> Thanks for finding and reporting this.
>>
>> I believe the following patch should fix the deadlock.
>>
>> If you are able to repeat your test and confirm this I would
>> appreciate it.
>>
>> Thanks,
>> NeilBrown
>>
>>
>>
>> Fix deadlock in md/raid1 when handling a read error.
>>
>> When handling a read error, we freeze the array to stop any other
>> IO while attempting to over-write with correct data.
>>

-- 
-
Kenichi TANAKA| Open Source Software Platform Development Division
  | Computers Software Operations Unit, NEC Corporation
  | [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html