Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-20 Thread Gabor Gombas
On Wed, Dec 19, 2007 at 10:31:12AM -0500, Justin Piszcz wrote:

> Some nice graphs found here:
> http://sqlblog.com/blogs/linchi_shea/archive/2007/02/01/performance-impact-of-disk-misalignment.aspx

Again, this is a HW RAID, and the partitioning is done _on top of_ the
RAID.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-20 Thread Gabor Gombas
On Wed, Dec 19, 2007 at 04:01:43PM +0100, Mattias Wadenstein wrote:

> From that setup it seems simple, scrap the partition table and use the disk 
> device for raid. This is what we do for all data storage disks (hw raid) 
> and sw raid members.

And _exactly_ that's when you run into the alignment problem. The common
SW RAID case (first partitioning, then building RAID arrays from
individual partitions) does not suffer from this issue.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux RAID Partition Offset 63 cylinders / 30% performance hit?

2007-12-20 Thread Gabor Gombas
On Wed, Dec 19, 2007 at 12:55:16PM -0500, Justin Piszcz wrote:

> unligned, just fdisk /dev/sdc, mkpartition, fd raid.
>  aligned, fdisk, expert, start at 512 as the off-set

No, that won't show any difference. You need to partition _the RAID
device_. If the partitioning is below the RAID level, then alignment do
not matter.

What is missing from your original quote is that the original reporter
used fake-HW RAID which can only handle full disks, and not individual
partitions. So if you want to experience the same performance drop, you
should also RAID full disks together and then put partitions on top of
the RAID array.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: optimal IO scheduler choice?

2007-12-13 Thread Gabor Gombas
On Thu, Dec 13, 2007 at 06:24:15AM -0500, Justin Piszcz wrote:

> Sequential:
> Output of CFQ: (horrid): 311,683 KiB/s
>  Output of AS: 443,103 KiB/s

OTOH AS had worse latencies than CFQ AFAIR (it was quite some time ago I
last experimented). So it depends on what do you want to optimize for.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid6 check/repair

2007-12-07 Thread Gabor Gombas
On Wed, Dec 05, 2007 at 03:31:14PM -0500, Bill Davidsen wrote:

> BTW: if this can be done in a user program, mdadm, rather than by code in 
> the kernel, that might well make everyone happy. Okay, realistically "less 
> unhappy."

I start to like the idea. Of course you can't repair a running array
from user space (just think about something re-writing the full stripe
while mdadm is trying to fix the old data - you can get the data disks
containing the new data but the "fixed" disks rewritten with the old
data).

We just need to make the kernel not to try to fix anything but merely
report that something is wrong - but wait, using "check" instead of
"repair" does that already.

So the kernel is fine as it is, we just need a simple user-space utility
that can take the components of a non-running array and repair a given
stripe using whatever method is appropriate. Shouldn't be too hard to
write for anyone interested...

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Implementing low level timeouts within MD

2007-10-30 Thread Gabor Gombas
On Tue, Oct 30, 2007 at 12:08:07AM -0500, Alberto Alonso wrote:

> > > * Internal serverworks PATA controller on a netengine server. The
> > >   server if off waiting to get picked up, so I can't get the important
> > >   details.
> > 
> > 1 PATA failure.
> 
> I was surprised on this one, I did have good luck with with PATA in
> the past. The kernel is whatever came standard in Fedora Core 2

The keyword here is probably not "PATA" but "Serverworks"... AFAIR that
chipset was always considered somewhat problematic. You may want to try
with the libata driver, it has a nice comment:

 *  Note that we don't copy the old serverworks code because the old
 *  code contains obvious mistakes

But even the new driver retained this coment from the old driver:

 * Documentation:
 *  Available under NDA only. Errata info very hard to get.

It isn't exactly giving me warm feelings to trust data to this chipset...

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid-10 mount at startup always has problem

2007-10-29 Thread Gabor Gombas
On Mon, Oct 29, 2007 at 08:41:39AM +0100, Luca Berra wrote:

> consider a storage with 64 spt, an io size of 4k and partition starting
> at sector 63.
> first io request will require two ios from the storage (1 for sector 63,
> and one for sectors 64 to 70)
> the next 7 io (71-78,79-86,97-94,95-102,103-110,111-118,119-126) will be
> on the same track
> the 8th will again require to be split, and so on.
> this causes the storage to do 1 unnecessary io every 8. YMMV.

That's only true for random reads. If the OS does sufficient read-ahead
then sequential reads are affected much less. But the killers are the
misaligned random writes since then (considering RAID5/6 for simplicity)
the stripe has to be read from all component disks before it can be
written back.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid-10 mount at startup always has problem

2007-10-27 Thread Gabor Gombas
On Sat, Oct 27, 2007 at 09:50:55AM +0200, Luca Berra wrote:

>> Because you didn't stripe align the partition, your bad.
> :)
> by default fdisk misalignes partition tables
> and aligning them is more complex than just doing without.

Why use fdisk then? Use parted instead. It's not the kernel's fault if
you use tools not suited for a given task...

>> Linux works properly with a partition table, so this is a specious
>> statement.
> It should also work properly without one.

It does:

sd 0:0:2:0: [sdc] Very big device. Trying to use READ CAPACITY(16).
sd 0:0:2:0: [sdc] 7812333568 512-byte hardware sectors (315 MB)
sd 0:0:2:0: [sdc] Write Protect is off
sd 0:0:2:0: [sdc] Mode Sense: 23 00 00 00
sd 0:0:2:0: [sdc] Write cache: enabled, read cache: disabled, doesn't support 
DPO or FUA
 sdc: unknown partition table

Works perfectly without any partition tables...

You seem to be annoyed because the kernel tells you that there is no
partition table it recognizes - but if that bothers you so, simply stop
reading the kernel logs. My kernel also tells me that it failed to find
an AGP bridge - by your logic that should mean that everyone still using
AGP-capable motherboards should toss their system to the junkyard?!?

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-26 Thread Gabor Gombas
On Fri, Oct 26, 2007 at 02:52:59PM -0400, Doug Ledford wrote:

> In fact, no you can't.  I know, because I've created a device that had
> both but wasn't a raid device.  And it's matching partner still existed
> too.  What you are talking about would have misrecognized this
> situation, guaranteed.

Maybe we need a 2.0 superblock that contains the physical size of every
component, not just the logical size that is used for RAID. That way if
the size read from the superblock does not match the size of the device,
you know that this device should be ignored.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-26 Thread Gabor Gombas
On Fri, Oct 26, 2007 at 02:41:56PM -0400, Doug Ledford wrote:

> * When using lilo to boot from a raid device, it automatically installs
> itself to the mbr, not to the partition.  This can not be changed.  Only
> 0.90 and 1.0 superblock types are supported because lilo doesn't
> understand the offset to the beginning of the fs otherwise.

Huh? I have several machines that boot with LILO and the root is on
RAID1. All install LILO to the boot sector of the mdX device (having
"boot=/dev/mdX" in lilo.conf), while the MBR is installed by
install-mbr. Since install-mbr has its own prompt that is displayed
before LILO's prompt on boot, I can be pretty sure that LILO did not
write anything to the MBR...

What you say is only true for "skewed" RAID setups, but I always
considered such a setup too risky for anything critical (not because of
LILO, but because of the increased administrative complexity).

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-26 Thread Gabor Gombas
On Fri, Oct 26, 2007 at 06:22:27PM +0200, Gabor Gombas wrote:

> You got the ordering wrong. You should get userspace support ready and
> accepted _first_, and then you can start the
> flamew^H^H^H^H^H^Hdiscussion to make the in-kernel partitioning code
> configurable.

Oh wait that is possible even today. So you can build your own kernel
without any partition table format support - problem solved.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid-10 mount at startup always has problem

2007-10-26 Thread Gabor Gombas
On Fri, Oct 26, 2007 at 11:15:13AM +0200, Luca Berra wrote:

> on a pc maybe, but that is 20 years old design.
> partition table design is limited because it is still based on C/H/S,
> which do not exist anymore.

The MS-DOS format is not the only possible partition table layout. Other
formats such as GPT do not have such limitations.

> Put a partition table on a big storage, say a DMX, and enjoy a 20%
> performance decrease.

I assume your "big storage" uses some kind of RAID. Are your partitions
stripe-aligned? (Btw. that has nothing to do with partitions, LVM can
also suffer if PEs are not aligned).

>> Oh, and let's not go into what can happen if you're talking about a dual
>> boot machine and what Windows might do to the disk if it doesn't think
>> the disk space is already spoken for by a linux partition.
> Why the hell should the existance of windows limit the possibility of
> linux working properly.

Well, if you want to convert a Windows partition to Linux by just
changing the partition type, running mke2fs over it, and filling it with
data, Windows will happily ignore the partition table change and will
overwrite your data without any notice on the next boot (happened with
one collegaue, not fun to debug). So much for automatic device type
detection...

> On the opposite, i once inserted an mmc memory card, which had been
> initialized on my mobile phone, into the mmc slot of my laptop, and was
> faced with a load of error about mmcblk0 having an invalid partition
> table. Obviously it had none, it was a plain fat filesystem.
> Is the solution partitioning it? I don't think the phone would
> agree.

Well, it said it could not find a valid partition change. That was the
truth. Why is it a problem if the kernel states a fact?

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Time to deprecate old RAID formats?

2007-10-26 Thread Gabor Gombas
On Fri, Oct 26, 2007 at 11:54:18AM +0200, Luca Berra wrote:

> but the fix is easy.
> remove the partition detection code from the kernel and start working on
> a smart userspace replacement for device detection. we already have
> vol_id from udev and blkid from ext3 which support detection of many
> device formats.

You got the ordering wrong. You should get userspace support ready and
accepted _first_, and then you can start the
flamew^H^H^H^H^H^Hdiscussion to make the in-kernel partitioning code
configurable. But even if you have the perfect userspace solution ready
today, removing partitioning support from the kernel is a pretty
invasive ABI change so it will take many years if it ever happens at
all.

I saw the "let's move partition detection to user space" argument
several times on l-k in the past years but it never gained support...
So if you want to make it happen, stop talking and start coding, and
persuade all major distros to accept your changes. _Then_ you can start
arguing to remove partition detection from the kernel, and even then it
won't be easy.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Software RAID5 Horrible Write Speed On 3ware Controller!!

2007-07-18 Thread Gabor Gombas
On Wed, Jul 18, 2007 at 01:51:16PM +0100, Robin Hill wrote:

> Just to pick up on this one (as I'm about to reformat my array as XFS) -
> does this actually work with a hardware controller?  Is there any
> assurance that the XFS stripes align with the hardware RAID stripes?  Or
> could you just end up offsetting everything so that every 128k chunk on
> the XFS side of things fits half-and-half into two hardware raid chunks
> (especially if the array has been partitioned)?

If you partition the device and does not explicitely align the
partitions on a stripe boundary then you'll get that effect. Also if you
do not use partitions but use LVM instead, then the stripe size should
be a power of 2 meaning the number of data disks should also be a power
of 2 to get the best performance.

> In which case would
> it be better (performance-wise) to provide the su,sw values or not?

Only testing can tell... But if one logical file system block spans
multiple stripes then you will lose some performance; if that will be
noticable or not depends on your usage pattern.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Software RAID5 Horrible Write Speed On 3ware Controller!!

2007-07-18 Thread Gabor Gombas
On Wed, Jul 18, 2007 at 06:23:25AM -0400, Justin Piszcz wrote:

> I recently got a chance to test SW RAID5 using 750GB disks (10) in a RAID5 
> on a 3ware card, model no: 9550SXU-12
>
> The bottom line is the controller is doing some weird caching with writes 
> on SW RAID5 which makes it not worth using.

Did you use the settings documented in
http://www.3ware.com/KB/article.aspx?id=11050 ? Setting nr_requests and
the deadline scheduler doubled the seq write performance for me. Do you
have the latest firmware? Firmware updates can improve performance - at
least for RAID5/6; I somewhat doubt that they care about JBOD
performance that much...

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Customize the error emails of `mdadm --monitor`

2007-06-06 Thread Gabor Gombas
On Wed, Jun 06, 2007 at 04:24:31PM +0200, Peter Rabbitson wrote:

> So I was asking if the component _number_, which is unique to a specific 
> device regardless of the assembly mechanism, can be reported in case of a 
> failure.

So you need to write an event-handling script and pass it to mdadm
(--program). In the script you can walk sysfs and/or call the
appropriate helper programs to extract all the information you need and
format it in the way you want. For example, if you want the slot number
of a failed disk, you can get it from /sys/block/$2/md/dev-$3/slot
(according to the manpage, not tested).

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Customize the error emails of `mdadm --monitor`

2007-06-06 Thread Gabor Gombas
On Wed, Jun 06, 2007 at 02:23:31PM +0200, Peter Rabbitson wrote:

> This would not work as arrays are assembled by the kernel at boot time, at 
> which point there is no udev or anything else for that matter other than 
> /dev/sdX. And I am pretty sure my OS (debian) does not support udev in 
> initrd as of yet.

But I think sending mails from the initrd isn't supported either, so if
you already hack the initrd, you can get the path information from
sysfs. udev is nothing magical, it just walks the sysfs tree and calls
some little helper programs when collecting the information for building
/dev/disk; you can do that yourself if you want.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: Identify SATA Disks

2007-05-24 Thread Gabor Gombas
On Thu, May 24, 2007 at 09:29:04AM +1000, lewis shobbrook wrote:

> I've noted that device allocation can change with the generation of
> new initrd's and installation of new kernels; i.e. /dev/sdc becomes
> /dev/sda depending upon what order the modules load etc.
> I'm wondering if one could send a looped read/write task to a swap
> partition or something to determine which the device is?

If you're using a relatively modern distro with udev then you can use
paths under /dev/disk/by-{id,path}. Unless you're using a RAID card that
hides the disk IDs...

> Also I've not had much joy in attempting to "hotswap" SATA on a live 
> system.
> Can anyone attest to successful hotswap (or blanket rule out as
> doesn't work) using std on board SATA controllers,  cf dedicated raid
> card, or suggest further reading?

Make sure you have a chipset that supports hotplug (some older ones do
not). Make sure its driver supports hotplug. Make sure you stop using
the disk before pulling it out (umount, swapoff, mdadm --remove,
pvremove whatever). Power down the disk before pulling it out if your
backplane/enclosure does not do that for you. Then it should work.

If the chipset does not support sending interrupt on hotswap or if the
driver does not implement hotswap signalling, you may need explicit
"scsiadd -r" before yanking out the old drive and "scsiadd -s" after
inserting the new one.

Also remember that this area is rather new and still evolving, so be
sure to try the latest kernel if you encounter problems.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1, hot-swap and boot integrity

2007-03-06 Thread Gabor Gombas
On Mon, Mar 05, 2007 at 06:32:32PM -0500, Mike Accetta wrote:

> Yes, we actually have a separate (smallish) boot partition at the front of
> the array.  This does reduce the at-risk window substantially.  I'll have to
> ponder whether it reduces it close enough to negligible to then ignore, but
> that is indeed a good point to consider.

Replacing a failed disk requires a human to pull out the old disk and
insert the new one. The /boot partition should resync in less than 1
minute (if not, it's _way_ too big), so the same human should still be
around to kick the machine if something goes wrong.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1, hot-swap and boot integrity

2007-03-02 Thread Gabor Gombas
On Fri, Mar 02, 2007 at 10:40:32AM -0500, Justin Piszcz wrote:

> AFAIK mdadm/kernel raid can handle this, I had a number of occaisons when 
> my UPS shut my machine down when I was rebuilding a RAID5 array, when the 
> box came back up, the rebuild picked up where it left off.

_If_ the resync got far enough that the kernel image is already copied.
The original mail is about the case when the sectors where the
kernel/initramfs should be are not yet synced...

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1, hot-swap and boot integrity

2007-03-02 Thread Gabor Gombas
On Fri, Mar 02, 2007 at 09:04:40AM -0500, Mike Accetta wrote:

> Thoughts or other suggestions anyone?

This is a case where a very small /boot partition is still a very good
idea... 50-100MB is a good choice (some initramfs generators require
quite a bit of space under /boot while generating the initramfs image
esp. if you use distro-provided "contains-everything-and-the-kitchen-sink"
kernels, so it is not wise to make /boot _too_ small).

But if you do not want /boot to be separate a moderately sized root
partition is equally good. What you want to avoid is the "whole disk is
a single partition/file system" kind of setup.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Odd (slow) RAID performance

2006-12-08 Thread Gabor Gombas
On Thu, Dec 07, 2006 at 10:51:25AM -0500, Bill Davidsen wrote:

> I also suspect that write are not being combined, since writing the 2GB 
> test runs at one-drive speed writing 1MB blocks, but floppy speed 
> writing 2k blocks. And no, I'm not running out of CPU to do the 
> overhead, it jumps from 2-4% to 30% of one CPU, but on an unloaded SMP 
> system it's not CPU bound.

You could use blktrace to see the actual requests that the md code sends
down to the device, including request merging actions. That may provide
more insight into what really happens.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Swap initialised as an md?

2006-11-12 Thread Gabor Gombas
On Fri, Nov 10, 2006 at 12:55:57PM +0100, Mogens Kjaer wrote:

> If one of your disks fails, and you have pages in the swapfile
> on the failing disk, your machine will crash when the pages are
> needed again.

IMHO the machine will not crash just the application which the page
belongs to will be killed. Of course, if that application happens to be
init or your mission-critical daemon then the effect is not much
different from a crash...

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Too much ECC?

2006-11-09 Thread Gabor Gombas
On Thu, Nov 09, 2006 at 03:30:55PM +0100, Dexter Filmore wrote:

> 195 Hardware_ECC_Recovered  3344107

For some models that's perfectly normal.

> Looking at a 5 year old 40GB Maxtor that's not been cooled too well I see "3" 
> as the raw value.

Different technology, different vendor, different meaning of the
attribute.

> Should I be worried or am I just not properly reading this?

If the other attributes are OK then the raw value of
Hardware_ECC_Recovered has not much meaning.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New features?

2006-11-03 Thread Gabor Gombas
On Fri, Nov 03, 2006 at 02:39:31PM +1100, Neil Brown wrote:

> mdadm could probably be changed to be able to remove the device
> anyway.  The only difficulty is: how do you tell it which device to
> remove", given that there is no name in /dev to use.
> Suggestions?

Major:minor? If /sys/block still holds an entry for the removed disk,
then the user can figure it out from the name. Or mdadm could just
accept a path under /sys/block instead of a device node.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata hotplug and md raid?

2006-10-17 Thread Gabor Gombas
On Tue, Oct 17, 2006 at 10:07:07AM +0200, Gabor Gombas wrote:

> Vanilla 2.6.18 kernel. In fact, all the /sys/block/*/holders directories
> are empty here.

Never mind, I just found the per-partition holders directories. Argh.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: libata hotplug and md raid?

2006-10-17 Thread Gabor Gombas
On Tue, Oct 17, 2006 at 11:58:03AM +1000, Neil Brown wrote:

> udev can find out what needs to be done by looking at
> /sys/block/whatever/holders. 

Are you sure?

$ cat /proc/mdstat
[...]
md0 : active raid1 sdd1[1] sdc1[0] sdb1[2] sda1[3]
  393472 blocks [4/4] []
[...]
$ ls -l /sys/block/sda/holders
total 0

Vanilla 2.6.18 kernel. In fact, all the /sys/block/*/holders directories
are empty here.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: avoiding the initial resync on --create

2006-10-10 Thread Gabor Gombas
On Tue, Oct 10, 2006 at 01:47:56PM -0400, Doug Ledford wrote:

> Not at all true.  Every filesystem, no matter where it stores its
> metadata blocks, still writes to every single metadata block it
> allocates to initialize that metadata block.  The same is true for
> directory blocks...they are created with a . and .. entry and nothing
> else.  What exactly do you think mke2fs is doing when it's writing out
> the inode groups, block groups, bitmaps, etc.?  Every metadata block
> needed by fsck is written either during mkfs or during use as the
> filesystem data is grown.

You don't get my point. I'm not talking about normal operation, but
about the case when the filesystem becomes corrupt, and fsck has to glue
together the pieces. Consider reiserfs: it stores metadata in a single
tree. If an internal node of the tree gets corrupted, reiserfsck has
absolutely no information where the child nodes are. So it must scan the
whole device, and perform a "does this block look like reiserfs
metadata?" test for every single block.

Btw. that's the reason why you can't store reiserfs3 file system images
on a reiserfs3 file system - reiserfsck simply can't tell if a block
that looks like metadata is really part of the filesystem or is it just
part of a regular file. AFAIK this design flaw is only fixed in reiser4.

> So, like my original email said, fsck has no business reading any block
> that hasn't been written to either by the install or since the install
> when the filesystem was filled up more.

But fsck has _ZERO_ information about what blocks were written since the
filesystem was created, because that information is part of the metadata
that got corrupted. If you could trust the metadata, you'd not need
fsck.

> It certainly does *not* read
> blocks just for the fun of it, nor does it rely on anything the
> filesystem didn't specifically write.

That's only true for "traditional" UNIX file systems like ext2/3. But
there are many other filesystems out there...

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: avoiding the initial resync on --create

2006-10-10 Thread Gabor Gombas
On Mon, Oct 09, 2006 at 12:32:00PM -0400, Doug Ledford wrote:

> You don't really need to.  After a clean install, the operating system
> has no business reading any block it didn't write to during the install
> unless you are just reading disk blocks for the fun of it.

What happens if you have a crash, and fsck for some reason tries to read
into that uninitialized area? This may happen even years after the
install if the array was never resynced and the filesystem was never
100% full... What happens, if fsck tries to read the same area twice but
gets different data, because the second time the read went to a
different disk?

And yes, fsck is exactly an application that reads blocks just "for the
fun of it" when it tries to find all the pieces of the filesystem, esp.
for filesystems that (unlike e.g. ext3) do not keep metadata at fixed
locations.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can you IMAGE Mirrored OS Drives?

2006-08-22 Thread Gabor Gombas
On Sat, Aug 19, 2006 at 09:05:39AM +0200, Luca Berra wrote:

> please, can we try not to resurrect again the kernel-level autodetection
> flamewar on this list.

There is no need for a flame war. In some situations one is better, in
other situations the other is better.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: remark and RFC

2006-08-18 Thread Gabor Gombas
On Thu, Aug 17, 2006 at 08:28:07AM +0200, Peter T. Breuer wrote:

> 1) if the network disk device has decided to shut down wholesale
>(temporarily) because of lack of contact over the net, then
>retries and writes are _bound_ to fail for a while, so there
>is no point in sending them now.  You'd really do infinitely
>better to wait a while.

On the other hand, if it's a physical disk that's gone, you _know_ it
will not come back, and stalling your mission-critical application
waiting for a never-occuring event instead of just continue using the
other disk does not seem right.

>You think the device has become unreliable because write failed, but
>it hasn't ... that's just the net. Try again later! If you like
>we can set the req error count to -ETIMEDOUT to signal it. Real
>remote write breakage can be signalled with -EIO or something.
>Only boot the device on -EIO.

Depending on the application, if one device is gone for an extended
period of time (and the range of seconds is a looong time), it may be
much more applicable to just forget about that disk and continue instead
of stalling the system waiting for the device coming back.

IMHO if you want to rely on the network, use equipment that can provide
the required QoS parameters. It may cost a lot - c'est la vie.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can you IMAGE Mirrored OS Drives?

2006-08-18 Thread Gabor Gombas
On Wed, Aug 16, 2006 at 06:06:24AM -0400, andy liebman wrote:

> There is absolutely NO PROBLEM making images of single disks and 
> restoring them to new disks (thus, creating clones). And it is very 
> fast. For an OS drive with about 4 GBs of data, it only takes about 5 
> minutes to make the image and 3 to restore it. So, after making the 
> first set of images, it would in theory take under 10 minutes to restore 
> a mirrored pair.

Be prepared that these times will be much larger if you try to clone
RAID5, since by looking at just a single disk the image creator program
will not be able to read the file system and identify which parts of the
partition contain useful data and which parts are empty.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can you IMAGE Mirrored OS Drives?

2006-08-18 Thread Gabor Gombas
On Wed, Aug 16, 2006 at 09:38:54AM +0200, Luca Berra wrote:

> The only risk is if you ever move one disk from one machine to another.
> To work around this you can change the uuid by recreating the array with
> mdadm,

No need to re-create, --update=uuid should be enough according to the
man page.

> but you would have to regenerate the initrd and fight again with lilo :(

Or you can just build a kernel with md built-in and use the kernel-level
RAID autodetection. In situations like this it is _much_ easier and
_much_ more robust than all the initrd-based solutions I have seen.

Also, if you install lilo to the RAID device instead of the MBR, and
install a bog-standard MBR using the install-mbr command on every drive,
your life will be easier.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: second controller: what will my discs be called, and does it matter?

2006-07-07 Thread Gabor Gombas
On Thu, Jul 06, 2006 at 08:12:14PM +0200, Dexter Filmore wrote:

> How can I tell if the discs on the new controller will become sd[e-h] or if 
> they'll be the new a-d and push the existing ones back?

If they are the same type (or more precisely, if they use the same
driver), then their order on the PCI bus will decide. Otherwise, if you
are using modules, then the order you load the drivers will decide. If
the drivers are built into the kernel, then their link order will
decide.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IBM xSeries stop responding during RAID1 reconstruction

2006-06-20 Thread Gabor Gombas
On Tue, Jun 20, 2006 at 08:00:13AM -0700, Mr. James W. Laferriere wrote:

>   At least one can do a ls of the /sys/block area & then do an 
>   automated
>   echo cfq down the tree .  Does anyone know of a method to set a 
>   default
>   scheduler ?

RTFM: Documentation/kernel-parameters.txt in the kernel source.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IBM xSeries stop responding during RAID1 reconstruction

2006-06-20 Thread Gabor Gombas
On Tue, Jun 20, 2006 at 03:08:59PM +0200, Niccolo Rigacci wrote:

> Do you know if it is possible to switch the scheduler at runtime?

echo cfq > /sys/block//queue/scheduler

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences,
 Laboratory of Parallel and Distributed Systems
 Address   : H-1132 Budapest Victor Hugo u. 18-22. Hungary
 Phone/Fax : +36 1 329-78-64 (secretary)
 W3: http://www.lpds.sztaki.hu
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IBM xSeries stop responding during RAID1 reconstruction

2006-06-19 Thread Gabor Gombas
On Wed, Jun 14, 2006 at 10:46:09AM -0500, Bill Cizek wrote:

> I was able to work around this by lowering 
> /proc/sys/dev/raid/speed_limit_max to a value
> below my disk thruput value (~ 50 MB/s) as follows:

IMHO a much better fix is to use the cfq I/O scheduler during the
rebuild. The default anticipatory scheduler gives horrible latencies
and can cause the machine to appear as 'locked up' if there is heavy
I/O load like a RAID reconstruct or heavy database usage.

The price of cfq is lower throughput (higher RAID rebuild time) than
with the anticipatory I/O scheduler.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: replace disk in raid5 without linux noticing?

2006-04-20 Thread Gabor Gombas
On Wed, Apr 19, 2006 at 02:16:10PM -0400, Ming Zhang wrote:

> is this possible? 
> * stop RAID5
> * set a mirror between current disk X and a new added disk Y, and X as
> primary one (which means copy X to Y to full sync, and before this ends,
> only read from X); also this mirror will not have any metadata or mark
> on existing disk;

The mirror should be created without persistent superblocks (obviously),
but --build does not seem to allow RAID1.

> * add this mirror to RAID5
> * start RAID5;
> 
> ... mirror will continue copy data from X to Y, once end
> 
> * stop RAID5
> * split mirror
> * put DISK Y back to RAID5
> * restart RAID5.
> 
> since this is a mirror, all metadata are same. it will be even greater
> if no need to stop raid5 to do this.

The process seems rather fragile. If I created a RAID5 array to protect
my data I most certainly would not like to perform so much steps where I
can mess up.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: help wanted - 6-disk raid5 borked: _ _ U U U U

2006-04-20 Thread Gabor Gombas
On Mon, Apr 17, 2006 at 09:30:32AM +1000, Neil Brown wrote:

> It is arguable that for a read error on a degraded raid5, that may not
> be the best thing to do, but I'm not completely convinced.

My opinion would be that in the degraded case md should behave as if it
was a single physical drive, and just pass the error to the upper layers.
Then the filesystem can decide what to do (for example, ext3 can be told
to just continue, remount read-only, or simply panic).

> A read error will mean that a write to the same stripe will have to
> fail, so at the very least we would want to switch the array
> read-only.

Not neccessarily. Just fail the write with -EIO and let the file system
decide. Or at least make the error-handling behaviour run-time configurable.

Gabor

-- 
 -
 MTA SZTAKI Computer and Automation Research Institute
Hungarian Academy of Sciences
 -
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html