from:"Neil Brown"

Re: How many drives are bad?

2008-02-20 Thread Neil Brown

On Tuesday February 19, [EMAIL PROTECTED] wrote:
> So I had my first "failure" today, when I got a report that one drive
> (/dev/sdam) failed. I've attached the output of "mdadm --detail". It
> appears that two drives are listed as "removed", but the array is
> still functioning. What does this mean? How many drives actually
> failed?

The array is configured for 8 devices, but on 6 are active.  So you
have lost data.
Of the two missing devices, one is still in the array and is marked as
fault.  One is simply not present at all.
Hence "Failed Devices: 1".  i.e. there is one failed device in the
array.

It looks like you have been running a degraded array for a while
(maybe not a long while) and the device has then failed.

"mdadm --monitor"

will send you mail if you have a degraded array.

NeilBrown

> 
> This is all a test system, so I can dink around as much as necessary.
> Thanks for any advice!
> 
> Norman Elton
> 
> == OUTPUT OF MDADM =
> 
> Version : 00.90.03
>   Creation Time : Fri Jan 18 13:17:33 2008
>  Raid Level : raid5
>  Array Size : 6837319552 (6520.58 GiB 7001.42 GB)
> Device Size : 976759936 (931.51 GiB 1000.20 GB)
>Raid Devices : 8
>   Total Devices : 7
> Preferred Minor : 4
> Persistence : Superblock is persistent
> 
> Update Time : Mon Feb 18 11:49:13 2008
>   State : clean, degraded
>  Active Devices : 6
> Working Devices : 6
>  Failed Devices : 1
>   Spare Devices : 0
> 
>  Layout : left-symmetric
>  Chunk Size : 64K
> 
>UUID : b16bdcaf:a20192fb:39c74cb8:e5e60b20
>  Events : 0.110
> 
> Number   Major   Minor   RaidDevice State
>0  6610  active sync   /dev/sdag1
>1  66   171  active sync   /dev/sdah1
>2  66   332  active sync   /dev/sdai1
>3  66   493  active sync   /dev/sdaj1
>4  66   654  active sync   /dev/sdak1
>5   005  removed
>6   006  removed
>7  66  1137  active sync   /dev/sdan1
> 
>8  66   97-  faulty spare   /dev/sdam1
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: suns raid-z / zfs

2008-02-18 Thread Neil Brown

On Monday February 18, [EMAIL PROTECTED] wrote:
> On Mon, Feb 18, 2008 at 03:07:44PM +1100, Neil Brown wrote:
> > On Sunday February 17, [EMAIL PROTECTED] wrote:
> > > Hi
> > > 
> > 
> > > It seems like a good way to avoid the performance problems of raid-5
> > > /raid-6
> > 
> > I think there are better ways.
> 
> Interesting! What do you have in mind?

A "Log Structured Filesystem" always does large contiguous writes.
Aligning these to the raid5 stripes wouldn't be too hard and then you
would never have to do any pre-reading.

> 
> and what are the problems with zfs?

Recovery after a failed drive would not be an easy operation, and I
cannot imagine it being even close to the raw speed of the device.

> 
> > > 
> > > But does it stripe? One could think that rewriting stripes
> > > other places would damage the striping effects.
> > 
> > I'm not sure what you mean exactly.  But I suspect your concerns here
> > are unjustified.
> 
> More precisely. I understand that zfs always write the data anew.
> That would mean at other blocks on the partitions, for the logical blocks
> of the file in question. So the blocks on the partitions will not be
> adjacant. And striping will not be possible, generally.

The important part of striping is that a write is spread out over
multiple devices, isn't it.

If ZFS can choose where to put each block that it writes, it can
easily choose to write a series of blocks to a collection of different
devices, thus getting the major benefit of striping.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Create Raid6 with 1 missing member fails

2008-02-17 Thread Neil Brown

On Sunday February 17, [EMAIL PROTECTED] wrote:
> I tried to create a raid6 with one missing member, but it fails.
> It works fine to create a raid6 with two missing members. Is it supposed 
> to be like that ?

No, it isn't supposed to be like that, but currently it is.

The easiest approach if to create it with 2 drives missing, and the
extra drive immediately.
This is essentially what mdadm will do when I fix it.

Alternately you can use --assume-clean to tell it that the array is
clean.  It is actually a lie, but it is a harmless lie. Whenever any
data is written to the array, that little part of the array will get
"cleaned". (Note that this isn't true of raid5, only of raid6).

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: suns raid-z / zfs

2008-02-17 Thread Neil Brown

On Sunday February 17, [EMAIL PROTECTED] wrote:
> Hi
> 
> any opinions on suns zfs/raid-z?

It's vaguely interesting.  I'm not sold on the idea though.

> It seems like a good way to avoid the performance problems of raid-5
> /raid-6

I think there are better ways.

> 
> But does it stripe? One could think that rewriting stripes
> other places would damage the striping effects.

I'm not sure what you mean exactly.  But I suspect your concerns here
are unjustified.

> 
> Or is the performance only meant to be good for random read/write?

I suspect it is mean to be good for everything.  But you would have to
ask SUN that.

> 
> Can the code be lifted to Linux? I understand that it is already in
> freebsd. Does Suns licence prevent this?

My understanding is that the sun license prevents it.

However raid-z only makes sense in the context of a specific
filesystem such as ZFS.  It isn't something that you could just layer
any filesystem on top of.

> 
> And could something like this be built into existing file systems like
> ext3 and xfs? They could have a multipartition layer in their code, and
> then the heuristics to optimize block access could also apply to stripe
> access.

I doubt it, but I haven't thought deeply enough about it to see if
there might be some relatively non-intrusive way.

NeilBrown

> 
> best regards
> keld
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID5 to RAID6 reshape?

2008-02-17 Thread Neil Brown

On Sunday February 17, [EMAIL PROTECTED] wrote:
> On Sun, 17 Feb 2008 14:31:22 +0100
> Janek Kozicki <[EMAIL PROTECTED]> wrote:
> 
> > oh, right - Sevrin Robstad has a good idea to solve your problem -
> > create raid6 with one missing member. And add this member, when you
> > have it, next year or such.
> > 
> 
> I thought I read that would involve a huge performance hit, since
> then everything would require parity calculations.  Or would that
> just be w/ 2 missing drives?

A raid6 with one missing drive would have a little bit of a
performance hit over raid5.

Partly there is a CPU hit to calculate the Q block which is slower
than calculating normal parity.

Partly there is the fact that raid6 never does "read-modify-write"
cycles, so to update one block in a stripe, it has to read all the
other data blocks.

But the worst aspect of doing this that if you have a system crash,
you could get hidden data corruption.
After a system crash you cannot trust parity data (as it may have been
in the process of being updated) so you have to regenerate it from
known good data.  But if your array is degraded, you don't have all
the known good data, so you loose.

It is really best to avoid degraded raid4/5/6 arrays when at all
possible.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID5 to RAID6 reshape?

2008-02-17 Thread Neil Brown

On Saturday February 16, [EMAIL PROTECTED] wrote:
> found was a few months old.  Is it likely that RAID5 to RAID6
> reshaping will be implemented in the next 12 to 18 months (my rough

Certainly possible.

I won't say it is "likely" until it is actually done.  And by then it
will be definite :-)

i.e. no concrete plans.
It is always best to base your decisions on what is available today.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid5: two writing algorithms

2008-02-07 Thread Neil Brown

On Friday February 8, [EMAIL PROTECTED] wrote:
> On Fri, Feb 08, 2008 at 07:25:31AM +1100, Neil Brown wrote:
> > On Thursday February 7, [EMAIL PROTECTED] wrote:
> 
> > > So I hereby give the idea for inspiration to kernel hackers.
> > 
> > and I hereby invite you to read the code ;-)
> 
> I did some reading.  Is there somewhere a description of it, especially
> the raid code, or are the comments and the code the best documentation?

No.  If a description was written (and various people have tried to
describe various parts) it would be out of date within a few months :-(

Look for "READ_MODIFY_WRITE" and "RECONSTRUCT_WRITE"  no.  That
only applied to raid6 code now..
Look instead for the 'rcw' and 'rmw' counters, and then at
'handle_write_operations5'  which does different things based on the
'rcw' variable.

It used to be a lot clearer before we implemented xor-offload.  The
xor-offload stuff is good, but it does make the code more complex.

> 
> Do you say that this is already implemented?

Yes.

> 
> I am sorry if you think I am mailing too much on the list.

You aren't.

> But I happen to think it is fun.

Good.

> And I do try to give something back.

We'll look forward to that.

> 
> > Code reading is a good first step to being a
> > > 
> > > Yoyr kernel hacker wannabe
> >^
> > 
> > NeilBrown
> 
> Well, I do have a hack in mind, on the raid10,f2.
> I need to investigate some more, and possibly test out
> what really happens. But maybe the code already does what I want it to.
> You are possibly the one that knows the code best, so maybe you can tell
> me if raid10,f2 always does its reading in the first part of the disks?

Yes, I know the code best.

No, raid10,f2 doesn't always use the first part of the disk.  Getting
it to do that would be a fairly small change in 'read_balance' in
md/raid10.c.

I'm not at all convinced that the read balancing code in raid10 (or
raid1) really does the best thing.  So any improvements - backed up
with broad testing - would be most welcome.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: when is a disk "non-fresh"?

2008-02-07 Thread Neil Brown

On Thursday February 7, [EMAIL PROTECTED] wrote:
> On Tuesday 05 February 2008 03:02:00 Neil Brown wrote:
> > On Monday February 4, [EMAIL PROTECTED] wrote:
> > > Seems the other topic wasn't quite clear...
> >
> > not necessarily.  sometimes it helps to repeat your question.  there
> > is a lot of noise on the internet and somethings important things get
> > missed... :-)
> >
> > > Occasionally a disk is kicked for being "non-fresh" - what does this mean
> > > and what causes it?
> >
> > The 'event' count is too small.
> > Every event that happens on an array causes the event count to be
> > incremented.
> 
> An 'event' here is any atomic action? Like "write byte there" or "calc XOR"?

An 'event' is
   - switch from clean to dirty
   - switch from dirty to clean
   - a device fails
   - a spare finishes recovery
things like that.

> 
> 
> > If the event counts on different devices differ by more than 1, then
> > the smaller number is 'non-fresh'.
> >
> > You need to look to the kernel logs of when the array was previously
> > shut down to figure out why it is now non-fresh.
> 
> The kernel logs show absolutely nothing. Log's fine, next time I boot up, one 
> disk is kicked, I got no clue why, badblocks is fine, smartctl is fine, selft 
> test fine, dmesg and /var/log/messages show nothing apart from that news that 
> the disk was kicked and mdadm -E doesn't say anything suspicious either.

Can you get "mdadm -E" on all devices *before* attempting to assemble
the array?

> 
> Question: what events occured on the 3 other disks that didn't occur on the 
> last? It only happens after reboots, not while the machine is up so the 
> closest assumption is that the array is not properly shut down somehow during 
> system shutdown - only I wouldn't know why.

Yes, most likely is that the array didn't shut down properly.

> Box is Slackware 11.0, 11 doesn't come with raid script of its own so I 
> hacked 
> them into the boot scripts myself and carefully watched that everything 
> accessing the array is down before mdadm --stop --scan is issued.
> No NFS, no Samba, no other funny daemons, disks are synced and so on.
> 
> I could write some failsafe inot it by checking if the event count is the 
> same 
> on all disks before --stop, but even if it wasn't, I really wouldn't know 
> what to do about it.
> 
> (btw mdadm -E gives me: Events : 0.1149316 - what's with the 0. ?)
> 

The events count is a 64bit number and for historical reasons it is
printed as 2 32bit numbers.  I agree this is ugly.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid5: two writing algorithms

2008-02-07 Thread Neil Brown

On Thursday February 7, [EMAIL PROTECTED] wrote:
> As I understand it, there are 2 valid algoritms for writing in raid5.
> 
> 1. calculate the parity data by XOR'ing all data of the relevant data
> chunks.
> 
> 2. calculate the parity data by kind of XOR-subtracting the old data to
> be changed, and then XOR-adding the new data. (XOR-subtract and XOR-add
> is actually the same).
> 
> There are situations where method 1 is the fastest, and situations where
> method 2 is the fastest.
> 
> My idea is then that the raid5 code in the kernel can calculate which
> method is the faster. 
> 
> method 1 is faster, if all data is already available. I understand that
> this method is employed in the current kernel. This would eg be the case
> with sequential writes.
> 
> Method 2 is faster, if no data is available in core. It would require
> 2 reads and two writes, which always will be faster than n reads and 1
> write, possibly except for n=2. method 2 is thus faster normally for
> random writes.
> 
> I think that method 2 is not used in the kernel today. Mayby I am wrong,
> but I did have a look in the kernel code.

It is very odd that you would think something about the behaviour of
the kernel with actually having looked.

It also seems a little arrogant to have a clever idea and assume that
no one else has thought of it before.

> 
> So I hereby give the idea for inspiration to kernel hackers.

and I hereby invite you to read the code ;-)

Code reading is a good first step to being a
> 
> Yoyr kernel hacker wannabe
   ^

NeilBrown


> keld
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: recommendations for stripe/chunk size

2008-02-06 Thread Neil Brown

On Thursday February 7, [EMAIL PROTECTED] wrote:
> 
> Anyway, why does a SATA-II drive not deliver something like 300 MB/s?

Are you serious?

I high end 15000RPM enterprise grade drive such as the Seagate
Cheetah® 15K.6 Hard Drives only deliver 164MB/sec.

The SATA Bus might be able to deliver 300MB/s, but an individual drive
would be around 80MB/s unless it is really expensive.

(or was that yesterday?  I'm having trouble keeping up with the pace
 of improvement :-)

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: recommendations for stripe/chunk size

2008-02-06 Thread Neil Brown

On Wednesday February 6, [EMAIL PROTECTED] wrote:
> Keld Jørn Simonsen wrote:
> > Hi
> >
> > I am looking at revising our howto. I see a number of places where a
> > chunk size of 32 kiB is recommended, and even recommendations on
> > maybe using sizes of 4 kiB. 
> >
> >   
> Depending on the raid level, a write smaller than the chunk size causes 
> the chunk to be read, altered, and rewritten, vs. just written if the 
> write is a multiple of chunk size. Many filesystems by default use a 4k 
> page size and writes. I believe this is the reasoning behind the 
> suggestion of small chunk sizes. Sequential vs. random and raid level 
> are important here, there's no one size to work best in all cases.

Not in md/raid.

RAID4/5/6 will do a read-modify-write if you are writing less than one
*page*, but then they often to read-modify-write anyway for parity
updates.

No level will every read a whole chunk just because it is a chunk.

To answer the original question:  The only way to be sure is to test
your hardware with your workload with different chunk sizes.
But I suspect that around 256K is good on current hardware.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: recommendations for stripe/chunk size

2008-02-06 Thread Neil Brown

On Wednesday February 6, [EMAIL PROTECTED] wrote:
> 
> We implemented the option to select kernel page sizes of  4,  16,  64
> and  256  kB for some PowerPC systems (440SPe, to be precise). A nice
> graphics of the effect can be found here:
> 
> https://www.amcc.com/MyAMCC/retrieveDocument/PowerPC/440SPe/RAIDinLinux_PB_0529a.pdf

Thanks for the link!

The second improvement is to remove a memory copy that is internal to the MD 
driver. The MD
driver stages strip data ready to be written next to the I/O controller in a 
page size pre-
allocated buffer. It is possible to bypass this memory copy for sequential 
writes thereby saving
SDRAM access cycles.

I sure hope you've checked that the filesystem never (ever) changes a
buffer while it is being written out.  Otherwise the data written to
disk might be different from the data used in the parity calculation
:-)

And what are the "Second memcpy" and "First memcpy" in the graph?
I assume one is the memcpy mentioned above, but what is the other?

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Deleting mdadm RAID arrays

2008-02-06 Thread Neil Brown

On Wednesday February 6, [EMAIL PROTECTED] wrote:
> 
> % cat /proc/partitions
> major minor  #blocks  name
> 
>8 0  390711384 sda
>8 1  390708801 sda1
>816  390711384 sdb
>817  390708801 sdb1
>832  390711384 sdc
>833  390708801 sdc1
>848  390710327 sdd
>849  390708801 sdd1
>864  390711384 sde
>865  390708801 sde1
>880  390711384 sdf
>881  390708801 sdf1
>364   78150744 hdb
>3651951866 hdb1
>3667815622 hdb2
>3674883760 hdb3
>368  1 hdb4
>369 979933 hdb5
>370 979933 hdb6
>371   61536951 hdb7
>9 1  781417472 md1
>9 0  781417472 md0

So all the expected partitions are known to the kernel - good.

> 
> /etc/udev/rules.d % cat /proc/mdstat
> Personalities : [raid1] [raid6] [raid5] [raid4]
> md0 : active(auto-read-only) raid5 sdc1[0] sde1[3](S) sdd1[1]
>   781417472 blocks level 5, 64k chunk, algorithm 2 [3/2] [UU_]
> 
> md1 : active(auto-read-only) raid5 sdf1[0] sdb1[3](S) sda1[1]
>   781417472 blocks level 5, 64k chunk, algorithm 2 [3/2] [UU_]
> 
> md0 consists of sdc1, sde1 and sdd1 even though when creating I asked it to 
> use d_1, d_2 and d_3 (this is probably written on the particular 
> disk/partition itself,
> but I have no idea how to clean this up - mdadm --zero-superblock /dev/d_1
> again produces "mdadm: Couldn't open /dev/d_1 for write - not zeroing")
> 

I suspect it is related to the (auto-read-only).
The array is degraded and has a spare, so it wants to do a recovery to
the spare.  But it won't start the recovery until the array is not
read-only.

But the recovery process has partly started (you'll see an md1_resync
thread) so it won't let go of any fail devices at the moment.
If you 
  mdadm -w /dev/md0

the recovery will start.
Then
  mdadm /dev/md0 -f /dev/d_1

will fail d_1, abort the recovery, and release d_1.

Then
  mdadm --zero-superblock /dev/d_1

should work.

It is currently failing with EBUSY - --zero-superblock opens the
device with O_EXCL to ensure that it isn't currently in use, and as
long as it is part of an md array, O_EXCL will fail.
I should make that more explicit in the error message.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid10 on three discs - few questions.

2008-02-06 Thread Neil Brown

On Wednesday February 6, [EMAIL PROTECTED] wrote:
> >
> >   
> >> 4. Would it be possible to later '--grow' the array to use 4 discs in
> >>raid10 ? Even with far=2 ?
> >>
> >> 
> >
> > No.
> >
> > Well if by "later" you mean "in five years", then maybe.  But the
> > code doesn't currently exist.
> >   
> 
> That's a reason to avoid raid10 for certain applications, then, and go 
> with a more manual 1+0 or similar.

Not really.  You cannot reshape a raid0 either.

> 
> Can you create a raid10 with one drive "missing" and add it later? I 
> know, I should try it when I get a machine free... but I'm being lazy today.

Yes, but then the array would be degraded and a single failure could
destroy your data.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Deleting mdadm RAID arrays

2008-02-06 Thread Neil Brown

On Wednesday February 6, [EMAIL PROTECTED] wrote:
> 
> > Maybe the kernel has  been told to forget about the partitions of
> > /dev/sdb.
> 
> But fdisk/cfdisk has no problem whatsoever finding the partitions .

It is looking at the partition table on disk.  Not at the kernel's
idea of partitions, which is initialised from that table...

What does

  cat /proc/partitions

say?

> 
> > mdadm will sometimes tell it to do that, but only if you try to
> > assemble arrays out of whole components.
> 
> > If that is the problem, then
> >blockdev --rereadpt /dev/sdb
> 
> I deleted LVM devices that were sitting on top of RAID and reinstalled mdadm.
> 
> % blockdev --rereadpt /dev/sdf
> BLKRRPART: Device or resource busy
> 

Implies that some partition is in use.

> % mdadm /dev/md2 --fail /dev/sdf1
> mdadm: set /dev/sdf1 faulty in /dev/md2
> 
> % blockdev --rereadpt /dev/sdf
> BLKRRPART: Device or resource busy
> 
> % mdadm /dev/md2 --remove /dev/sdf1
> mdadm: hot remove failed for /dev/sdf1: Device or resource busy

OK, that's weird.  If sdf1 is faulty, then you should be able to
remove it.  What does
  cat /proc/mdstat
  dmesg | tail

say at this point?

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Deleting mdadm RAID arrays

2008-02-05 Thread Neil Brown

On Tuesday February 5, [EMAIL PROTECTED] wrote:
> 
> % mdadm --zero-superblock /dev/sdb1
> mdadm: Couldn't open /dev/sdb1 for write - not zeroing

That's weird.
Why can't it open it?

Maybe you aren't running as root (The '%' prompt is suspicious).
Maybe the kernel has  been told to forget about the partitions of
/dev/sdb.
mdadm will sometimes tell it to do that, but only if you try to
assemble arrays out of whole components.

If that is the problem, then
   blockdev --rereadpt /dev/sdb

will fix it.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re[2]: mdadm 2.6.4 : How i can check out current status of reshaping ?

2008-02-05 Thread Neil Brown

On Tuesday February 5, [EMAIL PROTECTED] wrote:
> Feb  5 11:56:12 raid01 kernel: BUG: unable to handle kernel paging request at 
> virtual address 001cd901

This looks like some sort of memory corruption.

> Feb  5 11:56:12 raid01 kernel: EIP is at md_do_sync+0x629/0xa32

This tells us what code is executing.

> Feb  5 11:56:12 raid01 kernel: Code: 54 24 48 0f 87 a4 01 00 00 72 0a 3b 44 
> 24 44 0f 87 98 01 00 00 3b 7c 24 40 75 0a 3b 74 24 3c 0f 84 88 01 00 00 0b 85 
> 30 01 00 00 <88> 08 0f 85 90 01 00 00 8b 85 30 01 00 00 a8 04 0f 85 82 01 00

This tells us what the actual byte of code were.
If I feed this line (from "Code:" onwards) into "ksymoops" I get 

   0:   54push   %esp
   1:   24 48 and$0x48,%al
   3:   0f 87 a4 01 00 00 ja 1ad <_EIP+0x1ad>
   9:   72 0a jb 15 <_EIP+0x15>
   b:   3b 44 24 44   cmp0x44(%esp),%eax
   f:   0f 87 98 01 00 00 ja 1ad <_EIP+0x1ad>
  15:   3b 7c 24 40   cmp0x40(%esp),%edi
  19:   75 0a jne25 <_EIP+0x25>
  1b:   3b 74 24 3c   cmp0x3c(%esp),%esi
  1f:   0f 84 88 01 00 00 je 1ad <_EIP+0x1ad>
  25:   0b 85 30 01 00 00 or 0x130(%ebp),%eax
Code;   Before first symbol
  2b:   88 08 mov%cl,(%eax)
  2d:   0f 85 90 01 00 00 jne1c3 <_EIP+0x1c3>
  33:   8b 85 30 01 00 00 mov0x130(%ebp),%eax
  39:   a8 04 test   $0x4,%al
  3b:   0f.byte 0xf
  3c:   85.byte 0x85
  3d:   82(bad)  
  3e:   01 00 add%eax,(%eax)

I removed the "Code;..." lines as they are just noise, except for the
one that points to the current instruction in the middle.
Note that it is dereferencing %eax, after just 'or'ing some value into
it, which is rather unusual.

Now get the "md-mod.ko" for the kernel you are running.
run
   gdb md-mod.ko

and give the command

   disassemble md_do_sync

and look for code at offset 0x629, which is 1577 in decimal.

I found a similar kernel to what you are running, and the matching code
is 

0x55c0 :   cmp0x30(%esp),%eax
0x55c4 :   ja 0x5749 
0x55ca :   cmp0x2c(%esp),%edi
0x55ce :   jne0x55da 
0x55d0 :   cmp0x28(%esp),%esi
0x55d4 :   je 0x5749 
0x55da :   mov0x130(%ebp),%eax
0x55e0 :   test   $0x8,%al
0x55e2 :   jne0x575f 
0x55e8 :   mov0x130(%ebp),%eax
0x55ee :   test   $0x4,%al
0x55f0 :   jne0x575f 
0x55f6 :   mov0x38(%esp),%ecx
0x55fa :   mov0x0,%eax
-

Note the sequence "cmp, ja, cmp, jne, cmp, je"
where the "cmp" arguments are consecutive 4byte values on the stack
(%esp).
In the code from your oops, the offsets are 0x44 0x40 0x3c.
In the kernel I found they are 0x30 0x2c 0x28.  The difference is some
subtle difference in the kernel, possibly a different compiler or
something.

Anyway, your code crashed at 

  25:   0b 85 30 01 00 00 or 0x130(%ebp),%eax
Code;   Before first symbol
  2b:   88 08 mov%cl,(%eax)

The matching code in the kernel I found is 

0x55da :   mov0x130(%ebp),%eax
0x55e0 :   test   $0x8,%al

Note that you have an 'or', the kernel I found has 'mov'.

If we look at the actual byte of code for those two instructions
the code that crashed shows the bytes above:

0b 85 30 01 00 00
88 08

if I get the same bytes with gdb:

(gdb) x/8b 0x55da
0x55da :   0x8b0x850x300x010x000x00
0xa80x08
(gdb) 

So what should be "8b" has become "0b", and what should be "a8" has
become "08".

If you look for the same data in your md-mod.ko, you might find
slightly different details but it is clear to me that the code in
memory is bad.

Possible you have bad memory, or a bad CPU, or you are overclocking
the CPU, or it is getting hot, or something.

But you clearly have a hardware error.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: when is a disk "non-fresh"?

2008-02-04 Thread Neil Brown

On Monday February 4, [EMAIL PROTECTED] wrote:
> Seems the other topic wasn't quite clear...

not necessarily.  sometimes it helps to repeat your question.  there
is a lot of noise on the internet and somethings important things get
missed... :-)

> Occasionally a disk is kicked for being "non-fresh" - what does this mean and 
> what causes it?

The 'event' count is too small.  
Every event that happens on an array causes the event count to be
incremented.
If the event counts on different devices differ by more than 1, then
the smaller number is 'non-fresh'.

You need to look to the kernel logs of when the array was previously
shut down to figure out why it is now non-fresh.

NeilBrown

> 
> Dex
> 
> 
> 
> -- 
> -BEGIN GEEK CODE BLOCK-
> Version: 3.12
> GCS d--(+)@ s-:+ a- C UL++ P+>++ L+++> E-- W++ N o? K-
> w--(---) !O M+ V- PS+ PE Y++ PGP t++(---)@ 5 X+(++) R+(++) tv--(+)@ 
> b++(+++) DI+++ D- G++ e* h>++ r* y?
> --END GEEK CODE BLOCK--
> 
> http://www.vorratsdatenspeicherung.de
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mdadm 2.6.4 : How i can check out current status of reshaping ?

2008-02-04 Thread Neil Brown

On Monday February 4, [EMAIL PROTECTED] wrote:
> 
> [EMAIL PROTECTED]:/# cat /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
> [multipath] [faulty]
> md1 : active raid5 sdc[0] sdb[5](S) sdf[3] sde[2] sdd[1]
>   1465159488 blocks super 0.91 level 5, 64k chunk, algorithm 2 [5/4] 
> [_]
> 
> unused devices: 
> 
> ##
> But how i can see the status of reshaping ?
> Is it reshaped realy ? or may be just hang up ? or may be mdadm nothing do 
> not give in
> general ?
> How long wait when reshaping will finish ?
> ##
> 

The reshape hasn't restarted.

Did you do that "mdadm -w /dev/md1" like I suggested?  If so, what
happened?

Possibly you tried mounting the filesystem before trying the "mdadm
-w".  There seems to be a bug such that doing this would cause the
reshape not to restart, and "mdadm -w" would not help any more.

I suggest you:

  echo 0 > /sys/module/md_mod/parameters/start_ro

stop the array 
  mdadm -S /dev/md1
(after unmounting if necessary).

Then assemble the array again.
Then
  mdadm -w /dev/md1

just to be sure.

If this doesn't work, please report exactly what you did, exactly what
message you got and exactly where message appeared in the kernel log.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re[2]: problem with spare, acive device, clean degrated, reshaip RADI5, anybody can help ?

2008-02-03 Thread Neil Brown

On Monday February 4, [EMAIL PROTECTED] wrote:
> 
> raid01:/etc# cat /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
> [multipath] [faulty]
> md1 : active(auto-read-only) raid5 sdc[0] sdb[5](S) sdf[3] sde[2] sdd[1]
  ^^^
>   1465159488 blocks super 0.91 level 5, 64k chunk, algorithm 2 [5/4] 
> [_]
> 
> unused devices: 

That explains it.  The array is still 'read-only' and won't write
anything until you allow it to.
The easiest way is
  mdadm -w /dev/md1

That should restart the reshape.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: problem with spare, acive device, clean degrated, reshaip RADI5, anybody can help ?

2008-02-03 Thread Neil Brown

On Thursday January 31, [EMAIL PROTECTED] wrote:
> Hello linux-raid.
> 
> i have DEBIAN.
> 
> raid01:/# mdadm -V
> mdadm - v2.6.4 - 19th October 2007
> 
> raid01:/# mdadm -D /dev/md1
> /dev/md1:
> Version : 00.91.03
>   Creation Time : Tue Nov 13 18:42:36 2007
>  Raid Level : raid5

>   Delta Devices : 1, (4->5)

So the array is in the middle of a "reshape".

It should automatically complete...  Presumably it isn't doing that?

What does
   cat /proc/mdstat
say?

Where kernel log messages do you get when you assemble the array?

The spare device will not be added to the array until the reshape has
finished.

Hopefully you aren't using a 2.6.23 kernel?
That kernel had a bug which corrupted data when reshaping a degraded
raid5 array.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: /dev/sdb has different metadata to chosen array /dev/md1 0.91 0.90.

2008-02-03 Thread Neil Brown

On Saturday February 2, [EMAIL PROTECTED] wrote:
> Çäðàâñòâóéòå, linux-raid.
> 
> Help please, How i can to fight THIS :
> 
> [EMAIL PROTECTED]:~# mdadm -I /dev/sdb
> mdadm: /dev/sdb has different metadata to chosen array /dev/md1 0.91 0.90.
> 

Apparently "mdadm -I" doesn't work with arrays that are in the middle
of a reshape.  I'll try to fix that for the next release.

Thanks for the report.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid10 on three discs - few questions.

2008-02-03 Thread Neil Brown

On Sunday February 3, [EMAIL PROTECTED] wrote:
> Hi,
> 
> Maybe I'll buy three HDDs to put a raid10 on them. And get the total
> capacity of 1.5 of a disc. 'man 4 md' indicates that this is possible
> and should work.
> 
> I'm wondering - how a single disc failure is handled in such configuration?
> 
> 1. does the array continue to work in a degraded state?

Yes.

> 
> 2. after the failure I can disconnect faulty drive, connect a new one,
>start the computer, add disc to array and it will sync automatically?
> 

Yes.

> 
> Question seems a bit obvious, but the configuration is, at least for
> me, a bit unusual. This is why I'm asking. Anybody here tested such
> configuration, has some experience?
> 
> 
> 3. Another thing - would raid10,far=2 work when three drives are used?
>Would it increase the read performance?

Yes.

> 
> 4. Would it be possible to later '--grow' the array to use 4 discs in
>raid10 ? Even with far=2 ?
> 

No.

Well if by "later" you mean "in five years", then maybe.  But the
code doesn't currently exist.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Linux md and iscsi problems

2008-02-02 Thread Neil Brown

On Friday February 1, [EMAIL PROTECTED] wrote:
> 
> 
> Summarizing, I have two questions about the behavior of Linux md with  
> slow devices:
> 
> 1. Is it possible to modify some kind of time-out parameter on the  
> mdadm tool so the slow device wouldn't be marked as faulty because of  
> its slow performance.

No.  md doesn't do timeouts at all.  The underlying device does.
So if you are getting time out errors from the iscsi initiator, then
you need to change the timeout value used by the iscsi initiator.  md
has no part to play in this.  It just sends a request and eventually
gets either 'success' or 'fail'.

> 
> 2. Is it possible to control the "buffer" size of the RAID?, in other  
> words, can I control the amount of data I can write to the local disc  
> before I receive an acknowledgment from the slow device when I am  
> using the write-behind option.

No.  md/raid1 simply calls 'kmalloc' to get space to buffer each write
as the write arrives.  If the allocation succeeds, it is used to
perform the write lazily.  If the allocation fails, the write is
performs synchronously.

What did you hope to achieve by such tuning?  It can probably be
added if it is generally useful.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid problem: after every reboot /dev/sdb1 is removed?

2008-02-01 Thread Neil Brown

On Friday February 1, [EMAIL PROTECTED] wrote:
> Hi!
> 
> I have the following problem with my softraid (raid 1). I'm running
> Ubuntu 7.10 64bit with kernel 2.6.22-14-generic.
> 
> After every reboot my first boot partition in md0 is not synchron. One
> of the disks (the sdb1) is removed. 
> After a resynch every partition is synching. But after a reboot the
> state is "removed". 

Please send boot logs (e.g. dmesg > afile).

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: In this partition scheme, grub does not find md information?

2008-01-28 Thread Neil Brown

On Monday January 28, [EMAIL PROTECTED] wrote:
> 
> Perhaps I'm mistaken but I though it was possible to do boot from 
> /dev/md/all1.

It is my understanding that grub cannot boot from RAID.
You can boot from raid1 by the expedient of booting from one of the
halves.
A common approach is to make a small raid1 which contains /boot and
boot from that.  Then use the rest of your devices for raid10 or raid5
or whatever.
> 
> Am I trying to do something that's basically impossible?

I believe so.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BUG: possible array corruption when adding a component to a degraded raid5 (possibly other levels too)

2008-01-28 Thread Neil Brown

On Monday January 28, [EMAIL PROTECTED] wrote:
> Hello,
> 
> It seems that mdadm/md do not perform proper sanity checks before adding a 
> component to a degraded array. If the size of the new component is just 
> right, 
> the superblock information will overlap with the data area. This will happen 
> without any error indications in the syslog or otherwise.
> 
> I came up with a reproducible scenario which I am attaching to this email 
> alongside with the entire test script. I have not tested it for other raid 
> levels, or other types of superblocks, but I suspect the same problem will 
> occur for many other configurations.
> 
> I am willing to test patches, however the attached script is non-intrusive 
> enough to be executed anywhere.

Thanks for the report and the test script.

This patch for mdadm should fix this problem I hate the fact that
we sometimes use K and sometimes use sectors for
sizes/offsets... groan.

I'll probably get a test in the kernel as well to guard against this.

Thanks,
NeilBrown


### Diffstat output
 ./Manage.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff .prev/Manage.c ./Manage.c
--- .prev/Manage.c  2008-01-29 11:15:54.0 +1100
+++ ./Manage.c  2008-01-29 11:16:15.0 +1100
@@ -337,7 +337,7 @@ int Manage_subdevs(char *devname, int fd
 
/* Make sure device is large enough */
if (tst->ss->avail_size(tst, ldsize/512) <
-   array.size) {
+   array.size*2) {
fprintf(stderr, Name ": %s not large 
enough to join array\n",
dv->devname);
return 1;
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BUG: possible array corruption when adding a component to a degraded raid5 (possibly other levels too)

2008-01-28 Thread Neil Brown

On Monday January 28, [EMAIL PROTECTED] wrote:
> Hello,
> 
> It seems that mdadm/md do not perform proper sanity checks before adding a 
> component to a degraded array. If the size of the new component is just 
> right, 
> the superblock information will overlap with the data area. This will happen 
> without any error indications in the syslog or otherwise.

I thought I fixed that What versions of Linux kernel and mdadm are
you using for your tests?

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: striping of a 4 drive raid10

2008-01-27 Thread Neil Brown

On Sunday January 27, [EMAIL PROTECTED] wrote:
> On Mon, Jan 28, 2008 at 07:13:30AM +1100, Neil Brown wrote:
> > On Sunday January 27, [EMAIL PROTECTED] wrote:
> > > Hi
> > > 
> > > I have tried to make a striping raid out of my new 4 x 1 TB
> > > SATA-2 disks. I tried raid10,f2 in several ways:
> > > 
> > > 1: md0 = raid10,f2 of sda1+sdb1, md1= raid10,f2 of sdc1+sdd1, md2 = raid0
> > > of md0+md1
> > > 
> > > 2: md0 = raid0 of sda1+sdb1, md1= raid0 of sdc1+sdd1, md2 = raid01,f2
> > > of md0+md1
> > > 
> > > 3: md0 = raid10,f2 of sda1+sdb1, md1= raid10,f2 of sdc1+sdd1, chunksize 
> > > of 
> > > md0 =md1 =128 KB,  md2 = raid0 of md0+md1 chunksize = 256 KB
> > > 
> > > 4: md0 = raid0 of sda1+sdb1, md1= raid0 of sdc1+sdd1, chunksize
> > > of md0 = md1 = 128 KB, md2 = raid01,f2 of md0+md1 chunksize = 256 KB
> > > 
> > > 5: md0= raid10,f4 of sda1+sdb1+sdc1+sdd1
> > 
> > Try
> >   6: md0 = raid10,f2 of sda1+sdb1+sdc1+sdd1
> 
> That I already tried, (and I wrongly stated that I used f4 in stead of
> f2). I had two times a thruput of about 300 MB/s but since then I could
> not reproduce the behaviour. Are there errors on this that has been
> corrected in newer kernels?

No, I don't think any performance related changes have been made to
raid10 lately.

You could try increasing the read-ahead size.  For a 4-drive raid10 it
defaults to 4 times the read-ahead setting of a single drive, but
increasing substantially (e.g. 64 times) seem to increase the speed of
"dd" reading a gigabyte.
Whether that will actually affect your target workload is a different question.

> 
> 
> > Also try raid10,o2 with a largeish chunksize (256KB is probably big
> > enough).
> 
> I tried that too, but my mdadm did not allow me to use the o flag.
> 
> My kernel is 2.6.12  and mdadm is v1.12.0 - 14 June 2005.
> can I upgrade the mdadm alone to a newer version, and then which is
> recommendable?

You would need a newer kernel and a newer mdadm to get raid10 - offset
mode.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: striping of a 4 drive raid10

2008-01-27 Thread Neil Brown

On Sunday January 27, [EMAIL PROTECTED] wrote:
> Hi
> 
> I have tried to make a striping raid out of my new 4 x 1 TB
> SATA-2 disks. I tried raid10,f2 in several ways:
> 
> 1: md0 = raid10,f2 of sda1+sdb1, md1= raid10,f2 of sdc1+sdd1, md2 = raid0
> of md0+md1
> 
> 2: md0 = raid0 of sda1+sdb1, md1= raid0 of sdc1+sdd1, md2 = raid01,f2
> of md0+md1
> 
> 3: md0 = raid10,f2 of sda1+sdb1, md1= raid10,f2 of sdc1+sdd1, chunksize of 
> md0 =md1 =128 KB,  md2 = raid0 of md0+md1 chunksize = 256 KB
> 
> 4: md0 = raid0 of sda1+sdb1, md1= raid0 of sdc1+sdd1, chunksize
> of md0 = md1 = 128 KB, md2 = raid01,f2 of md0+md1 chunksize = 256 KB
> 
> 5: md0= raid10,f4 of sda1+sdb1+sdc1+sdd1

Try
  6: md0 = raid10,f2 of sda1+sdb1+sdc1+sdd1

Also try raid10,o2 with a largeish chunksize (256KB is probably big
enough).

NeilBrown


> 
> My new disks give a transfer rate of about 80 MB/s, so I expected
> to have something like 320 MB/s for the whole raid, but I did not get
> more than about 180 MB/s.
> 
> I think it may be something with the layout, that in effect 
> the drives should be something like:
> 
>   sda1 sdb1sdc1  sdd1
>01   2 3
>45   6 7
> 
> And this was not really doable for the combination of raids,
> because thet combinations give different block layouts.
> 
> How can it be done? Do we need a new raid type?
> 
> Best regards
> keld
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: write-intent bitmaps

2008-01-27 Thread Neil Brown

On Sunday January 27, [EMAIL PROTECTED] wrote:
> http://lists.debian.org/debian-devel/2008/01/msg00921.html
> 
> Are they regarded as a stable feature?  If so I'd like to see distributions 
> supporting them by default.  I've started a discussion in Debian on this 
> topic, see the above URL for details.

Yes, it is regarded as stable.

However it can be expected to reduce write throughput.  A reduction of
several percent would not be surprising, and depending in workload it
could probably be much higher.

It is quite easy to add or remove a bitmap on an active array, so
making it a default would probably be fine providing it was easy for
an admin to find out about it and remove the bitmap is they wanted the
extra performance.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: idle array consuming cpu ??!!

2008-01-23 Thread Neil Brown

On Tuesday January 22, [EMAIL PROTECTED] wrote:
> Neil Brown ([EMAIL PROTECTED]) wrote on 21 January 2008 12:15:
>  >On Sunday January 20, [EMAIL PROTECTED] wrote:
>  >> A raid6 array with a spare and bitmap is idle: not mounted and with no
>  >> IO to it or any of its disks (obviously), as shown by iostat. However
>  >> it's consuming cpu: since reboot it used about 11min in 24h, which is 
> quite
>  >> a lot even for a busy array (the cpus are fast). The array was cleanly
>  >> shutdown so there's been no reconstruction/check or anything else.
>  >> 
>  >> How can this be? Kernel is 2.6.22.16 with the two patches for the
>  >> deadlock ("[PATCH 004 of 4] md: Fix an occasional deadlock in raid5 -
>  >> FIX") and the previous one.
>  >
>  >Maybe the bitmap code is waking up regularly to do nothing.
>  >
>  >Would you be happy to experiment?  Remove the bitmap with
>  >   mdadm --grow /dev/mdX --bitmap=none
>  >
>  >and see how that affects cpu usage?
> 
> Confirmed, removing the bitmap stopped cpu consumption.

Thanks.

This patch should substantiallly reduce cpu consumption on an idle
bitmap.

NeilBrown

--
Reduce CPU wastage on idle md array with a write-intent bitmap.

On an md array with a write-intent bitmap, a thread wakes up every few
seconds to and scans the bitmap looking for work to do.  If there
array is idle, there will be no work to do, but a lot of scanning is
done to discover this.

So cache the fact that the bitmap is completely clean, and avoid
scanning the whole bitmap when the cache is known to be clean.

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/bitmap.c |   19 +--
 ./include/linux/raid/bitmap.h |2 ++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c
--- .prev/drivers/md/bitmap.c   2008-01-24 15:53:45.0 +1100
+++ ./drivers/md/bitmap.c   2008-01-24 15:54:29.0 +1100
@@ -1047,6 +1047,11 @@ void bitmap_daemon_work(struct bitmap *b
if (time_before(jiffies, bitmap->daemon_lastrun + 
bitmap->daemon_sleep*HZ))
return;
bitmap->daemon_lastrun = jiffies;
+   if (bitmap->allclean) {
+   bitmap->mddev->thread->timeout = MAX_SCHEDULE_TIMEOUT;
+   return;
+   }
+   bitmap->allclean = 1;
 
for (j = 0; j < bitmap->chunks; j++) {
bitmap_counter_t *bmc;
@@ -1068,8 +1073,10 @@ void bitmap_daemon_work(struct bitmap *b
clear_page_attr(bitmap, page, 
BITMAP_PAGE_NEEDWRITE);
 
spin_unlock_irqrestore(&bitmap->lock, flags);
-   if (need_write)
+   if (need_write) {
write_page(bitmap, page, 0);
+   bitmap->allclean = 0;
+   }
continue;
}
 
@@ -1098,6 +1105,9 @@ void bitmap_daemon_work(struct bitmap *b
 /*
   if (j < 100) printk("bitmap: j=%lu, *bmc = 0x%x\n", j, *bmc);
 */
+   if (*bmc)
+   bitmap->allclean = 0;
+
if (*bmc == 2) {
*bmc=1; /* maybe clear the bit next time */
set_page_attr(bitmap, page, BITMAP_PAGE_CLEAN);
@@ -1132,6 +1142,8 @@ void bitmap_daemon_work(struct bitmap *b
}
}
 
+   if (bitmap->allclean == 0)
+   bitmap->mddev->thread->timeout = bitmap->daemon_sleep * HZ;
 }
 
 static bitmap_counter_t *bitmap_get_counter(struct bitmap *bitmap,
@@ -1226,6 +1238,7 @@ int bitmap_startwrite(struct bitmap *bit
sectors -= blocks;
else sectors = 0;
}
+   bitmap->allclean = 0;
return 0;
 }
 
@@ -1296,6 +1309,7 @@ int bitmap_start_sync(struct bitmap *bit
}
}
spin_unlock_irq(&bitmap->lock);
+   bitmap->allclean = 0;
return rv;
 }
 
@@ -1332,6 +1346,7 @@ void bitmap_end_sync(struct bitmap *bitm
}
  unlock:
spin_unlock_irqrestore(&bitmap->lock, flags);
+   bitmap->allclean = 0;
 }
 
 void bitmap_close_sync(struct bitmap *bitmap)
@@ -1399,7 +1414,7 @@ static void bitmap_set_memory_bits(struc
set_page_attr(bitmap, page, BITMAP_PAGE_CLEAN);
}
spin_unlock_irq(&bitmap->lock);
-
+   bitmap->allclean = 0;
 }
 
 /* dirty the memory and file bits for bitmap chunks "s" to "e" */

diff .prev/include/linux/raid/bitmap.h ./include/linux/raid/bitmap.h
--- .prev/incl

Re: [BUG] The kernel thread for md RAID1 could cause a md RAID1 array deadlock

2008-01-23 Thread Neil Brown

On Tuesday January 15, [EMAIL PROTECTED] wrote:
> 
> This message describes the details about md-RAID1 issue found by
> testing the md RAID1 using the SCSI fault injection framework.
> 
> Abstract:
> Both the error handler for md RAID1 and write access request to the md RAID1
> use raid1d kernel thread. The nr_pending flag could cause a race condition
> in raid1d, results in a raid1d deadlock.

Thanks for finding and reporting this.

I believe the following patch should fix the deadlock.

If you are able to repeat your test and confirm this I would
appreciate it.

Thanks,
NeilBrown



Fix deadlock in md/raid1 when handling a read error.

When handling a read error, we freeze the array to stop any other
IO while attempting to over-write with correct data.

This is done in the raid1d thread and must wait for all submitted IO
to complete (except for requests that failed and are sitting in the
retry queue - these are counted in ->nr_queue and will stay there during
a freeze).

However write requests need attention from raid1d as bitmap updates
might be required.  This can cause a deadlock as raid1 is waiting for
requests to finish that themselves need attention from raid1d.

So we create a new function 'flush_pending_writes' to give that attention,
and call it in freeze_array to be sure that we aren't waiting on raid1d.

Thanks to "K.Tanaka" <[EMAIL PROTECTED]> for finding and reporting
this problem.

Cc: "K.Tanaka" <[EMAIL PROTECTED]>
Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/raid1.c |   66 ++-
 1 file changed, 45 insertions(+), 21 deletions(-)

diff .prev/drivers/md/raid1.c ./drivers/md/raid1.c
--- .prev/drivers/md/raid1.c2008-01-18 11:19:09.0 +1100
+++ ./drivers/md/raid1.c2008-01-24 14:21:55.0 +1100
@@ -592,6 +592,37 @@ static int raid1_congested(void *data, i
 }
 
 
+static int flush_pending_writes(conf_t *conf)
+{
+   /* Any writes that have been queue but are awaiting
+* bitmap updates get flushed here.
+* We return 1 if any requests were actually submitted.
+*/
+   int rv = 0;
+
+   spin_lock_irq(&conf->device_lock);
+
+   if (conf->pending_bio_list.head) {
+   struct bio *bio;
+   bio = bio_list_get(&conf->pending_bio_list);
+   blk_remove_plug(conf->mddev->queue);
+   spin_unlock_irq(&conf->device_lock);
+   /* flush any pending bitmap writes to
+* disk before proceeding w/ I/O */
+   bitmap_unplug(conf->mddev->bitmap);
+
+   while (bio) { /* submit pending writes */
+   struct bio *next = bio->bi_next;
+   bio->bi_next = NULL;
+   generic_make_request(bio);
+   bio = next;
+   }
+   rv = 1;
+   } else
+   spin_unlock_irq(&conf->device_lock);
+   return rv;
+}
+
 /* Barriers
  * Sometimes we need to suspend IO while we do something else,
  * either some resync/recovery, or reconfigure the array.
@@ -678,10 +709,14 @@ static void freeze_array(conf_t *conf)
spin_lock_irq(&conf->resync_lock);
conf->barrier++;
conf->nr_waiting++;
+   spin_unlock_irq(&conf->resync_lock);
+
+   spin_lock_irq(&conf->resync_lock);
wait_event_lock_irq(conf->wait_barrier,
conf->barrier+conf->nr_pending == conf->nr_queued+2,
conf->resync_lock,
-   raid1_unplug(conf->mddev->queue));
+   ({ flush_pending_writes(conf);
+  raid1_unplug(conf->mddev->queue); }));
spin_unlock_irq(&conf->resync_lock);
 }
 static void unfreeze_array(conf_t *conf)
@@ -907,6 +942,9 @@ static int make_request(struct request_q
blk_plug_device(mddev->queue);
spin_unlock_irqrestore(&conf->device_lock, flags);
 
+   /* In case raid1d snuck into freeze_array */
+   wake_up(&conf->wait_barrier);
+
if (do_sync)
md_wakeup_thread(mddev->thread);
 #if 0
@@ -1473,28 +1511,14 @@ static void raid1d(mddev_t *mddev)

for (;;) {
char b[BDEVNAME_SIZE];
-   spin_lock_irqsave(&conf->device_lock, flags);
-
-   if (conf->pending_bio_list.head) {
-   bio = bio_list_get(&conf->pending_bio_list);
-   blk_remove_plug(mddev->queue);
-   spin_unlock_irqrestore(&conf->device_lock, flags);
-   /* flush any pending bitmap writes to disk before 
proceeding w/ I/O */
-   bitmap_unplug(mddev->bitmap)

Re: Fwd: Error on /dev/sda, but takes down RAID-1

2008-01-23 Thread Neil Brown

On Wednesday January 23, [EMAIL PROTECTED] wrote:
> Hi, 
> 
> I'm not sure this is completely linux-raid related, but I can't figure out 
> where to start: 
> 
> A few days ago, my server died. I was able to log in and salvage this content 
> of dmesg: 
> http://pastebin.com/m4af616df 

At line 194:

   end_request: I/O error, dev sdb, sector 80324865

then at line 384

   end_request: I/O error, dev sda, sector 80324865

> 
> I talked to my hosting-people and they said it was an io-error on /dev/sda, 
> and replaced that drive. 
> After this, I was able to boot into a PXE-image and re-build the two RAID-1 
> devices with no problems - indicating that sdb was fine. 
> 
> I expected RAID-1 to be able to stomach exactly this kind of error - one 
> drive dying. What did I do wrong? 

Trouble is it wasn't "one drive dying".  You got errors from two
drives, at almost exactly the same time.  So maybe the controller
died.  Or maybe when one drive died, the controller or the driver got
confused and couldn't work with the other drive any more.

Certainly the "blk: request botched" message (line 233 onwards)
suggest some confusion in the driver.

Maybe post to [EMAIL PROTECTED] - that is where issues with
SATA drivers and controllers can be discussed.

NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: array doesn't run even with --force

2008-01-20 Thread Neil Brown

On Monday January 21, [EMAIL PROTECTED] wrote:
> 
> The command is
> 
> mdadm -A --verbose -f -R /dev/md3 /dev/sda4 /dev/sdc4 /dev/sde4 /dev/sdd4
> 
> The failed areas are sdb4 (which I didn't include above) and sdd4. I
> did a "dd if=/dev/sdb4 of=/dev/hda4 bs=512 conv=noerror" and it
> complained about roughly 10 bad sectors. I did "dd if=/dev/sdd4
> of=/dev/hdc4 bs=512 conv=noerror" and there were no errors, that's why
> I used sdd4 above. I tried to substitute hdc4 for sdd4, and hda4 for
> sdb4, to no avail.
> 
> I don't have kernel logs because the failed area has /home and /var.
> The double fault occurred during the holidays, so I don't know which
> happened first. Below are the output of the command above and of
> --examine.
> 
> mdadm: looking for devices for /dev/md3
> mdadm: /dev/sda4 is identified as a member of /dev/md3, slot 0.
> mdadm: /dev/sdc4 is identified as a member of /dev/md3, slot 2.
> mdadm: /dev/sde4 is identified as a member of /dev/md3, slot 4.
> mdadm: /dev/sdd4 is identified as a member of /dev/md3, slot 5.
> mdadm: no uptodate device for slot 1 of /dev/md3
> mdadm: added /dev/sdc4 to /dev/md3 as 2
> mdadm: no uptodate device for slot 3 of /dev/md3
> mdadm: added /dev/sde4 to /dev/md3 as 4
> mdadm: added /dev/sdd4 to /dev/md3 as 5
> mdadm: added /dev/sda4 to /dev/md3 as 0
> mdadm: failed to RUN_ARRAY /dev/md3: Input/output error
> mdadm: Not enough devices to start the array.

So no device claim to be member '1' or '3' of the array, and as you
cannot start an array with 2 devices missing, there is nothing that
mdadm can do.  It has no way of knowing what should go in as '1' or
'3'.

As you note, sda4 says that it thinks slot 1 is still active/sync, but
it doesn't seem to know which device should go there either.
However that does indicate that slot 3 failed first and slot 1 failed
later.  So if we have candidates for both, slot 1 is probably more
uptodate.

You need to tell mdadm what goes where by creating the array.
e.g. if you think that sdb4 is adequately reliable and that it was in
slot 1, then

 mdadm -C /dev/md3 -l5 -n5 -c 128 /dev/sda4 /dev/sdb4 /dev/sdc4 missing 
/dev/sde4

alternately if you think it best to use sdd, and it was in slot 3,
then

 mdadm -C /dev/md3 -l5 -n5 -c 128 /dev/sda4 missing /dev/sdc4 /dev/sdd4 
/dev/sde4

would be the command to use.

Note that this command will not touch any data.  It will just
overwrite the superblock and assemble the array.
You can then 'fsck' or whatever to confirm that the data looks good.

good luck.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: idle array consuming cpu ??!!

2008-01-20 Thread Neil Brown

On Sunday January 20, [EMAIL PROTECTED] wrote:
> A raid6 array with a spare and bitmap is idle: not mounted and with no
> IO to it or any of its disks (obviously), as shown by iostat. However
> it's consuming cpu: since reboot it used about 11min in 24h, which is quite
> a lot even for a busy array (the cpus are fast). The array was cleanly
> shutdown so there's been no reconstruction/check or anything else.
> 
> How can this be? Kernel is 2.6.22.16 with the two patches for the
> deadlock ("[PATCH 004 of 4] md: Fix an occasional deadlock in raid5 -
> FIX") and the previous one.

Maybe the bitmap code is waking up regularly to do nothing.

Would you be happy to experiment?  Remove the bitmap with
   mdadm --grow /dev/mdX --bitmap=none

and see how that affects cpu usage?

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: array doesn't run even with --force

2008-01-20 Thread Neil Brown

On Sunday January 20, [EMAIL PROTECTED] wrote:
> I've got a raid5 array with 5 disks where 2 failed. The failures are
> occasional and only on a few sectors so I tried to assemble it with 4
> disks anyway:
> 
> # mdadm -A -f -R /dev/md /dev/disk1 /dev/disk2 /dev/disk3 /dev/disk4
> 
> However mdadm complains that one of the disks has an out-of-date
> superblock and kicks it out, and then it cannot run the array with
> only 3 disks.
> 
> Shouldn't it adjust the superblock and assemble-run it anyway? That's
> what -f is for, no? This is with kernel 2.6.22.16 and mdadm 2.6.4.

Please provide actual commands and actual output.
Also add "--verbose" to the assemble command
Also provide "--examine" for all devices.
Also provide any kernel log messages.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: do_md_run returned -22 [Was: 2.6.24-rc8-mm1]

2008-01-17 Thread Neil Brown

On Thursday January 17, [EMAIL PROTECTED] wrote:
> On Thu, 17 Jan 2008 16:23:30 +0100 Jiri Slaby <[EMAIL PROTECTED]> wrote:
> 
> > On 01/17/2008 11:35 AM, Andrew Morton wrote:
> > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24-rc8/2.6.24-rc8-mm1/
> > 
> > still the same md issue (do_md_run returns -22=EINVAL) as in -rc6-mm1 
> > reported 
> > by Thorsten here:
> > http://lkml.org/lkml/2007/12/27/45
> 
> hm, I must have been asleep when that was reported.  Neil, did you see it?

No, even though it was Cc:ed to me - sorry.
Maybe a revised subject line would have helped... maybe not.

> 
> > Is there around any fix for this?
> 
> Well, we could bitbucket md-allow-devices-to-be-shared-between-md-arrays.patch

Yeah, do that.  I'll send you something new.
I'll move that chunk into a different patch and add the extra bits
needed to make that test correct in *all* cases rather than just the
ones I was thinking about at the time.
My test suit does try in-kernel-autodetect (the problem case) but it
didn't catch this bug due to another bug.  I'll fix that too.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How do I get rid of old device?

2008-01-16 Thread Neil Brown

On Wednesday January 16, [EMAIL PROTECTED] wrote:
> p34:~# mdadm /dev/md3 --zero-superblock
> p34:~# mdadm --examine --scan
> ARRAY /dev/md0 level=raid1 num-devices=2 
> UUID=f463057c:9a696419:3bcb794a:7aaa12b2
> ARRAY /dev/md1 level=raid1 num-devices=2 
> UUID=98e4948c:c6685f82:e082fd95:e7f45529
> ARRAY /dev/md2 level=raid1 num-devices=2 
> UUID=330c9879:73af7d3e:57f4c139:f9191788
> ARRAY /dev/md3 level=raid0 num-devices=10 
> UUID=6dc12c36:b3517ff9:083fb634:68e9eb49
> p34:~#
> 
> I cannot seem to get rid of /dev/md3, its almost as if there is a piece of 
> it on the root (2) disks or reference to it?
> 
> I also dd'd the other 10 disks (non-root) and /dev/md3 persists.

You don't zero the superblock on the array device, because the array
device does not have a superblock.  The component devices have the
superblock.

So
  mdadm --zero-superblock /dev/sd*
or whatever.
Maybe
  mdadm --examine --scan -v

then get the list of devices it found for the array you want to kill,
and  --zero-superblock that list.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 001 of 6] md: Fix an occasional deadlock in raid5

2008-01-16 Thread Neil Brown

On Tuesday January 15, [EMAIL PROTECTED] wrote:
> On Wed, 16 Jan 2008 00:09:31 -0700 "Dan Williams" <[EMAIL PROTECTED]> wrote:
> 
> > > heheh.
> > >
> > > it's really easy to reproduce the hang without the patch -- i could
> > > hang the box in under 20 min on 2.6.22+ w/XFS and raid5 on 7x750GB.
> > > i'll try with ext3... Dan's experiences suggest it won't happen with ext3
> > > (or is even more rare), which would explain why this has is overall a
> > > rare problem.
> > >
> > 
> > Hmmm... how rare?
> > 
> > http://marc.info/?l=linux-kernel&m=119461747005776&w=2
> > 
> > There is nothing specific that prevents other filesystems from hitting
> > it, perhaps XFS is just better at submitting large i/o's.  -stable
> > should get some kind of treatment.  I'll take altered performance over
> > a hung system.
> 
> We can always target 2.6.25-rc1 then 2.6.24.1 if Neil is still feeling
> wimpy.

I am feeling wimpy.  There've been a few too many raid5 breakages
recently and it is very hard to really judge the performance impact of
this change.  I even have a small uncertainty of correctness - could
it still hang in some other way?  I don't think so, but this is
complex code...

If it were really common I would have expected more noise on the
mailing list.  Sure, there has been some, but not much.  However maybe
people are searching the archives and finding the "increase stripe
cache size" trick, and not reporting anything  seems unlikely
though.

How about we queue it for 2.6.25-rc1 and then about when -rc2 comes
out, we queue it for 2.6.24.y?  Any one (or any distro) that really
needs it can of course grab the patch them selves...

??

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array.

2008-01-13 Thread Neil Brown

On Monday January 14, [EMAIL PROTECTED] wrote:
> 
> Thanks.  I'll see what I can some up with.

How about this, against current -mm

On both the read and write path for an rdev attribute, we
call mddev_lock, first checking that mddev is not NULL.
Once we get the lock, we check again.
If rdev->mddev is not NULL, we know it will stay that way as it only
gets cleared under the same lock.

While in the rdev show/store routines, we know that the mddev cannot
get freed, do to the kobject relationships.

rdev_size_store is awkward because it has to drop the lock.  So we
take a copy of rdev->mddev before the drop, and we are safe...

Comments?

NeilBrown

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/md.c |   35 ++-
 1 file changed, 26 insertions(+), 9 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2008-01-14 12:26:15.0 +1100
+++ ./drivers/md/md.c   2008-01-14 17:05:53.0 +1100
@@ -1998,9 +1998,11 @@ rdev_size_store(mdk_rdev_t *rdev, const 
char *e;
unsigned long long size = simple_strtoull(buf, &e, 10);
unsigned long long oldsize = rdev->size;
+   mddev_t *my_mddev = rdev->mddev;
+
if (e==buf || (*e && *e != '\n'))
return -EINVAL;
-   if (rdev->mddev->pers)
+   if (my_mddev->pers)
return -EBUSY;
rdev->size = size;
if (size > oldsize && rdev->mddev->external) {
@@ -2013,7 +2015,7 @@ rdev_size_store(mdk_rdev_t *rdev, const 
int overlap = 0;
struct list_head *tmp, *tmp2;
 
-   mddev_unlock(rdev->mddev);
+   mddev_unlock(my_mddev);
for_each_mddev(mddev, tmp) {
mdk_rdev_t *rdev2;
 
@@ -2033,7 +2035,7 @@ rdev_size_store(mdk_rdev_t *rdev, const 
break;
}
}
-   mddev_lock(rdev->mddev);
+   mddev_lock(my_mddev);
if (overlap) {
/* Someone else could have slipped in a size
 * change here, but doing so is just silly.
@@ -2045,8 +2047,8 @@ rdev_size_store(mdk_rdev_t *rdev, const 
return -EBUSY;
}
}
-   if (size < rdev->mddev->size || rdev->mddev->size == 0)
-   rdev->mddev->size = size;
+   if (size < my_mddev->size || my_mddev->size == 0)
+   my_mddev->size = size;
return len;
 }
 
@@ -2067,10 +2069,21 @@ rdev_attr_show(struct kobject *kobj, str
 {
struct rdev_sysfs_entry *entry = container_of(attr, struct 
rdev_sysfs_entry, attr);
mdk_rdev_t *rdev = container_of(kobj, mdk_rdev_t, kobj);
+   mddev_t *mddev = rdev->mddev;
+   ssize_t rv;
 
if (!entry->show)
return -EIO;
-   return entry->show(rdev, page);
+
+   rv = mddev ? mddev_lock(mddev) : -EBUSY;
+   if (!rv) {
+   if (rdev->mddev == NULL)
+   rv = -EBUSY;
+   else
+   rv = entry->show(rdev, page);
+   mddev_unlock(mddev);
+   }
+   return rv;
 }
 
 static ssize_t
@@ -2079,15 +2092,19 @@ rdev_attr_store(struct kobject *kobj, st
 {
struct rdev_sysfs_entry *entry = container_of(attr, struct 
rdev_sysfs_entry, attr);
mdk_rdev_t *rdev = container_of(kobj, mdk_rdev_t, kobj);
-   int rv;
+   ssize_t rv;
+   mddev_t *mddev = rdev->mddev;
 
if (!entry->store)
return -EIO;
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
-   rv = mddev_lock(rdev->mddev);
+   rv = mddev ? mddev_lock(mddev): -EBUSY;
if (!rv) {
-   rv = entry->store(rdev, page, length);
+   if (rdev->mddev == NULL)
+   rv = -EBUSY;
+   else
+   rv = entry->store(rdev, page, length);
mddev_unlock(rdev->mddev);
}
return rv;
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array.

2008-01-13 Thread Neil Brown

On Monday January 14, [EMAIL PROTECTED] wrote:
> On Mon, Jan 14, 2008 at 02:21:45PM +1100, Neil Brown wrote:
> 
> > Maybe it isn't there any more
> > 
> > Once upon a time, when I 
> >echo remove > /sys/block/mdX/md/dev-YYY/state
> 
> Egads.  And just what will protect you from parallel callers
> of state_store()?  buffer->mutex does *not* do that - it only
> gives you exclusion on given struct file.  Run the command
> above from several shells and you've got independent open
> from each redirect => different struct file *and* different
> buffer for each => no exclusion whatsoever.

well in -mm, rdev_attr_store gets a lock on
rdev->mddev->reconfig_mutex. 
It doesn't test is rdev->mddev is NULL though, so if the write happens
after unbind_rdev_from_array, we lose.
A test for NULL would be easy enough.  And I think that the mddev
won't actually disappear until the rdevs are all gone (you subsequent
comment about kobject_del ordering seems to confirm that) so a simple test
for NULL should be sufficient.

> 
> And _that_ is present right in the mainline tree - it's unrelated
> to -mm kobject changes.
> 
> BTW, yes, you do have a deadlock there - kobject_del() will try to evict
> children, which will include waiting for currently running ->store()
> to finish, which will include the caller since .../state *is* a child of
> that sucker.
> 
> The real problem is the lack of any kind of exclusion considerations in
> md.c itself, AFAICS.  Fun with ordering is secondary (BTW, yes, it is
> a problem - will sysfs ->store() to attribute between export_rdev() and
> kobject_del() work correctly?)

Probably not.  The possibility that rdev->mddev could be NULL would
break a lot of these.  Maybe I should delay setting rdev->mddev to
NULL until after the kobject_del.  Then audit them all.

Thanks.  I'll see what I can some up with.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 002 of 6] md: Fix use-after-free bug when dropping an rdev from an md array.

2008-01-13 Thread Neil Brown

On Monday January 14, [EMAIL PROTECTED] wrote:
> On Mon, Jan 14, 2008 at 12:45:31PM +1100, NeilBrown wrote:
> > 
> > Due to possible deadlock issues we need to use a schedule work to
> > kobject_del an 'rdev' object from a different thread.
> > 
> > A recent change means that kobject_add no longer gets a refernce, and
> > kobject_del doesn't put a reference.  Consequently, we need to
> > explicitly hold a reference to ensure that the last reference isn't
> > dropped before the scheduled work get a chance to call kobject_del.
> > 
> > Also, rename delayed_delete to md_delayed_delete to that it is more
> > obvious in a stack trace which code is to blame.
> 
> I don't know...  You still get kobject_del() and export_rdev()
> in unpredictable order; sure, it won't be freed under you, but...

I cannot see that that would matter.
kobject_del deletes the object from the kobj tree and free sysfs.
export_rdev disconnects the objects from md structures and releases
the connection with the device.  They are quite independent.

> 
> What is that deadlock problem, anyway?  I don't see anything that
> would look like an obvious candidate in the stuff you are delaying...

Maybe it isn't there any more

Once upon a time, when I 
   echo remove > /sys/block/mdX/md/dev-YYY/state

sysfs_write_file would hold buffer->sem while calling my store
handler.
When my store handler tried to delete the relevant kobject, it would
eventually call orphan_all_buffers which would try to take buf->sem
and deadlock.

orphan_all_buffers doesn't exist any more, so maybe the deadlock is
gone too.
However the comment at the top of sysfs_schedule_callback in
sysfs/file.c says:

 *
 * sysfs attribute methods must not unregister themselves or their parent
 * kobject (which would amount to the same thing).  Attempts to do so will
 * deadlock, since unregistration is mutually exclusive with driver
 * callbacks.
 *

so I'm included to leave the code as it is  ofcourse the comment
could be well out of date.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-10 Thread Neil Brown

On Thursday January 10, [EMAIL PROTECTED] wrote:
> On Jan 10, 2008 12:13 AM, dean gaudet <[EMAIL PROTECTED]> wrote:
> > w.r.t. dan's cfq comments -- i really don't know the details, but does
> > this mean cfq will misattribute the IO to the wrong user/process?  or is
> > it just a concern that CPU time will be spent on someone's IO?  the latter
> > is fine to me... the former seems sucky because with today's multicore
> > systems CPU time seems cheap compared to IO.
> >
> 
> I do not see this affecting the time slicing feature of cfq, because
> as Neil says the work has to get done at some point.   If I give up
> some of my slice working on someone else's I/O chances are the favor
> will be returned in kind since the code does not discriminate.  The
> io-priority capability of cfq currently does not work as advertised
> with current MD since the priority is tied to the current thread and
> the thread that actually submits the i/o on a stripe is
> non-deterministic.  So I do not see this change making the situation
> any worse.  In fact, it may make it a bit better since there is a
> higher chance for the thread submitting i/o to MD to do its own i/o to
> the backing disks.
> 
> Reviewed-by: Dan Williams <[EMAIL PROTECTED]>

Thanks.
But I suspect you didn't test it with a bitmap :-)
I ran the mdadm test suite and it hit a problem - easy enough to fix.

I'll look out for any other possible related problem (due to raid5d
running in different processes) and then submit it.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: The effects of multiple layers of block drivers

2008-01-10 Thread Neil Brown

On Thursday January 10, [EMAIL PROTECTED] wrote:
> Hello,
> 
> I am starting to dig into the Block subsystem to try and uncover the
> reason for some data I lost recently.  My situation is that I have
> multiple block drivers on top of each other and am wondering how the
> effectss of a raid 5 rebuild would affect the block devices above it.

It should "just work" - no surprises.  raid5 is just a block device
like any other.  When doing a rebuild it might be a bit slower, but
that is all.

> 
> The layers are raid 5 -> lvm -> cryptoloop.  It seems that after the
> raid 5 device was rebuilt by adding in a new disk, that the cryptoloop
> doesn't have a valid ext3 partition on it.

There was a difference of opinion between raid5 and dm-crypt which
could cause some corruption.
What kernel version are you using, and are you using dm-crypt or loop
(e..g losetup) with encryption?


> 
> As a raid device re-builds is there ant rearranging of sectors or
> corresponding blocks that would effect another block device on top of it?

No.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: md rotates RAID5 spare at boot

2008-01-10 Thread Neil Brown

On Thursday January 10, [EMAIL PROTECTED] wrote:
> (Sorry- yes it looks like I posted an incorrect dmesg extract)

This still doesn't seem to match your description.
I see:

> [   41.247389] md: bind
> [   41.247584] md: bind
> [   41.247787] md: bind
> [   41.247971] md: bind
> [   41.248151] md: bind
> [   41.248325] md: bind
> [   41.256718] raid5: device sde1 operational as raid disk 0
> [   41.256771] raid5: device sdc1 operational as raid disk 4
> [   41.256821] raid5: device sda1 operational as raid disk 3
> [   41.256870] raid5: device sdb1 operational as raid disk 2
> [   41.256919] raid5: device sdf1 operational as raid disk 1
> [   41.257426] raid5: allocated 5245kB for md0
> [   41.257476] raid5: raid level 5 set md0 active with 5 out of 5 
> devices, algorithm 2

which looks like 'md0' started with 5 of 5 drives, plus g1 is there as
a spare.  And

> [   41.312250] md: bind
> [   41.312476] md: bind
> [   41.312711] md: bind
> [   41.312922] md: bind
> [   41.313138] md: bind
> [   41.313343] md: bind
> [   41.313452] md: md1: raid array is not clean -- starting background 
> reconstruction
> [   41.322189] raid5: device sde2 operational as raid disk 0
> [   41.322243] raid5: device sdc2 operational as raid disk 4
> [   41.322292] raid5: device sdg2 operational as raid disk 3
> [   41.322342] raid5: device sdb2 operational as raid disk 2
> [   41.322391] raid5: device sdf2 operational as raid disk 1
> [   41.322823] raid5: allocated 5245kB for md1
> [   41.322872] raid5: raid level 5 set md1 active with 5 out of 5 
> devices, algorithm 2

md1 also assembled with 5/5 drives and sda2 as a spare.  
This one was not shut down cleanly so it started a resync.  But there
is not evidence of anything starting degraded.



NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: md rotates RAID5 spare at boot

2008-01-10 Thread Neil Brown

On Thursday January 10, [EMAIL PROTECTED] wrote:
> One quick question about those rules.  The 65-mdadm rule looks like it 
> checks ACTIVE arrays for filesystems, and the 85 rule assembles arrays.  
> Shouldn't they run in the other order?
> 

They are fine.  The '65' rule applies to arrays.  I.e. it fires on an
array device once it has been started.
The '85' rule applies to component devices.

They are quite independent.

NeilBrown


> 
> 
> 
> distro: Ubuntu 7.10
> 
> Two files show up...
> 
> 85-mdadm.rules:
> # This file causes block devices with Linux RAID (mdadm) signatures to
> # automatically cause mdadm to be run.
> # See udev(8) for syntax
> 
> SUBSYSTEM=="block", ACTION=="add|change", ENV{ID_FS_TYPE}=="linux_raid*", \
> RUN+="watershed /sbin/mdadm --assemble --scan --no-degraded"
> 
> 
> 
> 65-mdadm.vol_id.rules:
> # This file causes Linux RAID (mdadm) block devices to be checked for
> # further filesystems if the array is active.
> # See udev(8) for syntax
> 
> SUBSYSTEM!="block", GOTO="mdadm_end"
> KERNEL!="md[0-9]*", GOTO="mdadm_end"
> ACTION!="add|change", GOTO="mdadm_end"
> 
> # Check array status
> ATTR{md/array_state}=="|clear|inactive", GOTO="mdadm_end"
> 
> # Obtain array information
> IMPORT{program}="/sbin/mdadm --detail --export $tempnode"
> ENV{MD_NAME}=="?*", SYMLINK+="disk/by-id/md-name-$env{MD_NAME}"
> ENV{MD_UUID}=="?*", SYMLINK+="disk/by-id/md-uuid-$env{MD_UUID}"
> 
> # by-uuid and by-label symlinks
> IMPORT{program}="vol_id --export $tempnode"
> OPTIONS="link_priority=-100"
> ENV{ID_FS_USAGE}=="filesystem|other|crypto", ENV{ID_FS_UUID_ENC}=="?*", \
> SYMLINK+="disk/by-uuid/$env{ID_FS_UUID_ENC}"
> ENV{ID_FS_USAGE}=="filesystem|other", ENV{ID_FS_LABEL_ENC}=="?*", \
> SYMLINK+="disk/by-label/$env{ID_FS_LABEL_ENC}"
> 
> 
> I see.  So udev is invoking the assemble command as soon as it detects 
> the devices.  So is it possible that the spare is not the last drive to 
> be detected and mdadm assembles too soon?
> 
> 
> 
> Neil Brown wrote:
> > On Thursday January 10, [EMAIL PROTECTED] wrote:
> >   
> >> It looks to me like md inspects and attempts to assemble after each 
> >> drive controller is scanned (from dmesg, there appears to be a failed 
> >> bind on the first three devices after they are scanned, and then again 
> >> when the second controller is scanned).  Would the scan order cause a 
> >> spare to be swapped in?
> >>
> >> 
> >
> > This suggests that "mdadm --incremental" is being used to assemble the
> > arrays.  Every time udev finds a new device, it gets added to
> > whichever array is should be in.
> > If it is called as "mdadm --incremental --run", then it will get
> > started as soon as possible, even if it is degraded.  With the
> > "--run", it will wait until all devices are available.
> >
> > Even with "mdadm --incremental --run", you shouldn't get a resync if
> > the last device is added before the array is written to.
> >
> > What distro are you running?
> > What does
> >grep -R mdadm /etc/udev
> >
> > show?
> >
> > NeilBrown
> >
> >   
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: md rotates RAID5 spare at boot

2008-01-10 Thread Neil Brown

On Thursday January 10, [EMAIL PROTECTED] wrote:
> distro: Ubuntu 7.10
> 
> Two files show up...
> 
> 85-mdadm.rules:
> # This file causes block devices with Linux RAID (mdadm) signatures to
> # automatically cause mdadm to be run.
> # See udev(8) for syntax
> 
> SUBSYSTEM=="block", ACTION=="add|change", ENV{ID_FS_TYPE}=="linux_raid*", \
> RUN+="watershed /sbin/mdadm --assemble --scan --no-degraded"

> 
> I see.  So udev is invoking the assemble command as soon as it detects 
> the devices.  So is it possible that the spare is not the last drive to 
> be detected and mdadm assembles too soon?

The "--no-degraded' should stop it from assembling until all expected
devices have been found.  It could assemble before the spare is found,
but should not assemble before all the data devices have been found.

The "dmesg" trace you included in your first mail doesn't actually
show anything wrong - it never starts and incomplete array.
Can you try again and get a trace where there definitely is a rebuild
happening.

And please don't drop linux-raid from the 'cc' list.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: md rotates RAID5 spare at boot

2008-01-10 Thread Neil Brown

On Thursday January 10, [EMAIL PROTECTED] wrote:
> 
> It looks to me like md inspects and attempts to assemble after each 
> drive controller is scanned (from dmesg, there appears to be a failed 
> bind on the first three devices after they are scanned, and then again 
> when the second controller is scanned).  Would the scan order cause a 
> spare to be swapped in?
> 

This suggests that "mdadm --incremental" is being used to assemble the
arrays.  Every time udev finds a new device, it gets added to
whichever array is should be in.
If it is called as "mdadm --incremental --run", then it will get
started as soon as possible, even if it is degraded.  With the
"--run", it will wait until all devices are available.

Even with "mdadm --incremental --run", you shouldn't get a resync if
the last device is added before the array is written to.

What distro are you running?
What does
   grep -R mdadm /etc/udev

show?

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid5 stuck in degraded, inactive and dirty mode

2008-01-10 Thread Neil Brown

On Thursday January 10, [EMAIL PROTECTED] wrote:
> On Wed, Jan 09, 2008 at 07:16:34PM +1100, CaT wrote:
> > > But I suspect that "--assemble --force" would do the right thing.
> > > Without more details, it is hard to say for sure.
> > 
> > I suspect so aswell but throwing caution into the wind erks me wrt this
> > raid array. :)
> 
> Sorry. Not to be a pain but considering the previous email with all the
> examine dumps, etc would the above be the way to go? I just don't want
> to have missed something and bugger the array up totally.

Yes, definitely.

The superblocks look perfectly normal for a single drive failure
followed by a crash.  So "--assemble --force" is the way to go.

Technically you could have some data corruption if a write was under
way at the time of the crash.  In that case the parity block of that
stripe could be wrong, so the recovered data for the missing device
could be wrong.
This is why you are required to use "--force" - to confirm that you
are aware that there could be a problem.

It would be worth running "fsck" just to be sure that nothing critical
has been corrupted.  Also if you have a recent backup, I wouldn't
recycle it until I was fairly sure that all your data was really safe.

But in my experience the chance of actual data corruption in this
situation is fairly low.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-09 Thread Neil Brown

On Wednesday January 9, [EMAIL PROTECTED] wrote:
> On Jan 9, 2008 5:09 PM, Neil Brown <[EMAIL PROTECTED]> wrote:
> > On Wednesday January 9, [EMAIL PROTECTED] wrote:
> >
> > Can you test it please?
> 
> This passes my failure case.

Thanks!

> 
> > Does it seem reasonable?
> 
> What do you think about limiting the number of stripes the submitting
> thread handles to be equal to what it submitted?  If I'm a stripe that
> only submits 1 stripe worth of work should I get stuck handling the
> rest of the cache?

Dunno
Someone has to do the work, and leaving it all to raid5d means that it
all gets done on one CPU.
I expect that most of the time the queue of ready stripes is empty so
make_request will mostly only handle it's own stripes anyway.
The times that it handles other thread's stripes will probably balance
out with the times that other threads handle this threads stripes.

So I'm incline to leave it as "do as much work as is available to be
done" as that is simplest.  But I can probably be talked out of it
with a convincing argument

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc6 reproducible raid5 hang

2008-01-09 Thread Neil Brown

On Wednesday January 9, [EMAIL PROTECTED] wrote:
> On Sun, 2007-12-30 at 10:58 -0700, dean gaudet wrote:
> > i have evidence pointing to d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
> > 
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1
> > 
> > which was Neil's change in 2.6.22 for deferring generic_make_request 
> > until there's enough stack space for it.
> > 
> 
> Commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 reduced stack utilization
> by preventing recursive calls to generic_make_request.  However the
> following conditions can cause raid5 to hang until 'stripe_cache_size' is
> increased:
> 

Thanks for pursuing this guys.  That explanation certainly sounds very
credible.

The generic_make_request_immed is a good way to confirm that we have
found the bug,  but I don't like it as a long term solution, as it
just reintroduced the problem that we were trying to solve with the
problematic commit.

As you say, we could arrange that all request submission happens in
raid5d and I think this is the right way to proceed.  However we can
still take some of the work into the thread that is submitting the
IO by calling "raid5d()" at the end of make_request, like this.

Can you test it please?  Does it seem reasonable?

Thanks,
NeilBrown


Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/md.c|2 +-
 ./drivers/md/raid5.c |4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c   2008-01-07 13:32:10.0 +1100
+++ ./drivers/md/md.c   2008-01-10 11:08:02.0 +1100
@@ -5774,7 +5774,7 @@ void md_check_recovery(mddev_t *mddev)
if (mddev->ro)
return;
 
-   if (signal_pending(current)) {
+   if (current == mddev->thread->tsk && signal_pending(current)) {
if (mddev->pers->sync_request) {
printk(KERN_INFO "md: %s in immediate safe mode\n",
   mdname(mddev));

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c2008-01-07 13:32:10.0 +1100
+++ ./drivers/md/raid5.c2008-01-10 11:06:54.0 +1100
@@ -3432,6 +3432,7 @@ static int chunk_aligned_read(struct req
}
 }
 
+static void raid5d (mddev_t *mddev);
 
 static int make_request(struct request_queue *q, struct bio * bi)
 {
@@ -3547,7 +3548,7 @@ static int make_request(struct request_q
goto retry;
}
finish_wait(&conf->wait_for_overlap, &w);
-   handle_stripe(sh, NULL);
+   set_bit(STRIPE_HANDLE, &sh->state);
release_stripe(sh);
} else {
/* cannot get stripe for read-ahead, just give-up */
@@ -3569,6 +3570,7 @@ static int make_request(struct request_q
  test_bit(BIO_UPTODATE, &bi->bi_flags)
? 0 : -EIO);
}
+   raid5d(mddev);
return 0;
 }
 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid5 stuck in degraded, inactive and dirty mode

2008-01-08 Thread Neil Brown

On Wednesday January 9, [EMAIL PROTECTED] wrote:
> 
> I'd provide data dumps of --examine and friends but I'm in a situation
> where transferring the data would be a right pain. I'll do it if need
> be, though.
> 
> So, what can I do? 

Well, providing the output of "--examine" would help a lot.

But I suspect that "--assemble --force" would do the right thing.
Without more details, it is hard to say for sure.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid 1, new disk can't be added after replacing faulty disk

2008-01-07 Thread Neil Brown

On Monday January 7, [EMAIL PROTECTED] wrote:
> On Jan 7, 2008 6:44 AM, Radu Rendec <[EMAIL PROTECTED]> wrote:
> > I'm experiencing trouble when trying to add a new disk to a raid 1 array
> > after having replaced a faulty disk.
> >
> [..]
> > # mdadm --version
> > mdadm - v2.6.2 - 21st May 2007
> >
> [..]
> > However, this happens with both mdadm 2.6.2 and 2.6.4. I downgraded to
> > 2.5.4 and it works like a charm.
> 
> Looks like you are running into the issue described here:
> http://marc.info/?l=linux-raid&m=119892098129022&w=2

I cannot easily reproduce this.  I suspect it is sensitive to the
exact size of the devices involved.

Please test this patch and see if it fixes the problem.
If not, please tell me the exact sizes of the partition being used
(e.g. cat /proc/partitions) and I will try harder to reproduce it.

Thanks,
NeilBrown



diff --git a/super1.c b/super1.c
index 2b096d3..9eec460 100644
--- a/super1.c
+++ b/super1.c
@@ -903,7 +903,7 @@ static int write_init_super1(struct supertype *st, void 
*sbv,
 * for a bitmap.
 */
array_size = __le64_to_cpu(sb->size);
-   /* work out how much space we left of a bitmap */
+   /* work out how much space we left for a bitmap */
bm_space = choose_bm_space(array_size);
 
switch(st->minor_version) {
@@ -913,6 +913,8 @@ static int write_init_super1(struct supertype *st, void 
*sbv,
sb_offset &= ~(4*2-1);
sb->super_offset = __cpu_to_le64(sb_offset);
sb->data_offset = __cpu_to_le64(0);
+   if (sb_offset - bm_space < array_size)
+   bm_space = sb_offset - array_size;
sb->data_size = __cpu_to_le64(sb_offset - bm_space);
break;
case 1:
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is that normal a removed part in RAID0 still showed as "active sync"

2008-01-07 Thread Neil Brown

On Monday January 7, [EMAIL PROTECTED] wrote:
> 
> The /dev/md0 is set as RAID0
> "cat /proc/mdstat" shows
> md0 : active raid0 sda1[0] sdd1[3] sdc1[2] sdb1[1]
> 157307904 blocks 64k chunks
> 
> Then sdd is removed.
> 
> But  "cat /proc/mdsta" still shows the same information as above, while two
> RAID5 devices show their sdd parts as (F)
> md0 : active raid0 sda1[0] sdd1[3] sdc1[2] sdb1[1]
> 157307904 blocks 64k chunks
> 
> Is this normal?

Yes.

raid0 is not real raid.  It is not able to cope with disk failures, so
it doesn't even try.  Devices in a raid0 are never marked failed as
doing so would be of no benefit.

NeilBrown


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Why "mdadm --monitor --program" sometimes only gives 2 command-line arguments to the program?

2008-01-07 Thread Neil Brown

On Saturday January 5, [EMAIL PROTECTED] wrote:
> 
> Hi all,
> 
> I need to monitor my RAID and if it fails, I'd like to call my-script to
> deal with the failure.
> 
> I did: 
> mdadm --monitor --program my-script --delay 60 /dev/md1
> 
> And then, I simulate a failure with
> mdadm --manage --set-faulty /dev/md1 /dev/sda2
> mdadm /dev/md1 --remove /dev/sda2
> 
> I hope the mdadm monitor function can pass all three command-line
> arguments to my-script, including the name of the event, the name of the
> md device and the name of a related device if relevant.
> 
> But my-script doesn't get the third one, which should be /dev/sda2. Is
> this not "relevant"?
> 
> If I really need to know it's /dev/sda2 that fails, what can I do?

What version of mdadm are you using?
I'm guessing 2.6, 2.6.1, or 2.6.2.
There was a bug introduced in 2.6 that was fixed in 2.6.3 that would
have this effect.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid 1, can't get the second disk added back in.

2008-01-07 Thread Neil Brown

On Monday January 7, [EMAIL PROTECTED] wrote:
> Problem is not raid, or at least not obviously raid related.  The 
> problem is that the whole disk, /dev/hdb is unavailable. 

Maybe check /sys/block/hdb/holders ?  lsof /dev/hdb ?

good luck :-)

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid 1, can't get the second disk added back in.

2008-01-06 Thread Neil Brown

On Saturday January 5, [EMAIL PROTECTED] wrote:
> [EMAIL PROTECTED]:~# mdadm /dev/md0 --add /dev/hdb5
> mdadm: Cannot open /dev/hdb5: Device or resource busy
> 
> All the solutions I've been able to google fail with the busy.  There is 
> nothing that I can find that might be  using /dev/hdb5 except the raid 
> device and it appears it's not either.

Very odd. But something must be using it.

What does
   ls -l /sys/block/hdb/hdb5/holders
show?
What about
   cat /proc/mounts
   cat /proc/swaps
   lsof /dev/hdb5

??
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid 1, can't get the second disk added back in.

2008-01-06 Thread Neil Brown

On Saturday January 5, [EMAIL PROTECTED] wrote:
> 
> Since /dev/hdb5 has been part of this array before you should use  
> --re-add instead of --add.
> Kind regards,
> Alex.

That is not correct.

--re-add is only needed for arrays without metadata, for which you use
"--build" to start them.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] md: Fix data corruption when a degraded raid5 array is reshaped.

2008-01-03 Thread Neil Brown

On Thursday January 3, [EMAIL PROTECTED] wrote:
> 
> On closer look the safer test is:
> 
>   !test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending).
> 
> The 'req_compute' field only indicates that a 'compute_block' operation
> was requested during this pass through handle_stripe so that we can
> issue a linked chain of asynchronous operations.
> 
> ---
> 
> From: Neil Brown <[EMAIL PROTECTED]>

Technically that should probably be
  From: Dan Williams <[EMAIL PROTECTED]>

now, and then I add
  Acked-by: NeilBrown <[EMAIL PROTECTED]>

because I completely agree with your improvement.

We should keep an eye out for then Andrew commits this and make sure
the right patch goes in...

Thanks,
NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PROBLEM: RAID5 reshape data corruption

2008-01-03 Thread Neil Brown

On Monday December 31, [EMAIL PROTECTED] wrote:
> Ok, since my previous thread didn't seem to attract much attention,
> let me try again.

Thank you for your report and your patience.

> An interrupted RAID5 reshape will cause the md device in question to
> contain one corrupt chunk per stripe if resumed in the wrong manner.
> A testcase can be found at http://www.nagilum.de/md/ .
> The first testcase can be initialized with "start.sh" the real test
> can then be run with "test.sh". The first testcase also uses dm-crypt
> and xfs to show the corruption.

It looks like this can be fixed with the patch:

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/raid5.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c2008-01-04 09:20:54.0 +1100
+++ ./drivers/md/raid5.c2008-01-04 09:21:05.0 +1100
@@ -2865,7 +2865,7 @@ static void handle_stripe5(struct stripe
md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
}

-   if (s.expanding && s.locked == 0)
+   if (s.expanding && s.locked == 0 && s.req_compute == 0)
handle_stripe_expansion(conf, sh, NULL);

if (sh->ops.count)

With this patch in place, the v2 test only reports errors after the end
of the original array, as you would expect (the new space is
initialised to 0).

> I'm not just interested in a simple behaviour fix I'm also interested
> in what actually happens and if possible a repair program for that
> kind of data corruption.

What happens is that when reshape happens while a device is missing,
the data on that device should be computed from the other data devices
and parity.  However because of the above bug, the data is copied into
the new layout before the compute is complete.  This means that the
data that was on that device is really lost beyond recovery.

I'm really sorry about that, but there is nothing that can be done to
recover the lost data.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: stopped array, but /sys/block/mdN still exists.

2008-01-03 Thread Neil Brown

On Thursday January 3, [EMAIL PROTECTED] wrote:
> 
> So what happens if I try to _use_ that /sys entry? For instance run a 
> script which reads data, or sets the stripe_cache_size higher, or 
> whatever? Do I get back status, ignored, or system issues?

Try it:-)

The stripe_cache_size attributes will disappear (it is easy to remove
attributes, and stripe_cache_size is only meaningful for certain raid
levels).
Other attributes will return 0 or some equivalent, though I think
chunk_size will have the old value.

NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: stopped array, but /sys/block/mdN still exists.

2008-01-02 Thread Neil Brown

On Wednesday January 2, [EMAIL PROTECTED] wrote:
> This isn't a high priority issue or anything, but I'm curious:
> 
> I --stop(ped) an array but /sys/block/md2 remained largely populated.
> Is that intentional?

It is expected.
Because of the way that md devices are created (just open the
device-special file), it is very hard to make them disappear in a
race-free manner.  I tried once and failed.  It is probably getting
close to trying again, but as you say: it isn't a high priority issue.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Last ditch plea on remote double raid5 disk failure

2007-12-31 Thread Neil Brown

On Monday December 31, [EMAIL PROTECTED] wrote:
> 
> I'm hoping that if I can get raid5 to continue despite the errors, I
> can bring back up enough of the server to continue, a bit like the
> remount-ro option in ext2/ext3.
> 
> If not, oh well...

Sorry, but it is "oh well".

I could probably make it behave a bit better in this situation, but
not in time for you.


NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mdadm --stop goes off and never comes back?

2007-12-22 Thread Neil Brown

On Wednesday December 19, [EMAIL PROTECTED] wrote:
> On 12/19/07, Jon Nelson <[EMAIL PROTECTED]> wrote:
> > On 12/19/07, Neil Brown <[EMAIL PROTECTED]> wrote:
> > > On Tuesday December 18, [EMAIL PROTECTED] wrote:
> > > >
> > > > I tried to stop the array:
> > > >
> > > > mdadm --stop /dev/md2
> > > >
> > > > and mdadm never came back. It's off in the kernel somewhere. :-(

Looking at your stack traces, you have the "mdadm -S" holding
an md lock and trying to get a sysfs lock as part of tearing down the
array, and 'hald' is trying to read some attribute in
   /sys/block/md
and is holding the sysfs lock and trying to get the md lock.
A classic AB-BA deadlock.

> 
> NOTE: kernel is stock openSUSE 10.3 kernel, x86_64, 2.6.22.13-0.3-default.
> 

It is fixed in mainline with some substantial changes to sysfs.
I don't imagine they are likely to get back ported to openSUSE, but
you could try logging a bugzilla if you like.

The 'hald' process is interruptible and killing it would release the
deadlock.

I suspect you have to be fairly unlucky to lose the race but it is
obviously quite possible.

I don't think there is anything I can do on the md side to avoid the
bug.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mdadm --stop goes off and never comes back?

2007-12-19 Thread Neil Brown

On Tuesday December 18, [EMAIL PROTECTED] wrote:
> This just happened to me.
> Create raid with:
> 
> mdadm --create /dev/md2 --level=raid10 --raid-devices=3
> --spare-devices=0 --layout=o2 /dev/sdb3 /dev/sdc3 /dev/sdd3
> 
> cat /proc/mdstat
> 
> md2 : active raid10 sdd3[2] sdc3[1] sdb3[0]
>   5855424 blocks 64K chunks 2 offset-copies [3/3] [UUU]
>   [==>..]  resync = 14.6% (859968/5855424)
> finish=1.3min speed=61426K/sec
> 
> Some log messages:
> 
> Dec 18 15:02:28 turnip kernel: md: md2: raid array is not clean --
> starting background reconstruction
> Dec 18 15:02:28 turnip kernel: raid10: raid set md2 active with 3 out
> of 3 devices
> Dec 18 15:02:28 turnip kernel: md: resync of RAID array md2
> Dec 18 15:02:28 turnip kernel: md: minimum _guaranteed_  speed: 1000
> KB/sec/disk.
> Dec 18 15:02:28 turnip kernel: md: using maximum available idle IO
> bandwidth (but not more than 20 KB/sec) for resync.
> Dec 18 15:02:28 turnip kernel: md: using 128k window, over a total of
> 5855424 blocks.
> Dec 18 15:03:36 turnip kernel: md: md2: resync done.
> Dec 18 15:03:36 turnip kernel: md: checkpointing resync of md2.
> 
> I tried to stop the array:
> 
> mdadm --stop /dev/md2
> 
> and mdadm never came back. It's off in the kernel somewhere. :-(
> 
> kill, of course, has no effect.
> The machine still runs fine, the rest of the raids (md0 and md1) work
> fine (same disks).
> 
> The output (snipped, only mdadm) of 'echo t > /proc/sysrq-trigger'
> 
> Dec 18 15:09:13 turnip kernel: mdadm S 0001e5359fa38fb0 0
> 3943  1 (NOTLB)
> Dec 18 15:09:13 turnip kernel:  810033e7ddc8 0086
>  0092
> Dec 18 15:09:13 turnip kernel:  0fc7 810033e7dd78
> 80617800 80617800
> Dec 18 15:09:13 turnip kernel:  8061d210 80617800
> 80617800 
> Dec 18 15:09:13 turnip kernel: Call Trace:
> Dec 18 15:09:13 turnip kernel:  []
> __mutex_lock_interruptible_slowpath+0x8b/0xca
> Dec 18 15:09:13 turnip kernel:  [] do_open+0x222/0x2a5
> Dec 18 15:09:13 turnip kernel:  [] md_seq_show+0x127/0x6c1
> Dec 18 15:09:13 turnip kernel:  [] vma_merge+0x141/0x1ee
> Dec 18 15:09:13 turnip kernel:  [] seq_read+0x1bf/0x28b
> Dec 18 15:09:13 turnip kernel:  [] vfs_read+0xcb/0x153
> Dec 18 15:09:13 turnip kernel:  [] sys_read+0x45/0x6e
> Dec 18 15:09:13 turnip kernel:  [] system_call+0x7e/0x83
> 
> 
> 
> What happened? Is there any debug info I can provide before I reboot?

Don't know very odd.

The rest of the 'sysrq' output would possibly help.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid5 resizing

2007-12-19 Thread Neil Brown

On Wednesday December 19, [EMAIL PROTECTED] wrote:
> Hi,
> 
> I'm thinking of slowly replacing disks in my raid5 array with bigger
> disks and then resize the array to fill up the new disks. Is this
> possible? Basically I would like to go from:
> 
> 3 x 500gig RAID5 to 3 x 1tb RAID5, thereby going from 1tb to 2tb of
> storage.
> 
> It seems like it should be, but... :)

Yes.

mdadm --grow /dev/mdX --size=max

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid over 48 disks

2007-12-18 Thread Neil Brown

On Tuesday December 18, [EMAIL PROTECTED] wrote:
> We're investigating the possibility of running Linux (RHEL) on top of  
> Sun's X4500 Thumper box:
> 
> http://www.sun.com/servers/x64/x4500/
> 
> Basically, it's a server with 48 SATA hard drives. No hardware RAID.  
> It's designed for Sun's ZFS filesystem.
> 
> So... we're curious how Linux will handle such a beast. Has anyone run  
> MD software RAID over so many disks? Then piled LVM/ext3 on top of  
> that? Any suggestions?
> 
> Are we crazy to think this is even possible?

Certainly possible.
The default metadata is limited to 28 devices, but with
--metadata=1

you can easily use all 48 drives or more in the one array.  I'm not
sure if you would want to though.

If you just wanted an enormous scratch space and were happy to lose
all your data on a drive failure, then you could make a raid0 across
all the drives which should work perfectly and give you lots of
space.  But that probably isn't what you want.

I wouldn't create a raid5 or raid6 on all 48 devices.
RAID5 only survives a single device failure and with that many
devices, the chance of a second failure before you recover becomes
appreciable.

RAID6 would be much more reliable, but probably much slower.  RAID6
always needs to read or write every block in a stripe (i.e. it always
uses reconstruct-write to generate the P and Q blocks,  It never does
a read-modify-write like raid5 does).  This means that every write
touches every device so you have less possibility for parallelism
among your many drives.
It might be instructive to try it out though.

RAID10 would be a good option if you are happy wit 24 drives worth of
space.  I would probably choose a largish chunk size (256K) and use
the 'offset' layout.

Alternately, 8 6drive RAID5s or 6 8raid RAID6s, and use RAID0 to
combine them together.  This would give you adequate reliability and
performance and still a large amount of storage space.

Have fun!!!

NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Cannot re-assemble Degraded RAID6 after crash

2007-12-17 Thread Neil Brown

On Monday December 17, [EMAIL PROTECTED] wrote:
> My system has crashed a couple of times, each time the two drives have
> dropped off of the RAID.
> 
> Previously I simply did the following, which would take all night:
> 
> mdadm -a --re-add /dev/md2 /dev/sde3
> mdadm -a --re-add /dev/md2 /dev/sdf3
> mdadm -a --re-add /dev/md3 /dev/sde5
> mdadm -a --re-add /dev/md3 /dev/sde5
> 
> When I woke up in the morning, everything was happy...until it crashed
> again yesterday. This time, I get a message: "/dev/md3 assembled from
> 4 drives - not enough to start the array while not clean - consider
> --force."
> 
> I can re-assemble /dev/md3 (sda5, sdb5, sdc5, sdd5, sde5 and sdf5) if
> I use -f, although all the other sets seem fine. I cannot "--re-add"
> the other partitions. 

What happens when you try to re-add those devices?
How about just "--add".  --re-add is only need for arrays without
metadata, in your case it should behave the same as --add.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 007 of 7] md: Get name for block device in sysfs

2007-12-16 Thread Neil Brown

On Saturday December 15, [EMAIL PROTECTED] wrote:
> On Dec 14, 2007 7:26 AM, NeilBrown <[EMAIL PROTECTED]> wrote:
> >
> > Given an fd on a block device, returns a string like
> >
> > /block/sda/sda1
> >
> > which can be used to find related information in /sys.

> 
> As pointed out to when you came up with the idea, we can't do this. A devpath
> is a path to the device and will not necessarily start with "/block" for block
> devices. It may start with "/devices" and can be much longer than
> BDEVNAME_SIZE*2  + 10.

When you say "will not necessarily" can I take that to mean that it
currently does, but it might (will) change??
In that case can we have the patch as it stands and when the path to
block devices in /sys changes, the ioctl can be changed at the same
time to match?

Or are you saying that as the kernel is today, some block devices
appear under /devices/..., in which case could you please give an
example?

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Please Help!!! Raid 5 reshape failed!

2007-12-16 Thread Neil Brown

On Friday December 14, [EMAIL PROTECTED] wrote:
> 
> gentoofs ~#mdadm --assemble /dev/md1 /dev/sdc /dev/sdd /dev/sdf
> mdadm: /dev/md1 assembled from 2 drives - not enough to start the array

Try adding "--run".  or maybe "--force".

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mdadm break / restore soft mirror

2007-12-14 Thread Neil Brown

On Thursday December 13, [EMAIL PROTECTED] wrote:
> > What you could do is set the number of devices in the array to 3 so
> > they it always appears to be degraded, then rotate your backup drives
> > through the array.  The number of dirty bits in the bitmap will
> > steadily grow and so resyncs will take longer.  Once it crosses some
> > threshold you set the array back to having 2 devices to that it looks
> > non-degraded and clean the bitmap.  Then each device will need a full
> > resync after which you will get away with partial resyncs for a while.
> 
> I don't undertand why clearing the bitmap causes a rebuild of
> all devices. I think I have a conceptual misunderstanding.  Consider
> a RAID-1 and three physical disks involved, A,B,C
> 
> 1) A and B are in the RAID, everything is synced
> 2) Create a bitmap on the array
> 3) Fail + remove B
> 4) Hot add C, wait for C to sync
> 5) Fail + remove C
> 6) Hot add B, wait for B to resync
> 7) Goto step 3
> 
> I understand that after a while we might want to clean the bitmap
> and that would trigger a full resync for drives B and C. I don't
> understand why it would ever cause a resync for drive A.

You are exactly correct.  That is what I meant, though I probably
didn't express it very clearly.

After you clean out the bitmap, any devices that are not in the array
at that time will need a full resync to come back in to the array.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: mdadm break / restore soft mirror

2007-12-13 Thread Neil Brown

On Thursday December 13, [EMAIL PROTECTED] wrote:

> How do I create the internal bitmap?  man mdadm didn't shed any
> light and my brief excursion into google wasn't much more helpful. 

  mdadm --grow --bitmap=internal /dev/mdX

> 
> The version I have installed is mdadm-1.12.0-5.i386 from RedHat
> which would appear to be way out of date! 

WAY!  mdadm 2.0 would be an absolute minimum, and linux 2.6.13 as an
absolute minimum, probably something closer to 2.6.20 would be a good
idea.

NeilBRown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Auto assembly errors with mdadm and 64K aligned partitions.

2007-12-13 Thread Neil Brown

On Thursday December 13, [EMAIL PROTECTED] wrote:
> Good morning to Neil and everyone on the list, hope your respective
> days are going well.
> 
> Quick overview.  We've isolated what appears to be a failure mode with
> mdadm assembling RAID1 (and presumably other level) volumes which
> kernel based RAID autostart is able to do correctly.
> 
> We picked up on the problem with OES based systems with SAN attached
> volumes.  I am able to reproduce the problem under 2.6.23.9 UML with
> version 2.6.4 of mdadm.
> 
> The problem occurs when partitions are aligned on a 64K boundary.  Any
> 64K boundary seems to work, ie 128, 256 and 512 sector offsets.
> 
> Block devices look like the following:
> 
> ---
> cat /proc/partions:
> 
> major minor  #blocks  name
> 
>   98 0 262144 ubda
>   9816  10240 ubdb
>   9817  10176 ubdb1
>   9832  10240 ubdc
>   9833  10176 ubdc1
> ---
> 
> 
> A RAID1 device was created and started consisting of the /dev/ubdb1
> and /dev/ubdc1 partitions.  An /etc/mdadm.conf file was generated
> which contains the following:
> 
> ---
> DEVICE partitions
> ARRAY /dev/md0 level=raid1 num-devices=2 
> UUID=e604c49e:d3a948fd:13d9bc11:dbc82862
> ---
> 
> 
> The RAID1 device was shutdown.  The following assembly command yielded:
> 
> ---
> mdadm -As
> 
> mdadm: WARNING /dev/ubdc1 and /dev/ubdc appear to have very similar 
> superblocks.  If they are really different, please --zero the superblock 
> on one
>   If they are the same or overlap, please remove one from the
>   DEVICE list in mdadm.conf.
> ---

Yes.  This is one of the problems with v0.90 metadata, and with
"DEVICE partitions".

As the partitions start on a 64K alignment, and the metadata is 64K
aligned, the metadata appears look  right for both the whole device
and for the last partition on the device, and mdadm cannot tell the
difference.

With v1.x metadata, we store the superblock offset which allows us to
tell if we have mis-identified a superblock that was meant to be part
of a partition or of the whole device.

If you make your "DEVICE" line a little more restrictive. e.g.

 DEVICE /dev/ubc?1

then it will also work.

Or just don't use partitions.  Make the arrat from /dev/ubdb and
/dev/ubdc.


NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mdadm break / restore soft mirror

2007-12-12 Thread Neil Brown

On Wednesday December 12, [EMAIL PROTECTED] wrote:
> 
> >If you can be certain that the device that you break out of the mirror
> >is never altered, then you could add an internal bitmap while the
> >array is split and the rebuild will go much faster.
> 
> Is this also a viable speedup for the "kep rotating backup drives through
> the array" strategy? If so, how much speedup are we talking about? Assume
> the array changes by 1% before a backup drive gets rotated in again.
> 

Not really...

The bitmap only records areas of the array that have changed since one
particular moment in time.  For rotating backs you would really want
several moments in time.
Whenever the array is non-degraded, the bitmap forgets any old state.

What you could do is set the number of devices in the array to 3 so
they it always appears to be degraded, then rotate your backup drives
through the array.  The number of dirty bits in the bitmap will
steadily grow and so resyncs will take longer.  Once it crosses some
threshold you set the array back to having 2 devices to that it looks
non-degraded and clean the bitmap.  Then each device will need a full
resync after which you will get away with partial resyncs for a while.

Not ideal, but it might work.

If 1% changes each time, then you will initially get a 100 fold
speedup, dropping away after that.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mdadm break / restore soft mirror

2007-12-12 Thread Neil Brown

On Wednesday December 12, [EMAIL PROTECTED] wrote:
> Hi, 
> 
>   Question for you guys. 
> 
>   A brief history: 
>   RHEL 4 AS 
>   I have a partition with way to many small files on (Usually around a couple 
> of million) that needs to be backed up, standard
> 
>   methods mean that a restore is impossibly slow due to the sheer volume of 
> files. 
>   Solution, raw backup /restore of the device.  However the partition is 
> permanently being accessed. 
> 
>   Proposed solution is to use software raid mirror.  Before backup starts, 
> break the soft mirror unmount and backup partition
> 
>   restore soft mirror and let it resync / rebuild itself. 
> 
>   Would the above intentional break/fix of the mirror cause any problems? 

No, it should work fine.

If you can be certain that the device that you break out of the mirror
is never altered, then you could add an internal bitmap while the
array is split and the rebuild will go much faster.
However even mounting a device readonly will sometimes alter the
content (e.g. if ext3 needs to replay the journal) so you need to be
very careful.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID mapper device size wrong after replacing drives

2007-12-06 Thread Neil Brown


I think you would have more luck posting this to
[EMAIL PROTECTED] - I think that is where support for device mapper
happens.

NeilBrown


On Thursday December 6, [EMAIL PROTECTED] wrote:
> 
> Hi,
> 
> I have a problem with my RAID array under Linux after upgrading to larger
> drives. I have a machine with Windows and Linux dual-boot which had a pair
> of 160GB drives in a RAID-1 mirror with 3 partitions: partiton 1 = Windows
> boot partition (FAT32), partiton 2 = Linux /boot (ext3), partiton 3 =
> Windows system (NTFS). The Linux /root is on a separate physical drive. The
> dual boot is via Grub installed on the /boot partiton, and this was all
> working fine.
> 
> But I just upgraded the drives in the RAID pair, replacing them with 500GB
> drives. I did this by replacing one of the 160s with a new 500 and letting
> the RAID copy the drive, splitting the drives out of the RAID array and
> increasing the size of the last partition of the 500 (which I did under
> Windows since its the Windows partiton) then replacing the last 160 with the
> other 500 and having the RAID controller create a new array with the two
> 500s, copying the drive that I'd copied from the 160. This worked great for
> Windows, and that now boots and sees a 500GB RAID drive with all the data
> intact.
> 
> However, Linux has a problem and will not now boot all the way. It reports
> that the RAID /dev/mapper volume failed - the partition is beyond the
> boundaries of the disk. Running fdisk shows that it is seeing the larger
> partiton, but still sees the size of the RAID /dev/mapper drive as 160GB.
> Here is the fdisk output for one of the physical drives and for the RAID
> mapper drive:
> 
> Disk /dev/sda: 500.1 GB, 500107862016 bytes
> 255 heads, 63 sectors/track, 60801 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
> 
>Device Boot  Start End  Blocks   Id  System
> /dev/sda1   1 625 5018624b  W95 FAT32
> Partition 1 does not end on cylinder boundary.
> /dev/sda2 626 637   96390   83  Linux
> /dev/sda3   * 638   60802   4832645127  HPFS/NTFS
> 
> 
> Disk /dev/mapper/isw_bcifcijdi_Raid-0: 163.9 GB, 163925983232 bytes
> 255 heads, 63 sectors/track, 19929 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
> 
> Device Boot  Start End  Blocks  
> Id  System
> /dev/mapper/isw_bcifcijdi_Raid-0p1   1 625 5018624   
> b  W95 FAT32
> Partition 1 does not end on cylinder boundary.
> /dev/mapper/isw_bcifcijdi_Raid-0p2 626 637   96390  
> 83  Linux
> /dev/mapper/isw_bcifcijdi_Raid-0p3   * 638   60802   483264512   
> 7  HPFS/NTFS
> 
> 
> They differ only in the drive capacity and number of cylinders.
> 
> I started to try to run a Linux reinstall, but it reports that the partiion
> table on the mapper drive is invalid, giving an option to re-initialize it
> but saying that doing so will lose all the data on the drive.
> 
> So questions:
> 
> 1. Where is the drive size information for the RAID mapper drive kept, and
> is there some way to patch it?
> 
> 2. Is there some way to re-initialize the RAID mapper drive without
> destroying the data on the drive?
> 
> Thanks,
> Ian
> -- 
> View this message in context: 
> http://www.nabble.com/RAID-mapper-device-size-wrong-after-replacing-drives-tf4958354.html#a14200241
> Sent from the linux-raid mailing list archive at Nabble.com.
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] (2nd try) force parallel resync

2007-12-06 Thread Neil Brown

On Thursday December 6, [EMAIL PROTECTED] wrote:
> Hello,
> 
> here is the second version of the patch. With this version also on  
> setting /sys/block/*/md/sync_force_parallel the sync_thread is woken up. 
> Though, I still don't understand why md_wakeup_thread() is not working.

Could give a little more detail on why you want this?  When do you
want multiple arrays on the same device to sync at the same time?
What exactly is the hardware like?

md threads generally run for a little while to perform some task, then
stop and wait to be needed again.  md_wakeup_thread says "you are
needed again".

The resync/recovery thread is a bit different.  It just run md_do_sync
once.  md_wakeup_thread is not really meaningful in that context.

What you want is:
wake_up(&resync_wait);

that will get any thread that is waiting for some other array to
resync to wake up and see if something needs to be done.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: problem with software raid1 on 2.6.22.10: check/rebuild hangs

2007-12-02 Thread Neil Brown

On Monday December 3, [EMAIL PROTECTED] wrote:
> Hello,
> 
> with kernel 2.6.22.10 checking a raid1 or rebuilding ist does not work on one 
> of our machines. After a short time the rebuild/check does not make progress 
> any more . Processes which then access the filesystems on those raids are 
> blocked.
> 
> Nothing gets logged. Access to other filesystems works fine.
> 
> If we boot 2.6.17.10 (the kernel we used befor upgrading to 2.6.22) the raids 
> the check/rebuild is done without any problems.
> 

Sounds like a driver problem.
Your symptoms are completely consistent with a request being submitted
to the underlying device, and that request never completing.

What controller runs your drives for you.  You should probably report
the problem to the relevant maintainer.

Do you compile your own kernels?  Would you be comfortable using "git
bisect" to narrow down exactly which change breaks things?  It should
take more than a dozen or so tests.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Reading takes 100% precedence over writes for mdadm+raid5?

2007-12-02 Thread Neil Brown

On Sunday December 2, [EMAIL PROTECTED] wrote:
> 
> Was curious if when running 10 DD's (which are writing to the RAID 5) 
> fine, no issues, suddenly all go into D-state and let the read/give it 
> 100% priority?

So are you saying that the writes completely stalled while the read
was progressing?  How exactly did you measure that?

What kernel version are you running.

> 
> Is this normal?

It shouldn't be.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Spontaneous rebuild

2007-12-02 Thread Neil Brown

On Sunday December 2, [EMAIL PROTECTED] wrote:
> 
> Anyway, the problems are back: To test my theory that everything is
> alright with the CPU running within its specs, I removed one of the
> drives while copying some large files yesterday. Initially, everything
> seemed to work out nicely, and by the morning, the rebuild had finished.
> Again, I unmounted the filesystem and ran badblocks -svn on the LVM. It
> ran without gripes for some hours, but just now I saw md had started to
> rebuild the array again out of the blue:
> 
> Dec  1 20:04:49 quassel kernel: usb 4-5.2: reset high speed USB device
> using ehci_hcd and address 4
> Dec  2 01:06:02 quassel kernel: md: data-check of RAID array md0
  ^^
> Dec  2 01:06:02 quassel kernel: md: minimum _guaranteed_  speed: 1000
> KB/sec/disk.
> Dec  2 01:06:02 quassel kernel: md: using maximum available idle IO
> bandwidth (but not more than 20 KB/sec) for data-check.
  ^^
> Dec  2 01:06:02 quassel kernel: md: using 128k window, over a total of
> 488383936 blocks.
> Dec  2 03:57:24 quassel kernel: usb 4-5.2: reset high speed USB device
> using ehci_hcd and address 4
> 

This isn't a resync, it is a data check.  "Dec  2" is the first Sunday
of the month.  You probably have a crontab entries that does
   echo check > /sys/block/mdX/md/sync_action

early on the first Sunday of the month.  I know that Debian does this.

It is good to do this occasionally to catch sleeping bad blocks.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: assemble vs create an array.......

2007-11-29 Thread Neil Brown

On Thursday November 29, [EMAIL PROTECTED] wrote:
> Hello,
> I had created a raid 5 array on 3 232GB SATA drives. I had created one 
> partition (for /home) formatted with either xfs or reiserfs (I do not 
> recall).
> Last week I reinstalled my box from scratch with Ubuntu 7.10, with mdadm 
> v. 2.6.2-1ubuntu2.
> Then I made a rookie mistake: I --create instead of --assemble. The 
> recovery completed. I then stopped the array, realizing the mistake.
> 
> 1. Please make the warning more descriptive: ALL DATA WILL BE LOST, when 
> attempting to created an array over an existing one.

No matter how loud the warning is, people will get it wrong... unless
I make it actually impossible to corrupt data (which may not be
possible) in which case it will inconvenience many more people.

> 2. Do you know of any way to recover from this mistake? Or at least what 
> filesystem it was formated with.

If you created the same array with the same devices and layout etc,
the data will still be there, untouched.
Try to assemble the array and use "fsck" on it.

When you create a RAID5 array, all that is changed is the metadata (at
the end of the device) and one drive is changed to be the xor of all
the others.

> 
> Any help would be greatly appreciated. I have hundreds of family digital 
> pictures and videos that are irreplaceable.

You have probably heard it before, but RAID is no replacement for
backups. 
My photos are one two separate computers, one with RAID.  And I will
be backing them up to DVD any day now . really!!   or maybe next
year, if I remember :-)

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

2007-11-28 Thread Neil Brown

On Tuesday November 27, [EMAIL PROTECTED] wrote:
> Thiemo Nagel wrote:
> > Dear Neil,
> >
> > thank you very much for your detailed answer.
> >
> > Neil Brown wrote:
> >> While it is possible to use the RAID6 P+Q information to deduce which
> >> data block is wrong if it is known that either 0 or 1 datablocks is 
> >> wrong, it is *not* possible to deduce which block or blocks are wrong
> >> if it is possible that more than 1 data block is wrong.
> >
> > If I'm not mistaken, this is only partly correct.  Using P+Q redundancy,
> > it *is* possible, to distinguish three cases:
> > a) exactly zero bad blocks
> > b) exactly one bad block
> > c) more than one bad block
> >
> > Of course, it is only possible to recover from b), but one *can* tell,
> > whether the situation is a) or b) or c) and act accordingly.
> I was waiting for a response before saying "me too," but that's exactly 
> the case, there is a class of failures other than power failure or total 
> device failure which result in just the "one identifiable bad sector" 
> result. Given that the data needs to be read to realize that it is bad, 
> why not go the extra inch and fix it properly instead of redoing the p+q 
> which just makes the problem invisible rather than fixing it.
> 
> Obviously this is a subset of all the things which can go wrong, but I 
> suspect it's a sizable subset.

Why do think that it is a sizable subset.  Disk drives have internal
checksum which are designed to prevent corrupted data being returned.

If the data is getting corrupt on some buss between the CPU and the
media, then I suspect that your problem is big enough that RAID cannot
meaningfully solve it, and "New hardware plus possibly restore from
backup" would be the only credible option.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

2007-11-28 Thread Neil Brown

On Thursday November 22, [EMAIL PROTECTED] wrote:
> Dear Neil,
> 
> thank you very much for your detailed answer.
> 
> Neil Brown wrote:
> > While it is possible to use the RAID6 P+Q information to deduce which
> > data block is wrong if it is known that either 0 or 1 datablocks is 
> > wrong, it is *not* possible to deduce which block or blocks are wrong
> > if it is possible that more than 1 data block is wrong.
> 
> If I'm not mistaken, this is only partly correct.  Using P+Q redundancy,
> it *is* possible, to distinguish three cases:
> a) exactly zero bad blocks
> b) exactly one bad block
> c) more than one bad block
> 
> Of course, it is only possible to recover from b), but one *can* tell,
> whether the situation is a) or b) or c) and act accordingly.

It would seem that either you or Peter Anvin is mistaken.

On page 9 of 
  http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf
at the end of section 4 it says:

  Finally, as a word of caution it should be noted that RAID-6 by
  itself cannot even detect, never mind recover from, dual-disk
  corruption. If two disks are corrupt in the same byte positions,
  the above algorithm will in general introduce additional data
  corruption by corrupting a third drive.

> 
> The point that I'm trying to make is, that there does exist a specific
> case, in which recovery is possible, and that implementing recovery for
> that case will not hurt in any way.

Assuming that it true (maybe hpa got it wrong) what specific
conditions would lead to one drive having corrupt data, and would
correcting it on an occasional 'repair' pass be an appropriate
response?

Does the value justify the cost of extra code complexity?

> 
> > RAID is not designed to protect again bad RAM, bad cables, chipset 
> > bugs drivers bugs etc.  It is only designed to protect against drive 
> > failure, where the drive failure is apparent.  i.e. a read must 
> > return either the same data that was last written, or a failure 
> > indication. Anything else is beyond the design parameters for RAID.
> 
> I'm taking a more pragmatic approach here.  In my opinion, RAID should
> "just protect my data", against drive failure, yes, of course, but if it
> can help me in case of occasional data corruption, I'd happily take
> that, too, especially if it doesn't cost extra... ;-)

Everything costs extra.  Code uses bytes of memory, requires
maintenance, and possibly introduced new bugs.  I'm not convinced the
failure mode that you are considering actually happens with a
meaningful frequency.

NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid5 reshape/resync

2007-11-28 Thread Neil Brown

On Sunday November 25, [EMAIL PROTECTED] wrote:
> - Message from [EMAIL PROTECTED] -
>  Date: Sat, 24 Nov 2007 12:02:09 +0100
>  From: Nagilum <[EMAIL PROTECTED]>
> Reply-To: Nagilum <[EMAIL PROTECTED]>
>   Subject: raid5 reshape/resync
>To: linux-raid@vger.kernel.org
> 
> > Hi,
> > I'm running 2.6.23.8 x86_64 using mdadm v2.6.4.
> > I was adding a disk (/dev/sdf) to an existing raid5 (/dev/sd[a-e] -> md0)
> > During that reshape (at around 4%) /dev/sdd reported read errors and
> > went offline.

Sad.

> > I replaced /dev/sdd with a new drive and tried to reassemble the array
> > (/dev/sdd was shown as removed and now as spare).

There must be a step missing here.
Just because one drive goes offline, that  doesn't mean that you need
to reassemble the array.  It should just continue with the reshape
until that is finished.  Did you shut the machine down or did it crash
or what

> > Assembly worked but it would not run unless I use --force.

That suggests an unclean shutdown.  Maybe it did crash?


> > Since I'm always reluctant to use force I put the bad disk back in,
> > this time as /dev/sdg . I re-added the drive and could run the array.
> > The array started to resync (since the disk can be read until 4%) and
> > then I marked the disk as failed. Now the array is "active, degraded,
> > recovering":

It should have restarted the reshape from whereever it was up to, so
it should have hit the read error almost immediately.  Do you remember
where it started the reshape from?  If it restarted from the beginning
that would be bad.

Did you just "--assemble" all the drives or did you do something else?

> >
> > What I find somewhat confusing/disturbing is that does not appear to
> > utilize /dev/sdd. What I see here could be explained by md doing a
> > RAID5 resync from the 4 drives sd[a-c,e] to sd[a-c,e,f] but I would
> > have expected it to use the new spare sdd for that. Also the speed is

md cannot recover to a spare while a reshape is happening.  It
completes the reshape, then does the recovery (as you discovered).

> > unusually low which seems to indicate a lot of seeking as if two
> > operations are happening at the same time.

Well reshape is always slow as it has to read from one part of the
drive and write to another part of the drive.

> > Also when I look at the data rates it looks more like the reshape is
> > continuing even though one drive is missing (possible but risky).

Yes, that is happening.

> > Can someone relief my doubts as to whether md does the right thing here?
> > Thanks,

I believe it is do "the right thing".

> >
> - End message from [EMAIL PROTECTED] -
> 
> Ok, so the reshape tried to continue without the failed drive and  
> after that resynced to the new spare.

As I would expect.

> Unfortunately the result is a mess. On top of the Raid5 I have  

Hmm.  This I would not expect.

> dm-crypt and LVM.
> Although dmcrypt and LVM dont appear to have a problem the filesystems  
> on top are a mess now.

Can you be more specific about what sort of "mess" they are in?

NeilBrown


> I still have the failed drive, I can read the superblock from that  
> drive and up to 4% from the beginning and probably backwards from the  
> end towards that point.
> So in theory it could be possible to reorder the stripe blocks which  
> appears to have been messed up.(?)
> Unfortunately I'm not sure what exactly went wrong or what I did  
> wrong. Can someone please give me hint?
> Thanks,
> Alex.
> 
> 
> #_  __  _ __ http://www.nagilum.org/ \n icq://69646724 #
> #   / |/ /__  _(_) /_  _  [EMAIL PROTECTED] \n +491776461165 #
> #  // _ `/ _ `/ / / // /  ' \  Amiga (68k/PPC): AOS/NetBSD/Linux   #
> # /_/|_/\_,_/\_, /_/_/\_,_/_/_/_/   Mac (PPC): MacOS-X / NetBSD /Linux #
> #   /___/ x86: FreeBSD/Linux/Solaris/Win2k  ARM9: EPOC EV6 #
> 
> 
> 
> 
> cakebox.homeunix.net - all the machine one needs..
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Skip bio copy in full-stripe write ops

2007-11-23 Thread Neil Brown

On Friday November 23, [EMAIL PROTECTED] wrote:
> 
>  Hello all,
> 
>  Here is a patch which allows to skip intermediate data copying between the 
> bio
> requested to write and the disk cache in  if the full-stripe write 
> operation is
> on the way.
> 
>  This improves the performance of write operations for some dedicated cases
> when big chunks of data are being sequentially written to RAID array, but in
> general eliminating disk cache slows the performance down.

There is a subtlety here that we need to be careful not to miss. 
The stripe cache has an import 'correctness' aspect that you might be
losing.

When a write request is passed to generic_make_request, it is entirely
possible for the data in the buffer to be changing while the write is
being processed.  This can happen particularly with memory mapped
files, but also in other cases.
If we perform the XOR operation against the data in the buffer, and
then later DMA that data out to the storage device, the data could
have changed in the mean time.  The net result will be that the that
parity block is wrong.
That is one reason why we currently copy the data before doing the XOR
(though copying at the same time as doing the XOR would be a suitable
alternative).

I can see two possible approaches where it could be safe to XOR out of
the provided buffer.

 1/ If we can be certain that the data in the buffer will not change
until the write completes.  I think this would require the
filesystem to explicitly promise not to change the data, possibly by
setting some flag in the BIO.  The filesystem would then need its
own internal interlock mechanisms to be able to keep the promise,
and we would only be able to convince filesystems to do this if
there were significant performance gains.

 2/ We allow the parity to be wrong for a little while (it happens
anyway) but make sure that:
a/ future writes to the same stripe use reconstruct_write, not
  read_modify_write, as the parity block might be wrong.
b/ We don't mark the array or (with bitmaps) region 'clean' until
  we have good reason to believe that it is.  i.e. somehow we
  would need to check that the last page written to each device
  were still clean when the write completed.

I think '2' is probably too complex.  Part 'a' makes it particularly
difficult to achieve efficiently.

I think that '1' might be possible for some limited cases, and it
could be that those limited cases form 99% for all potential
stripe-wide writes.
e.g. If someone was building a dedicated NAS device and wanted this
performance improvement, they could work with the particular
filesystem that they choose, and ensure that - for the applications
that they use on top of it - the filesystem does not update in-flight
data.

But without the above issues being considered and addressed, we cannot
proceed with this patch..

> 
>  The performance results obtained on the ppc440spe-based board using the
> PPC440SPE ADMA driver, Xdd benchmark, and the RAID-5 of 4 disks are as
> follows:
> 
>  SKIP_BIO_SET = 'N': 40 MBps;
>  SKIP_BIO_SET = 'Y': 70 MBps.

.which is a shame because that is a very significant performance
increase I wonder if that comes from simply avoiding the copy, or
whether there are some scheduling improvements that account for some
of it After all a CPU can copy data around at much more that
30MBps.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: md RAID 10 on Linux 2.6.20?

2007-11-22 Thread Neil Brown

On Thursday November 22, [EMAIL PROTECTED] wrote:
> Hi all,
> 
> I am running a home-grown Linux 2.6.20.11 SMP 64-bit build, and I am 
> wondering if there is indeed a RAID 10 "personality" defined in md that 
> can be implemented using mdadm. If so, is it available in 2.6.20.11, or 
> is it in a later kernel version? In the past, to create RAID 10, I 
> created RAID 1's and a RAID 0, so an 8 drive RAID 10 would actually 
> consist of 5 md devices (four RAID 1's and one RAID 0). But if I could 
> just use RAID 10 natively, and simply create one RAID 10, that would of 
> course be better both in terms of management and probably performance I 
> would guess. Is this possible?

Why don't you try it and see, or check the documentation?

But yes, there is native RAID10 in 2.6.20.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BUG: soft lockup detected on CPU#1! (was Re: raid6 resync blocks the entire system)

2007-11-21 Thread Neil Brown

On Tuesday November 20, [EMAIL PROTECTED] wrote:
> 
> My personal (wild) guess for this problem is, that there is somewhere a 
> global 
> lock, preventing all other CPUs to do something. At 100%s (at 80 MB/s) 
> there's probably not left any time frame to wake up the other CPUs or its 
> sufficiently small to only allow high priority kernel threads to do 
> something.
> When I limit the sync to 40MB/s each resync-CPU has to wait sufficiently long 
> to allow the other CPUs to wake up.
> 
> 

md doesn't hold any locks that would interfere with other parts of the
kernel from working.

I cannot imagine what would be causing your problems.  The resync
thread makes a point of calling cond_resched() periodically so that it
will let other processes run even if it constantly has work to do.

If you have nothing that could write to the RAID6 arrays, then I
cannot see how the resync could affect the rest of the system except
to reduce the amount of available CPU time.  And as CPU is normally
much faster than drives, you wouldn't expect that effect to be very
great.

Very strange.

Can you do 'alt-sysrq-T' when it is frozen and get the process traces
from the kernel logs?

Can you send me "cat /proc/mdstat" after the resyn has started, but
before the system has locked up.

I'm sorry that I cannot suggest anything more useful.

NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid6 check/repair

2007-11-21 Thread Neil Brown

On Wednesday November 21, [EMAIL PROTECTED] wrote:
> Dear Neal,
> 
> >> I have been looking a bit at the check/repair functionality in the
> >> raid6 personality.
> >> 
> >> It seems that if an inconsistent stripe is found during repair, md
> >> does not try to determine which block is corrupt (using e.g. the
> >> method in section 4 of HPA's raid6 paper), but just recomputes the
> >> parity blocks - i.e. the same way as inconsistent raid5 stripes are
> >> handled.
> >> 
> >> Correct?
> > 
> > Correct!
> > 
> > The mostly likely cause of parity being incorrect is if a write to
> > data + P + Q was interrupted when one or two of those had been
> > written, but the other had not.
> > 
> > No matter which was or was not written, correctly P and Q will produce
> > a 'correct' result, and it is simple.  I really don't see any
> > justification for being more clever.
> 
> My opinion about that is quite different.  Speaking just for myself:
> 
> a) When I put my data on a RAID running on Linux, I'd expect the 
> software to do everything which is possible to protect and when 
> necessary to restore data integrity.  (This expectation was one of the 
> reasons why I chose software RAID with Linux.)

Yes, of course.  "possible" is an import aspect of this.

> 
> b) As a consequence of a):  When I'm using a RAID level that has extra 
> redundancy, I'd expect Linux to make use of that extra redundancy during 
> a 'repair'.  (Otherwise I'd consider repair a misnomer and rather call 
> it 'recalc parity'.)

The extra redundancy in RAID6 is there to enable you to survive two
drive failure.  Nothing more.

While it is possible to use the RAID6 P+Q information to deduce which
data block is wrong if it is known that either 0 or 1 datablocks is
wrong, it is *not* possible to deduce which block or blocks are wrong
if it is possible that more than 1 data block is wrong.
As it is quite possible for a write to be aborted in the middle
(during unexpected power down) with an unknown number of blocks in a
given stripe updated but others not, we do not know how many blocks
might be "wrong" so we cannot try to recover some wrong block.  Doing
so would quite possibly corrupt a block that is not wrong.

The "repair" process "repairs" the parity (redundancy information).
It does not repair the data.  It cannot.

The only possible scenario that md/raid recognises for the parity
information being wrong is the case of an unexpected shutdown in the
middle of a stripe write, where some blocks have been written and some
have not.
Further (for raid 4/5/6), it only supports this case when your array
is not degraded.  If you have a degraded array, then an unexpected
shutdown is potentially fatal to your data (the chances of it actually
being fatal is actually quite small, but the potential is still there).
There is nothing RAID can do about this.  It is not designed to
protect against power failure.  It is designed to protect again drive
failure.  It does that quite well.

If you have wrong data appearing on your device for some other reason,
then you have a serious hardware problem and RAID cannot help you.

The best approach to dealing with data on drives getting spontaneously
corrupted is for the filesystem to perform strong checksums on the
data block, and store the checksums in the indexing information.  This
provides detection, not recovery of course.

> 
> c) Why should 'repair' be implemented in a way that only works in most 
> cases when there exists a solution that works in all cases?  (After all, 
> possibilities for corruption are many, e.g. bad RAM, bad cables, chipset 
> bugs, driver bugs, last but not least human mistake.  From all these 
> errors I'd like to be able to recover gracefully without putting the 
> array at risk by removing and readding a component device.)

As I said above - there is no solution that works in all cases.  If
more that one block is corrupt, and you don't know which ones, then
you lose and there is now way around that.
RAID is not designed to protect again bad RAM, bad cables, chipset
bugs drivers bugs etc.  It is only designed to protect against drive
failure, where the drive failure is apparent.  i.e. a read must return
either the same data that was last written, or a failure indication.
Anything else is beyond the design parameters for RAID.
It might be possible to design a data storage system that was
resilient to these sorts of errors.  It would be much more
sophisticated than RAID though.

NeilBrown

> 
> Bottom line:  So far I was talking about *my* expectations, is it 
> reasonable to assume that it is shared by others?  Are there any 
> arguments that I'm not aware of speaking against an improved 
> implementation of 'repair'?
> 
> BTW:  I just checked, it's the same for RAID 1:  When I intentionally 
> corrupt a sector in the first device of a set of 16, 'repair' copies the 
> corrupted data to the 15 remaining devices instead of restoring the 
> correct sector from one of the other fifteen d

Re: raid6 check/repair

2007-11-15 Thread Neil Brown

On Thursday November 15, [EMAIL PROTECTED] wrote:
> Hi,
> 
> I have been looking a bit at the check/repair functionality in the
> raid6 personality.
> 
> It seems that if an inconsistent stripe is found during repair, md
> does not try to determine which block is corrupt (using e.g. the
> method in section 4 of HPA's raid6 paper), but just recomputes the
> parity blocks - i.e. the same way as inconsistent raid5 stripes are
> handled.
> 
> Correct?

Correct!

The mostly likely cause of parity being incorrect is if a write to
data + P + Q was interrupted when one or two of those had been
written, but the other had not.

No matter which was or was not written, correctly P and Q will produce
a 'correct' result, and it is simple.  I really don't see any
justification for being more clever.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Chnaging partition types of RAID array members

2007-11-15 Thread Neil Brown

On Thursday November 15, [EMAIL PROTECTED] wrote:
> 
> Hi.  I have two RAID5 arrays on an opensuse 10.3 system.  They are used
>  together in a large LVM volume that contains a lot of data I'd rather
>  not have to try and backup/recreate.
> 
> md1 comes up fine and is detected by the OS on boot and assembled
>  automatically.  md0 however, doesn't, and needs to be brought up manually,
>  followed by a manual start of lvm.  This is a real pain of course.  The
>  issue I think is that md0 was created through EVMS, which I have
>  stopped using some time ago since it's support seems to have been deprecated.
>   EVMS created the array fine, but using partitions that were not 0xFD
>  (Linux RAID), but rather 0x83 (linux native).  Since stopping the use
>  of EVMS on boot, the array has not come up automatically.
> 
> I have tried failing one of the array members, recreating the partition
>  as linux RAID though the yast partition manager, and then trying to
>  add it, but I get a "mdadm: Cannot open /dev/sdb1: Device or resource
>  busy" error.  If the partition is type 0x83 (linux native) and formatted
>  with a filesystem first, then re-adding it is no problem at all, and the 
> array rebuilds
>  fine.

You don't need to fail a device just to change the partition type.
Just use "cfdisk" to change all the partition types to 'fd', then
reboot and see what happens.

NeilBrown

> 
> In googling the topic I can't seem to find out why I get the error
>  message, and how to fix this.  I'd really like to get this problem
>  resolved.  Does anyone out there know how to fix this, so I can get 
> partitions
>  correctly flagged as Linux RAID and the array autodetected at start?
> 
> Sorry if I missed something obvious.
> 
> Thanks,
> Mike
> 
> 
> 
> 
> 
> 
>   
> 
> Never miss a thing.  Make Yahoo your home page. 
> http://www.yahoo.com/r/hs
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Proposal: non-striping RAID4

2007-11-14 Thread Neil Brown

On Thursday November 15, [EMAIL PROTECTED] wrote:
> 
> Neil: any comments on whether this would be desirable / useful / feasible?

1/ Have in raid4 variant which arranges the data like 'linear' is
   something I am planning to do eventually.  If your filesystem nows
   about the geometry of the array , then it can distribute the data
   across the drives and can make up for a lot of the benefits of
   striping.  The big advantage of such an arrangement is that it is
   trivial to add a drive - just zero it and make it part of the
   array.  No need to re-arrange what is currently there.
   However I was not thinking of support different sizes devices in
   such a configuration.

2/ Having an array with redundancy where drives are of different sizes
   is awkward, primarily because if there was a spare that as not as
   large as the largest device, you may-or-may not be able to rebuild
   in that situation.   Certainly I could code up those decisions, but
   I'm not sure the scenario is worth the complexity.
   If you have drives of different sizes, use raid0 to combine pairs
   of smaller one to match larger ones, and do raid5 across devices
   that look like the same size.

3/ If you really want to use exactly what you have, you can partition
   them into bits and make a variety of raid5 arrays as you suggest.
   md will notice and will resync in series so that you don't kill
   performance.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [stable] [PATCH 000 of 2] md: Fixes for md in 2.6.23

2007-11-14 Thread Neil Brown

On Tuesday November 13, [EMAIL PROTECTED] wrote:
> 
> raid5-fix-unending-write-sequence.patch is in -mm and I believe is
> waiting on an Acked-by from Neil?
> 

It seems to have just been sent on to Linus, so it probably will go in
without:

   Acked-By: NeilBrown <[EMAIL PROTECTED]>

I'm beginning to think that I really should sit down and make sure I
understand exactly how those STRIPE_OP_ flags are uses.  They
generally make sense but there seem to be a number of corner cases
where they aren't quite handled properly..  Maybe they are all found
now, or maybe

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Building a new raid6 with bitmap does not clear bits during resync

2007-11-12 Thread Neil Brown

On Monday November 12, [EMAIL PROTECTED] wrote:
> Neil Brown wrote:
> >
> > However there is value in regularly updating the bitmap, so add code
> > to periodically pause while all pending sync requests complete, then
> > update the bitmap.  Doing this only every few seconds (the same as the
> > bitmap update time) does not notciable affect resync performance.
> >   
> 
> I wonder if a minimum time and minimum number of stripes would be 
> better. If a resync is going slowly because it's going over a slow link 
> to iSCSI, nbd, or a box of cheap drives fed off a single USB port, just 
> writing the updated bitmap may represent as much data as has been 
> resynced in the time slice.
> 
> Not a suggestion, but a request for your thoughts on that.

Thanks for your thoughts.
Choosing how often to update the bitmap during a sync is certainly not
trivial.   In different situations, different requirements might rule.

I chose to base it on time, and particularly on the time we already
have for "how soon to write back clean bits to the bitmap" because it
is fairly easy to users to understand the implications (if I set the
time to 30 seconds, then I might have to repeat 30second of resync)
and it is already configurable (via the "--delay" option to --create
--bitmap).

Presumably if someone has a very slow system and wanted to use
bitmaps, they would set --delay relatively large to reduce the cost
and still provide significant benefits.  This would effect both normal
clean-bit writeback and during-resync clean-bit-writeback.

Hope that clarifies my approach.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Building a new raid6 with bitmap does not clear bits during resync

2007-11-11 Thread Neil Brown

On Thursday November 8, [EMAIL PROTECTED] wrote:
> Hi,
> 
> I have created a new raid6:
> 
> md0 : active raid6 sdb1[0] sdl1[5] sdj1[4] sdh1[3] sdf1[2] sdd1[1]
>   6834868224 blocks level 6, 512k chunk, algorithm 2 [6/6] [UU]
>   [>]  resync = 21.5% (368216964/1708717056) 
> finish=448.5min speed=49808K/sec
>   bitmap: 204/204 pages [816KB], 4096KB chunk
> 
> The raid is totally idle, not mounted and nothing.
> 
> So why does the "bitmap: 204/204" not sink? I would expect it to clear
> bits as it resyncs so it should count slowly down to 0. As a side
> effect of the bitmap being all dirty the resync will restart from the
> beginning when the system is hard reset. As you can imagine that is
> pretty anoying.
> 
> On the other hand on a clean shutdown it seems the bitmap gets updated
> before stopping the array:
> 
> md3 : active raid6 sdc1[0] sdm1[5] sdk1[4] sdi1[3] sdg1[2] sde1[1]
>   6834868224 blocks level 6, 512k chunk, algorithm 2 [6/6] [UU]
>   [===>.]  resync = 38.4% (656155264/1708717056) 
> finish=17846.4min speed=982K/sec
>   bitmap: 187/204 pages [748KB], 4096KB chunk
> 
> Consequently the rebuild did restart and is already further along.
> 

Thanks for the report.

> 
> Any ideas why that is so?

Yes.  The following patch should explain (a bit tersely) why this was
so, and should also fix it so it will no longer be so.  Test reports
always welcome.

NeilBrown

Status: ok

Update md bitmap during resync.

Currently and md array with a write-intent bitmap does not updated
that bitmap to reflect successful partial resync.  Rather the entire
bitmap is updated when the resync completes.

This is because there is no guarentee that resync requests will
complete in order, and tracking each request individually is
unnecessarily burdensome.

However there is value in regularly updating the bitmap, so add code
to periodically pause while all pending sync requests complete, then
update the bitmap.  Doing this only every few seconds (the same as the
bitmap update time) does not notciable affect resync performance.

Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/bitmap.c |   34 +-
 ./drivers/md/raid1.c  |1 +
 ./drivers/md/raid10.c |2 ++
 ./drivers/md/raid5.c  |3 +++
 ./include/linux/raid/bitmap.h |3 +++
 5 files changed, 38 insertions(+), 5 deletions(-)

diff .prev/drivers/md/bitmap.c ./drivers/md/bitmap.c
--- .prev/drivers/md/bitmap.c   2007-10-22 16:55:52.0 +1000
+++ ./drivers/md/bitmap.c   2007-11-12 16:36:30.0 +1100
@@ -1349,14 +1349,38 @@ void bitmap_close_sync(struct bitmap *bi
 */
sector_t sector = 0;
int blocks;
-   if (!bitmap) return;
+   if (!bitmap)
+   return;
while (sector < bitmap->mddev->resync_max_sectors) {
bitmap_end_sync(bitmap, sector, &blocks, 0);
-/*
-   if (sector < 500) printk("bitmap_close_sync: sec %llu blks 
%d\n",
-(unsigned long long)sector, blocks);
-*/ sector += blocks;
+   sector += blocks;
+   }
+}
+
+void bitmap_cond_end_sync(struct bitmap *bitmap, sector_t sector)
+{
+   sector_t s = 0;
+   int blocks;
+
+   if (!bitmap)
+   return;
+   if (sector == 0) {
+   bitmap->last_end_sync = jiffies;
+   return;
+   }
+   if (time_before(jiffies, (bitmap->last_end_sync
+ + bitmap->daemon_sleep * HZ)))
+   return;
+   wait_event(bitmap->mddev->recovery_wait,
+  atomic_read(&bitmap->mddev->recovery_active) == 0);
+
+   sector &= ~((1ULL << CHUNK_BLOCK_SHIFT(bitmap)) - 1);
+   s = 0;
+   while (s < sector && s < bitmap->mddev->resync_max_sectors) {
+   bitmap_end_sync(bitmap, s, &blocks, 0);
+   s += blocks;
}
+   bitmap->last_end_sync = jiffies;
 }
 
 static void bitmap_set_memory_bits(struct bitmap *bitmap, sector_t offset, int 
needed)

diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c
--- .prev/drivers/md/raid10.c   2007-10-30 13:50:45.0 +1100
+++ ./drivers/md/raid10.c   2007-11-12 16:06:39.0 +1100
@@ -1671,6 +1671,8 @@ static sector_t sync_request(mddev_t *md
if (!go_faster && conf->nr_waiting)
msleep_interruptible(1000);
 
+   bitmap_cond_end_sync(mddev->bitmap, sector_nr);
+
/* Again, very different code for resync and recovery.
 * Both must result in an r10bio with a list of bios that
 * have bi_end_io, bi_sector, bi_bdev set,

diff .prev/drivers/md/raid1.c ./drivers/md/

Re: 2.6.23.1: mdadm/raid5 hung/d-state

2007-11-04 Thread Neil Brown

On Sunday November 4, [EMAIL PROTECTED] wrote:
> # ps auxww | grep D
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> root   273  0.0  0.0  0 0 ?DOct21  14:40 [pdflush]
> root   274  0.0  0.0  0 0 ?DOct21  13:00 [pdflush]
> 
> After several days/weeks, this is the second time this has happened, while 
> doing regular file I/O (decompressing a file), everything on the device 
> went into D-state.

At a guess (I haven't looked closely) I'd say it is the bug that was
meant to be fixed by

commit 4ae3f847e49e3787eca91bced31f8fd328d50496

except that patch applied badly and needed to be fixed with
the following patch (not in git yet).
These have been sent to stable@ and should be in the queue for 2.6.23.2


NeilBrown

Fix misapplied patch in raid5.c

commit 4ae3f847e49e3787eca91bced31f8fd328d50496 did not get applied
correctly, presumably due to substantial similarities between
handle_stripe5 and handle_stripe6.

This patch (with lots of context) moves the chunk of new code from
handle_stripe6 (where it isn't needed (yet)) to handle_stripe5.


Signed-off-by: Neil Brown <[EMAIL PROTECTED]>

### Diffstat output
 ./drivers/md/raid5.c |   14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c2007-11-02 12:10:49.0 +1100
+++ ./drivers/md/raid5.c2007-11-02 12:25:31.0 +1100
@@ -2607,40 +2607,47 @@ static void handle_stripe5(struct stripe
struct bio *return_bi = NULL;
struct stripe_head_state s;
struct r5dev *dev;
unsigned long pending = 0;
 
memset(&s, 0, sizeof(s));
pr_debug("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d "
"ops=%lx:%lx:%lx\n", (unsigned long long)sh->sector, sh->state,
atomic_read(&sh->count), sh->pd_idx,
sh->ops.pending, sh->ops.ack, sh->ops.complete);
 
spin_lock(&sh->lock);
clear_bit(STRIPE_HANDLE, &sh->state);
clear_bit(STRIPE_DELAYED, &sh->state);
 
s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
/* Now to look around and see what can be done */
 
+   /* clean-up completed biofill operations */
+   if (test_bit(STRIPE_OP_BIOFILL, &sh->ops.complete)) {
+   clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending);
+   clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack);
+   clear_bit(STRIPE_OP_BIOFILL, &sh->ops.complete);
+   }
+
rcu_read_lock();
for (i=disks; i--; ) {
mdk_rdev_t *rdev;
struct r5dev *dev = &sh->dev[i];
clear_bit(R5_Insync, &dev->flags);
 
pr_debug("check %d: state 0x%lx toread %p read %p write %p "
"written %p\n", i, dev->flags, dev->toread, dev->read,
dev->towrite, dev->written);
 
/* maybe we can request a biofill operation
 *
 * new wantfill requests are only permitted while
 * STRIPE_OP_BIOFILL is clear
 */
if (test_bit(R5_UPTODATE, &dev->flags) && dev->toread &&
!test_bit(STRIPE_OP_BIOFILL, &sh->ops.pending))
set_bit(R5_Wantfill, &dev->flags);
 
/* now count some things */
@@ -2880,47 +2887,40 @@ static void handle_stripe6(struct stripe
struct stripe_head_state s;
struct r6_state r6s;
struct r5dev *dev, *pdev, *qdev;
 
r6s.qd_idx = raid6_next_disk(pd_idx, disks);
pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
"pd_idx=%d, qd_idx=%d\n",
   (unsigned long long)sh->sector, sh->state,
   atomic_read(&sh->count), pd_idx, r6s.qd_idx);
memset(&s, 0, sizeof(s));
 
spin_lock(&sh->lock);
clear_bit(STRIPE_HANDLE, &sh->state);
clear_bit(STRIPE_DELAYED, &sh->state);
 
s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
/* Now to look around and see what can be done */
 
-   /* clean-up completed biofill operations */
-   if (test_bit(STRIPE_OP_BIOFILL, &sh->ops.complete)) {
-   clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending);
-   clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack);
-

Re: Very small internal bitmap after recreate

2007-11-02 Thread Neil Brown

On Friday November 2, [EMAIL PROTECTED] wrote:
> 
> Am 02.11.2007 um 10:22 schrieb Neil Brown:
> 
> > On Friday November 2, [EMAIL PROTECTED] wrote:
> >> I have a 5 disk version 1.0 superblock RAID5 which had an internal
> >> bitmap that has been reported to have a size of 299 pages in /proc/
> >> mdstat. For whatever reason I removed this bitmap (mdadm --grow --
> >> bitmap=none) and recreated it afterwards (mdadm --grow --
> >> bitmap=internal). Now it has a reported size of 10 pages.
> >>
> >> Do I have a problem?
> >
> > Not a big problem, but possibly a small problem.
> > Can you send
> >mdadm -E /dev/sdg1
> > as well?
> 
> Sure:
> 
> # mdadm -E /dev/sdg1
> /dev/sdg1:
>Magic : a92b4efc
>  Version : 01
>  Feature Map : 0x1
>   Array UUID : e1a335a8:fc0f0626:d70687a6:5d9a9c19
> Name : 1
>Creation Time : Wed Oct 31 14:30:55 2007
>   Raid Level : raid5
> Raid Devices : 5
> 
>Used Dev Size : 625137008 (298.09 GiB 320.07 GB)
>   Array Size : 2500547584 (1192.35 GiB 1280.28 GB)
>Used Size : 625136896 (298.09 GiB 320.07 GB)
> Super Offset : 625137264 sectors

So there is 256 sectors before the superblock were a bitmap could go,
or about 6 sectors afterwards

>State : clean
>  Device UUID : 95afade2:f2ab8e83:b0c764a0:4732827d
> 
> Internal Bitmap : 2 sectors from superblock

And the '6 sectors afterwards' was chosen.
6 sectors has room for 5*512*8 = 20480 bits,
and from your previous email:
>   Bitmap : 19078 bits (chunks), 0 dirty (0.0%)
you have 19078 bits, which is about right (a the bitmap chunk size
must be a power of 2).

So the problem is that "mdadm -G" is putting the bitmap after the
superblock rather than considering the space before
(checks code)

Ahh, I remember now.  There is currently no interface to tell the
kernel where to put the bitmap when creating one on an active array,
so it always puts in the 'safe' place.  Another enhancement waiting
for time.

For now, you will have to live with a smallish bitmap, which probably
isn't a real problem.  With 19078 bits, you will still get a
several-thousand-fold increase it resync speed after a crash
(i.e. hours become seconds) and to some extent, fewer bits are better
and you have to update them less.

I've haven't made any measurements to see what size bitmap is
ideal... maybe someone should :-)

>  Update Time : Fri Nov  2 07:46:38 2007
> Checksum : 4ee307b3 - correct
>   Events : 408088
> 
>   Layout : left-symmetric
>   Chunk Size : 128K
> 
>  Array Slot : 3 (0, 1, failed, 2, 3, 4)
> Array State : uuUuu 1 failed
> 
> This time I'm getting nervous - Array State failed doesn't sound good!

This is nothing to worry about - just a bad message from mdadm.

The superblock has recorded that there was once a device in position 2
which is now failed (See the list in "Array Slot").
This summaries as "1 failed" in "Array State".

But the array is definitely working OK now.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Superblocks

2007-11-01 Thread Neil Brown

On Tuesday October 30, [EMAIL PROTECTED] wrote:
> Which is the default type of superblock? 0.90 or 1.0?

The default default is 0.90.
However a local device can be set in mdadm.conf with e.g.
   CREATE metdata=1.0

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: stride / stripe alignment on LVM ?

2007-11-01 Thread Neil Brown

On Thursday November 1, [EMAIL PROTECTED] wrote:
> Hello,
> 
> I have raid5 /dev/md1, --chunk=128 --metadata=1.1. On it I have
> created LVM volume called 'raid5', and finally a logical volume
> 'backup'.
> 
> Then I formatted it with command:
> 
>mkfs.ext3 -b 4096 -E stride=32 -E resize=550292480 /dev/raid5/backup
> 
> And because LVM is putting its own metadata on /dev/md1, the ext3
> partition is shifted by some (unknown for me) amount of bytes from
> the beginning of /dev/md1.
> 
> I was wondering, how big is the shift, and would it hurt the
> performance/safety if the `ext3 stride=32` didn't align perfectly
> with the physical stripes on HDD?

It is probably better to ask this question on an ext3 list as people
there might know exactly what 'stride' does.

I *think* it causes the inode tables to be offset in different
block-groups so that they are not all on the same drive.  If that is
the case, then an offset causes by LVM isn't going to make any
difference at all.

NeilBrown

> 
> PS: the resize option is to make sure that I can grow this fs
> in the future.
> 
> PSS: I looked in the archive but didn't find this question asked
> before. I'm sorry if it really was asked.

Thanks for trying!
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Bad drive discovered during raid5 reshape

2007-10-30 Thread Neil Brown

On Tuesday October 30, [EMAIL PROTECTED] wrote:
> Neil Brown wrote:
> > On Monday October 29, [EMAIL PROTECTED] wrote:
> >> Hi,
> >> I bought two new hard drives to expand my raid array today and
> >> unfortunately one of them appears to be bad. The problem didn't arise
> 
> > Looks like you are in real trouble.  Both the drives seem bad in some
> > way.  If it was just sdc that was failing it would have picked up
> > after the "-Af", but when it tried, sdb gave errors.
> 
> Humble enquiry :)
> 
> I'm not sure that's right?
> He *removed* sdb and sdc when the failure occurred so sdc would indeed be 
> non-fresh.

I'm not sure what point you are making here.
In any case, remove two drives from a raid5 is always a bad thing.
Part of the array was striped over 8 drives by this time.  With only
six still in the array, some data will be missing.

> 
> The key question I think is: will md continue to grow an array even if it 
> enters
> degraded mode during the grow?
> ie grow from a 6 drive array to a 7-of-8 degraded array?
> 
> Technically I guess it should be able to.

Yes, md can grow to a degraded array.  If you get a single failure I
would expect it to abort the growth process, then restart where it
left off (after checking that that made sense).

> 
> In which case should he be able to re-add /dev/sdc and allow md to retry the
> grow? (possibly losing some data due to the sdc staleness)

He only needs one of the two drives in there.  I got the impression
that both sdc and sdb had reported errors.  If not, and sdc really
seems OK, then "--assemble --force" listing all drives except sdb
should make it all work again.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1235 matches

Mail list logo