Re: raid5: I lost a XFS file system due to a minor IDE cable problem

2007-05-27 Thread David Chinner
On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote:
 
 On Friday 25 May 2007 06:55:00 David Chinner wrote:
  Oh, did you look at your logs and find that XFS had spammed them
  about writes that were failing?
 
 The first message after the incident:
 
 May 24 01:53:50 hq kernel: Filesystem loop1: XFS internal error 
 xfs_btree_check_sblock at line 336 of file fs/xfs/xfs_btree.c.  Caller 
 0xf8ac14f8
 May 24 01:53:50 hq kernel: f8adae69 xfs_btree_check_sblock+0x4f/0xc2 [xfs]  
 f8ac14f8 xfs_alloc_lookup+0x34e/0x47b [xfs]
 May 24 01:53:50 HF kernel: f8ac14f8 xfs_alloc_lookup+0x34e/0x47b [xfs]  
 f8b1a9c7 kmem_zone_zalloc+0x1b/0x43 [xfs]
 May 24 01:53:50 hq kernel: f8abe645 xfs_alloc_ag_vextent+0x24d/0x1110 [xfs] 
  f8ac0647 xfs_alloc_vextent+0x3bd/0x53b [xfs]
 May 24 01:53:50 hq kernel: f8ad2f7e xfs_bmapi+0x1ac4/0x23cd [xfs]  
 f8acab97 xfs_bmap_search_multi_extents+0x8e/0xd8 [xfs]
 May 24 01:53:50 hq kernel: f8b1 xlog_dealloc_log+0x49/0xea [xfs]  
 f8afdaee xfs_iomap_write_allocate+0x2d9/0x58b [xfs]
 May 24 01:53:50 hq kernel: f8afc3ae xfs_iomap+0x60e/0x82d [xfs]  c0113bc8 
 __wake_up_common+0x39/0x59
 May 24 01:53:50 hq kernel: f8b1ae11 xfs_map_blocks+0x39/0x6c [xfs]  
 f8b1bd7b xfs_page_state_convert+0x644/0xf9c [xfs]
 May 24 01:53:50 hq kernel: c036f384 schedule+0x5d1/0xf4d  f8b1c780 
 xfs_vm_writepage+0x0/0xe0 [xfs]
 May 24 01:53:50 hq kernel: f8b1c7d7 xfs_vm_writepage+0x57/0xe0 [xfs]  
 c01830e8 mpage_writepages+0x1fb/0x3bb
 May 24 01:53:50 hq kernel: c0183020 mpage_writepages+0x133/0x3bb  
 f8b1c780 xfs_vm_writepage+0x0/0xe0 [xfs]
 May 24 01:53:50 hq kernel: c0147bb3 do_writepages+0x35/0x3b  c018135c 
 __writeback_single_inode+0x88/0x387
 May 24 01:53:50 hq kernel: c01819b7 sync_sb_inodes+0x1b4/0x2a8  c0181c63 
 writeback_inodes+0x63/0xdc
 May 24 01:53:50 hq kernel: c0147943 background_writeout+0x66/0x9f  
 c01482b3 pdflush+0x0/0x1ad
 May 24 01:53:50 hq kernel: c01483a2 pdflush+0xef/0x1ad  c01478dd 
 background_writeout+0x0/0x9f
 May 24 01:53:50 hq kernel: c012d10b kthread+0xc2/0xc6  c012d049 
 kthread+0x0/0xc6
 May 24 01:53:50 hq kernel: c0100dd5 kernel_thread_helper+0x5/0xb
 
 .and I've spammed such messages. This internal error isn't a good reason to 
 shut down
 the file system?

Actaully, that error does shut the filesystem down in most cases. When you
see that output, the function is returning -EFSCORRUPTED. You've got a corrupted
freespace btree.

The reason why you get spammed is that this is happening during background
writeback, and there is no one to return the -EFSCORRUPTED error to. The
background writeback path doesn't specifically detect shut down filesystems or
trigger shutdowns on errors because that happens in different layers so you
just end up with failed data writes. These errors will occur on the next
foreground data or metadata allocation and that will shut the filesystem down
at that point.

I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe in
this case we should be shutting down the filesystem.  That would certainly cut
down on the spamming and would not appear to change anything other
behaviour

 I think if there's a sign of corrupted file system, the first thing we should 
 do
 is to stop writes (or the entire FS) and let the admin to examine the 
 situation.

Yes, that's *exactly* what a shutdown does. In this case, your writes are
being stopped - hence the error messages - but the filesystem has not yet
been shutdown.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-27 Thread Neil Brown

Thanks everyone for your input.  There was some very valuable
observations in the various emails.
I will try to pull most of it together and bring out what seem to be
the important points.


1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP.

 This is certainly a very attractive position - it makes the interface
 cleaner and makes life easier for filesystems and other clients of
 the block interface.
 Currently filesystems handle -EOPNOTSUP by
  a/ resubmitting the request without the BARRIER (after waiting for
earlier requests to complete) and
  b/ possibly printing an error message to the kernel logs.

 The block layer can do both of these just as easily and it does make
 sense to do it there.

 md/dm modules could keep count of requests as has been suggested
 (though that would be a fairly big change for raid0 as it currently
 doesn't know when a request completes - bi_endio goes directly to the
 filesystem). 
 However I think the idea of a zero-length BIO_RW_BARRIER would be a
 good option.  raid0 could send one of these down each device, and
 when they all return, the barrier request can be sent to it's target
 device(s).

 I think this is a worthy goal that we should work towards.

2/ Maybe barriers provide stronger semantics than are required.

 All write requests are synchronised around a barrier write.  This is
 often more than is required and apparently can cause a measurable
 slowdown.

 Also the FUA for the actual commit write might not be needed.  It is
 important for consistency that the preceding writes are in safe
 storage before the commit write, but it is not so important that the
 commit write is immediately safe on storage.  That isn't needed until
 a 'sync' or 'fsync' or similar.

 One possible alternative is:
   - writes can overtake barriers, but barrier cannot overtake writes.
   - flush before the barrier, not after.

 This is considerably weaker, and hence cheaper. But I think it is
 enough for all filesystems (providing it is still an option to call
 blkdev_issue_flush on 'fsync').

 Another alternative would be to tag each bio was being in a
 particular barrier-group.  Then bio's in different groups could
 overtake each other in either direction, but a BARRIER request must
 be totally ordered w.r.t. other requests in the barrier group.
 This would require an extra bio field, and would give the filesystem
 more appearance of control.  I'm not yet sure how much it would
 really help...
 It would allow us to set FUA on all bios with a non-zero
 barrier-group.  That would mean we don't have to flush the entire
 cache, just those blocks that are critical but I'm still not sure
 it's a good idea.

 Of course, these weaker rules would only apply inside the elevator.
 Once the request goes to the device we need to work with what the
 device provides, which probably means total-ordering around the
 barrier. 

 I think this requires more discussion before a way forward is clear.

3/ Do we need explicit control of the 'ordered' mode?

  Consider a SCSI device that has NV RAM cache.  mode_sense reports
  that write-back is enabled, so _FUA or _FLUSH will be used.
  But as it is *NV* ram, QUEUE_ORDER_DRAIN is really the best mode.
  But it seems there is no way to query this information.
  Using _FLUSH causes the NVRAM to be flushed to media which is a
  terrible performance problem.
  Setting SYNC_NV doesn't work on the particular device in question.
  We currently tell customers to mount with -o nobarriers, but that
  really feels like the wrong solution.  We should be telling the scsi
  device don't flush.
  An advantage of 'nobarriers' is it can go in /etc/fstab.  Where
  would you record that a SCSI drive should be set to
  QUEUE_ORDERD_DRAIN ??


I think the implementation priorities here are:

1/ implement a zero-length BIO_RW_BARRIER option.
2/ Use it (or otherwise) to make all dm and md modules handle
   barriers (and loop?).
3/ Devise and implement appropriate fall-backs with-in the block layer
   so that  -EOPNOTSUP is never returned.
4/ Remove unneeded cruft from filesystems (and elsewhere).

Comments?

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-27 Thread Neil Brown
On Friday May 25, [EMAIL PROTECTED] wrote:
 2007/5/25, Neil Brown [EMAIL PROTECTED]:
   - Are there other bit that we could handle better?
  BIO_RW_FAILFAST?  BIO_RW_SYNC?  What exactly do they mean?
 
 BIO_RW_FAILFAST: means low-level driver shouldn't do much (or no)
 error recovery. Mainly used by mutlipath targets to avoid long SCSI
 recovery. This should just be propagated when passing requests on.

Is it much or no?
Would it be reasonable to use this for reads from a non-degraded
raid1?  What about writes?

What I would really like is some clarification on what sort of errors
get retried, how often, and how much timeout there is..

And does the 'error' code returned in -bi_end_io allow us to
differentiate media errors from other errors yet?

 
 BIO_RW_SYNC: means this is a bio of a synchronous request. I don't
 know whether there are more uses to it but this at least causes queues
 to be flushed immediately instead of waiting for more requests for a
 short time. Should also just be passed on. Otherwise performance gets
 poor since something above will rather wait for the current
 request/bio to complete instead of sending more.

Yes, this one is pretty straight forward.. I mentioned it more as a
reminder to my self that I really should support it in raid5 :-(

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: I lost a XFS file system due to a minor IDE cable problem

2007-05-27 Thread David Chinner
On Mon, May 28, 2007 at 03:50:17AM +0200, Pallai Roland wrote:
 On Monday 28 May 2007 02:30:11 David Chinner wrote:
  On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote:
   .and I've spammed such messages. This internal error isn't a good
   reason to shut down the file system?
 
  Actaully, that error does shut the filesystem down in most cases. When you
  see that output, the function is returning -EFSCORRUPTED. You've got a
  corrupted freespace btree.
 
  The reason why you get spammed is that this is happening during background
  writeback, and there is no one to return the -EFSCORRUPTED error to. The
  background writeback path doesn't specifically detect shut down filesystems
  or trigger shutdowns on errors because that happens in different layers so
  you just end up with failed data writes. These errors will occur on the
  next foreground data or metadata allocation and that will shut the
  filesystem down at that point.
 
  I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe in
  this case we should be shutting down the filesystem.  That would certainly
  cut down on the spamming and would not appear to change anything other
  behaviour
  If I remember correctly, my file system wasn't shutted down at all, it 
 was writeable for whole night, the yafc slowly written files to it. Maybe 
 all write operations had failed, but yafc doesn't warn.

So you never created new files or directories, unlinked files or
directories, did synchronous writes, etc? Just had slowly growing files?

  Spamming is just annoying when we need to find out what went wrong (My 
 kernel.log is 300Mb), but for data security it's important to react to 
 EFSCORRUPTED error in any case, I think so. Please consider this.

The filesystem has responded correctly to the corruption in terms of
data security (i.e. failed the data write and warned noisily about
it), but it probably hasn't done everything it should

H. A quick look at the linux code makes me thikn that background
writeback on linux has never been able to cause a shutdown in this
case. However, the same error on Irix will definitely cause a
shutdown, though

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-27 Thread David Chinner
On Mon, May 28, 2007 at 11:30:32AM +1000, Neil Brown wrote:
 
 Thanks everyone for your input.  There was some very valuable
 observations in the various emails.
 I will try to pull most of it together and bring out what seem to be
 the important points.
 
 
 1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP.

Sounds good to me, but how do we test to see if the underlying
device supports barriers? Do we just assume that they do and
only change behaviour if -o nobarrier is specified in the mount
options?

 2/ Maybe barriers provide stronger semantics than are required.
 
  All write requests are synchronised around a barrier write.  This is
  often more than is required and apparently can cause a measurable
  slowdown.
 
  Also the FUA for the actual commit write might not be needed.  It is
  important for consistency that the preceding writes are in safe
  storage before the commit write, but it is not so important that the
  commit write is immediately safe on storage.  That isn't needed until
  a 'sync' or 'fsync' or similar.

The use of barriers in XFS assumes the commit write to be on stable
storage before it returns.  One of the ordering guarantees that we
need is that the transaction (commit write) is on disk before the
metadata block containing the change in the transaction is written
to disk and the current barrier behaviour gives us that.

  One possible alternative is:
- writes can overtake barriers, but barrier cannot overtake writes.

No, that breaks the above usage of a barrier

- flush before the barrier, not after.
 
  This is considerably weaker, and hence cheaper. But I think it is
  enough for all filesystems (providing it is still an option to call
  blkdev_issue_flush on 'fsync').

No, not enough for XFS.

  Another alternative would be to tag each bio was being in a
  particular barrier-group.  Then bio's in different groups could
  overtake each other in either direction, but a BARRIER request must
  be totally ordered w.r.t. other requests in the barrier group.
  This would require an extra bio field, and would give the filesystem
  more appearance of control.  I'm not yet sure how much it would
  really help...

And that assumes the filesystem is tracking exact dependencies
between I/Os.  Such a mechanism would probably require filesystems
to be redesigned to use this, but I can see how it would be useful
for doing things like ensuring ordering between just an inode and
it's data writes.  What would the overhead of having to support
several hundred thousand different barrier groups be (i.e. one per
dirty inode in a system)?

 I think the implementation priorities here are:

Depending on the answer to my first question:

0/ implement a specific test for filesystems to run at mount time
   to determine if barriers are supported or not.

 1/ implement a zero-length BIO_RW_BARRIER option.
 2/ Use it (or otherwise) to make all dm and md modules handle
barriers (and loop?).
 3/ Devise and implement appropriate fall-backs with-in the block layer
so that  -EOPNOTSUP is never returned.
 4/ Remove unneeded cruft from filesystems (and elsewhere).

Sounds like a good start. ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html