Re: Proposal to improve filesystem/block snapshot interaction

2007-11-20 Thread Roger Strassburg
Greg,

Sorry I didn't respond sooner - other things have gotten in the way of reading 
this thread.

See comments below.

Roger

Greg Banks wrote:
 On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote:
 On Tuesday October 30, [EMAIL PROTECTED] wrote:
 Of course snapshot cow elements may be part of more generic element
 trees.  In general there may be more than one consumer of block usage
 hints in a given filesystem's element tree, and their locations in that
 tree are not predictable.  This means the block extents mentioned in
 the usage hints need to be subject to the block mapping algorithms
 provided by the element tree.  As those algorithms are currently
 implemented using bio mapping and splitting, the easiest and simplest
 way to reuse those algorithms is to add new bio flags.
 So are you imagining that you might have a distinct snapshotable
 elements, and that some of these might be combined by e.g. RAID0 into
 a larger device, then a filesystem is created on that?
 
 I was thinking more a concatenation than a stripe, but yes you could
 do such a thing, e.g. to parallelise the COW procedure.  We don't do
 any such thing in our product; the COW element is always inserted at
 the top of the logical element tree.
 
 I ask because my first thought was that the sort of communication you
 want seems like it would be just between a filesystem and the block
 device that it talks directly to, and as you are particularly
 interested in XFS and XVM, should could come up with whatever protocol
 you want for those two to talk to either other, prototype it, iron out
 all the issues, then say We've got this really cool thing to make
 snapshots much faster - wanna share?  and thus be presenting from a
 position of more strength (the old 'code talks' mantra).
 
 Indeed, code talks ;-)  I was hoping someone else would do that
 talking for me, though.
 
 First we need a mechanism to indicate that a bio is a hint rather
 than a real IO.  Perhaps the easiest way is to add a new flag to
 the bi_rw field:

 #define BIO_RW_HINT 5   /* bio is a hint not a real io; no 
 pages */
 Reminds me of the new approach to issue_flush_fn which is just to have
 a zero-length barrier bio (is that implemented yet? I lost track).
 But different as a zero length barrier has zero length, and your hints
 have a very meaningful length.
 
 Yes.
 
 Next we'll need three bio hints types with the following semantics.

 BIO_HINT_ALLOCATE
 The bio's block extent will soon be written by the filesystem
 and any COW that may be necessary to achieve that should begin
 now.  If the COW is going to fail, the bio should fail.  Note
 that this provides a way for the filesystem to manage when and
 how failures to COW are reported.
 Would it make sense to allow the bi_sector to be changed by the device
 and to have that change honoured.
 i.e. Please allocate 128 blocks, maybe 'here' 
  OK, 128 blocks allocated, but they are actually over 'there'.
 
 That wasn't the expectation at all.  Perhaps allocate is a poor
 name.   I have just allocated, deal with it might be more appropriate.
 Perhaps BIO_HINT_WILLUSE or something.
 
 If the device is tracking what space is and isn't used, it might make
 life easier for it to do the allocation.  Maybe even have a variant
 Allocate 128 blocks, I don't care where.
 
 That kind of thing might perhaps be useful for flash, but I think
 current filesystems would have conniptions.
 
 Is this bio supposed to block until the copy has happened?  Or only
 until the space of the copy has been allocated and possibly committed?
 
 The latter.  The writes following will block until the COW has
 completed, or might be performed sufficiently later that the COW
 has meanwhile completed (I think this implies an extra state in the
 snapshot metadata to avoid double-COWing).  The point of the hint is
 to allow the snapshot code to test for running out of repo space and
 report that failure at a time when the filesystem is able to handle
 it gracefully.
 
 Or must it return without doing any IO at all?
 
 I would expect it would be a useful optimisation to start the IO but
 not wait for it's completion, but that the first implementation would
 just do a space check.
 
 BIO_HINT_RELEASE
 The bio's block extent is no longer in use by the filesystem
 and will not be read in the future.  Any storage used to back
 the extent may be released without any threat to filesystem
 or data integrity.
 If the allocation unit of the storage device (e.g. a few MB) does not
 match the allocation unit of the filesystem (e.g. a few KB) then for
 this to be useful either the storage device must start recording tiny
 allocations, or the filesystem should re-release areas as they grow.
 i.e. when releasing a range of a device, look in the filesystem's usage
 records for the largest surrounding free space, and release all of that.
 
 Good point.  I was planning on ignoring this problem :-/ Given that
 

Re: Proposal to improve filesystem/block snapshot interaction

2007-10-31 Thread David Chinner
On Wed, Oct 31, 2007 at 03:01:58PM +1100, Greg Banks wrote:
 On Wed, Oct 31, 2007 at 10:56:52AM +1100, David Chinner wrote:
  On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote:
   On Tuesday October 30, [EMAIL PROTECTED] wrote:
BIO_HINT_RELEASE
The bio's block extent is no longer in use by the filesystem
and will not be read in the future.  Any storage used to back
the extent may be released without any threat to filesystem
or data integrity.
   
   If the allocation unit of the storage device (e.g. a few MB) does not
   match the allocation unit of the filesystem (e.g. a few KB) then for
   this to be useful either the storage device must start recording tiny
   allocations, or the filesystem should re-release areas as they grow.
   i.e. when releasing a range of a device, look in the filesystem's usage
   records for the largest surrounding free space, and release all of that.
  
  I figured that the easiest way around this is reporting free space
  extents, not the amoutn actually freed. e.g.
  
  4k in file A @ block 10
  4k in file B @ block 11
  4k free space @ block 12
  4k in file C @ block 13
  1008k in free space at block 14.
  
  If we free file A, we report that we've released an extent of 4k @ block 10.
  if we then free file B, we report we've released an extent of 12k @ block 
  10.
  If we then free file C, we report a release of 1024k @ block 10.
  
  Then the underlying device knows what the aggregated free space regions
  are and can easily release large regions without needing to track tiny
  allocations and frees done by the filesystem.
 
 If you could do that in the filesystem, it certainly solve the problem.
 In which case I'll explicitly allow for the hint's extent to overlap
 extents previous extents thus hinted, and define the semantics
 for overlaps.  I think I'll rename the hint to BIO_HINT_RELEASED,
 I think that will make the semantics a little clearer.

I think that can be done - i wouldn't have mentioned it if I didn't
think it was possible to implement ;).

It will require a further btree lookup once the free transaction
hits the disk, but I think that's pretty easy to do. I'd probably
hook xfs_alloc_clear_busy() to do this.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Dongjun Shin
On 10/30/07, Greg Banks [EMAIL PROTECTED] wrote:

 BIO_HINT_RELEASE
 The bio's block extent is no longer in use by the filesystem
 and will not be read in the future.  Any storage used to back
 the extent may be released without any threat to filesystem
 or data integrity.


I'd like to second the proposal, but it would be more useful to bring the hint
down to the physical devices.

There is an ongoing discussion about adding 'Trim' ATA command for notifying
the drive about the deleted blocks.

http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf

This is especially useful for the storage device like Solid State Drive (SSD).

Dongjun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Arnd Bergmann
On Tuesday 30 October 2007, Dongjun Shin wrote:
 There is an ongoing discussion about adding 'Trim' ATA command for notifying
 the drive about the deleted blocks.
 
 http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf
 
 This is especially useful for the storage device like Solid State Drive (SSD).
 
This make me curious, why would t13 want to invent a new command when
there is already the erase command from CFA? 

It's not exactly the same, but close enough that the proposed BIO_HINT_RELEASE
should probably be mapped to CFA_ERASE (0xc0) on drives that support it:
http://t13.org/Documents/UploadedDocuments/technical/d97116r1.pdf

Arnd 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Dongjun Shin
On 10/30/07, Arnd Bergmann [EMAIL PROTECTED] wrote:
 This make me curious, why would t13 want to invent a new command when
 there is already the erase command from CFA?

 It's not exactly the same, but close enough that the proposed BIO_HINT_RELEASE
 should probably be mapped to CFA_ERASE (0xc0) on drives that support it:
 http://t13.org/Documents/UploadedDocuments/technical/d97116r1.pdf


I'm not sure about the background.
However, it's definitely a sign that passing the deleted block info
to the flash-based storage is useful.

Anyway, BIO_HINT_RELEASE could destroy the content of the blocks
after being passed to the device. I think that other bio should not be
reordered
accross that hint (just like barrier).

Dongjun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Dongjun Shin
On 10/30/07, Arnd Bergmann [EMAIL PROTECTED] wrote:

 Not sure. Why shouldn't you be able to reorder the hints provided that
 they don't overlap with read/write bios for the same block?


You're right. The bios can be reordered if they don't overlap with hint.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Jörn Engel
On Tue, 30 October 2007 18:35:08 +0900, Dongjun Shin wrote:
 On 10/30/07, Greg Banks [EMAIL PROTECTED] wrote:
 
  BIO_HINT_RELEASE
  The bio's block extent is no longer in use by the filesystem
  and will not be read in the future.  Any storage used to back
  the extent may be released without any threat to filesystem
  or data integrity.
 
 I'd like to second the proposal, but it would be more useful to bring the hint
 down to the physical devices.

Absolutely.  Logfs would love to have an erase operation for block
devices as well.  However the above doesn't quite match my needs,
because the blocks _will_ be read in the future.

There are two reasons for reading things back later.  The good one is to
determine whether the segment was erased or not.  Reads should return
either valid data or one of (all-0xff, all-0x00, -ESOMETHING).  Having
a dedicated error code would be best.

And getting the device erasesize would be useful as well, for obvious
reasons.

Jörn

-- 
When you close your hand, you own nothing. When you open it up, you
own the whole world.
-- Li Mu Bai in Tiger  Dragon
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Jörn Engel
On Tue, 30 October 2007 23:19:48 +0900, Dongjun Shin wrote:
 On 10/30/07, Arnd Bergmann [EMAIL PROTECTED] wrote:
 
  Not sure. Why shouldn't you be able to reorder the hints provided that
  they don't overlap with read/write bios for the same block?
 
 You're right. The bios can be reordered if they don't overlap with hint.

I would keep things simpler.  Bios can be reordered, full stop.  If an
erase and a write overlap, the caller (filesystem?) has to add a
barrier.

Jörn

-- 
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Arnd Bergmann
On Tuesday 30 October 2007, Jörn Engel wrote:
 On Tue, 30 October 2007 23:19:48 +0900, Dongjun Shin wrote:
  On 10/30/07, Arnd Bergmann [EMAIL PROTECTED] wrote:
  
   Not sure. Why shouldn't you be able to reorder the hints provided that
   they don't overlap with read/write bios for the same block?
  
  You're right. The bios can be reordered if they don't overlap with hint.
 
 I would keep things simpler.  Bios can be reordered, full stop.  If an
 erase and a write overlap, the caller (filesystem?) has to add a
 barrier.

I thought bios were already ordered if they affect the same blocks.
Either way, I agree that an erase should not be treated special on
the bio layer, its ordering should be handled the same way we do it
for writes.

Arnd 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Kyungmin Park
On 10/31/07, Arnd Bergmann [EMAIL PROTECTED] wrote:
 On Tuesday 30 October 2007, Jörn Engel wrote:
  On Tue, 30 October 2007 23:19:48 +0900, Dongjun Shin wrote:
   On 10/30/07, Arnd Bergmann [EMAIL PROTECTED] wrote:
   
Not sure. Why shouldn't you be able to reorder the hints provided that
they don't overlap with read/write bios for the same block?
  
   You're right. The bios can be reordered if they don't overlap with hint.
 
  I would keep things simpler. Bios can be reordered, full stop. If an
  erase and a write overlap, the caller (filesystem?) has to add a
  barrier.

 I thought bios were already ordered if they affect the same blocks.
 Either way, I agree that an erase should not be treated special on
 the bio layer, its ordering should be handled the same way we do it
 for writes.


To support the new ATA command (trim, or dataset), the suggested hint
is not enough.
We have to send the bio with data (at least one sector or more) since
the new ATA command requests the dataset information.

And also we have to strictly follow the order using barrier or other
methods at filesystem level
For example, the delete operation in ext3.
1. delete some file
2. ext3_delete_inode() called
3. ... - ext3_free_blocks_sb() releases the free blocks
4. If it sends the hints here, it breaks the ext3 power off recovery
scheme since it trims the data from given information after reboot
5. after transaction, all dirty pages are flushed. after this work, we
can trim the free blocks safely.

Another approach is modifying the block framework.
At  I/O scheduler, it don't merge the hint bio (in my terminology, bio
control info) with general bio. In this case we also consider the
reordering problem.
I'm not sure it is possible at this time.

Thank you,
Kyungmin Park
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread David Chinner
On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote:
 On Tuesday October 30, [EMAIL PROTECTED] wrote:
  BIO_HINT_RELEASE
  The bio's block extent is no longer in use by the filesystem
  and will not be read in the future.  Any storage used to back
  the extent may be released without any threat to filesystem
  or data integrity.
 
 If the allocation unit of the storage device (e.g. a few MB) does not
 match the allocation unit of the filesystem (e.g. a few KB) then for
 this to be useful either the storage device must start recording tiny
 allocations, or the filesystem should re-release areas as they grow.
 i.e. when releasing a range of a device, look in the filesystem's usage
 records for the largest surrounding free space, and release all of that.

I figured that the easiest way around this is reporting free space
extents, not the amoutn actually freed. e.g.

4k in file A @ block 10
4k in file B @ block 11
4k free space @ block 12
4k in file C @ block 13
1008k in free space at block 14.

If we free file A, we report that we've released an extent of 4k @ block 10.
if we then free file B, we report we've released an extent of 12k @ block 10.
If we then free file C, we report a release of 1024k @ block 10.

Then the underlying device knows what the aggregated free space regions
are and can easily release large regions without needing to track tiny
allocations and frees done by the filesystem.

 I guess that is equally domain specific, but the difference is that if
 you try to read from the DONTCOW part of the snapshot, you get bad
 old data, where as if you try to access the subordinate device of a
 snapshot, you get an IO error - which is probably safer.

If you read from a DONTCOW region you should get zeros back - it's
a hole in the snapshot.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Greg Banks
On Tue, Oct 30, 2007 at 06:35:08PM +0900, Dongjun Shin wrote:
 On 10/30/07, Greg Banks [EMAIL PROTECTED] wrote:
 
  BIO_HINT_RELEASE
  The bio's block extent is no longer in use by the filesystem
  and will not be read in the future.  Any storage used to back
  the extent may be released without any threat to filesystem
  or data integrity.
 
 
 I'd like to second the proposal, but it would be more useful to bring the hint
 down to the physical devices.
 
 There is an ongoing discussion about adding 'Trim' ATA command for notifying
 the drive about the deleted blocks.
 
 http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf

What an interesting document.  Am I reading the change markup correctly,
did it get *simpler* in the last revision?  Wow.

I agree that BIO_HINT_RELEASE would be a good match for the proposed
Trim command.  But I don't think we'll ever be issuing Trims with
more than a single LBA Range Entry, that feature seems unhelpful.

The Trim proposal doesn't specify what happens when a sector which
is already deallocated is deallocated again, presumably this is
supposed to be harmless?

Greg.
-- 
Greg Banks, RD Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere.  Which MPHG character are you?
I don't speak for SGI.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-30 Thread Greg Banks
On Wed, Oct 31, 2007 at 10:56:52AM +1100, David Chinner wrote:
 On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote:
  On Tuesday October 30, [EMAIL PROTECTED] wrote:
   BIO_HINT_RELEASE
   The bio's block extent is no longer in use by the filesystem
   and will not be read in the future.  Any storage used to back
   the extent may be released without any threat to filesystem
   or data integrity.
  
  If the allocation unit of the storage device (e.g. a few MB) does not
  match the allocation unit of the filesystem (e.g. a few KB) then for
  this to be useful either the storage device must start recording tiny
  allocations, or the filesystem should re-release areas as they grow.
  i.e. when releasing a range of a device, look in the filesystem's usage
  records for the largest surrounding free space, and release all of that.
 
 I figured that the easiest way around this is reporting free space
 extents, not the amoutn actually freed. e.g.
 
   4k in file A @ block 10
   4k in file B @ block 11
   4k free space @ block 12
   4k in file C @ block 13
   1008k in free space at block 14.
 
 If we free file A, we report that we've released an extent of 4k @ block 10.
 if we then free file B, we report we've released an extent of 12k @ block 10.
 If we then free file C, we report a release of 1024k @ block 10.
 
 Then the underlying device knows what the aggregated free space regions
 are and can easily release large regions without needing to track tiny
 allocations and frees done by the filesystem.

If you could do that in the filesystem, it certainly solve the problem.
In which case I'll explicitly allow for the hint's extent to overlap
extents previous extents thus hinted, and define the semantics
for overlaps.  I think I'll rename the hint to BIO_HINT_RELEASED,
I think that will make the semantics a little clearer.

Greg.
-- 
Greg Banks, RD Software Engineer, SGI Australian Software Group.
Apparently, I'm Bedevere.  Which MPHG character are you?
I don't speak for SGI.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Proposal to improve filesystem/block snapshot interaction

2007-10-29 Thread Greg Banks
G'day,

A number of people have already seen this; I'm posting for wider
comment and to move some interesting discussion to a public list.

I'll apologise in advance for the talk about SGI technologies (including
proprietary ones), but all the problems mentioned apply to in-tree
technologies too.



This proposal seeks to solve three problems in our NAS server product
due to the interaction of the filesystem (XFS) and the block-based
snapshot feature (XVM snapshot).  It's based on discussions held with
various people over the last few weeks, including Roger Strassburg,
Christoph Hellwig, David Chinner, and Donald Douwsma.

a)  The first problem is the server's behaviour when a filesystem
which is subject to snapshot is written to, and the snapshot
repository runs out of room.

The failure mode can be quite severe.  XFS issues a metadata write
to the block device, triggering a Copy-On-Write operation in the
XVM snapshot element, which because of the full repository fails
with EIO.  When XFS sees the failure it shuts down the filesystem.
All subsequent attempts to perform IO to the filesystem block
indefinitely.  In particular any NFS server thread will block
and never reply to the NFS client.  The NFS client will retry,
causing another NFS server thread to block, and repeat until every
NFS server thread is blocked.  At this point all NFS service for
all filesystems ceases.

See PV 958220 and PV 958140 for a description of this problem and
some of the approaches which have been discussed for resolving it.


b)  The second problem is that certain common combinations of
filesystem operations can cause large wastes of space in the XVM
snapshot repository.

Examples include writing the same file twice with dd, or writing
a new file and deleting it.  The cause is the inability of the
XVM snapshot code to be able to free regions in the snapshot
repository that are no longer in use by the filesystem; this
information is simply not available within the block layer.

Note that problem b) also contributes to problem a) by increasing
repository usage and thus making it easier to encounter an
out-of-space condition on the repository.

c)  The third problem is an unfortunate interaction between an XFS
internal log and block snapshots.

The log is a fixed region of the block device which is written as
a side effect of a great many different filesystem operations.
The information written there has no value and is not even
read until and unless log recovery needs to be performed after
the server has crashed.  This means the log does not need to be
preserved by the block feature snapshot (because at the point in
time when the snapshot is taken, log recovery must have already
happened).  In fact the correct procedure when mounting a read-only
snapshot is to use the norecovery option to prevent any attempt
to read the log (although the NAS server software actually doesn't
do this).

However, because the block device layer doesn't have enough
information to know any better, the first pass of writes to the log
are subjected to Copy-On-Write.  This has two undesirable effects.
Firstly, it increases the amount of snapshot repository space
used by each snapshot, thus contributing to problem a).  Secondly,
it puts a significant performance penalty on filesystem metadata
operations for some time after each snapshot is taken; given
that the NAS server can be configured to take regular frequent
snapshots this may mean all of the time.

An obvious solution is to use an external XFS log, but this quite
inconvenient for the NAS server software to arrange.  For one
thing, we would need to construct a separate external log device
for the main filesystem and one for each mounted snapshot.

Note that these problems are not specific to XVM but will be
encountered by any Linux block-COWing snapshot implementation.
For example the DM snapshot implementation is documented to suffer from
problem a).  From the linux/Documentation/device-mapper/snapshot.txt:

 COW device will often be smaller than the origin and if it
 fills up the snapshot will become useless and be disabled,
 returning errors.  So it is important to monitor the amount of
 free space and expand the COW device before it fills up.

During discussions, it became clear that we could solve all three
of these problems by improving the block device interface to allow a
filesystem to provide the block device with dynamic block usage hints.

For example, when unlinking a file the filesystem could tell the
block device a hint of the form I'm about to stop using these
blocks.  Most block devices would silently ignore these hints, but
a snapshot COW implementation (the copy-on-write XVM element or
the snapshot-origin dm target) could use them to help avoid these
problems.  For example, the response 

Re: Proposal to improve filesystem/block snapshot interaction

2007-10-29 Thread Greg Banks
On Tue, Oct 30, 2007 at 12:51:47AM +0100, Arnd Bergmann wrote:
 On Monday 29 October 2007, Christoph Hellwig wrote:
  - Forwarded message from Greg Banks [EMAIL PROTECTED] -
  
  Date: Thu, 27 Sep 2007 16:31:13 +1000
  From: Greg Banks [EMAIL PROTECTED]
  Subject: Proposal to improve filesystem/block snapshot interaction
  To: David Chinner [EMAIL PROTECTED], Donald Douwsma [EMAIL PROTECTED],
  Christoph Hellwig [EMAIL PROTECTED], Roger Strassburg [EMAIL 
  PROTECTED]
  Cc: Mark Goodwin [EMAIL PROTECTED],
  Brett Jon Grandbois [EMAIL PROTECTED]
  
  
  
  This proposal seeks to solve three problems in our NAS server product
  due to the interaction of the filesystem (XFS) and the block-based
  snapshot feature (XVM snapshot).  It's based on discussions held with
  various people over the last few weeks, including Roger Strassburg,
  Christoph Hellwig, David Chinner, and Donald Douwsma.
 
 Hi Greg,
 
 Christoph forwarded me your mail, because I mentioned to him that
 I'm trying to come up with a similar change, and it might make sense
 to combine our efforts.

Excellent, thanks Christoph ;-)


 
  For example, when unlinking a file the filesystem could tell the
  block device a hint of the form I'm about to stop using these
  blocks.  Most block devices would silently ignore these hints, but
  a snapshot COW implementation (the copy-on-write XVM element or
  the snapshot-origin dm target) could use them to help avoid these
  problems.  For example, the response to the I'm about to stop using
  these blocks hint could be to free the space used in the snapshot
  repository for unnecessary copies of those blocks.
 
 The case I'm interested in is the more specific case of 'erase',
 which is more of a performance optimization than a space optimization.
 When you have a flash medium, it's useful to erase a block as soon
 as it's becoming unused, so that a subsequent write will be faster.
 Moreover, on an MTD medium, you may not even be able to write to
 a block unless it has been erased before.

Spending the device's time to erase early, when the CPU isn't waiting
for it, instead of later, when it adds to effective write latency.
Makes sense.

  Of course snapshot cow elements may be part of more generic element
  trees.  In general there may be more than one consumer of block usage
  hints in a given filesystem's element tree, and their locations in that
  tree are not predictable.  This means the block extents mentioned in
  the usage hints need to be subject to the block mapping algorithms
  provided by the element tree.  As those algorithms are currently
  implemented using bio mapping and splitting, the easiest and simplest
  way to reuse those algorithms is to add new bio flags.
  
  First we need a mechanism to indicate that a bio is a hint rather
  than a real IO.  Perhaps the easiest way is to add a new flag to
  the bi_rw field:
  
  #define BIO_RW_HINT 5   /* bio is a hint not a real io; no 
  pages */
 
 My first thought was to do this on the request layer, not already
 on bio, but they can easily be combined, I guess.

My first thoughts were along similar lines, but I wasn't expecting
these hint bios to survive deep enough in the stack to need queuing
and thus visibility in struct request; I was expecting their lifetime
to be some passage and splitting through a volume manager and then
conversion to synchronous metadata operations.  Plus, hijacking bios
means not having to modify every single DM target to duplicate it's
block mapping algorithm.

Basically, I was thinking of loopback-like block mapping and not
considering flash.  I suppose for flash where there's a real erase
operation, you'd want to be queuing and that means a new request type.

 
  We'll also need a field to tell us which kind of hint the bio
  represents.  Perhaps a new field could be added, or perhaps the top
  16 bits of bi_rw (currently used to encode the bio's priority, which
  has no meaning for hint bios) could be reused.  The latter approach
  may allow hints to be used without modifying the bio structure or
  any code that uses it other than the filesystem and the snapshot
  implementation.  Such a property would have obvious advantages for
  our NAS server software, where XFS and XVM modules are provided but
  the other users of struct bio are stock SLES code.
  
  
  Next we'll need three bio hints types with the following semantics.
  
  BIO_HINT_ALLOCATE
  The bio's block extent will soon be written by the filesystem
  and any COW that may be necessary to achieve that should begin
  now.  If the COW is going to fail, the bio should fail.  Note
  that this provides a way for the filesystem to manage when and
  how failures to COW are reported.
  
  BIO_HINT_RELEASE
  The bio's block extent is no longer in use by the filesystem
  and will not be read in the future.  Any storage used to back
  the extent may be released without any threat to filesystem

Re: Proposal to improve filesystem/block snapshot interaction

2007-10-29 Thread Neil Brown
On Tuesday October 30, [EMAIL PROTECTED] wrote:
 
 Of course snapshot cow elements may be part of more generic element
 trees.  In general there may be more than one consumer of block usage
 hints in a given filesystem's element tree, and their locations in that
 tree are not predictable.  This means the block extents mentioned in
 the usage hints need to be subject to the block mapping algorithms
 provided by the element tree.  As those algorithms are currently
 implemented using bio mapping and splitting, the easiest and simplest
 way to reuse those algorithms is to add new bio flags.

So are you imagining that you might have a distinct snapshotable
elements, and that some of these might be combined by e.g. RAID0 into
a larger device, then a filesystem is created on that?

I ask because my first thought was that the sort of communication you
want seems like it would be just between a filesystem and the block
device that it talks directly to, and as you are particularly
interested in XFS and XVM, should could come up with whatever protocol
you want for those two to talk to either other, prototype it, iron out
all the issues, then say We've got this really cool thing to make
snapshots much faster - wanna share?  and thus be presenting from a
position of more strength (the old 'code talks' mantra).

 
 First we need a mechanism to indicate that a bio is a hint rather
 than a real IO.  Perhaps the easiest way is to add a new flag to
 the bi_rw field:
 
 #define BIO_RW_HINT   5   /* bio is a hint not a real io; no pages */

Reminds me of the new approach to issue_flush_fn which is just to have
a zero-length barrier bio (is that implemented yet? I lost track).
But different as a zero length barrier has zero length, and your hints
have a very meaningful length.

 
 Next we'll need three bio hints types with the following semantics.
 
 BIO_HINT_ALLOCATE
 The bio's block extent will soon be written by the filesystem
 and any COW that may be necessary to achieve that should begin
 now.  If the COW is going to fail, the bio should fail.  Note
 that this provides a way for the filesystem to manage when and
 how failures to COW are reported.

Would it make sense to allow the bi_sector to be changed by the device
and to have that change honoured.
i.e. Please allocate 128 blocks, maybe 'here' 
 OK, 128 blocks allocated, but they are actually over 'there'.

If the device is tracking what space is and isn't used, it might make
life easier for it to do the allocation.  Maybe even have a variant
Allocate 128 blocks, I don't care where.

Is this bio supposed to block until the copy has happened?  Or only
until the space of the copy has been allocated and possibly committed?
Or must it return without doing any IO at all?

 
 BIO_HINT_RELEASE
 The bio's block extent is no longer in use by the filesystem
 and will not be read in the future.  Any storage used to back
 the extent may be released without any threat to filesystem
 or data integrity.

If the allocation unit of the storage device (e.g. a few MB) does not
match the allocation unit of the filesystem (e.g. a few KB) then for
this to be useful either the storage device must start recording tiny
allocations, or the filesystem should re-release areas as they grow.
i.e. when releasing a range of a device, look in the filesystem's usage
records for the largest surrounding free space, and release all of that.

Would this be a burden on the filesystems?
Is my imagined disparity between block sizes valid?
Would it be just as easy for the storage device to track small
allocation/deallocations?

 
 BIO_HINT_DONTCOW
 (the Bart Simpson BIO).  The bio's block extent is not needed
 in mounted snapshots and does not need to be subjected to COW.

This seems like a much more domain-specific function that the other
two which themselves could be more generally useful (I'm imagining
using hints from them to e.g. accelerate RAID reconstruction).

Surely the correct thing to do with the log is to put it on a separate
device which itself isn't snapshotted.

If you have a storage manager that is smart enough to handle these
sorts of things, maybe the functionality you want is Give me a
subordinate device which is not snapshotted, size X, then journal to
that virtual device.
I guess that is equally domain specific, but the difference is that if
you try to read from the DONTCOW part of the snapshot, you get bad
old data, where as if you try to access the subordinate device of a
snapshot, you get an IO error - which is probably safer.

 
 Comments?

On the whole it seems reasonably sane  providing you are from the
school which believes that volume managers and filesystems should be
kept separate :-)

NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal to improve filesystem/block snapshot interaction

2007-10-29 Thread Greg Banks
On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote:
 On Tuesday October 30, [EMAIL PROTECTED] wrote:
  
  Of course snapshot cow elements may be part of more generic element
  trees.  In general there may be more than one consumer of block usage
  hints in a given filesystem's element tree, and their locations in that
  tree are not predictable.  This means the block extents mentioned in
  the usage hints need to be subject to the block mapping algorithms
  provided by the element tree.  As those algorithms are currently
  implemented using bio mapping and splitting, the easiest and simplest
  way to reuse those algorithms is to add new bio flags.
 
 So are you imagining that you might have a distinct snapshotable
 elements, and that some of these might be combined by e.g. RAID0 into
 a larger device, then a filesystem is created on that?

I was thinking more a concatenation than a stripe, but yes you could
do such a thing, e.g. to parallelise the COW procedure.  We don't do
any such thing in our product; the COW element is always inserted at
the top of the logical element tree.

 I ask because my first thought was that the sort of communication you
 want seems like it would be just between a filesystem and the block
 device that it talks directly to, and as you are particularly
 interested in XFS and XVM, should could come up with whatever protocol
 you want for those two to talk to either other, prototype it, iron out
 all the issues, then say We've got this really cool thing to make
 snapshots much faster - wanna share?  and thus be presenting from a
 position of more strength (the old 'code talks' mantra).

Indeed, code talks ;-)  I was hoping someone else would do that
talking for me, though.

  First we need a mechanism to indicate that a bio is a hint rather
  than a real IO.  Perhaps the easiest way is to add a new flag to
  the bi_rw field:
  
  #define BIO_RW_HINT 5   /* bio is a hint not a real io; no 
  pages */
 
 Reminds me of the new approach to issue_flush_fn which is just to have
 a zero-length barrier bio (is that implemented yet? I lost track).
 But different as a zero length barrier has zero length, and your hints
 have a very meaningful length.

Yes.

  
  Next we'll need three bio hints types with the following semantics.
  
  BIO_HINT_ALLOCATE
  The bio's block extent will soon be written by the filesystem
  and any COW that may be necessary to achieve that should begin
  now.  If the COW is going to fail, the bio should fail.  Note
  that this provides a way for the filesystem to manage when and
  how failures to COW are reported.
 
 Would it make sense to allow the bi_sector to be changed by the device
 and to have that change honoured.
 i.e. Please allocate 128 blocks, maybe 'here' 
  OK, 128 blocks allocated, but they are actually over 'there'.

That wasn't the expectation at all.  Perhaps allocate is a poor
name.   I have just allocated, deal with it might be more appropriate.
Perhaps BIO_HINT_WILLUSE or something.

 If the device is tracking what space is and isn't used, it might make
 life easier for it to do the allocation.  Maybe even have a variant
 Allocate 128 blocks, I don't care where.

That kind of thing might perhaps be useful for flash, but I think
current filesystems would have conniptions.

 Is this bio supposed to block until the copy has happened?  Or only
 until the space of the copy has been allocated and possibly committed?

The latter.  The writes following will block until the COW has
completed, or might be performed sufficiently later that the COW
has meanwhile completed (I think this implies an extra state in the
snapshot metadata to avoid double-COWing).  The point of the hint is
to allow the snapshot code to test for running out of repo space and
report that failure at a time when the filesystem is able to handle
it gracefully.

 Or must it return without doing any IO at all?

I would expect it would be a useful optimisation to start the IO but
not wait for it's completion, but that the first implementation would
just do a space check.

  
  BIO_HINT_RELEASE
  The bio's block extent is no longer in use by the filesystem
  and will not be read in the future.  Any storage used to back
  the extent may be released without any threat to filesystem
  or data integrity.
 
 If the allocation unit of the storage device (e.g. a few MB) does not
 match the allocation unit of the filesystem (e.g. a few KB) then for
 this to be useful either the storage device must start recording tiny
 allocations, or the filesystem should re-release areas as they grow.
 i.e. when releasing a range of a device, look in the filesystem's usage
 records for the largest surrounding free space, and release all of that.

Good point.  I was planning on ignoring this problem :-/ Given that
current snapshot implementations waste *all* the blocks in deleted
files, it would be an improvement to scavenge the blocks in large
extents.