Re: Proposal to improve filesystem/block snapshot interaction
Greg, Sorry I didn't respond sooner - other things have gotten in the way of reading this thread. See comments below. Roger Greg Banks wrote: On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote: On Tuesday October 30, [EMAIL PROTECTED] wrote: Of course snapshot cow elements may be part of more generic element trees. In general there may be more than one consumer of block usage hints in a given filesystem's element tree, and their locations in that tree are not predictable. This means the block extents mentioned in the usage hints need to be subject to the block mapping algorithms provided by the element tree. As those algorithms are currently implemented using bio mapping and splitting, the easiest and simplest way to reuse those algorithms is to add new bio flags. So are you imagining that you might have a distinct snapshotable elements, and that some of these might be combined by e.g. RAID0 into a larger device, then a filesystem is created on that? I was thinking more a concatenation than a stripe, but yes you could do such a thing, e.g. to parallelise the COW procedure. We don't do any such thing in our product; the COW element is always inserted at the top of the logical element tree. I ask because my first thought was that the sort of communication you want seems like it would be just between a filesystem and the block device that it talks directly to, and as you are particularly interested in XFS and XVM, should could come up with whatever protocol you want for those two to talk to either other, prototype it, iron out all the issues, then say We've got this really cool thing to make snapshots much faster - wanna share? and thus be presenting from a position of more strength (the old 'code talks' mantra). Indeed, code talks ;-) I was hoping someone else would do that talking for me, though. First we need a mechanism to indicate that a bio is a hint rather than a real IO. Perhaps the easiest way is to add a new flag to the bi_rw field: #define BIO_RW_HINT 5 /* bio is a hint not a real io; no pages */ Reminds me of the new approach to issue_flush_fn which is just to have a zero-length barrier bio (is that implemented yet? I lost track). But different as a zero length barrier has zero length, and your hints have a very meaningful length. Yes. Next we'll need three bio hints types with the following semantics. BIO_HINT_ALLOCATE The bio's block extent will soon be written by the filesystem and any COW that may be necessary to achieve that should begin now. If the COW is going to fail, the bio should fail. Note that this provides a way for the filesystem to manage when and how failures to COW are reported. Would it make sense to allow the bi_sector to be changed by the device and to have that change honoured. i.e. Please allocate 128 blocks, maybe 'here' OK, 128 blocks allocated, but they are actually over 'there'. That wasn't the expectation at all. Perhaps allocate is a poor name. I have just allocated, deal with it might be more appropriate. Perhaps BIO_HINT_WILLUSE or something. If the device is tracking what space is and isn't used, it might make life easier for it to do the allocation. Maybe even have a variant Allocate 128 blocks, I don't care where. That kind of thing might perhaps be useful for flash, but I think current filesystems would have conniptions. Is this bio supposed to block until the copy has happened? Or only until the space of the copy has been allocated and possibly committed? The latter. The writes following will block until the COW has completed, or might be performed sufficiently later that the COW has meanwhile completed (I think this implies an extra state in the snapshot metadata to avoid double-COWing). The point of the hint is to allow the snapshot code to test for running out of repo space and report that failure at a time when the filesystem is able to handle it gracefully. Or must it return without doing any IO at all? I would expect it would be a useful optimisation to start the IO but not wait for it's completion, but that the first implementation would just do a space check. BIO_HINT_RELEASE The bio's block extent is no longer in use by the filesystem and will not be read in the future. Any storage used to back the extent may be released without any threat to filesystem or data integrity. If the allocation unit of the storage device (e.g. a few MB) does not match the allocation unit of the filesystem (e.g. a few KB) then for this to be useful either the storage device must start recording tiny allocations, or the filesystem should re-release areas as they grow. i.e. when releasing a range of a device, look in the filesystem's usage records for the largest surrounding free space, and release all of that. Good point. I was planning on ignoring this problem :-/ Given that
Re: Proposal to improve filesystem/block snapshot interaction
On Wed, Oct 31, 2007 at 03:01:58PM +1100, Greg Banks wrote: On Wed, Oct 31, 2007 at 10:56:52AM +1100, David Chinner wrote: On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote: On Tuesday October 30, [EMAIL PROTECTED] wrote: BIO_HINT_RELEASE The bio's block extent is no longer in use by the filesystem and will not be read in the future. Any storage used to back the extent may be released without any threat to filesystem or data integrity. If the allocation unit of the storage device (e.g. a few MB) does not match the allocation unit of the filesystem (e.g. a few KB) then for this to be useful either the storage device must start recording tiny allocations, or the filesystem should re-release areas as they grow. i.e. when releasing a range of a device, look in the filesystem's usage records for the largest surrounding free space, and release all of that. I figured that the easiest way around this is reporting free space extents, not the amoutn actually freed. e.g. 4k in file A @ block 10 4k in file B @ block 11 4k free space @ block 12 4k in file C @ block 13 1008k in free space at block 14. If we free file A, we report that we've released an extent of 4k @ block 10. if we then free file B, we report we've released an extent of 12k @ block 10. If we then free file C, we report a release of 1024k @ block 10. Then the underlying device knows what the aggregated free space regions are and can easily release large regions without needing to track tiny allocations and frees done by the filesystem. If you could do that in the filesystem, it certainly solve the problem. In which case I'll explicitly allow for the hint's extent to overlap extents previous extents thus hinted, and define the semantics for overlaps. I think I'll rename the hint to BIO_HINT_RELEASED, I think that will make the semantics a little clearer. I think that can be done - i wouldn't have mentioned it if I didn't think it was possible to implement ;). It will require a further btree lookup once the free transaction hits the disk, but I think that's pretty easy to do. I'd probably hook xfs_alloc_clear_busy() to do this. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On 10/30/07, Greg Banks [EMAIL PROTECTED] wrote: BIO_HINT_RELEASE The bio's block extent is no longer in use by the filesystem and will not be read in the future. Any storage used to back the extent may be released without any threat to filesystem or data integrity. I'd like to second the proposal, but it would be more useful to bring the hint down to the physical devices. There is an ongoing discussion about adding 'Trim' ATA command for notifying the drive about the deleted blocks. http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf This is especially useful for the storage device like Solid State Drive (SSD). Dongjun - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On Tuesday 30 October 2007, Dongjun Shin wrote: There is an ongoing discussion about adding 'Trim' ATA command for notifying the drive about the deleted blocks. http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf This is especially useful for the storage device like Solid State Drive (SSD). This make me curious, why would t13 want to invent a new command when there is already the erase command from CFA? It's not exactly the same, but close enough that the proposed BIO_HINT_RELEASE should probably be mapped to CFA_ERASE (0xc0) on drives that support it: http://t13.org/Documents/UploadedDocuments/technical/d97116r1.pdf Arnd - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On 10/30/07, Arnd Bergmann [EMAIL PROTECTED] wrote: This make me curious, why would t13 want to invent a new command when there is already the erase command from CFA? It's not exactly the same, but close enough that the proposed BIO_HINT_RELEASE should probably be mapped to CFA_ERASE (0xc0) on drives that support it: http://t13.org/Documents/UploadedDocuments/technical/d97116r1.pdf I'm not sure about the background. However, it's definitely a sign that passing the deleted block info to the flash-based storage is useful. Anyway, BIO_HINT_RELEASE could destroy the content of the blocks after being passed to the device. I think that other bio should not be reordered accross that hint (just like barrier). Dongjun - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On 10/30/07, Arnd Bergmann [EMAIL PROTECTED] wrote: Not sure. Why shouldn't you be able to reorder the hints provided that they don't overlap with read/write bios for the same block? You're right. The bios can be reordered if they don't overlap with hint. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On Tue, 30 October 2007 18:35:08 +0900, Dongjun Shin wrote: On 10/30/07, Greg Banks [EMAIL PROTECTED] wrote: BIO_HINT_RELEASE The bio's block extent is no longer in use by the filesystem and will not be read in the future. Any storage used to back the extent may be released without any threat to filesystem or data integrity. I'd like to second the proposal, but it would be more useful to bring the hint down to the physical devices. Absolutely. Logfs would love to have an erase operation for block devices as well. However the above doesn't quite match my needs, because the blocks _will_ be read in the future. There are two reasons for reading things back later. The good one is to determine whether the segment was erased or not. Reads should return either valid data or one of (all-0xff, all-0x00, -ESOMETHING). Having a dedicated error code would be best. And getting the device erasesize would be useful as well, for obvious reasons. Jörn -- When you close your hand, you own nothing. When you open it up, you own the whole world. -- Li Mu Bai in Tiger Dragon - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On Tue, 30 October 2007 23:19:48 +0900, Dongjun Shin wrote: On 10/30/07, Arnd Bergmann [EMAIL PROTECTED] wrote: Not sure. Why shouldn't you be able to reorder the hints provided that they don't overlap with read/write bios for the same block? You're right. The bios can be reordered if they don't overlap with hint. I would keep things simpler. Bios can be reordered, full stop. If an erase and a write overlap, the caller (filesystem?) has to add a barrier. Jörn -- My second remark is that our intellectual powers are rather geared to master static relations and that our powers to visualize processes evolving in time are relatively poorly developed. -- Edsger W. Dijkstra - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On Tuesday 30 October 2007, Jörn Engel wrote: On Tue, 30 October 2007 23:19:48 +0900, Dongjun Shin wrote: On 10/30/07, Arnd Bergmann [EMAIL PROTECTED] wrote: Not sure. Why shouldn't you be able to reorder the hints provided that they don't overlap with read/write bios for the same block? You're right. The bios can be reordered if they don't overlap with hint. I would keep things simpler. Bios can be reordered, full stop. If an erase and a write overlap, the caller (filesystem?) has to add a barrier. I thought bios were already ordered if they affect the same blocks. Either way, I agree that an erase should not be treated special on the bio layer, its ordering should be handled the same way we do it for writes. Arnd - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On 10/31/07, Arnd Bergmann [EMAIL PROTECTED] wrote: On Tuesday 30 October 2007, Jörn Engel wrote: On Tue, 30 October 2007 23:19:48 +0900, Dongjun Shin wrote: On 10/30/07, Arnd Bergmann [EMAIL PROTECTED] wrote: Not sure. Why shouldn't you be able to reorder the hints provided that they don't overlap with read/write bios for the same block? You're right. The bios can be reordered if they don't overlap with hint. I would keep things simpler. Bios can be reordered, full stop. If an erase and a write overlap, the caller (filesystem?) has to add a barrier. I thought bios were already ordered if they affect the same blocks. Either way, I agree that an erase should not be treated special on the bio layer, its ordering should be handled the same way we do it for writes. To support the new ATA command (trim, or dataset), the suggested hint is not enough. We have to send the bio with data (at least one sector or more) since the new ATA command requests the dataset information. And also we have to strictly follow the order using barrier or other methods at filesystem level For example, the delete operation in ext3. 1. delete some file 2. ext3_delete_inode() called 3. ... - ext3_free_blocks_sb() releases the free blocks 4. If it sends the hints here, it breaks the ext3 power off recovery scheme since it trims the data from given information after reboot 5. after transaction, all dirty pages are flushed. after this work, we can trim the free blocks safely. Another approach is modifying the block framework. At I/O scheduler, it don't merge the hint bio (in my terminology, bio control info) with general bio. In this case we also consider the reordering problem. I'm not sure it is possible at this time. Thank you, Kyungmin Park - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote: On Tuesday October 30, [EMAIL PROTECTED] wrote: BIO_HINT_RELEASE The bio's block extent is no longer in use by the filesystem and will not be read in the future. Any storage used to back the extent may be released without any threat to filesystem or data integrity. If the allocation unit of the storage device (e.g. a few MB) does not match the allocation unit of the filesystem (e.g. a few KB) then for this to be useful either the storage device must start recording tiny allocations, or the filesystem should re-release areas as they grow. i.e. when releasing a range of a device, look in the filesystem's usage records for the largest surrounding free space, and release all of that. I figured that the easiest way around this is reporting free space extents, not the amoutn actually freed. e.g. 4k in file A @ block 10 4k in file B @ block 11 4k free space @ block 12 4k in file C @ block 13 1008k in free space at block 14. If we free file A, we report that we've released an extent of 4k @ block 10. if we then free file B, we report we've released an extent of 12k @ block 10. If we then free file C, we report a release of 1024k @ block 10. Then the underlying device knows what the aggregated free space regions are and can easily release large regions without needing to track tiny allocations and frees done by the filesystem. I guess that is equally domain specific, but the difference is that if you try to read from the DONTCOW part of the snapshot, you get bad old data, where as if you try to access the subordinate device of a snapshot, you get an IO error - which is probably safer. If you read from a DONTCOW region you should get zeros back - it's a hole in the snapshot. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On Tue, Oct 30, 2007 at 06:35:08PM +0900, Dongjun Shin wrote: On 10/30/07, Greg Banks [EMAIL PROTECTED] wrote: BIO_HINT_RELEASE The bio's block extent is no longer in use by the filesystem and will not be read in the future. Any storage used to back the extent may be released without any threat to filesystem or data integrity. I'd like to second the proposal, but it would be more useful to bring the hint down to the physical devices. There is an ongoing discussion about adding 'Trim' ATA command for notifying the drive about the deleted blocks. http://www.t13.org/Documents/UploadedDocuments/docs2007/e07154r3-Data_Set_Management_Proposal_for_ATA-ACS2.pdf What an interesting document. Am I reading the change markup correctly, did it get *simpler* in the last revision? Wow. I agree that BIO_HINT_RELEASE would be a good match for the proposed Trim command. But I don't think we'll ever be issuing Trims with more than a single LBA Range Entry, that feature seems unhelpful. The Trim proposal doesn't specify what happens when a sector which is already deallocated is deallocated again, presumably this is supposed to be harmless? Greg. -- Greg Banks, RD Software Engineer, SGI Australian Software Group. Apparently, I'm Bedevere. Which MPHG character are you? I don't speak for SGI. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On Wed, Oct 31, 2007 at 10:56:52AM +1100, David Chinner wrote: On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote: On Tuesday October 30, [EMAIL PROTECTED] wrote: BIO_HINT_RELEASE The bio's block extent is no longer in use by the filesystem and will not be read in the future. Any storage used to back the extent may be released without any threat to filesystem or data integrity. If the allocation unit of the storage device (e.g. a few MB) does not match the allocation unit of the filesystem (e.g. a few KB) then for this to be useful either the storage device must start recording tiny allocations, or the filesystem should re-release areas as they grow. i.e. when releasing a range of a device, look in the filesystem's usage records for the largest surrounding free space, and release all of that. I figured that the easiest way around this is reporting free space extents, not the amoutn actually freed. e.g. 4k in file A @ block 10 4k in file B @ block 11 4k free space @ block 12 4k in file C @ block 13 1008k in free space at block 14. If we free file A, we report that we've released an extent of 4k @ block 10. if we then free file B, we report we've released an extent of 12k @ block 10. If we then free file C, we report a release of 1024k @ block 10. Then the underlying device knows what the aggregated free space regions are and can easily release large regions without needing to track tiny allocations and frees done by the filesystem. If you could do that in the filesystem, it certainly solve the problem. In which case I'll explicitly allow for the hint's extent to overlap extents previous extents thus hinted, and define the semantics for overlaps. I think I'll rename the hint to BIO_HINT_RELEASED, I think that will make the semantics a little clearer. Greg. -- Greg Banks, RD Software Engineer, SGI Australian Software Group. Apparently, I'm Bedevere. Which MPHG character are you? I don't speak for SGI. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Proposal to improve filesystem/block snapshot interaction
G'day, A number of people have already seen this; I'm posting for wider comment and to move some interesting discussion to a public list. I'll apologise in advance for the talk about SGI technologies (including proprietary ones), but all the problems mentioned apply to in-tree technologies too. This proposal seeks to solve three problems in our NAS server product due to the interaction of the filesystem (XFS) and the block-based snapshot feature (XVM snapshot). It's based on discussions held with various people over the last few weeks, including Roger Strassburg, Christoph Hellwig, David Chinner, and Donald Douwsma. a) The first problem is the server's behaviour when a filesystem which is subject to snapshot is written to, and the snapshot repository runs out of room. The failure mode can be quite severe. XFS issues a metadata write to the block device, triggering a Copy-On-Write operation in the XVM snapshot element, which because of the full repository fails with EIO. When XFS sees the failure it shuts down the filesystem. All subsequent attempts to perform IO to the filesystem block indefinitely. In particular any NFS server thread will block and never reply to the NFS client. The NFS client will retry, causing another NFS server thread to block, and repeat until every NFS server thread is blocked. At this point all NFS service for all filesystems ceases. See PV 958220 and PV 958140 for a description of this problem and some of the approaches which have been discussed for resolving it. b) The second problem is that certain common combinations of filesystem operations can cause large wastes of space in the XVM snapshot repository. Examples include writing the same file twice with dd, or writing a new file and deleting it. The cause is the inability of the XVM snapshot code to be able to free regions in the snapshot repository that are no longer in use by the filesystem; this information is simply not available within the block layer. Note that problem b) also contributes to problem a) by increasing repository usage and thus making it easier to encounter an out-of-space condition on the repository. c) The third problem is an unfortunate interaction between an XFS internal log and block snapshots. The log is a fixed region of the block device which is written as a side effect of a great many different filesystem operations. The information written there has no value and is not even read until and unless log recovery needs to be performed after the server has crashed. This means the log does not need to be preserved by the block feature snapshot (because at the point in time when the snapshot is taken, log recovery must have already happened). In fact the correct procedure when mounting a read-only snapshot is to use the norecovery option to prevent any attempt to read the log (although the NAS server software actually doesn't do this). However, because the block device layer doesn't have enough information to know any better, the first pass of writes to the log are subjected to Copy-On-Write. This has two undesirable effects. Firstly, it increases the amount of snapshot repository space used by each snapshot, thus contributing to problem a). Secondly, it puts a significant performance penalty on filesystem metadata operations for some time after each snapshot is taken; given that the NAS server can be configured to take regular frequent snapshots this may mean all of the time. An obvious solution is to use an external XFS log, but this quite inconvenient for the NAS server software to arrange. For one thing, we would need to construct a separate external log device for the main filesystem and one for each mounted snapshot. Note that these problems are not specific to XVM but will be encountered by any Linux block-COWing snapshot implementation. For example the DM snapshot implementation is documented to suffer from problem a). From the linux/Documentation/device-mapper/snapshot.txt: COW device will often be smaller than the origin and if it fills up the snapshot will become useless and be disabled, returning errors. So it is important to monitor the amount of free space and expand the COW device before it fills up. During discussions, it became clear that we could solve all three of these problems by improving the block device interface to allow a filesystem to provide the block device with dynamic block usage hints. For example, when unlinking a file the filesystem could tell the block device a hint of the form I'm about to stop using these blocks. Most block devices would silently ignore these hints, but a snapshot COW implementation (the copy-on-write XVM element or the snapshot-origin dm target) could use them to help avoid these problems. For example, the response
Re: Proposal to improve filesystem/block snapshot interaction
On Tue, Oct 30, 2007 at 12:51:47AM +0100, Arnd Bergmann wrote: On Monday 29 October 2007, Christoph Hellwig wrote: - Forwarded message from Greg Banks [EMAIL PROTECTED] - Date: Thu, 27 Sep 2007 16:31:13 +1000 From: Greg Banks [EMAIL PROTECTED] Subject: Proposal to improve filesystem/block snapshot interaction To: David Chinner [EMAIL PROTECTED], Donald Douwsma [EMAIL PROTECTED], Christoph Hellwig [EMAIL PROTECTED], Roger Strassburg [EMAIL PROTECTED] Cc: Mark Goodwin [EMAIL PROTECTED], Brett Jon Grandbois [EMAIL PROTECTED] This proposal seeks to solve three problems in our NAS server product due to the interaction of the filesystem (XFS) and the block-based snapshot feature (XVM snapshot). It's based on discussions held with various people over the last few weeks, including Roger Strassburg, Christoph Hellwig, David Chinner, and Donald Douwsma. Hi Greg, Christoph forwarded me your mail, because I mentioned to him that I'm trying to come up with a similar change, and it might make sense to combine our efforts. Excellent, thanks Christoph ;-) For example, when unlinking a file the filesystem could tell the block device a hint of the form I'm about to stop using these blocks. Most block devices would silently ignore these hints, but a snapshot COW implementation (the copy-on-write XVM element or the snapshot-origin dm target) could use them to help avoid these problems. For example, the response to the I'm about to stop using these blocks hint could be to free the space used in the snapshot repository for unnecessary copies of those blocks. The case I'm interested in is the more specific case of 'erase', which is more of a performance optimization than a space optimization. When you have a flash medium, it's useful to erase a block as soon as it's becoming unused, so that a subsequent write will be faster. Moreover, on an MTD medium, you may not even be able to write to a block unless it has been erased before. Spending the device's time to erase early, when the CPU isn't waiting for it, instead of later, when it adds to effective write latency. Makes sense. Of course snapshot cow elements may be part of more generic element trees. In general there may be more than one consumer of block usage hints in a given filesystem's element tree, and their locations in that tree are not predictable. This means the block extents mentioned in the usage hints need to be subject to the block mapping algorithms provided by the element tree. As those algorithms are currently implemented using bio mapping and splitting, the easiest and simplest way to reuse those algorithms is to add new bio flags. First we need a mechanism to indicate that a bio is a hint rather than a real IO. Perhaps the easiest way is to add a new flag to the bi_rw field: #define BIO_RW_HINT 5 /* bio is a hint not a real io; no pages */ My first thought was to do this on the request layer, not already on bio, but they can easily be combined, I guess. My first thoughts were along similar lines, but I wasn't expecting these hint bios to survive deep enough in the stack to need queuing and thus visibility in struct request; I was expecting their lifetime to be some passage and splitting through a volume manager and then conversion to synchronous metadata operations. Plus, hijacking bios means not having to modify every single DM target to duplicate it's block mapping algorithm. Basically, I was thinking of loopback-like block mapping and not considering flash. I suppose for flash where there's a real erase operation, you'd want to be queuing and that means a new request type. We'll also need a field to tell us which kind of hint the bio represents. Perhaps a new field could be added, or perhaps the top 16 bits of bi_rw (currently used to encode the bio's priority, which has no meaning for hint bios) could be reused. The latter approach may allow hints to be used without modifying the bio structure or any code that uses it other than the filesystem and the snapshot implementation. Such a property would have obvious advantages for our NAS server software, where XFS and XVM modules are provided but the other users of struct bio are stock SLES code. Next we'll need three bio hints types with the following semantics. BIO_HINT_ALLOCATE The bio's block extent will soon be written by the filesystem and any COW that may be necessary to achieve that should begin now. If the COW is going to fail, the bio should fail. Note that this provides a way for the filesystem to manage when and how failures to COW are reported. BIO_HINT_RELEASE The bio's block extent is no longer in use by the filesystem and will not be read in the future. Any storage used to back the extent may be released without any threat to filesystem
Re: Proposal to improve filesystem/block snapshot interaction
On Tuesday October 30, [EMAIL PROTECTED] wrote: Of course snapshot cow elements may be part of more generic element trees. In general there may be more than one consumer of block usage hints in a given filesystem's element tree, and their locations in that tree are not predictable. This means the block extents mentioned in the usage hints need to be subject to the block mapping algorithms provided by the element tree. As those algorithms are currently implemented using bio mapping and splitting, the easiest and simplest way to reuse those algorithms is to add new bio flags. So are you imagining that you might have a distinct snapshotable elements, and that some of these might be combined by e.g. RAID0 into a larger device, then a filesystem is created on that? I ask because my first thought was that the sort of communication you want seems like it would be just between a filesystem and the block device that it talks directly to, and as you are particularly interested in XFS and XVM, should could come up with whatever protocol you want for those two to talk to either other, prototype it, iron out all the issues, then say We've got this really cool thing to make snapshots much faster - wanna share? and thus be presenting from a position of more strength (the old 'code talks' mantra). First we need a mechanism to indicate that a bio is a hint rather than a real IO. Perhaps the easiest way is to add a new flag to the bi_rw field: #define BIO_RW_HINT 5 /* bio is a hint not a real io; no pages */ Reminds me of the new approach to issue_flush_fn which is just to have a zero-length barrier bio (is that implemented yet? I lost track). But different as a zero length barrier has zero length, and your hints have a very meaningful length. Next we'll need three bio hints types with the following semantics. BIO_HINT_ALLOCATE The bio's block extent will soon be written by the filesystem and any COW that may be necessary to achieve that should begin now. If the COW is going to fail, the bio should fail. Note that this provides a way for the filesystem to manage when and how failures to COW are reported. Would it make sense to allow the bi_sector to be changed by the device and to have that change honoured. i.e. Please allocate 128 blocks, maybe 'here' OK, 128 blocks allocated, but they are actually over 'there'. If the device is tracking what space is and isn't used, it might make life easier for it to do the allocation. Maybe even have a variant Allocate 128 blocks, I don't care where. Is this bio supposed to block until the copy has happened? Or only until the space of the copy has been allocated and possibly committed? Or must it return without doing any IO at all? BIO_HINT_RELEASE The bio's block extent is no longer in use by the filesystem and will not be read in the future. Any storage used to back the extent may be released without any threat to filesystem or data integrity. If the allocation unit of the storage device (e.g. a few MB) does not match the allocation unit of the filesystem (e.g. a few KB) then for this to be useful either the storage device must start recording tiny allocations, or the filesystem should re-release areas as they grow. i.e. when releasing a range of a device, look in the filesystem's usage records for the largest surrounding free space, and release all of that. Would this be a burden on the filesystems? Is my imagined disparity between block sizes valid? Would it be just as easy for the storage device to track small allocation/deallocations? BIO_HINT_DONTCOW (the Bart Simpson BIO). The bio's block extent is not needed in mounted snapshots and does not need to be subjected to COW. This seems like a much more domain-specific function that the other two which themselves could be more generally useful (I'm imagining using hints from them to e.g. accelerate RAID reconstruction). Surely the correct thing to do with the log is to put it on a separate device which itself isn't snapshotted. If you have a storage manager that is smart enough to handle these sorts of things, maybe the functionality you want is Give me a subordinate device which is not snapshotted, size X, then journal to that virtual device. I guess that is equally domain specific, but the difference is that if you try to read from the DONTCOW part of the snapshot, you get bad old data, where as if you try to access the subordinate device of a snapshot, you get an IO error - which is probably safer. Comments? On the whole it seems reasonably sane providing you are from the school which believes that volume managers and filesystems should be kept separate :-) NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal to improve filesystem/block snapshot interaction
On Tue, Oct 30, 2007 at 03:16:06PM +1100, Neil Brown wrote: On Tuesday October 30, [EMAIL PROTECTED] wrote: Of course snapshot cow elements may be part of more generic element trees. In general there may be more than one consumer of block usage hints in a given filesystem's element tree, and their locations in that tree are not predictable. This means the block extents mentioned in the usage hints need to be subject to the block mapping algorithms provided by the element tree. As those algorithms are currently implemented using bio mapping and splitting, the easiest and simplest way to reuse those algorithms is to add new bio flags. So are you imagining that you might have a distinct snapshotable elements, and that some of these might be combined by e.g. RAID0 into a larger device, then a filesystem is created on that? I was thinking more a concatenation than a stripe, but yes you could do such a thing, e.g. to parallelise the COW procedure. We don't do any such thing in our product; the COW element is always inserted at the top of the logical element tree. I ask because my first thought was that the sort of communication you want seems like it would be just between a filesystem and the block device that it talks directly to, and as you are particularly interested in XFS and XVM, should could come up with whatever protocol you want for those two to talk to either other, prototype it, iron out all the issues, then say We've got this really cool thing to make snapshots much faster - wanna share? and thus be presenting from a position of more strength (the old 'code talks' mantra). Indeed, code talks ;-) I was hoping someone else would do that talking for me, though. First we need a mechanism to indicate that a bio is a hint rather than a real IO. Perhaps the easiest way is to add a new flag to the bi_rw field: #define BIO_RW_HINT 5 /* bio is a hint not a real io; no pages */ Reminds me of the new approach to issue_flush_fn which is just to have a zero-length barrier bio (is that implemented yet? I lost track). But different as a zero length barrier has zero length, and your hints have a very meaningful length. Yes. Next we'll need three bio hints types with the following semantics. BIO_HINT_ALLOCATE The bio's block extent will soon be written by the filesystem and any COW that may be necessary to achieve that should begin now. If the COW is going to fail, the bio should fail. Note that this provides a way for the filesystem to manage when and how failures to COW are reported. Would it make sense to allow the bi_sector to be changed by the device and to have that change honoured. i.e. Please allocate 128 blocks, maybe 'here' OK, 128 blocks allocated, but they are actually over 'there'. That wasn't the expectation at all. Perhaps allocate is a poor name. I have just allocated, deal with it might be more appropriate. Perhaps BIO_HINT_WILLUSE or something. If the device is tracking what space is and isn't used, it might make life easier for it to do the allocation. Maybe even have a variant Allocate 128 blocks, I don't care where. That kind of thing might perhaps be useful for flash, but I think current filesystems would have conniptions. Is this bio supposed to block until the copy has happened? Or only until the space of the copy has been allocated and possibly committed? The latter. The writes following will block until the COW has completed, or might be performed sufficiently later that the COW has meanwhile completed (I think this implies an extra state in the snapshot metadata to avoid double-COWing). The point of the hint is to allow the snapshot code to test for running out of repo space and report that failure at a time when the filesystem is able to handle it gracefully. Or must it return without doing any IO at all? I would expect it would be a useful optimisation to start the IO but not wait for it's completion, but that the first implementation would just do a space check. BIO_HINT_RELEASE The bio's block extent is no longer in use by the filesystem and will not be read in the future. Any storage used to back the extent may be released without any threat to filesystem or data integrity. If the allocation unit of the storage device (e.g. a few MB) does not match the allocation unit of the filesystem (e.g. a few KB) then for this to be useful either the storage device must start recording tiny allocations, or the filesystem should re-release areas as they grow. i.e. when releasing a range of a device, look in the filesystem's usage records for the largest surrounding free space, and release all of that. Good point. I was planning on ignoring this problem :-/ Given that current snapshot implementations waste *all* the blocks in deleted files, it would be an improvement to scavenge the blocks in large extents.