Re: [patch 0/6][RFC] Cleanup FIBMAP
Zach Brown wrote: The second use case is to look at the physical layout of blocks on disk for a specific file, use Mark Lord's write_long patches to inject a disk error and then read that file to make sure that we are handling disk IO errors correctly. A bit obscure, but really quite useful. Hmm, yeah, that's interesting. It would be even better if we could poke holes in metadata, etc, but this gives us a reasonable test case. We have also used FIBMAP a few times to try and map an observed IO error back to a file. Really slow and painful to do, but should work on any file system when a better method is not supported. We're getting off of this FIBMAP topic, but this interests me. Can we explore this a little? How did you find out about the error without having a file to associate with it? Drive scrubbing, or some such? - z Vladimir extended debugreiserfs to do an optimized reverse mapping scan of the disk sector to file/metadata/etc. Definitely worth having that ability for any file system. We also do drive scrubbing, looking for bad sectors. The list of those sectors is fed into the reverse mapping code to enable us to gage the impact of the IO errors, start recovering the user files, etc. The scrub code we use takes advantage of the read-verify command (to avoid data transfer from the drive to the page cache). ric - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
> The second use case is to look at the physical layout of blocks on disk > for a specific file, use Mark Lord's write_long patches to inject a disk > error and then read that file to make sure that we are handling disk IO > errors correctly. A bit obscure, but really quite useful. Hmm, yeah, that's interesting. > We have also used FIBMAP a few times to try and map an observed IO error > back to a file. Really slow and painful to do, but should work on any > file system when a better method is not supported. We're getting off of this FIBMAP topic, but this interests me. Can we explore this a little? How did you find out about the error without having a file to associate with it? Drive scrubbing, or some such? - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
Zach Brown wrote: Can you clarify what you mean above with an example? I don't really follow. Sure, take 'tar' as an example. It'll read files in the order that their names are returned from directory listing. This can produce bad IO patterns because the order in which the file names are returned doesn't match the order of the file's blocks on disk. (htree, I'm looking at you!) People have noticed that tar-like loads can be sped up greatly just by sorting the files by their inode number as returned by stat(), never mind the file blocks themselves. One example of this is Chris Mason's 'acp'. http://oss.oracle.com/~mason/acp/ The logical extension of that is to use FIBMAP to find the order of file blocks on disk and then doing IO on blocks in sorted order. It'd take work to write an app that does this reliably, sure. In this use the application doesn't actually care what the absolute numbers are. It cares about their ordering. File systems would be able to chose whatever scheme they wanted for the actual values of the results from a FIBMAP-alike as long as the sorting resulted in the right IO patterns. Arguing that this use is significant enough to justify an addition to the file system API is a stretch. I'm just sharing the observation. - z I use FIBMAP support for a few different things. The first is to exactly the case that you describe above where we can use the first block of a file extracted by FIBMAP to produce an optimal sorting for the read order. My testing showed that the cost of the extra fibmap was not too high compared to the speedup, but it was not a huge gain over the speedup gained when the read was done in inode sorted order. The second use case is to look at the physical layout of blocks on disk for a specific file, use Mark Lord's write_long patches to inject a disk error and then read that file to make sure that we are handling disk IO errors correctly. A bit obscure, but really quite useful. We have also used FIBMAP a few times to try and map an observed IO error back to a file. Really slow and painful to do, but should work on any file system when a better method is not supported. ric - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
> But, we shouldn't inflict all of this on fibmap/fiemapwe'll get > lost trying to make the one true interface for all operations. > > For grouping operations on files, I think a read_tree syscall with > hints for what userland will do (read, stat, delete, list > filenames), and a better cookie than readdir should do it. Agreed, on both points. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
> Can you clarify what you mean above with an example? I don't really > follow. Sure, take 'tar' as an example. It'll read files in the order that their names are returned from directory listing. This can produce bad IO patterns because the order in which the file names are returned doesn't match the order of the file's blocks on disk. (htree, I'm looking at you!) People have noticed that tar-like loads can be sped up greatly just by sorting the files by their inode number as returned by stat(), never mind the file blocks themselves. One example of this is Chris Mason's 'acp'. http://oss.oracle.com/~mason/acp/ The logical extension of that is to use FIBMAP to find the order of file blocks on disk and then doing IO on blocks in sorted order. It'd take work to write an app that does this reliably, sure. In this use the application doesn't actually care what the absolute numbers are. It cares about their ordering. File systems would be able to chose whatever scheme they wanted for the actual values of the results from a FIBMAP-alike as long as the sorting resulted in the right IO patterns. Arguing that this use is significant enough to justify an addition to the file system API is a stretch. I'm just sharing the observation. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
On Mon, 29 Oct 2007 12:18:22 -0700 Mike Waychison <[EMAIL PROTECTED]> wrote: > Zach Brown wrote: > >>> And another of my pet peeves with ->bmap is that it uses 0 to > >>> mean "sparse" which causes a conflict on NTFS at least as block > >>> zero is part of the $Boot system file so it is a real, valid > >>> block... NTFS uses -1 to denote sparse blocks internally. > >> Reiserfs and Btrfs also use 0 to mean packed. It would be nice if > >> there was a way to indicate your-data-is-here-but-isn't-alone. > >> But that's more of a feature for the FIEMAP stuff. > > > > And maybe we can step back and see what the callers of FIBMAP are > > doing with the results they're getting. > > > > One use is to discover the order in which to read file data that > > will result in efficient IO. > > > > If we had an interface specifically for this use case then perhaps a > > sparse block would be better reported as the position of the inode > > relative to other data blocks. Maybe the inode block number in > > ext* land. > > > > Can you clarify what you mean above with an example? I don't really > follow. This is a larger topic of helping userland optimize access to groups of files. For example, during a readdir if we knew the next step was to delete all the files found, we could do one top of readahead (or even ordering the returned values). If we knew the next step would be to read all the files found, a different type of readahead would be useful. But, we shouldn't inflict all of this on fibmap/fiemapwe'll get lost trying to make the one true interface for all operations. For grouping operations on files, I think a read_tree syscall with hints for what userland will do (read, stat, delete, list filenames), and a better cookie than readdir should do it. -chris - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
On Oct 29, 2007 12:16 -0700, Mike Waychison wrote: > Chris Mason wrote: >> Reiserfs and Btrfs also use 0 to mean packed. It would be nice if there >> was a way to indicate your-data-is-here-but-isn't-alone. But that's >> more of a feature for the FIEMAP stuff. > > I hadn't heard of FIEMAP, so I went back and read the thread from > April/May. It seems that this is a much better approach than introducing a > FIBMAP64. > > What ever happened with this proposal? We made a patch for ext4 that we need to update and push upstream. I've just resent the spec we used in a separate email (attached to old thread) for reference. Cheers, Andreas -- Andreas Dilger Sr. Software Engineer, Lustre Group Sun Microsystems of Canada, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
Zach Brown wrote: And another of my pet peeves with ->bmap is that it uses 0 to mean "sparse" which causes a conflict on NTFS at least as block zero is part of the $Boot system file so it is a real, valid block... NTFS uses -1 to denote sparse blocks internally. Reiserfs and Btrfs also use 0 to mean packed. It would be nice if there was a way to indicate your-data-is-here-but-isn't-alone. But that's more of a feature for the FIEMAP stuff. And maybe we can step back and see what the callers of FIBMAP are doing with the results they're getting. One use is to discover the order in which to read file data that will result in efficient IO. If we had an interface specifically for this use case then perhaps a sparse block would be better reported as the position of the inode relative to other data blocks. Maybe the inode block number in ext* land. Can you clarify what you mean above with an example? I don't really follow. Thanks, Mike Waychison - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
Chris Mason wrote: On Sat, 27 Oct 2007 18:57:06 +0100 Anton Altaparmakov <[EMAIL PROTECTED]> wrote: Hi, ->bmap is ugly and horrible! If you have to do this at the very least please cause ->bmap64 to be able to return error values in case the file system failed to get the information or indeed such information does not exist as is the case for compressed and encrypted files for example and also for small files that are inside the on-disk inode (NTFS resident files and reiserfs packed tails are examples of this). And another of my pet peeves with ->bmap is that it uses 0 to mean "sparse" which causes a conflict on NTFS at least as block zero is part of the $Boot system file so it is a real, valid block... NTFS uses -1 to denote sparse blocks internally. Reiserfs and Btrfs also use 0 to mean packed. It would be nice if there was a way to indicate your-data-is-here-but-isn't-alone. But that's more of a feature for the FIEMAP stuff. I hadn't heard of FIEMAP, so I went back and read the thread from April/May. It seems that this is a much better approach than introducing a FIBMAP64. What ever happened with this proposal? Mike Waychison - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
>> And another of my pet peeves with ->bmap is that it uses 0 to mean >> "sparse" which causes a conflict on NTFS at least as block zero is >> part of the $Boot system file so it is a real, valid block... NTFS >> uses -1 to denote sparse blocks internally. > > Reiserfs and Btrfs also use 0 to mean packed. It would be nice if there > was a way to indicate your-data-is-here-but-isn't-alone. But that's > more of a feature for the FIEMAP stuff. And maybe we can step back and see what the callers of FIBMAP are doing with the results they're getting. One use is to discover the order in which to read file data that will result in efficient IO. If we had an interface specifically for this use case then perhaps a sparse block would be better reported as the position of the inode relative to other data blocks. Maybe the inode block number in ext* land. This use is interesting to me because it can be useful for any file system, particularly networked file systems. - z - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
On Sat, 27 Oct 2007 18:57:06 +0100 Anton Altaparmakov <[EMAIL PROTECTED]> wrote: > Hi, > > ->bmap is ugly and horrible! If you have to do this at the very > least please cause ->bmap64 to be able to return error values in case > the file system failed to get the information or indeed such > information does not exist as is the case for compressed and > encrypted files for example and also for small files that are inside > the on-disk inode (NTFS resident files and reiserfs packed tails are > examples of this). > > And another of my pet peeves with ->bmap is that it uses 0 to mean > "sparse" which causes a conflict on NTFS at least as block zero is > part of the $Boot system file so it is a real, valid block... NTFS > uses -1 to denote sparse blocks internally. Reiserfs and Btrfs also use 0 to mean packed. It would be nice if there was a way to indicate your-data-is-here-but-isn't-alone. But that's more of a feature for the FIEMAP stuff. -chris - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
Mike Waychison wrote: The following series is meant to clean up FIBMAP paths with the eventual goal of allowing users to be able to FIBMAP their data. Keep in mind FIBMAP is currently extremely expensive on some filesystems, e.g. ext3. Therefore, additional filesystem-level work would have to be done in order for this not become a DoS issue. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
On Sat, 27 Oct 2007, Anton Altaparmakov wrote: > And another of my pet peeves with ->bmap is that it uses 0 to mean "sparse" > which causes a conflict on NTFS at least as block zero is part of the $Boot > system file so it is a real, valid block... NTFS uses -1 to denote sparse > blocks internally. In practice, the meaning of 0 is file system [driver] dependent. For example in case of NTFS-3G it means that the block is sparse or the file is encrypted or compressed, or resident, or it's the $Boot file, or an error happened. Thankfully the widely used FIBMAP users (swapon and the ever less used lilo) are only interested in the non-zero values and they report an error if the driver returns 0 for some reason. Which is perfectly ok since both swaping and Linux booting would fail using a sparse, encrypted, compressed, resident, or the NTFS $Boot file. But in real, both swap files and lilo work fine with NTFS if the needed files were created the way these softwares expect. If not then swapon or lilo will catch and report the file creation error. Afair, somebody is doing (has done?) an indeed much needed, better alternative. Bmap is legacy, thank you Mike for maintaining it. Szaka -- NTFS-3G Lead Developer: http://ntfs-3g.org - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/6][RFC] Cleanup FIBMAP
Hi, ->bmap is ugly and horrible! If you have to do this at the very least please cause ->bmap64 to be able to return error values in case the file system failed to get the information or indeed such information does not exist as is the case for compressed and encrypted files for example and also for small files that are inside the on-disk inode (NTFS resident files and reiserfs packed tails are examples of this). And another of my pet peeves with ->bmap is that it uses 0 to mean "sparse" which causes a conflict on NTFS at least as block zero is part of the $Boot system file so it is a real, valid block... NTFS uses -1 to denote sparse blocks internally. Best regards, Anton On 27 Oct 2007, at 00:37, Mike Waychison wrote: The following series is meant to clean up FIBMAP paths with the eventual goal of allowing users to be able to FIBMAP their data. I'm sending this as an RFC as I've only tested this on a x86_64 kernel with a 32bit binary on ext2 and I've noticed a couple ext2_warnings already. I'm unsure of the locking in [4/6] fix_race_with_truncate.patch. Any help here would greatly be appreciated. The last patch, [6/6] drop_cap_sys_rawio_for_fibmap.patch, is of course, not to be applied until any remaining issues are fixed :) Thanks, Mike Waychison -- Anton Altaparmakov (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html