Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Neil Brown writes: > [...] > Thus the general sequence might be: > > a/ issue all "preceding writes". > b/ issue the commit write with BIO_RW_BARRIER > c/ wait for the commit to complete. > If it was successful - done. > If it failed other than with EOPNOTSUPP, abort > else continue > d/ wait for all 'preceding writes' to complete > e/ call blkdev_issue_flush > f/ issue commit write without BIO_RW_BARRIER > g/ wait for commit write to complete >if it failed, abort > h/ call blkdev_issue > DONE > > steps b and c can be left out if it is known that the device does not > support barriers. The only way to discover this to try and see if it > fails. > > I don't think any filesystem follows all these steps. It seems that steps b/ -- h/ are quite generic, and can be implemented once in a generic code (with some synchronization mechanism like wait-queue at d/). Nikita. [...] > > Thank you for your attention. > > NeilBrown > Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
Valerie Henson writes: [...] > > You're right about needing to read the equivalent data-structure - for > other reasons, each continuation inode will need an easily accessible > list of byte ranges covered by that inode. (Sounds like, hey, > extents!) The important part is that you don't have go walk all the I see. I was under impression that idea was to use indirect blocks themselves as that data-structure, e.g., block number 0 to mark holes, block number 1 to mark "block not in this continuation", and all other block numbers for real blocks. > indirect blocks or check your bitmap. > > -VAL Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] TileFS - a proposal for scalable integrity checking
Valerie Henson writes: [...] > > Hm, I'm not sure that everyone understands, a particular subtlety of > how the fsck algorithm works in chunkfs. A lot of people seem to > think that you need to check *all* cross-chunk links, every time an > individual chunk is checked. That's not the case; you only need to > check the links that go into and out of the dirty chunk. You also > don't need to check the other parts of the file outside the chunk, > except for perhaps reading the byte range info for each continuation > node and making sure no two continuation inodes think they both have > the same range, but you don't check the indirect blocks, block > bitmaps, etc. I guess I miss something. If chunkfs maintains "at most one continuation per chunk" invariant, then continuation inode might end up with multiple byte ranges, and to check that they do not overlap one has to read indirect blocks (or some equivalent data-structure). Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
David Lang writes: > On Tue, 24 Apr 2007, Nikita Danilov wrote: > > > David Lang writes: > > > On Tue, 24 Apr 2007, Nikita Danilov wrote: > > > > > > > Amit Gud writes: > > > > > > > > Hello, > > > > > > > > > > > > > > This is an initial implementation of ChunkFS technique, briefly > > > > > discussed > > > > > at: http://lwn.net/Articles/190222 and > > > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf > > > > > > > > I have a couple of questions about chunkfs repair process. > > > > > > > > First, as I understand it, each continuation inode is a sparse file, > > > > mapping some subset of logical file blocks into block numbers. Then it > > > > seems, that during "final phase" fsck has to check that these partial > > > > mappings are consistent, for example, that no two different > > > > continuation > > > > inodes for a given file contain a block number for the same offset. > > > > This > > > > check requires scan of all chunks (rather than of only "active during > > > > crash"), which seems to return us back to the scalability problem > > > > chunkfs tries to address. > > > > > > not quite. > > > > > > this checking is a O(n^2) or worse problem, and it can eat a lot of > > > memory in > > > the process. with chunkfs you divide the problem by a large constant > > > (100 or > > > more) for the checks of individual chunks. after those are done then the > > > final > > > pass checking the cross-chunk links doesn't have to keep track of > > > everything, it > > > only needs to check those links and what they point to > > > > Maybe I failed to describe the problem presicely. > > > > Suppose that all chunks have been checked. After that, for every inode > > I0 having continuations I1, I2, ... In, one has to check that every > > logical block is presented in at most one of these inodes. For this one > > has to read I0, with all its indirect (double-indirect, triple-indirect) > > blocks, then read I1 with all its indirect blocks, etc. And to repeat > > this for every inode with continuations. > > > > In the worst case (every inode has a continuation in every chunk) this > > obviously is as bad as un-chunked fsck. But even in the average case, > > total amount of io necessary for this operation is proportional to the > > _total_ file system size, rather than to the chunk size. > > actually, it should be proportional to the number of continuation nodes. The > expectation (and design) is that they are rare. Indeed, but total size of meta-data pertaining to all continuation inodes is still proportional to the total file system size, and so is fsck time: O(total_file_system_size). What is more important, design puts (as far as I can see) no upper limit on the number of continuation inodes, and hence, even if _average_ fsck time is greatly reduced, occasionally it can take more time than ext2 of the same size. This is clearly unacceptable in many situations (HA, etc.). Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
David Lang writes: > On Tue, 24 Apr 2007, Nikita Danilov wrote: > > > Amit Gud writes: > > > > Hello, > > > > > > > > This is an initial implementation of ChunkFS technique, briefly discussed > > > at: http://lwn.net/Articles/190222 and > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf > > > > I have a couple of questions about chunkfs repair process. > > > > First, as I understand it, each continuation inode is a sparse file, > > mapping some subset of logical file blocks into block numbers. Then it > > seems, that during "final phase" fsck has to check that these partial > > mappings are consistent, for example, that no two different continuation > > inodes for a given file contain a block number for the same offset. This > > check requires scan of all chunks (rather than of only "active during > > crash"), which seems to return us back to the scalability problem > > chunkfs tries to address. > > not quite. > > this checking is a O(n^2) or worse problem, and it can eat a lot of memory > in > the process. with chunkfs you divide the problem by a large constant (100 or > more) for the checks of individual chunks. after those are done then the > final > pass checking the cross-chunk links doesn't have to keep track of > everything, it > only needs to check those links and what they point to Maybe I failed to describe the problem presicely. Suppose that all chunks have been checked. After that, for every inode I0 having continuations I1, I2, ... In, one has to check that every logical block is presented in at most one of these inodes. For this one has to read I0, with all its indirect (double-indirect, triple-indirect) blocks, then read I1 with all its indirect blocks, etc. And to repeat this for every inode with continuations. In the worst case (every inode has a continuation in every chunk) this obviously is as bad as un-chunked fsck. But even in the average case, total amount of io necessary for this operation is proportional to the _total_ file system size, rather than to the chunk size. > > any ability to mark a filesystem as 'clean' and then not have to check it on > reboot is a bonus on top of this. > > David Lang Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
Amit Gud writes: Hello, > > This is an initial implementation of ChunkFS technique, briefly discussed > at: http://lwn.net/Articles/190222 and > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf I have a couple of questions about chunkfs repair process. First, as I understand it, each continuation inode is a sparse file, mapping some subset of logical file blocks into block numbers. Then it seems, that during "final phase" fsck has to check that these partial mappings are consistent, for example, that no two different continuation inodes for a given file contain a block number for the same offset. This check requires scan of all chunks (rather than of only "active during crash"), which seems to return us back to the scalability problem chunkfs tries to address. Second, it is not clear how, under assumption of bugs in the file system code (which paper makes at the very beginning), fsck can limit itself only to the chunks that were active at the moment of crash. [...] > > Best, > AG Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
Mikulas Patocka writes: > > > BTW. How does ReiserFS find that a given inode number (or object ID in > > > ReiserFS terminology) is free before assigning it to new file/directory? > > > > reiserfs v3 has an extent map of free object identifiers in > > super-block. > > Inode free space can have at most 2^31 extents --- if inode numbers > alternate between "allocated", "free". How do you pack it to superblock? In the worst case, when free/used extents are small, some free oids are "leaked", but this has never been problem in practice. In fact, there was a patch for reiserfs v3 to store this map in special hidden file but it wasn't included in mainline, as nobody ever complained about oid map fragmentation. > > > reiser4 used 64 bit object identifiers without reuse. > > So you are going to hit the same problem as I did with SpadFS --- you > can't export 64-bit inode number to userspace (programs without > -D_FILE_OFFSET_BITS=64 will have stat() randomly failing with EOVERFLOW > then) and if you export only 32-bit number, it will eventually wrap-around > and colliding st_ino will cause data corruption with many userspace > programs. Indeed, this is fundamental problem. Reiser4 tries to ameliorate it by using hash function that starts colliding only when there are billions of files, in which case 32bit inode number is screwed anyway. Note, that none of the above problems invalidates reasons for having long in-kernel inode identifiers that I outlined in other message. > > Mikulas Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
Mikulas Patocka writes: [...] > > BTW. How does ReiserFS find that a given inode number (or object ID in > ReiserFS terminology) is free before assigning it to new file/directory? reiserfs v3 has an extent map of free object identifiers in super-block. reiser4 used 64 bit object identifiers without reuse. > > Mikulas Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding hardlinks
Mikulas Patocka writes: > > > On Fri, 29 Dec 2006, Trond Myklebust wrote: > > > On Thu, 2006-12-28 at 19:14 +0100, Mikulas Patocka wrote: > >> Why don't you rip off the support for colliding inode number from the > >> kernel at all (i.e. remove iget5_locked)? > >> > >> It's reasonable to have either no support for colliding ino_t or full > >> support for that (including syscalls that userspace can use to work with > >> such filesystem) --- but I don't see any point in having half-way support > >> in kernel as is right now. > > > > What would ino_t have to do with inode numbers? It is only used as a > > hash table lookup. The inode number is set in the ->getattr() callback. > > The question is: why does the kernel contain iget5 function that looks up > according to callback, if the filesystem cannot have more than 64-bit > inode identifier? Generally speaking, file system might have two different identifiers for files: - one that makes it easy to tell whether two files are the same one; - one that makes it easy to locate file on the storage. According to POSIX, inode number should always work as identifier of the first class, but not necessary as one of the second. For example, in reiserfs something called "a key" is used to locate on-disk inode, which in turn, contains inode number. Identifiers of the second class tend to live in directory entries, and during lookup we want to consult inode cache _before_ reading inode from the disk (otherwise cache is mostly useless), right? This means that some file systems want to index inodes in a cache by something different than inode number. There is another reason, why I, personally, would like to have an ability to index inodes by things other than inode numbers: delayed inode number allocation. Strictly speaking, file system has to assign inode number to the file only when it is just about to report it to the user space (either though stat, or, ugh... readdir). If location of inode on disk depends on its inode number (like it is in inode-table based file systems like ext[23]) then delayed inode number allocation has to same advantages as delayed block allocation. > > This lookup callback just induces writing bad filesystems with coliding > inode numbers. Either remove coda, smb (and possibly other) filesystems > from the kernel or make a proper support for userspace for them. > > The situation is that current coreutils 6.7 fail to recursively copy > directories if some two directories in the tree have coliding inode > number, so you get random data corruption with these filesystems. > > Mikulas Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSv4/pNFS possible POSIX I/O API standards
Christoph Hellwig writes: > I'd like to Cc Ulrich Drepper in this thread because he's going to decide > what APIs will be exposed at the C library level in the end, and he also > has quite a lot of experience with the various standardization bodies. > > Ulrich, this in reply to these API proposals: > > > http://www.opengroup.org/platform/hecewg/uploads/40/10903/posix_io_readdir+.pdf > > http://www.opengroup.org/platform/hecewg/uploads/40/10898/POSIX-stat-manpages.pdf What readdirplus() is supposed to return in ->d_stat field for a name "foo" in directory "bar" when "bar/foo" is a mount-point? Note that in the case of distributed file system, server has no idea about client mount-points, which implies some form of local post-processing. Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: readdir behaviour
Jan Blunck writes: > This was also topic on lkml 2 weeks ago. > > Zitat von Tomas Hruby <[EMAIL PROTECTED]>: > > > First of all I would like to know what exactly is the meaning of the > > 'offset' parameter of filldir and whether it is used somewhere? Unlike > > ext2, our directories are not easily read sequentially and this value > > (copied by filldir to dirent->d_off) seems to be quite useless outside > > our fs code. > > The offset of the dirent has no common meaning. Think of it as a cookie or > something like that. It should not be interpreted either by VFS or by the > user-space. ->d_off is remembered by glibc, and returned to the user as a result of telldir(3). As such it is valid argument for the following seekdir(3). > > > Related question is what is the correct behaviour of readdir in case > > of user's seeking in the directory? If I understand correctly, in case > > of ext3 (indexed directories), when seeking is detected, readdir > > starts reading from the directory beginning again. > > On different archs the libc is seeking (to d_off) after each call to > getdents(). > Therefore the implementation should honor it. > > > The last is about concurrency. How is solved problem when a directory > > is read by readdir and between two readdir calls the same directory is > > changed? Single UNIX specification (http://www.opengroup.org/onlinepubs/007904875/functions/readdir.html) is vague about whether directory entries added asynchronously should be returned. > > This is the filesystems duty to seek to the next valid dentry. Although it is > not defined if the new directories contents is returned or the one on > opendir(). > > Although I think it would be nice (and convenient to the "everything is a > file" > paradigm) when the directory is presented like a sequential file this is not > the common practice. Due to the fact that there are no applications which are > reading and seeking the directories directly this is a good tradeoff to > achieve > high performance for readdir(). Unfortunately, seekdir and telldir are standard (albeit optional) interfaces, and libc translates seekdir into lseek. File systems have to support this. > > Jan Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Lazy block allocation and block_prepare_write?
Mingming Cao <[EMAIL PROTECTED]> writes: > On Tue, 2005-04-19 at 19:55 +0400, Nikita Danilov wrote: >> Badari Pulavarty <[EMAIL PROTECTED]> writes: >> >> > On Tue, 2005-04-19 at 04:22, Nikita Danilov wrote: >> >> Badari Pulavarty <[EMAIL PROTECTED]> writes: >> >> >> >> [...] >> >> >> >> > >> >> > Yes. Its possible to do what you want to. I am currently working on >> >> > adding "delayed allocation" support to ext3. As part of that, We >> >> >> >> As you most likely already know, Alex Thomas already implemented delayed >> >> block allocation for ext3. >> > >> > Yep. I reviewed Alex Thomas patches for delayed allocation. He handled >> > all the cases in his code and did NOT use any mpage* routines to do >> > the work. I was hoping to change the mpage infrastructure to handle >> > these, so that every filesystem doesn't have to do their thing. >> > >> >> Just keep in mind that filesystem != ext3. :-) Generic support makes >> sense only when it is usable by multiple file systems. This is not >> always possible, e.g., there is no "generic block allocator" for >> precisely the same reason: disk space allocation policies are tightly >> intertwined with the rest of file system internals. >> > > This generic support should be useful for ext2 and xfs. From delayed But it won't work for reiser4, that allocates blocks _across_ multiple files. E.g., if many files were created in the same directory, allocation (performed just before write-out) will assign block numbers so that files are ordered according to the readdir order on the disk (with each file body being an interval in that ordering). This is done by arranging all dirty blocks of a given transaction according to some "ideal" ordering and then trying to map this ordering onto disk blocks. As you see, in this case allocation is not done on inode-by-inode basis at all: instead delayed allocation is done at the transaction level of granularity, and I am trying to point out that this is natural thing for the journalled file system to do. The same goes for write-out: in ext3 there is only one "active" transaction at any moment, and this means that ->writepages() calls can go in arbitrary order, but for the file system type with multiple active transactions that can be committed separately, order of ->writepages() calls has to follow ordering between transactions. Again, this means that write-out should be transaction rather than inode based. If we want really generic support for journalling and delayed-allocation, mpage_* functions are the wrong level. Instead proper notion of transaction has to be introduced, and file system IO and disk space allocation interfaces adjusted appropriately. Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Lazy block allocation and block_prepare_write?
Badari Pulavarty <[EMAIL PROTECTED]> writes: > On Tue, 2005-04-19 at 04:22, Nikita Danilov wrote: >> Badari Pulavarty <[EMAIL PROTECTED]> writes: >> >> [...] >> >> > >> > Yes. Its possible to do what you want to. I am currently working on >> > adding "delayed allocation" support to ext3. As part of that, We >> >> As you most likely already know, Alex Thomas already implemented delayed >> block allocation for ext3. > > Yep. I reviewed Alex Thomas patches for delayed allocation. He handled > all the cases in his code and did NOT use any mpage* routines to do > the work. I was hoping to change the mpage infrastructure to handle > these, so that every filesystem doesn't have to do their thing. > Just keep in mind that filesystem != ext3. :-) Generic support makes sense only when it is usable by multiple file systems. This is not always possible, e.g., there is no "generic block allocator" for precisely the same reason: disk space allocation policies are tightly intertwined with the rest of file system internals. > >> >> > >> > In order to do the correct accounting, we need to mark a page >> > to indicate if we reserved a block or not. One way to do this, >> > to use page->private to indicate this. But then, all the generic >> >> I believe one can use PG_mappedtodisk bit in page->flags for this >> purpose. There was old Andrew Morton's patch that introduced new bit >> (PG_delalloc?) for this purpose. > > That would be good. But I don't feel like asking for a bit in page > if there is a way to get around it. Clarification: PG_mappedtodisk is already here, it seems you can reuse this already existing bit to implement delayed allocation support. > [...] >> > > Need to think some more. I guess you thought about this more than you > do :) > > Thanks, > Badari > Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Lazy block allocation and block_prepare_write?
Badari Pulavarty <[EMAIL PROTECTED]> writes: [...] > > Yes. Its possible to do what you want to. I am currently working on > adding "delayed allocation" support to ext3. As part of that, We As you most likely already know, Alex Thomas already implemented delayed block allocation for ext3. [...] > > In order to do the correct accounting, we need to mark a page > to indicate if we reserved a block or not. One way to do this, > to use page->private to indicate this. But then, all the generic I believe one can use PG_mappedtodisk bit in page->flags for this purpose. There was old Andrew Morton's patch that introduced new bit (PG_delalloc?) for this purpose. > routines will fail - since they assume that page->private represents > bufferheads. So we need a better way to do this. They are not generic then. Some file systems store things completely different from buffer head ring in page->private. > > 3) We need add hooks into filesystem specific calls from these > generic routines to handle "journaling mode" requirements > (for ext3 and may be others). Please don't. There is no such thing as "generic journalling". Traditional WAL used by ext3, phase-trees of Tux2, and wandering logs of reiser4 are so much different that there is no hope for a single API to accommodate them all. Adding such API will only force more workarounds and hacks in non-ext3 file systems. What _is_ common to all journalling file systems on the other hand, is the notion of transaction as the natural unit of caching and write-out. Currently in Linux, write-out is inode-based (->writepages()). Reiser4 already has a patch that replaces sync_sb_inodes() function with super-block operation. In reiser4 case, this operation scans the list of transactions (instead of the list of inodes) and writes some of them out, which is natural thing to do for a journalled file system. Similarly, transaction is a unit of caching: it's often necessary to scan all pages of a given transaction, all dirty pages of a given transaction, or to check whether given page belongs to a given transaction. That is, transaction plays role similar to struct address_space. But currently there is 1-to-1 relation between inodes and address_spaces, and this forces file system to implement additional data structures to duplicate functionality already present in address_space. > > So, what are your requirements ? I am looking for a common > way to combine all the requirements and come out with a > saner "generic" routines to handle these. > I think that one reasonable way to add generic support for journalling is to split struct address_space into two objects: lower layer that represents "file" (say, struct vm_file), in which pages are linearly ordered, and on top of this vm_cache (representing transaction) that keeps track of pages from various vm_file's. vm_file is embedded into inode, and vm_cache has a pointer to (the analog of) struct address_space_operations. vm_cache's are created by file system back-end as necessary (can be embedded into inode for non-journalled file systems). VM scanner and balance_dirty_pages() call vm_cache operations to do write-out. > > Thanks, > Badari Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Lilo requirements (Was: Re: Address space operations questions)
Martin Jambor writes: > Thanks for your reply, I found the the following thing interesting on its > own: > > On 4/7/05, Nikita Danilov <[EMAIL PROTECTED]> wrote: > > Consider tools like LILO that want stable block numbers for certain > > files. In reiserfs (both v3 and v4) there is an ioctl that disables > > relocation for a given file. Besides, I do not think ->bmap() is useless > > even when block numbers are volatile, for one thing it allows user level > > to track how file is laid out (for example, to measure fragmentation). > > I tried to google out what behaviour lilo requires filesystems to > exhibit without much success... is that information available > somnewhere I din't look? Is that simple enought to be explained here? As opposed to, say, GRUB, LILO doesn't parse file system layout at the boot time. Instead it remembers in what blocks kernel image is stored. This assumes following properties of the file system: - unit of disk space allocation for the kernel image file is block. That is, optimizations like UFS fragments or reiserfs tails are not applied, and - blocks that kernel image is stored into are real disk blocks (i.e., there is a way to disable "delayed allocation"), and - kernel image file is not relocated, i.e., data are not moved into another blocks on the fly. Currently the only file system that doesn't satisfy any of there requirements is reiserfs, and it has special ioctl REISERFS_IOC_UNPACK that forces LILO friendly behaviour for a specified file: no tails, no delayed allocation, and no relocation. LILO detects when kernel image is on reiserfs and calls that ioctl. > > TIA > > Martin Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Address space operations questions
Martin Jambor writes: > Thank you very much for your reply. > > On Mar 30, 2005 3:55 PM, Nikita Danilov <[EMAIL PROTECTED]> wrote: > > > 1. What is bmap for and what is it supposed to do? > > > > ->bmap() maps logical block offset within "object" to physical block > > number. It is used in few places, notably in the implementation of > > FIBMAP ioctl. > > We are about to start implementing a fs where data can move around the > device and so a physical block address is not really useful. I have > understood from other postings to this list that reiserfs and ntfs > don;t implement this method so I suppose we'll do the same. I'll just > find some nice error to return. Consider tools like LILO that want stable block numbers for certain files. In reiserfs (both v3 and v4) there is an ioctl that disables relocation for a given file. Besides, I do not think ->bmap() is useless even when block numbers are volatile, for one thing it allows user level to track how file is laid out (for example, to measure fragmentation). [...] > > OK, so if I understand it well, sync_page does not actually write the > page anywhere, it only waits until the device driver finishes all > previous requests with that page, right? Does block_sync_page do No. ->sync_page() doesn't wait for anything. It simply tells to the underlying storage layer "start executing all queued IO requests". If your file system uses block device as a storage, use block_sync_page as your ->sync_page() method. > exactly that? (I would read the source but all it does is that it > calls a callback function) BTW, does it wait also for metadata? No difference between data and meta-data at this level. > > Martin Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Add support for semaphore-like structure with support for asynchronous I/O
Trond Myklebust writes: > to den 31.03.2005 Klokka 12:02 (+0400) skreiv Nikita Danilov: > > > As I understand it, in the blocking path IOSEM_LOCK_EXCLUSIVE is set by > > iosem_lock_wake_function() called by the waker thread. But this is > > asking for convoy formation: iosem_unlock() transfers ownership of the > > lock to the thread that is currently sleeping. This means that all > > threads _running_ on another processors and bumping into that lock will > > go to sleep too (i.e., lock is owned but unused), thus forming a > > "convoy" that has a tendency to grow over time when there is at least > > smallest contention. This is known problem with all "early ownership > > transfer" locks designs (except maybe in your case contention is not > > supposed to happen). > > You are assuming that all the waiters on the queue are tasks that must > sleep if they cannot take the lock. That is not the case here. Whereas > some users will indeed fall in this category, I expect that most will > rather want to use the non-blocking mode in which the caller is free to > go off and do other useful work. Ah, I see... But this doesn't look like semaphore _at_ _all_. Semaphores have no call-backs, and in iosem case it's callback (in the form of struct work_struct) that is central to the interface. I belive naming should reflect this, it's utterly confusing as it is. Maybe struct work_queue_token and schedule_work_{with,end}_token()? [...] > > Cheers, > Trond Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Add support for semaphore-like structure with support for asynchronous I/O
Trond Myklebust writes: > In NFSv4 we often want to serialize asynchronous RPC calls with ordinary [...] > + > +void fastcall iosem_lock(struct iosem *lk) > +{ > +struct iosem_wait waiter; > + > +might_sleep(); > + > +init_iosem_waiter(&waiter); > +waiter.wait.func = iosem_lock_wake_function; > + > +set_current_state(TASK_UNINTERRUPTIBLE); > +if (__iosem_lock(lk, &waiter)) > +schedule(); > +__set_current_state(TASK_RUNNING); > + > +BUG_ON(!list_empty(&waiter.wait.task_list)); > +} > +EXPORT_SYMBOL(iosem_lock); As I understand it, in the blocking path IOSEM_LOCK_EXCLUSIVE is set by iosem_lock_wake_function() called by the waker thread. But this is asking for convoy formation: iosem_unlock() transfers ownership of the lock to the thread that is currently sleeping. This means that all threads _running_ on another processors and bumping into that lock will go to sleep too (i.e., lock is owned but unused), thus forming a "convoy" that has a tendency to grow over time when there is at least smallest contention. This is known problem with all "early ownership transfer" locks designs (except maybe in your case contention is not supposed to happen). And as a nitpick: struct iosem is emphatically _not_ a semaphore, it even doesn't have a counter. :) Can it be named iomutex or iolock or async_lock or something? We have enough confusion going on with struct semaphore that is mostly used as mutex. [...] > + > +int fastcall iosem_lock_and_schedule_work(struct iosem *lk, struct > iosem_work *wk) > +{ > +int ret; > + > +init_iosem_waiter(&wk->waiter); > +wk->waiter.wait.func = iosem_lock_and_schedule_function; > +ret = __iosem_lock(lk, &wk->waiter); > +if (ret == 0) > +ret = schedule_work(&wk->work); > +return ret; > +} This is actually trylock, right? If iosem_lock_and_schedule_work() returns -EINPROGRESS lock is not acquired on return and caller has to call schedule(). [...] > > -- > Trond Myklebust <[EMAIL PROTECTED]> > Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Address space operations questions
Martin Jambor writes: > Hi, > > I have problems understanding the purpose of different entries of > struc address_space_operations in 2.6 kernels: > > 1. What is bmap for and what is it supposed to do? ->bmap() maps logical block offset within "object" to physical block number. It is used in few places, notably in the implementation of FIBMAP ioctl. > > 2. What is the difference between sync_page and write_page? (It is spelt ->writepage() by the way). ->sync_page() is an awful misnomer. Usually, when page IO operation is requested by calling ->writepage() or ->readpage(), file-system queues IO request (e.g., disk-based file system may do this my calling submit_bio()), but underlying device driver does not proceed with this IO immediately, because IO scheduling is more efficient when there are multiple requests in the queue. Only when something really wants to wait for IO completion (wait_on_page_{locked,writeback}() are used to wait for read and write completion respectively) IO queue is processed. To do this wait_on_page_bit() calls ->sync_page() (see block_sync_page()---standard implementation of ->sync_page() for disk-based file systems). So, semantics of ->sync_page() are roughly "kick underlying storage driver to actually perform all IO queued for this page, and, maybe, for other pages on this device too". > > 3. What exactly (fs independent) is the relation in between > write_page, prepare_write and commit_write? Does prepare make sure a > page can be written (like allocating space), commit mark it dirty a > write write it sometime later on? ->prepare_write() and ->commit_write() are only used by generic_file_write() (so, one may argue that they shouldn't be placed into struct address_space at all). generic_file_write() has a loop for each page overlapping with portion of file that write goes into: a_ops->prepare_write(file, page, from, to); copy_from_user(...); a_ops->commit_write(file, page, from, to); In page is partially overwritten, ->prepare_write() has to read parts of the page that are not covered by write. ->commit_write() is expected to mark page (or buffers) and inode dirty, and update inode size, if write extends file. As for block allocation and transaction handling, this is up to the file system back end. Usually ->commit_write() doesn't start IO by itself, it just marks pages dirty. Write-out is done by balance_dirty_pages_ratelimited(): when number of dirty pages in the system exceeds some threshold, kernel calls ->writepages() of dirty inodes. ->writepage() is used in two places: - by VM scanner to write out dirty page from tail of the inactive list. This is "rare" path, because balance_dirty_pages() is supposed to keep amount of dirty pages under control. - by mpage_writepages(): default implementation of ->writepages() method. > > Thak you very much for any insight, > > Martin Hope this helps. Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [PATCH] Generic mpage_writepage() support
Badari Pulavarty writes: > On Tue, 2005-02-15 at 09:54, Andrew Morton wrote: > > Badari Pulavarty <[EMAIL PROTECTED]> wrote: > > > > > > Yep. nobh_prepare_write() doesn't add any bufferheads. But > > > we call block_write_full_page() even for "nobh" case, which > > > does create bufferheads, attaches to the page and operates > > > on them.. > > > > hmm, yeah, OK, we'll attach bh's in that case. It's a rare case though - > > when a dirty page falls off the end of the LRU. There's no particular > > reason why we cannot have a real mpage_writepage() which doesn't use bh's > > and employ that. > > > > I coulda sworn we used to have one. > > Hi Andrew, > > Here is my first version of mpage_writepage() patch. > I haven't handle the "confused" case yet - I need to > pass a function pointer to handle it. Just for > initial code review. I am still testing it. > > Thanks, > Badari > > > diff -Narup -X dontdiff linux-2.6.10/fs/ext2/inode.c > linux-2.6.10.nobh/fs/ext2/inode.c > --- linux-2.6.10/fs/ext2/inode.c 2004-12-24 13:33:51.0 -0800 [...] > return ret; > } > + > +/* > + * The generic ->writepage function for address_spaces > + */ This function doesn't look generic. It only works correctly with file systems that store pointer to buffer head ring in page->private (at least temporarily), otherwise code after page_has_buffers(page) check in __mpage_writepage() will corrupt page->private. Actually, this looks confusing. I thought that main idea of mpage.c is to get rid of buffer heads, and switch everything to bios. But looking at the current code it seems that buffer heads are striking back: code simply assumes that PG_private means "buffers in page->private", making mpage.c effectively useless for file systems using page->private for something else. There is another reason why mpage_writepage() is a problematic choice for ->writepage: __mpage_writepage() calls page->mapping->a_ops->writepage() in "confused" case, which sounds like infinite recursion. [...] > +if (page->index >= end_index+1 || !offset) { > +/* > + * The page may have dirty, unmapped buffers. For example, > + * they may have been added in ext3_writepage(). Make them > + * freeable here, so the page does not leak. > + */ > +block_invalidatepage(page, 0); Shouldn't this be page->mapping->a_ops->invalidatepage(page, 0) ? To preserve external appearance of "genericity", that is. :) > +unlock_page(page); > +return 0; /* don't care */ > +} > + > +/* > + * The page straddles i_size. It must be zeroed out on each and every > + * writepage invokation because it may be mmapped. "A file is mapped Typo: should be invocation (at least beyond Australia). Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bufferheads & page-cache reference
Andrew Morton <[EMAIL PROTECTED]> writes: > Badari Pulavarty <[EMAIL PROTECTED]> wrote: >> >> Yep. nobh_prepare_write() doesn't add any bufferheads. But >> we call block_write_full_page() even for "nobh" case, which >> does create bufferheads, attaches to the page and operates >> on them.. > > hmm, yeah, OK, we'll attach bh's in that case. It's a rare case though - > when a dirty page falls off the end of the LRU. There's no particular Maybe DB2 dirties pages through mmap? Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html