Re: [patch] [possible race in ext2] Re: how to write get_block?
From: "Stephen C. Tweedie" [EMAIL PROTECTED] Date: Mon, 11 Oct 1999 17:34:36 +0100 (BST) The _fast_ quick fix is to maintain a per-inode list of dirty buffers and to invalidate that list when we do a delete. This works for directories if we only support truncate back to zero --- it obviously gets things wrong if we allow partial truncates of directories (but why would anyone want to allow that?!) This would have minimal performance implication and would also allow fast fsync() of indirect block metadata for regular files. I've actually had patches to do fast fsync for some time, but I thought you said you had some changes in queue which would make the need for this obsolete, so didn't bother to submit this, since it is a bit of a hack. Here it is though, for folks to comment upon. Code to invalidate the list of dirty buffers is missing from this code, but it wouldn't be hard to add. - Ted Patch generated: on Wed Aug 18 15:01:49 EDT 1999 by [EMAIL PROTECTED] against Linux version 2.2.10 === RCS file: fs/ext2/RCS/fsync.c,v retrieving revision 1.1 diff -u -r1.1 fs/ext2/fsync.c --- fs/ext2/fsync.c 1999/07/21 10:45:22 1.1 +++ fs/ext2/fsync.c 1999/07/21 10:45:26 @@ -250,36 +250,83 @@ return err; } -/* - * File may be NULL when we are called. Perhaps we shouldn't - * even pass file to fsync ? - */ - -int ext2_sync_file(struct file * file, struct dentry *dentry) +static int sync_inode(struct inode *inode) { - int wait, err = 0; - struct inode *inode = dentry-d_inode; - + int i, wait, err = 0; + __u32 *blklist; + if (S_ISLNK(inode-i_mode) !(inode-i_blocks)) - /* -* Don't sync fast links! -*/ - goto skip; + return 0; - for (wait=0; wait=1; wait++) - { - err |= sync_direct (inode, wait); - err |= sync_indirect (inode, + if (!S_ISDIR(inode-i_mode) inode-u.ext2_i.i_ffsync_flag) { + blklist = inode-u.ext2_i.i_ffsync_blklist; + for (wait = 0; wait =1; wait++) { + for (i=0; i inode-u.ext2_i.i_ffsync_ptr; i++) { +#if 0 /* Debugging */ + if (!wait) + printk("Fasy sync: %d\n", blklist[i]); +#endif + err |= sync_block(inode, blklist[i], wait); + } + } + } else { + for (wait=0; wait=1; wait++) { + err |= sync_direct (inode, wait); + err |= sync_indirect (inode, inode-u.ext2_i.i_data+EXT2_IND_BLOCK, - wait); - err |= sync_dindirect (inode, + wait); + err |= sync_dindirect (inode, inode-u.ext2_i.i_data+EXT2_DIND_BLOCK, - wait); - err |= sync_tindirect (inode, + wait); + err |= sync_tindirect (inode, inode-u.ext2_i.i_data+EXT2_TIND_BLOCK, - wait); + wait); + } } -skip: + inode-u.ext2_i.i_ffsync_flag = 1; + inode-u.ext2_i.i_ffsync_ptr = 0; err |= ext2_sync_inode (inode); + return err; +} + +/* + * File may be NULL when we are called by msync on a vma. In the + * future, the VFS layer should be changed to not pass the struct file + * parameter to the fsync function, since it's not used by any of the + * implementations (and the dentry parameter is all that we need). + */ +int ext2_sync_file(struct file * file, struct dentry *dentry) +{ + int err = 0; + + err = sync_inode(dentry-d_inode); + if (dentry-d_parent dentry-d_parent-d_inode) + err |= sync_inode(dentry-d_parent-d_inode); + return err ? -EIO : 0; +} + +/* + * This function adds a list of blocks to be written out by fsync. If + * it exceeds NUM_EXT2_FFSYNC_BLKS, then we turn off the fast fsync flag. + */ +void ext2_ffsync_add_blk(struct inode *inode, __u32 blk) +{ + int i; + __u32 *blklist; + + if (inode-u.ext2_i.i_ffsync_flag == 0) + return; +#if 0 /* Debugging */ + printk("Add fast sync: %d\n", blk); +#endif + blklist = inode-u.ext2_i.i_ffsync_blklist; + for (i=0; i inode-u.ext2_i.i_ffsync_ptr; i++) + if (blklist[i] == blk) + return; + if (inode-u.ext2_i.i_ffsync_ptr NUM_EXT2_FFSYNC_BLKS) {
Re: [patch] [possible race in ext2] Re: how to write get_block?
On 11 Oct 1999 17:58:54 -0500, [EMAIL PROTECTED] (Eric W. Biederman) said: Ultimately we really want to have indirect blocks, and the directory in the page cache as it should result in more uniform code, and faster partial truncates (as well as faster syncs). Stephen C. Tweedie [EMAIL PROTECTED] wrote: There is one major potential future problem with moving this to the page cache. At some point I want to be able to extend the large (64G) memory support on Intel to include the page cache in high memory. The buffer cache would still live in low memory. If we do that, then moving filesystem metadata out of permanently-mapped buffer memory and into the page cache is going to complicate directory and indirect operations significantly. Because of write ordering, I presume -- the dirty buffers you want most likely want to be able to write are (of the dirty buffers) the ones you most likely want to move out to large memory? Is there a problem with the simple answer (don't move dirty buffers there)? It seems like the only problem there would be memory pressure, nothing algorithmic. If so, I'm guessing that the trick is how to best save write order for dirty blocks in large memory? Unless I'm way off base, the second simplest answer is don't move dirty meta-data blocks out to large memory. That's going to cost you an if () -- and I suppose it may mean some extra L1 cache pressure for machines with large memory support compiled in -- but otherwise doesn't seem any worse than the current situation. -- Raul
Re: [patch] [possible race in ext2] Re: how to write get_block?
Hi, On Sat, 9 Oct 1999 23:53:01 +0200 (CEST), Andrea Arcangeli [EMAIL PROTECTED] said: What I said about bforget in my old email is still true. The _only_ reason for using bforget instead of brelse is to get buffer performances (that in 2.3.x are not so interesting as in 2.2.x as in 2.3.x flushpage is just doing the interesting stuff with the real data). The current design bug in 2.3.20pre2 and previous has nothing to do with bforget. Nope. We spent a fair amount of effort on this with the page cache changes. The ext2 truncate code is really, really careful to provide bforget() with the correct conditions to get rid of the buffer: it is a closely designed interaction between delete and bforget. Remember, we have exactly the same potential problems when freeing dirty indirect blocks as we have when freeing directory blocks. There's a big comment near the top of fs/ext2/truncate.c about the way in which this works. The right fix is to do a query on the hash every time you overlap a buffer on the page cache. Ouch --- that pays the penalty for normal data blocks every time you pull them into cache. No way. The correct way to extend the current rules cleanly is to make the truncate code do a bforget() on data blocks as well as indirect blocks if, and only if, the file is not a regular file. That will deal with the symlink case too, and will mean that we are using the same mechanism for all of our dynamically-allocatable metadata. --Stephen
Re: [patch] [possible race in ext2] Re: how to write get_block?
Hi, On 11 Oct 1999 17:58:54 -0500, [EMAIL PROTECTED] (Eric W. Biederman) said: What about adding to the end of ext2_alloc_block: bh = get_hash_table(inode-i_dev, result, inode-i_sb-s_blocksize); /* something is playing with our fresh block, make them stop. ;-) */ if (bh) { if (buffer_dirty(bh)) { mark_buffer_clean(bh); wait_on_buffer(bh); } bforget(bh); } Again, it's a lot of extra unnecessary lookups. The advantages of having a dirty buffer list include being fast, and also massively speeding up the metadata update part of fsync. Ultimately we really want to have indirect blocks, and the directory in the page cache as it should result in more uniform code, and faster partial truncates (as well as faster syncs). There is one major potential future problem with moving this to the page cache. At some point I want to be able to extend the large (64G) memory support on Intel to include the page cache in high memory. The buffer cache would still live in low memory. If we do that, then moving filesystem metadata out of permanently-mapped buffer memory and into the page cache is going to complicate directory and indirect operations significantly. --Stephen
Re: [patch] [possible race in ext2] Re: how to write get_block?
On Tue, 12 Oct 1999, Stephen C. Tweedie wrote: changes. The ext2 truncate code is really, really careful to provide I was _not_ talking about ext2 at all. I was talking about the bforget and brelse semantics. As bforget fallback to brelse you shouldn't expect bforget to really destroy the buffer. It may do that but only if it's possible. You are using a _trick_ to let bforget to do the thing you want. If all filesystem needs such thing it's ugly having to duplicate that current ext2 tricks all over the place. The correct way to extend the current rules cleanly is to make the truncate code do a bforget() on data blocks as well as indirect blocks if, and only if, the file is not a regular file. That will deal with the symlink case too, and will mean that we are using the same mechanism for all of our dynamically-allocatable metadata. IMHO we should drop all that race-prone tricks isntead of adding new ones. If you use bforget to avoid fs corruption you are asking for troubles and IMHO you are going in the wrong direction. You may change bforget to do the right thing cleanly, but it will be badly blocking. While you shouldn't block unless you are the one who wants to reuse the buffer that is currently under I/O or dirty. So if you'll change bforget then you'll block in the wrong place. Andrea
Re: [patch] [possible race in ext2] Re: how to write get_block?
On 11 Oct 1999, Eric W. Biederman wrote: What about adding to the end of ext2_alloc_block: It's _equally_ slow. Do you seen my patch? I prefer to do the query at the higher lever to save cpu cache. Andrea
Re: [patch] [possible race in ext2] Re: how to write get_block?
Hi, On Tue, 12 Oct 1999 15:39:35 +0200 (CEST), Andrea Arcangeli [EMAIL PROTECTED] said: On Tue, 12 Oct 1999, Stephen C. Tweedie wrote: changes. The ext2 truncate code is really, really careful to provide I was _not_ talking about ext2 at all. I was talking about the bforget and brelse semantics. As bforget fallback to brelse you shouldn't expect bforget to really destroy the buffer. It may do that but only if it's possible. You are using a _trick_ to let bforget to do the thing you want. If all filesystem needs such thing it's ugly having to duplicate that current ext2 tricks all over the place. Andrea, you are just trying to relax carefully designed buffer cache semantics which are relied upon by the current filesystems. Saying it is a trick doesn't help matters much. If you use bforget to avoid fs corruption you are asking for troubles and IMHO you are going in the wrong direction. Umm, your last proposal was to do a hash lookup on each new page cache buffer mapping. That is a significant performance cost, which IMHO is not exactly the right direction either. :) You may change bforget to do the right thing cleanly, but it will be badly blocking. While you shouldn't block unless you are the one who wants to reuse the buffer that is currently under I/O or dirty. So if you'll change bforget then you'll block in the wrong place. Unfortunately, the plain fact is that the current page cache relies on data blocks never being in the buffer cache hash lists, and certainly never being dirty in that cache. _That_ is the invariant we need to preserve. truncate already does that for regular files. Please, let's try to keep that clean invariant rather than increase the cost of normal page cache operations. --Stephen
Re: [patch] [possible race in ext2] Re: how to write get_block?
On Tue, 12 Oct 1999, Stephen C. Tweedie wrote: Umm, your last proposal was to do a hash lookup on each new page cache buffer mapping. That is a significant performance cost, which IMHO is not exactly the right direction either. :) It's not obvious that the only thing to consider are performances. It's not obvious to me that `less performances' == `wrong way' as you state above. That's been the simpler way I had to solve the design bug in a way that looks obviously right to me (no way for mistakes or races with my way) and without breaking all lowlevel filesystems. Said that I will think soon about the better way that you are proposing and that will break all filesystems again (except NFS and SMBFS where the metadata lives on the server). I had not thought about this metadata-in-page-cache way carefully yet. (I wanted to do that yesterday but I destroyed all my filesystems with a cp /dev/zero /var/tmp due an fdisk bug and so I wasted some good time ...). If there is just some code available to see I'd like to see it (as it will avoid me reading lots of emails to understand what's going on). Thanks! Andrea
Re: [patch] [possible race in ext2] Re: how to write get_block?
G'day! On Tue, 12 Oct 1999, Stephen C. Tweedie wrote: ... Andrea, you are just trying to relax carefully designed buffer cache semantics which are relied upon by the current filesystems. Saying it is a trick doesn't help matters much. Andrea's right in that the semantics make it far too easy to introduce bugs. How about making bforget set a flag that the buffer head is to be destroyed as soon as it's released? If a filesystem re-uses the buffer, mark_buffer_dirty can clear this destroyed flag (to handle the case where a buffer head becomes legitimately reused). -ben
Re: [patch] [possible race in ext2] Re: how to write get_block?
"Stephen C. Tweedie" [EMAIL PROTECTED] writes: Hi, On Sun, 10 Oct 1999 16:57:18 +0200 (CEST), Andrea Arcangeli [EMAIL PROTECTED] said: My point was that even being forced to do a lookup before creating each empty buffer, will be still faster than 2.2.x as in 2.3.x the hash will contain only metadata. Less elements means faster lookups. The _fast_ quick fix is to maintain a per-inode list of dirty buffers and to invalidate that list when we do a delete. This works for directories if we only support truncate back to zero --- it obviously gets things wrong if we allow partial truncates of directories (but why would anyone want to allow that?!) This would have minimal performance implication and would also allow fast fsync() of indirect block metadata for regular files. What about adding to the end of ext2_alloc_block: bh = get_hash_table(inode-i_dev, result, inode-i_sb-s_blocksize); /* something is playing with our fresh block, make them stop. ;-) */ if (bh) { if (buffer_dirty(bh)) { mark_buffer_clean(bh); wait_on_buffer(bh); } bforget(bh); } This is in a relatively slow/uncommon path and also catches races if indirect blocks are allocated to a file, instead of just directory syncs. Ultimately we really want to have indirect blocks, and the directory in the page cache as it should result in more uniform code, and faster partial truncates (as well as faster syncs). Eric
Re: [patch] [possible race in ext2] Re: how to write get_block?
On Mon, 11 Oct 1999, Stephen C. Tweedie wrote: Hi, On Sun, 10 Oct 1999 16:57:18 +0200 (CEST), Andrea Arcangeli [EMAIL PROTECTED] said: My point was that even being forced to do a lookup before creating each empty buffer, will be still faster than 2.2.x as in 2.3.x the hash will contain only metadata. Less elements means faster lookups. The _fast_ quick fix is to maintain a per-inode list of dirty buffers and to invalidate that list when we do a delete. This works for directories if we only support truncate back to zero --- it obviously gets things wrong if we allow partial truncates of directories (but why would anyone want to allow that?!) AFFS. OTOH there we are dropping the known bh every time.
Re: [patch] [possible race in ext2] Re: how to write get_block?
Hi, On Sun, 10 Oct 1999 16:57:18 +0200 (CEST), Andrea Arcangeli [EMAIL PROTECTED] said: My point was that even being forced to do a lookup before creating each empty buffer, will be still faster than 2.2.x as in 2.3.x the hash will contain only metadata. Less elements means faster lookups. The _fast_ quick fix is to maintain a per-inode list of dirty buffers and to invalidate that list when we do a delete. This works for directories if we only support truncate back to zero --- it obviously gets things wrong if we allow partial truncates of directories (but why would anyone want to allow that?!) This would have minimal performance implication and would also allow fast fsync() of indirect block metadata for regular files. --Stephen
Re: [patch] [possible race in ext2] Re: how to write get_block?
On Sat, 9 Oct 1999, Andrea Arcangeli wrote: On Fri, 8 Oct 1999, Mikulas Patocka wrote: Here goes quick'n'dirty patch. It does bforget(). It should prevent file corruption. wrong patch. bforget give you no guarantee at all. bfoget always fallback to brelse if necessary. What I said about bforget in my old email is still true. The _only_ reason for using bforget instead of brelse is to get buffer performances (that in 2.3.x are not so interesting as in 2.2.x as in 2.3.x flushpage is just doing the interesting stuff with the real data). Yes, there is race when bforget is called while buffer is being flushed. mark_buffer_clean();wait_on_buffer(); should be used. Mikulas Patocka
Re: [patch] [possible race in ext2] Re: how to write get_block?
On Fri, 8 Oct 1999, Mikulas Patocka wrote: Here goes quick'n'dirty patch. It does bforget(). It should prevent file corruption. wrong patch. bforget give you no guarantee at all. bfoget always fallback to brelse if necessary. What I said about bforget in my old email is still true. The _only_ reason for using bforget instead of brelse is to get buffer performances (that in 2.3.x are not so interesting as in 2.2.x as in 2.3.x flushpage is just doing the interesting stuff with the real data). The current design bug in 2.3.20pre2 and previous has nothing to do with bforget. The right fix is to do a query on the hash every time you overlap a buffer on the page cache. If you'll find the buffer - you are going to overlap in the page-cache - in the hash, then you'll have to flush it exactly as flushpage does. My way has no cache coherency downside and it's completly safe. It will impact a bit the CPU though. Still far better than 2.2.x because the hash will be almost empty in 2.3.x compared with the huge one of 2.2.x. I just had to do something like that to fix the ramdisk device. Now I'll implement the thing. Andrea
Re: [possible race in ext2] Re: how to write get_block?
On Fri, 8 Oct 1999, Alexander Viro wrote: yep we knew about this problem ... it's not quite an easy hack though. Putting the directory block cache (and symlink block cache) into the page cache would be the preferred method - this would also clean up the code alot i think. More or less random comments/questions: 1) Is ext2_bread() vs. bread() separation in fs/ext2/* a first step to that? (I.e. data blocks vs. inode/bitmap/indirect ones) 2) Sometime ago there was a pretty impressive patch excluding large part of ext2_find_entry() (it caches the offset into dentry and invalidates it if needed) and I have a code for cyclic lookups in a directory (cheap hack, but it really speeds the things up for ls -l and friends). Both would go nicely into this scheme. 3) We ought to switch from (bh+pointer) to (page+pointer) in namei.c. 4) I still think that indirect blocks may go into the page cache - that would further simplify truncate(). OTOH the method used by BSD (indirect block covering addresses from n to n+block_size*pointers_per_block gets an address -n, double indirect block covering the area from n to n+... gets -n-block_size, triple indirect - -n-2*block size) will be wasteful for situations when block size is smaller than page. yes, all these cleanups would be very nice to have. I'm not aware of any rathole in that area - even those small steps we took in early 2.3 proved how much cleaner things could get. The memory usage point on low memory machines - i dont think it holds. Small directories (well, even larger ones ...) will use 4k, but since the 2.0 kernel moved to the page cache small files already used up 4k as well, and there are more small files than small directories. In the case of indirect pointers the 2.3 page cache only needs the indirect pointer once, then it can be deallocated by the cache if there is memory pressure - while 2.2 needed the indirect pointer for every write(). So i suspect 2.4 might even do better on small-memory boxes than 2.2. 5) Who works on that? I would try to do the thing, but if there already is some code I'ld like to look at it, indeed. nobody AFAIK. -- mingo
[patch] [possible race in ext2] Re: how to write get_block?
Sheesh... Yes, you are right. It was not a problem with the old buffer cache, but now... Arrgh. Races in truncate(), film at 11. Pheeewww... Unless I'm seriously misunderstanding what happens you've just dug out a race in ext2. It doesn't do bforget() on the data blocks in directories. If I'm not mistaken the same scenario applies here just fine. Here goes quick'n'dirty patch. It does bforget(). It should prevent file corruption. Mikulas Patocka --- linux-2.3.19.tar.gz#utar/linux/fs/ext2/truncate.c Tue Jun 29 18:22:08 1999 +++ linux/fs/ext2/truncate.cFri Oct 8 18:40:36 1999 @@ -101,6 +101,14 @@ * -- WSH, 1998 */ +/* M.P.: quick dirty patch to prevent data corruption */ + +static void brelse_blocks(struct inode *i, unsigned long block_to_free, unsigned long +count) +{ + while (count--) + bforget(get_hash_table(i-i_dev, block_to_free++, +i-i_sb-s_blocksize)); +} + /* * Check whether any of the slots in an indirect block are * still in use, and if not free the block. @@ -184,14 +192,17 @@ else if (block_to_free == tmp - free_count) free_count++; else { + if (!S_ISREG(inode-i_mode)) brelse_blocks(inode, +block_to_free, free_count); ext2_free_blocks (inode, block_to_free, free_count); free_this: block_to_free = tmp; free_count = 1; } } - if (free_count 0) + if (free_count 0) { + if (!S_ISREG(inode-i_mode)) brelse_blocks(inode, block_to_free, +free_count); ext2_free_blocks (inode, block_to_free, free_count); + } return retry; } @@ -247,14 +258,17 @@ else if (block_to_free == tmp - free_count) free_count++; else { + if (!S_ISREG(inode-i_mode)) brelse_blocks(inode, +block_to_free, free_count); ext2_free_blocks (inode, block_to_free, free_count); free_this: block_to_free = tmp; free_count = 1; } } - if (free_count 0) + if (free_count 0) { + if (!S_ISREG(inode-i_mode)) brelse_blocks(inode, block_to_free, +free_count); ext2_free_blocks (inode, block_to_free, free_count); + } /* * Check the block and dispose of the ind_bh buffer. */
Re: [possible race in ext2] Re: how to write get_block?
On Fri, 8 Oct 1999, Ingo Molnar wrote: On Fri, 8 Oct 1999, Alexander Viro wrote: Stephen, Ingo, could you look at the stuff above? Methink it means that we either must separate ext2_truncate() for directories (doing bforget() on the data blocks) _or_ put the directory blocks into the page cache and do block_flush(). I'ld rather prefer the latter. Moreover, we can stuff the indirect blocks into the page cache too (using negative offsets). That would leave us with pure buffer cache only for static structures. yep we knew about this problem ... it's not quite an easy hack though. Putting the directory block cache (and symlink block cache) into the page cache would be the preferred method - this would also clean up the code alot i think. More or less random comments/questions: 1) Is ext2_bread() vs. bread() separation in fs/ext2/* a first step to that? (I.e. data blocks vs. inode/bitmap/indirect ones) 2) Sometime ago there was a pretty impressive patch excluding large part of ext2_find_entry() (it caches the offset into dentry and invalidates it if needed) and I have a code for cyclic lookups in a directory (cheap hack, but it really speeds the things up for ls -l and friends). Both would go nicely into this scheme. 3) We ought to switch from (bh+pointer) to (page+pointer) in namei.c. 4) I still think that indirect blocks may go into the page cache - that would further simplify truncate(). OTOH the method used by BSD (indirect block covering addresses from n to n+block_size*pointers_per_block gets an address -n, double indirect block covering the area from n to n+... gets -n-block_size, triple indirect - -n-2*block size) will be wasteful for situations when block size is smaller than page. 5) Who works on that? I would try to do the thing, but if there already is some code I'ld like to look at it, indeed. Comments?
Re: how to write get_block?
Alexander Viro wrote: On Fri, 8 Oct 1999, Manfred Spraul wrote: I think it's wrong to fix synchonization on a problem-by-problem basis: VFS synchonization for read()/write()/truncate()/file lock's is broken, and I would try to find one solution instead of patch-work. There are different things to protect. E.g. i_sem is (ab)used for dcache race-prevention. Lumping everything together is not a good idea, IMO. Of course. But it's wrong to start fixing on one end of the problem before you see the complete problem: if you start with a rw-semaphore (called ERESOURCE in WinNT kernel, if you search for a name), then I'm 99% sure that this code will die before 2.4. Wait a minute. It's exactly the reason why we should not do it in sys_write() - synchronization issues vary depending on _what_ you are accessing. think about f_pos: VFS hides the difference between sys_pread() and sys_read(). Filesystems can't implement the locking. I think VFS should (and must) offer 3 or 4 locking options through i_flags, otherwise every filesystem must reinvent the wheel. (do nothing; protect f_pos; EOF and NvsV2 protection) , mandantory file locks. broken so badly, that... well, for example there is a nice race with mmap(). mandantory filelocks and mmap() are incompatible, there is no chance to fix that. eg. under Windows, all filelocks are mandantory locks. This means that noone can use mmap() for databases, that's a known and unsolvable problem. I'll check the POSIX standard, perhaps we can just reject mmap() or file-locks in this situation. Btw, look at default_llseek(): it's not POSIX compliant because it does not check for file system specific file size limits. As a result, ext2 has it's own llseek-function. A few weeks ago I noticed that UDF has a special llseek function. That's the wrong aproach - I posted a patch which fixes default_llseek(), but it was ignored. -- Manfred
Re: how to write get_block?
Andrea Arcangeli wrote: I really don't think it make sense to let write(2) to be SMP parallel as it would be unreliable and thus useless. So basically allowing more writes to enter the critical section looks an useless improvement and lose of robustness to me. a) we must allow concurrent write operations for pipes (O_NONBLOCK). b) are you sure that these statements are true for a large database? I think the right aproach would be new functions [public, because needed by nfsd and arch/*/kernel/sys*.c] int inode_lock_write(inode, filp, offset, len); ... int inode_lock_rename(inode_source,inode_dest); ... and flags so that the f_ops implementation can choose the amount of synchronization it needs. Or you could add new function pointers to f_ops/i_ops for synchronization. -- Manfred
Re: how to write get_block?
Hi. I'm porting my HPFS driver to 2.3.18 and I don't know how to write get_block properly. When user does lseek(far beyond file end) and write(), get_block must allocate a lot of sectors and clear them (HPFS doesn't support holes). The question is: what call should I use to clear these sectors? Use the (fixed) variant of FAT code. I.e block_write_cont_page() + equivalent of fat_write_partial_page(). The only thing this function needs from fat filesystem is i_realsize. What about to replace i_realsize with i_size (I guess they have identical values at the beginning) and move it to /fs/buffer.c to let other filesystems use it? What should I do with it? Is there any function that I didn't notice? Of course I could write routines to directly touch page cache and attach buffer there, but I think it's dirty. Especially when other filesystem also need to do the same (fat currently crashes when writing beyond file end, other filesystems aren't ported yet) FAT should be OK now. How do you manage freeing buffers on rmdir? I don't see any bforget there. If you don't free buffers pointing to deallocated space, it can cause data corruption. Mikulas Patocka
Re: how to write get_block?
On Fri, 8 Oct 1999, Manfred Spraul wrote: Alexander Viro wrote: All stuff with truncate()/write() is racey right now, AFAICS. The next thing I'm going to do is a (dumb) implementation of blocking rwlocks. write() being reader and truncate() - writer. IMHO that's the wrong approach: 1) overlapping concurrent write operations violate Nfsv2. Yes, that is serious. 2) the pipe code needs O_NONBLOCK characteristics, ie no synchonization is allowed. The 2.2 code uses an ugly hack: sys_write() acquires i_sem, and inside the pipe code the semaphore is dropped. Huh? It's not a sys_write(), it's generic_file_write(). Pipes do not come even close to that place. And I'm talking about 2.3 - 2.2 is _way_ too different. 3) do not forget f_pos when you implement a synchonization: even concurrent reads are not allowed if both use the same file pointer. Ditto. It's a separate issue. What about these ideas: a) rename i_sem (*), and add public functions such as inode_lock_exclusive(), inode_lock_truncate(), inode_lock_write() etc. Currently, everyone directly accesses the VFS synchonization (nfsd, drivers/char/tty_io.c; arch/*/kernel/sys*.c). b) the requirements of the VFS users differ completely, add flags: 0:synchonize nothing, eg pipes. I_FPOS_SYNC: synchonize f_pos access, nothing else I_FILE_SYNC: multiple read, non-overlapping write, truncate. Since it will be hidden by inode_lock_??(), it will be easy to optimize the implementation as needed. Eg I'm pretty sure that Stephen will not like it if raw-io is slowed down by a multiple reader-single writer synchonization. Erm? Writer access is for truncate() and I don't see where the raw-io comes into the picture.
Re: how to write get_block?
On Fri, 8 Oct 1999, Manfred Spraul wrote: I couldn't find it in the POSIX standart, but at least some UNIXes don't allow it: Sure. We are kinda-sorta preventing it too. But there is a windows when both the shared mmap() and lockf() may pass (if mmap() comes first, overlaps the memory area that was the last mmap to the unlinked file and fcntl() happens before the file is finally removed). OK, folks, I'm going down - 18 hours of tube time have some unpleasant effects.
Re: how to write get_block?
Alexander Viro wrote: All stuff with truncate()/write() is racey right now, AFAICS. The next thing I'm going to do is a (dumb) implementation of blocking rwlocks. write() being reader and truncate() - writer. IMHO that's the wrong approach: 1) overlapping concurrent write operations violate Nfsv2. 2) the pipe code needs O_NONBLOCK characteristics, ie no synchonization is allowed. The 2.2 code uses an ugly hack: sys_write() acquires i_sem, and inside the pipe code the semaphore is dropped. 3) do not forget f_pos when you implement a synchonization: even concurrent reads are not allowed if both use the same file pointer. What about these ideas: a) rename i_sem (*), and add public functions such as inode_lock_exclusive(), inode_lock_truncate(), inode_lock_write() etc. Currently, everyone directly accesses the VFS synchonization (nfsd, drivers/char/tty_io.c; arch/*/kernel/sys*.c). b) the requirements of the VFS users differ completely, add flags: 0: synchonize nothing, eg pipes. I_FPOS_SYNC:synchonize f_pos access, nothing else I_FILE_SYNC:multiple read, non-overlapping write, truncate. Since it will be hidden by inode_lock_??(), it will be easy to optimize the implementation as needed. Eg I'm pretty sure that Stephen will not like it if raw-io is slowed down by a multiple reader-single writer synchonization. -- Manfred (*) I prefer compiler errors instead of missing synchonization. You could rename it back to i_sem later.
Re: how to write get_block?
I'm porting my HPFS driver to 2.3.18 and I don't know how to write get_block properly. When user does lseek(far beyond file end) and write(), get_block must allocate a lot of sectors and clear them (HPFS doesn't support holes). The question is: what call should I use to clear these sectors? Using bread/mark_buffer_dirty/brelse is bad idea - when the block is read to page cache later, there are two buffers pointing to the same block = data corruption. Calling page cache functions to create buffers? get_block is called from page cache layer, thus calling anything from there makes possible deadlock condition. I think that if you first allocate and fill in references to all needed blocks without initializing and than you'll initialize data through block_write_full_page() or block_write_partial_page() than I think you are not going to have problems with deadlocks. I think that the only lock held during call of get_block() is lock of the page to be read/written (but I'm not sure on this issue). Honza