cont_prepare_write fix
Hi, While trying to get affs working again, I noticed that the current cont_prepare_write is hardly working, if you write to a file in steps of 1024, the bytes pointer is never updated resulting in an endless loop. Below is a fixed version, that also removes the bogus tricks to the bytes ptr. I also "fixed" the comment, any reason we have to be insulting like this? Anyway, does anyone see a problem with the fix before I send it to Linus? bye, Roman Index: fs/buffer.c === RCS file: /dump/cvs/linux-2.4/fs/buffer.c,v retrieving revision 1.1.1.3 diff -u -r1.1.1.3 buffer.c --- fs/buffer.c 2000/07/17 10:46:49 1.1.1.3 +++ fs/buffer.c 2000/07/24 11:45:54 @@ -1585,7 +1585,7 @@ } /* - * For moronic filesystems that do not allow holes in file. + * For filesystems that do not allow holes in file. * We may have to extend the file. */ @@ -1600,7 +1600,15 @@ unsigned blocksize = inode-i_sb-s_blocksize; char *kaddr; - while(page-index (pgpos = *bytesPAGE_CACHE_SHIFT)) { +restart: + zerofrom = offset; + pgpos = *bytes; + if (page-index (pgpos PAGE_CACHE_SHIFT)) + goto skip_extend; + + zerofrom = pgpos ~PAGE_CACHE_MASK; + pgpos = PAGE_CACHE_SHIFT; + while (page-index pgpos) { status = -ENOMEM; new_page = grab_cache_page(mapping, pgpos); if (!new_page) @@ -1609,12 +1617,7 @@ if (*bytesPAGE_CACHE_SHIFT != pgpos) { UnlockPage(new_page); page_cache_release(new_page); - continue; - } - zerofrom = *bytes ~PAGE_CACHE_MASK; - if (zerofrom (blocksize-1)) { - *bytes |= (blocksize-1); - (*bytes)++; + goto restart; } status = __block_prepare_write(inode, new_page, zerofrom, PAGE_CACHE_SIZE, get_block); @@ -1626,35 +1629,26 @@ kunmap(new_page); UnlockPage(new_page); page_cache_release(new_page); + zerofrom = 0; + *bytes = ++pgpos PAGE_CACHE_SHIFT; } - if (page-index pgpos) { - /* completely inside the area */ - zerofrom = offset; - } else { - /* page covers the boundary, find the boundary offset */ - zerofrom = *bytes ~PAGE_CACHE_MASK; - - /* if we will expand the thing last block will be filled */ - if (to zerofrom (zerofrom (blocksize-1))) { - *bytes |= (blocksize-1); - (*bytes)++; - } + pgpos = PAGE_CACHE_SHIFT; + if (zerofrom to) + *bytes = pgpos + to; - /* starting below the boundary? Nothing to zero out */ - if (offset = zerofrom) - zerofrom = offset; - } +skip_extend: status = __block_prepare_write(inode, page, zerofrom, to, get_block); - if (status) - goto out1; - kaddr = (char*)page_address(page); - if (zerofrom offset) { - memset(kaddr+zerofrom, 0, offset-zerofrom); - __block_commit_write(inode, page, zerofrom, offset); + if (!status) { + if (zerofrom offset) { + kaddr = (char*)page_address(page); + memset(kaddr+zerofrom, 0, offset-zerofrom); + __block_commit_write(inode, page, zerofrom, offset); + *bytes = pgpos + to; + } + return 0; } - return 0; -out1: + ClearPageUptodate(page); kunmap(page); return status;
Re: cont_prepare_write fix
On Mon, 24 Jul 2000, Roman Zippel wrote: Hi, While trying to get affs working again, I noticed that the current cont_prepare_write is hardly working, if you write to a file in steps of 1024, the bytes pointer is never updated resulting in an endless loop. Testcase, please? It certainly seems to be working on FAT and HPFS - both are using the same thing. Below is a fixed version, that also removes the bogus tricks to the bytes ptr. I also "fixed" the comment, any reason we have to be insulting like this? Sheer frustration from the braindead design? _Nothing_ can insult persons responsible for AFFS and FAT "designs". Sorry. Anyway, does anyone see a problem with the fix before I send it to Linus? Yes. In particular, adding the whole PAGE_CACHE_SIZE to *bytes is _definitely_ wrong. The whole "round up to the block boundary" thing is there for purpose - in native environment all of these filesystems do not care to clean the tail of the last allocated block. So we have to do it for them. And we don't want to repeat such cleaning on _every_ write() - the first one should normalize the situation. If you have problems with affs_get_block() - feel free to ask, I probably can help. What _is_ the problem? Notice that you can assume that affs_get_block() will only have to create blocks immediately past the end-of-file. Please, check fat_get_block() or hpfs_get_block() - they seem to work. I don't claim that I didn't fsck up there, but I would certainly like to hear more before I decide that I did. If you need description of cont_... design - just ask.
Re: cont_prepare_write fix
Hi, While trying to get affs working again, I noticed that the current cont_prepare_write is hardly working, if you write to a file in steps of 1024, the bytes pointer is never updated resulting in an endless loop. Testcase, please? It certainly seems to be working on FAT and HPFS - both are using the same thing. Ok, now I see it, they are updating it themselves, why isn't that done by the vfs? Below is a fixed version, that also removes the bogus tricks to the bytes ptr. I also "fixed" the comment, any reason we have to be insulting like this? Sheer frustration from the braindead design? _Nothing_ can insult persons responsible for AFFS and FAT "designs". Sorry. $ grep cont_prepare_write fs/*/*.c fs/adfs/inode.c:prepare_write: cont_prepare_write, fs/affs/file.c: prepare_write: cont_prepare_write, fs/hfs/inode.c: prepare_write: cont_prepare_write, fs/hpfs/file.c: prepare_write: cont_prepare_write, fs/ntfs/fs.c: prepare_write: cont_prepare_write, fs/qnx4/inode.c:prepare_write: cont_prepare_write, Only because a fs wasn't designed for UNIX, means it's shit. Sorry, for me it's a simple matter of respect. A fs isn't "moronic", only because it doesn't support holes, the real design issues are somewhere else. Anyway, probably as I only see mmu_private updates in fat and hpfs, it looks like adfs, hfs, ntfs and qnx4 are broken as well. BTW wouldn't it be better to introduce a i_realsize field for that, it can be useful for other fs later as well. E.g. it can be used to fix read/truncate race without a BKL (i_realsize only grows while file is open and truncate can be delayed until file is closed). Yes. In particular, adding the whole PAGE_CACHE_SIZE to *bytes is _definitely_ wrong. The whole "round up to the block boundary" thing is there for purpose - in native environment all of these filesystems do not care to clean the tail of the last allocated block. So we have to do it for them. And we don't want to repeat such cleaning on _every_ write() - the first one should normalize the situation. Rounding it up to the block size isn't a problem, the question is why do we give the bytes ptr to cont_prepare_write, if get_block has to update it in first place. That's probably the point that confused me here. If you have problems with affs_get_block() - feel free to ask, I probably can help. What _is_ the problem? Notice that you can assume that affs_get_block() will only have to create blocks immediately past the end-of-file. Yes, the whole get_block/getblock stuff in affs needs a real cleanup because of this. bye, Roman
Re: cont_prepare_write fix
On Mon, 24 Jul 2000, Roman Zippel wrote: Hi, While trying to get affs working again, I noticed that the current cont_prepare_write is hardly working, if you write to a file in steps of 1024, the bytes pointer is never updated resulting in an endless loop. Testcase, please? It certainly seems to be working on FAT and HPFS - both are using the same thing. Ok, now I see it, they are updating it themselves, why isn't that done by the vfs? Because actual _allocation_ unit may have nothing to blocksize and be invisible to VFS. Check fatfs - it uses the thing to trigger allocation, but actual zero-out goes on per-sector basis. Sheer frustration from the braindead design? _Nothing_ can insult persons responsible for AFFS and FAT "designs". Sorry. $ grep cont_prepare_write fs/*/*.c fs/adfs/inode.c:prepare_write: cont_prepare_write, Not on this box. What version are you using? fs/affs/file.c: prepare_write: cont_prepare_write, fs/hfs/inode.c: prepare_write: cont_prepare_write, Ditto. fs/hpfs/file.c: prepare_write: cont_prepare_write, Ditto. fs/ntfs/fs.c: prepare_write: cont_prepare_write, Ditto. fs/qnx4/inode.c:prepare_write: cont_prepare_write, Ditto. They are _calling_ the thing, but they would simply refuse to compile if they would have such lines - different prototypes. BTW, when that comment went in it was FAT, HPFS and AFFS. Only because a fs wasn't designed for UNIX, means it's shit. Sorry, for me it's a simple matter of respect. A fs isn't "moronic", only because it doesn't support holes, the real design issues are somewhere else. And you _bet_ that they are. No, lack of holes per se is not horrible. But there is a _lot_ of things about AFFS and FAT (and to some extent HPFS) design that are past "horrible" - "Lovecraftian" would be more accurate. But that's another rant... Anyway, probably as I only see mmu_private updates in fat and hpfs, it looks like adfs, hfs, ntfs and qnx4 are broken as well. BTW wouldn't it From adfs_get_block(): ... /* don't support allocation of blocks yet */ return -EIO; NTFS is not just broken, it eats filesystems in rw mode. QNX4 does not allocate blocks ("// to be done" says the driver) HFS refuses to allocate anything. Probably broken... be better to introduce a i_realsize field for that, it can be useful for other fs later as well. E.g. it can be used to fix read/truncate race without a BKL (i_realsize only grows while file is open and truncate can be delayed until file is closed). Not needed. That race is taken care of - readpage() never does access past i_size and has the page locked, while vmtruncate() starts with setting i_size (on shrinking truncate(), that is), then goes through all pages in affected area and grabs the lock on all of them (one by one, not simultaneously, indeed) and only then calls -truncate(). So readpage() in affected area that began before we've set -i_size will be completed prior to call of -truncate() and any attempt after setting -i_size will see new -i_size and will not bother -truncate() at all. _That_ one is OK. There are very scary dragons around -truncate(), but a) they are unrelated to that b) knowledge of realsize won't help c) let's take that to email. Rounding it up to the block size isn't a problem, the question is why do we give the bytes ptr to cont_prepare_write, if get_block has to update it in first place. That's probably the point that confused me here. cont_prepare_write() has to zero the tail of the last block out ;- If you have problems with affs_get_block() - feel free to ask, I probably can help. What _is_ the problem? Notice that you can assume that affs_get_block() will only have to create blocks immediately past the end-of-file. Yes, the whole get_block/getblock stuff in affs needs a real cleanup because of this. Well, it will get simpler - all logics with growing the chain shrinks a lot, AFAICS...
Re: cont_prepare_write fix
On Mon, 24 Jul 2000, Alexander Viro wrote: Not needed. That race is taken care of - readpage() never does access past i_size and has the page locked, while vmtruncate() starts with setting i_size (on shrinking truncate(), that is), then goes through all pages in affected area and grabs the lock on all of them (one by one, not simultaneously, indeed) and only then calls -truncate(). That assumes that i_size is atomically updated, which currently is not the case on 32 bit platforms. -ben
Re: cont_prepare_write fix
Hi, Because actual _allocation_ unit may have nothing to blocksize and be invisible to VFS. Check fatfs - it uses the thing to trigger allocation, but actual zero-out goes on per-sector basis. I tried vfat on a MO drive with 1KByte sectors - it doesn't work. I also don't know how you want to do it. A fs with a block size smaller than the sector size can't impossibly write directly into the page cache and I don't see anywhere where that is taken care of. fs/adfs/inode.c:prepare_write: cont_prepare_write, Not on this box. What version are you using? They are _calling_ the thing, but they would simply refuse to compile if they would have such lines - different prototypes. Oops, sorry, wrong source tree, it's a 2.4 with some other changes. :) What I tried was to change the prototypes so the smaller caller routines in _every_ fs can be removed - seems to work, except that cont_prepare_write needs that extra argument (another reason I'd like to move it to i_realsize). It seems to work, is there any reason not to do this? Not needed. That race is taken care of - readpage() never does access past i_size and has the page locked, while vmtruncate() starts with setting i_size (on shrinking truncate(), that is), then goes through all pages in affected area and grabs the lock on all of them (one by one, not simultaneously, indeed) and only then calls -truncate(). Hmm, then I don't understand the BKL in the get_block functions. Rounding it up to the block size isn't a problem, the question is why do we give the bytes ptr to cont_prepare_write, if get_block has to update it in first place. That's probably the point that confused me here. cont_prepare_write() has to zero the tail of the last block out ;- IMO we could simplify that if we would do the preallocation business in the page cache. We only need to support a dummy page structure in the page cache to delay the allocation of the real page (we need that anyway, if want to do direct io into/from e.g. pci space), but that's more something for 2.5 or is there any reason why it shouldn't work? bye, Roman
Re: cont_prepare_write fix
On Mon, Jul 24, 2000 at 12:04:39PM -0400, Benjamin C.R. LaHaise wrote: On Mon, 24 Jul 2000, Alexander Viro wrote: Not needed. That race is taken care of - readpage() never does access past i_size and has the page locked, while vmtruncate() starts with setting i_size (on shrinking truncate(), that is), then goes through all pages in affected area and grabs the lock on all of them (one by one, not simultaneously, indeed) and only then calls -truncate(). That assumes that i_size is atomically updated, which currently is not the case on 32 bit platforms. -ben I found that i_size is in most cases reasonably protected by i_sem. Except for Coda where inode-i_size != inode-i_mapping-host-i_size, because we `borrow' the mapping of the underlying container file. At the moment I believe that the introduction of i_mapping-host might have been a mistake. The readpage/writepage operations use i_mapping-host-i_size, but sys_stat and generic_write_file (for a file with O_APPEND) are using inode-i_size. So for Coda we are copying the i_size back to the Coda inode in a wrapper around generic_file_write. AFAIK Coda is the only FS that has an address space mapping that associated with another inode, and we need to lock the i_sem of that inode to avoid problems. Maybe the Coda file write should become something like: coda_file_write() { down(container_inode-i_sem); n = generic_file_write(file, ...); container_inode-i_size = file-f_dentry-d_inode-i_size; up(container_inode-i_sem); return n; } That way all writes are protected by the semaphore on the container file's inode. And the use of i_mapping-host, along with the nasty cast to (struct inode *) could be removed. Jan
Re: cont_prepare_write fix
On Mon, Jul 24, 2000 at 02:24:42PM -0400, Jan Harkes wrote: On Mon, Jul 24, 2000 at 12:04:39PM -0400, Benjamin C.R. LaHaise wrote: On Mon, 24 Jul 2000, Alexander Viro wrote: Not needed. That race is taken care of - readpage() never does access past i_size and has the page locked, while vmtruncate() starts with setting i_size (on shrinking truncate(), that is), then goes through all pages in affected area and grabs the lock on all of them (one by one, not simultaneously, indeed) and only then calls -truncate(). That assumes that i_size is atomically updated, which currently is not the case on 32 bit platforms. I found that i_size is in most cases reasonably protected by i_sem. Except for Coda where inode-i_size != inode-i_mapping-host-i_size, because we `borrow' the mapping of the underlying container file. At the moment I believe that the introduction of i_mapping-host might have been a mistake. The readpage/writepage operations use i_mapping-host-i_size, but sys_stat and generic_write_file (for a file with O_APPEND) are using inode-i_size. So for Coda we are copying the i_size back to the Coda inode in a wrapper around generic_file_write. AFAIK Coda is the only FS that has an address space mapping that associated with another inode, and we need to lock the i_sem of that inode to avoid problems. Maybe the Coda file write should become something like: coda_file_write() { down(container_inode-i_sem); n = generic_file_write(file, ...); container_inode-i_size = file-f_dentry-d_inode-i_size; up(container_inode-i_sem); return n; } That way all writes are protected by the semaphore on the container file's inode. And the use of i_mapping-host, along with the nasty cast to (struct inode *) could be removed. This might work for Coda, but I think in general there are a couple of problems with always using the struct file to get at the inode structure. IIRC either writepage or prepare_write is called with file==NULL in some circumstances (when creating a symlink I think). I discovered this when I got oopsen in my patch which adds space/inode/dentry usage limits to ramfs. There's another problem too: for ramfs limits, I obviously also need to keep track of when a page is removed from an inode. The only sane way I've seen to do this is to add a hook to address_space_operations called when the page is removed (currently I'm calling it from remove_page_from_inode_queue() but that may not be the best place). At this point, only the struct page is available, so from that I need to find the inode (well, actually inode-i_sb where I store the limits). Easy with mapping-host, otherwise impossible (AFAICT, it's entirely possible I'm missing something blindingly obvious). -- David Gibson, Technical Support Engineer, Linuxcare, Inc. +61 2 6262 8990 [EMAIL PROTECTED], http://www.linuxcare.com/ Linuxcare. Support for the revolution.