Re: The INN/mmap bug
Daniel Phillips wrote: > Alexander Viro wrote: > > On Tue, 19 Sep 2000, Daniel Phillips wrote: > > > > > !Mapped, !Uptodate, Dirty: pending map and write > > > > Wrong. It's an instant BUG at line 711 in ll_rw_blk.c - remember these > > reports? Yes, correct, this state should not escape from prepare_write, though that's only a limitation of the current code. Here's the corrected, corrected table: Mapped, Uptodate, Dirty: pending write !Mapped, Uptodate, Dirty: not possible Mapped, !Uptodate, Dirty: not possible !Mapped, !Uptodate, Dirty: not possible Mapped, Uptodate, !Dirty: regular block !Mapped, Uptodate, !Dirty: hole of zeroes Mapped, !Uptodate, !Dirty: unread !Mapped, !Uptodate, !Dirty: pending map And here's the punchline: why not have 5 states instead of 3 bits and set the state with a single function call, instead of the mutiple bit sets we see now. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
Alexander Viro wrote: > > On Tue, 19 Sep 2000, Daniel Phillips wrote: > > > The more I think about it the less clear and ambiguous I find it. > > When you add the dirty bit into the pot you get: > > > >Mapped, Uptodate, Dirty: not possible > > Sure, it is possible - that's how the write happens > > > !Mapped, Uptodate, Dirty: not possible > >Mapped, !Uptodate, Dirty: pending write > > s/pending write/obvious bug/, damnit. Daniel, just think for a second: we > have the buffer read to be picked by bdflush and written to disk, while > the _contents_ _is_ _not_ _uptodate_. Just what can you expect when write > succeeds? Junk on disk, right? You know, GIGO queue - garbage in, garbage > out... > > > !Mapped, !Uptodate, Dirty: pending map and write > > Wrong. It's an instant BUG at line 711 in ll_rw_blk.c - remember these > reports? It wasn't a conceptual error, it was a typo: here's the corrected table with the wrongly tagged states you noticed interchanged: Mapped, Uptodate, Dirty: pending write !Mapped, Uptodate, Dirty: not possible Mapped, !Uptodate, Dirty: not possible !Mapped, !Uptodate, Dirty: pending map and write Mapped, Uptodate, !Dirty: regular block !Mapped, Uptodate, !Dirty: hole of zeroes Mapped, !Uptodate, !Dirty: unread !Mapped, !Uptodate, !Dirty: pending map > Sure, the dirty bit is not orthogonal to the rest. You don't need to do > any complex analysis - it's as simple as > * if I don't know _where_ to write the data - I'ld better not feed > the request to ll_rw_block(), or it may get PO'd > * if I know that data is junk - I don't want it hitting the disk. > > Dirty bit == request fed into the funnel and can be on disk any moment now. > Locked == already in IO subsystem. > > That's it - completely independent from the rest, except that you don't > want the whole write mechanism applied to non-uptodate or non-mapped > pieces. Right, two obvious bugs, which is the same thing as saying two uneeded states. Now, I'm just trying to be tidy and fit this all into a nice regular model that I can represent with state transition diagrams. I don't know for sure why it's good to do that, but it seems good. I like to imagine that if I could just get them all down on paper in a regular form some possible optimizations would just jump right out at me. Mind you, this is not necessary for the filesystem work I'm doing, this is more like a side interest. I get along just fine with the existing mechanism. > Think about IO as a memory bus - cache controller deals with writing the > cache line to RAM, but you don't want it to try that on lines that don't > have PA already calculated or have invalid contents. "mark dirty" == point > the write-behind mechanism to it and let it decide when the thing must be > written. I'd also like it to be able to decide on its own when to map the block. If I'm fully replacing the contents of a buffer on a page I would like to be able to just mark the buffer dirty without mapping it and let bdflush map it later. Right now it doesn't work because bdflush tries to feed a null block to ll_rw_block, but a small change would fix that: the flush daemon just has to notice the buffer is on a page, then it can call get_block to map it. Does this accomplish anything useful? I *think* so but I'm not sure. It seems to me that if you handled this properly you could have, for example, a temporary file written, read back and deleted, all without ever touching the disk, or even going into get_block to check for mappings, let alone fussing around with the allocation bitmaps. I'm also trying to determine if a 'don't know' mapping state would be useful for optimizing certain I/O paths. > Think how to make the cache indexed by virtual address (not by > physical, as in case of x86) work correctly. That's what pagecache is - > software MMU with cache-by-VA architecture. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Tue, 19 Sep 2000, Daniel Phillips wrote: > The more I think about it the less clear and ambiguous I find it. > When you add the dirty bit into the pot you get: > >Mapped, Uptodate, Dirty: not possible Sure, it is possible - that's how the write happens > !Mapped, Uptodate, Dirty: not possible >Mapped, !Uptodate, Dirty: pending write s/pending write/obvious bug/, damnit. Daniel, just think for a second: we have the buffer read to be picked by bdflush and written to disk, while the _contents_ _is_ _not_ _uptodate_. Just what can you expect when write succeeds? Junk on disk, right? You know, GIGO queue - garbage in, garbage out... > !Mapped, !Uptodate, Dirty: pending map and write Wrong. It's an instant BUG at line 711 in ll_rw_blk.c - remember these reports? >Mapped, Uptodate, !Dirty: regular block > !Mapped, Uptodate, !Dirty: hole of zeroes >Mapped, !Uptodate, !Dirty: unread > !Mapped, !Uptodate, !Dirty: pending map > Those two 'not possible' states bother me, they show that the three > bits are not orthogonal. We can make arbitrary assignments of meaning > to them as you did with !Mapped, !Uptodate (pending map) but in that > case why not just enumerate all the states we need and lose the bit > assignments? I think the deep problem here is trying to look at this > as a bit-flipping problem instead of a state-transition problem. Sure, the dirty bit is not orthogonal to the rest. You don't need to do any complex analysis - it's as simple as * if I don't know _where_ to write the data - I'ld better not feed the request to ll_rw_block(), or it may get PO'd * if I know that data is junk - I don't want it hitting the disk. Dirty bit == request fed into the funnel and can be on disk any moment now. Locked == already in IO subsystem. That's it - completely independent from the rest, except that you don't want the whole write mechanism applied to non-uptodate or non-mapped pieces. Think about IO as a memory bus - cache controller deals with writing the cache line to RAM, but you don't want it to try that on lines that don't have PA already calculated or have invalid contents. "mark dirty" == point the write-behind mechanism to it and let it decide when the thing must be written. Think how to make the cache indexed by virtual address (not by physical, as in case of x86) work correctly. That's what pagecache is - software MMU with cache-by-VA architecture. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
Linus Torvalds wrote: > > On Mon, 18 Sep 2000, Alexander Viro wrote: > > * we have several bh state components and the thing is a big, > > fscking mess. If we look at the areas outside of the page lock we have: > > It's not a mess at all. > > I don't see why you call things "first stage" etc. It's perfectly > straightforward. There are two bits that matter: > - buffer uptodate > - buffer mapped. > > And the state is very clear and unambiguous: > > Mapped, Uptodate: regular block > !Mapped, Uptodate: hole of zeroes > Mapped, !Uptodate: unread > !Mapped, !Uptodate: "pending map" > > No "several states" at all. The more I think about it the less clear and ambiguous I find it. When you add the dirty bit into the pot you get: Mapped, Uptodate, Dirty: not possible !Mapped, Uptodate, Dirty: not possible Mapped, !Uptodate, Dirty: pending write !Mapped, !Uptodate, Dirty: pending map and write Mapped, Uptodate, !Dirty: regular block !Mapped, Uptodate, !Dirty: hole of zeroes Mapped, !Uptodate, !Dirty: unread !Mapped, !Uptodate, !Dirty: pending map Those two 'not possible' states bother me, they show that the three bits are not orthogonal. We can make arbitrary assignments of meaning to them as you did with !Mapped, !Uptodate (pending map) but in that case why not just enumerate all the states we need and lose the bit assignments? I think the deep problem here is trying to look at this as a bit-flipping problem instead of a state-transition problem. I'm not saying that the current design can't be made to work. It can, it's practically working now - it's just confusing as Al said, and there are reasons for that. It gets more confusing when you add in the less-than-simple relationship between buffers and pages and then try to make sense of the combination of the two sets of states. The motivation for making this less confusing is to be able to think about optimization possibilities instead of just making it work. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Tue, 19 Sep 2000, Rik van Riel wrote: > IMHO it would be really nice if this problem was solved > on the /PAGE/ level, so it will work for non-buffer > filesystems as well ;) It would be nice if we separated the buffer-cache storage pages from the pagecache buffer-using ones and from the swap ones - logics is seriously different and using the same code for all of them... Not good. We can distinguish between them just by looking at ->mapping, AFAICS. BTW, I suspect that combination of partial block_flushpage() with block_truncate_page() should be address_space method (different for NFS, etc., indeed). Then truncate_inode_pages() would become much simpler ("get rid of pages that went off-limits" rather than "... and do some part of the work on the partial page". Oh, and block_flushpage() would be simpler that way. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Mon, 18 Sep 2000, Alexander Viro wrote: > On Sun, 17 Sep 2000, Linus Torvalds wrote: > > > And as we've seen, simplifying this area would not necessarily be a bad > > thing ;) > > > > Right now I'll just do the minimal fix, though, I think. > > Fine with me. I'll do the full analysis tonight, anyway, and try > to write down the rules for that stuff. One thing that makes me > seriously uneasy is the fact that VM knows about ->buffers - This doesn't make me happy either. It doesn't work for network-based filesystems, won't work with KIOBUF and will mean extra difficulties with filesystems that have write ordering constraints IMHO it would be really nice if this problem was solved on the /PAGE/ level, so it will work for non-buffer filesystems as well ;) (at least, for those problems which aren't specific to the disk-based filesystems only) > I'm not proposing it for immediate inclusion, but I don't feel > good about all this code - current rules are too complex and > rely on details of VM/buffer interaction too much for my taste. > It may be correct, but it's not obviously correct... Indeed... regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
Andreas Dilger wrote: > > Daniel writes: > > Alexander Viro wrote: > > > On Mon, 18 Sep 2000, Andreas Dilger wrote: > > > > This may actually be a problem in the future... what about shared access > > > > block devices like FCAL or a distributed filesystem? It has to be > > > > possible for pages to become non-uptodate in a sane way. > > > > > > So what the heck do you do when something modifies mmaped page when you > > > get the change of on-disk one? Say it, writer is notified that write had > > > been completed, sends packet to you and you flip a bit on a page that > > > happens to be mmaped on the place where write had happened. > > > > > > Write-through-pagecache is OK, but write straight to disk bypassing the > > > cache? Welcome to the fun with aliases... > > > > Yes, I think that's one rule we can write down right now: to update a > > block on disk you have to go through the buffer. Not going through > > the buffer is about the same as accessing a semaphore-protected > > resource without bothering with the semaphore. > > What I'm getting at is that the local system didn't make the change at all, > so there is _no way_ to make the write go through the local cache. > The write to disk happens on a remote system. The only thing that happens > on the local system is that it gets a message that a buffer is invalid. > > You don't want to have to re-send each buffer to each system that ever > read it. It is much better to simply invalidate the cache, and only if > the client accesses it again will it be re-read. > > It should be possible to mark a page non-uptodate, so that the next time > it is accessed locally it will re-read the _new_ data from disk. Think > of the fs "revalidate" method. Yes, this is a cache coherency protocol. I should have said you have to go through the *buffer head*, it's what I meant. You can't just invalidate the buffer any time you want, you have to wait until there are no readers/writers first. Ooh, my head hurts, I was just trying to write a simple filesystem extension and now I'm getting sucked into distributed filesystems... *** goes back to hunting down tailmerge races -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
Linus Torvalds wrote: > > On Mon, 18 Sep 2000, Alexander Viro wrote: > > That's what makes me unhappy about the current situation + obvious > > fixes. It works, but the proof is... well, not pretty. > > Oh, agreed. I think we should clean up the code. I looked at it yesterday, > and it didn't look all that horribly bad, but I lost interest It is horribly bad, just not unmanageably horribly bad. > I don't know if it is worth doing before 2.4.x, as the current code certainly > should work with the small changes already proposed. Emphatically agreed. The buffer/cache states are in need of repair, that's pretty obvious. Getting them to where they are seen to be correct will take some time and will hit a lot of code. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
Daniel writes: > Alexander Viro wrote: > > On Mon, 18 Sep 2000, Andreas Dilger wrote: > > > This may actually be a problem in the future... what about shared access > > > block devices like FCAL or a distributed filesystem? It has to be > > > possible for pages to become non-uptodate in a sane way. > > > > So what the heck do you do when something modifies mmaped page when you > > get the change of on-disk one? Say it, writer is notified that write had > > been completed, sends packet to you and you flip a bit on a page that > > happens to be mmaped on the place where write had happened. > > > > Write-through-pagecache is OK, but write straight to disk bypassing the > > cache? Welcome to the fun with aliases... > > Yes, I think that's one rule we can write down right now: to update a > block on disk you have to go through the buffer. Not going through > the buffer is about the same as accessing a semaphore-protected > resource without bothering with the semaphore. What I'm getting at is that the local system didn't make the change at all, so there is _no way_ to make the write go through the local cache. The write to disk happens on a remote system. The only thing that happens on the local system is that it gets a message that a buffer is invalid. You don't want to have to re-send each buffer to each system that ever read it. It is much better to simply invalidate the cache, and only if the client accesses it again will it be re-read. It should be possible to mark a page non-uptodate, so that the next time it is accessed locally it will re-read the _new_ data from disk. Think of the fs "revalidate" method. Cheers, Andreas -- Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto, \ would they cancel out, leaving him still hungry?" http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
Alexander Viro wrote: > > On Mon, 18 Sep 2000, Andreas Dilger wrote: > > > Alexander writes: > > > * uptodate pages should never become non-uptodate. > > > uptodate .. pages ... never have data _older_ than on disk > > > > This may actually be a problem in the future... what about shared access > > block devices like FCAL or a distributed filesystem? It has to be > > possible for pages to become non-uptodate in a sane way. > > So what the heck do you do when something modifies mmaped page when you > get the change of on-disk one? Say it, writer is notified that write had > been completed, sends packet to you and you flip a bit on a page that > happens to be mmaped on the place where write had happened. > > Write-through-pagecache is OK, but write straight to disk bypassing the > cache? Welcome to the fun with aliases... Yes, I think that's one rule we can write down right now: to update a block on disk you have to go through the buffer. Not going through the buffer is about the same as accessing a semaphore-protected resource without bothering with the semaphore. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Mon, 18 Sep 2000, Alexander Viro wrote: > > > > but should instead be > > > > static int make_buffer_uptodate(struct page *page, struct buffer_head * bh) > > { > > if (Page_Uptodate(page)) { > ^^^ > > set_bit(BH_Uptodate, &bh->b_state); > > return; > > } > > ll_rw_block(READ, 1, &bh); > > } > > > Forget about your "stage 1" and "stage 2" complications. They shouldn't > ^^ > > exist. > > Erm... No, but that's just because you _think_ about the problem the wrong way around. If you think about it like the above, it's not a "stage" at all. It's just part of getting the buffer to be up-to-date. Very logical, very simple, no "fscking mess" at all. That's my argument. You're trying to make it a problem. It's not. It's something new with the page cache, yes - the fact that the page cache drives the logic means that the buffer "uptodate" logic is different, but if you think about it some, you'll realize that that's actually just a natural outgrowth of the fact that the buffer cache is slaved to the page cache these days. It's not a "stage 2". It's a basic fact of life - we don't care _how_ the page got to be up-to-date, because for all we know there are other ways to get it to be up-to-date than your "stage 1". For example, a recvfile() implementation could mark the page up-to-date by virtue of getting the information off the network. No "stage 1" or "stage 2" there at all: the make_buffer_uptodate() logic does _not_ mean that we lost the buffers from "stage 1", it might equally well mean that we got the page up-to-date through some other means. Just change how you think about the problem, and suddenly it's not a mess, it's a design. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Re: The INN/mmap bug
That's to fs/buffer.c, generic_file_write() part will go separately (it's a separate bug actually - if copy_from_user() in generic_file_write() fails we mark page as not-uptodate; if memory pressure will drop the buffer_heads attempts to read will blow the page contents away). Linus, does this look OK with you? --- buffer.cMon Sep 18 18:56:29 2000 +++ buffer.c.newMon Sep 18 19:11:06 2000 @@ -1495,6 +1495,8 @@ err = get_block(inode, block, bh, 1); if (err) goto out; + if (Page_Uptodate(page)) + mark_buffer_uptodate(bh, 1); if (buffer_new(bh)) { unmap_underlying_metadata(bh); if (block_end > to) @@ -1600,8 +1602,10 @@ continue; if (!buffer_mapped(bh)) { - if (iblock < lblock) - get_block(inode, iblock, bh, 0); + if (iblock < lblock) { + if (get_block(inode, iblock, bh, 0)) + continue; + } if (!buffer_mapped(bh)) { if (!kaddr) kaddr = kmap(page); @@ -1789,7 +1793,11 @@ /* Hole? Nothing to do */ if (buffer_uptodate(bh)) goto unlock; - get_block(inode, iblock, bh, 0); + err = get_block(inode, iblock, bh, 0); + if (err) + goto unlock; + if (Page_Uptodate(page)) + mark_buffer_uptodate(bh, 1); /* Still unmapped? Nothing to do */ if (!buffer_mapped(bh)) goto unlock; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Mon, 18 Sep 2000, Alexander Viro wrote: > Except that the only reason why check for page being uptodate is correct > is that for such pages we are guaranteed that buffer ring will not be > dropped when page contatins data newer than on disk. Ugh. Translation: The only reason why the simple check for page being uptodate is enough to avoid killing data with bogus rereads is that for pages that are NOT uptodate we are guaranteed that buffer ring will not be dropped when data in page is newer than on disk. Sorry. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Mon, 18 Sep 2000, Linus Torvalds wrote: > It's not a mess at all. > > I don't see why you call things "first stage" etc. It's perfectly > straightforward. There are two bits that matter: > - buffer uptodate > - buffer mapped. first stage == page still not uptodate. second stage == page is already uptodate and if it reverts to not-uptade state - it's a bug. Why? Because then all nice logics about "no read requests for mapped pages" goes to hell. We have such bug, BTW (in generic_file_write()). > And the state is very clear and unambiguous: > > Mapped, Uptodate: regular block > !Mapped, Uptodate: hole of zeroes > Mapped, !Uptodate: unread > !Mapped, !Uptodate: "pending map" > > No "several states" at all. Except that the only reason why check for page being uptodate is correct is that for such pages we are guaranteed that buffer ring will not be dropped when page contatins data newer than on disk. Yes, it works. But I'ld really prefer to avoid these extra reads at all - if the rule is that buffer is read only once during the whole life of page the life becomes somewhat easier to explain. Currently (even with the fixes discussed) we _do_ reread them. Yes, it's harmless (see above). > It so happens that we forgot an important part of the "read a buffer" > equation. The "read a buffer" function is NOT just > > static int make_buffer_uptodate(struct buffer_head * bh) > { > ll_rw_block(READ, 1, &bh); > } > > but should instead be > > static int make_buffer_uptodate(struct page *page, struct buffer_head * bh) > { > if (Page_Uptodate(page)) { ^^^ > set_bit(BH_Uptodate, &bh->b_state); > return; > } > ll_rw_block(READ, 1, &bh); > } > Forget about your "stage 1" and "stage 2" complications. They shouldn't ^^ > exist. Erm... > > 1st stage, uptodate, !mappedhole. Contents is all-zeroes. It may also > > be a result of failed attempt to map - we > > have no way to tell. > > The "uptodate !mapped" case is entirely clear: it's a hole. No ifs, no > buts, no nothing. > > If a map() failed, that map() will have returned an error, and if > something turns the buffer uptodate that's a BUG. Yes. In block_read_full_page(), for one thing. Oops. Sorry, I didn't mark it. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Mon, 18 Sep 2000, Alexander Viro wrote: > * we have several bh state components and the thing is a big, > fscking mess. If we look at the areas outside of the page lock we have: It's not a mess at all. I don't see why you call things "first stage" etc. It's perfectly straightforward. There are two bits that matter: - buffer uptodate - buffer mapped. And the state is very clear and unambiguous: Mapped, Uptodate: regular block !Mapped, Uptodate: hole of zeroes Mapped, !Uptodate: unread !Mapped, !Uptodate: "pending map" No "several states" at all. It so happens that we forgot an important part of the "read a buffer" equation. The "read a buffer" function is NOT just static int make_buffer_uptodate(struct buffer_head * bh) { ll_rw_block(READ, 1, &bh); } but should instead be static int make_buffer_uptodate(struct page *page, struct buffer_head * bh) { if (Page_Uptodate(page)) { set_bit(BH_Uptodate, &bh->b_state); return; } ll_rw_block(READ, 1, &bh); } Forget about your "stage 1" and "stage 2" complications. They shouldn't exist. > 1st stage, uptodate, !mapped hole. Contents is all-zeroes. It may also > be a result of failed attempt to map - we > have no way to tell. The "uptodate !mapped" case is entirely clear: it's a hole. No ifs, no buts, no nothing. If a map() failed, that map() will have returned an error, and if something turns the buffer uptodate that's a BUG. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
--On 09/18/00 13:19:27 -0400 Alexander Viro <[EMAIL PROTECTED]> wrote: > > > On Mon, 18 Sep 2000, Chris Mason wrote: > >> I'm not trying to put it all into a single get_block call, we have >> different get_block funcs for different purposes. What I'm really trying >> to do is squeeze into block_prepare_write, as a generic setup function >> for file modifications. > > It is not. Period. Full stop. Check cont_prepare_write() and you will see > a counterexample in the same fs/buffer.c. Yes, they are used in exactly > the same manner. > generic setup function. Not the function that does the writes, but the function that gets the page ready. No, block_prepare_write does not have to be the generic setup function, and it should not be for some filesystems, and situations. Right now, for what I need, it is usually enough (except for writepage ;-) In other words, I agree with you, even if we are saying it differently. Clearly, when we go to writing directly to the tail for file_write, we'll need a completely different prepare_write function. >Generic function for file modifications is generic_file_write(). > block_prepare_write() lives several layers below and is used only for some > of the local filesystems (ones where it fits). > >> The other reason why I convert in get_block is that is when I've got the >> most information about the internals of the file, including a path >> through the btree to the item that may or may not need converting. In >> reiserfs, the only way to find out if the last item of the file is >> packed is to search the tree for it, and the same searching code is used >> if it is a packed or an unpacked item. > >? Chris, I hope you've noticed that e.g. ext2_prepare_write() > lives in fs/ext2/inode.c, not in fs/buffer.c. So I don't see what stops > you from pulling _any_ data you want. > Sorry, not sure what you mean. When prepare_write is called, reiserfs doesn't know if it needs to convert until it has searched the internal btree. Once the search is done, 90% of prepare_write is also done, so we might as well do the conversion there (otherwise we end up searching twice). The searching is done in reiserfs_get_block, so the conversion is done in reiserfs_get_block. Yes, I can pull almost anything I need out of buffer.c. I'm trying to make it so I don't need to, just to keep down the replication (and bugs). But if it makes things too nasty, I pull the code out and make it reiserfs specific. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
OK, let's see. I've tried to describe what we have now (marking the bugs) + proposal that would give somewhat saner logics (in the end of posting). Comments are more than welcome. * life of the page is clearly divided in two parts - before it can be mapped to a user context and after that. * no read requests can be issued in the second stage, by that time page should be permanently up-to-date [ClearPageUptodate() in generic_file_write() is a bug] * buffer ring can be dropped only under the page lock. * on the first stage buffer ring can be dropped only if the page doesn't contain data that would be more recent than data in fs. * on the second stage buffer ring can be dropped if there is no IO scheduled or in progress [note: buffer_tied() might be handy, meaning "dirty or under IO"; i.e. when ll_rw_block() does or will need this bh] * we have several bh state components and the thing is a big, fscking mess. If we look at the areas outside of the page lock we have: 1st stage, !uptodate, !mapped contents is either the same as on disk or it's a junk. 1st stage, !uptodate, mappedfailed attempt to read. contents may be the same as the last data on disk or it may be junk 1st stage, uptodate, !mappedhole. Contents is all-zeroes. It may also be a result of failed attempt to map - we have no way to tell. 1st stage, uptodate, mapped data. Same as on-disk or newer. 2nd stage, !uptodate, !mapped contents is same or newer than on disk, mapping unknown. [Recipe for disaster, since the current code may try to read it; should be map-and-be-done-with-that] 2nd stage, !uptodate, mappedfailed attempt to read that should never happen. [Bug] 2nd stage, uptodate, !mappedhole. Contents may be nonzero due to access via mmap(). May be a result of failed attempt to map [we don't handle such errors] 2nd stage, uptodate, mapped data. Same as on disk or newer. Something should be done about that - aside of forced uptodate on the second stage we ought to flag the errors (possibly by BH_Error?) In __block_read_full_page() page being uptodate is a bug - we are going to lose data. Additionally, we have "tied by IO" state - can happen only for mapped uptodate bh. We seriously rely on the fact that on stage 1 we drop buffer ring only if the page has no data in need of write. We can't do that on stage 2 (mmap() and dirtying the page from processes). Due to the above we can afford re-reading the data on stage 1. However, I'ld rather see a bitmap (PAGE_CACHE_SIZE/blocksize bits) describing what is not up-to-date. Then we could use it for deciding what should not be read. Comments? - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]
Re: The INN/mmap bug
On Mon, 18 Sep 2000, Chris Mason wrote: > I'm not trying to put it all into a single get_block call, we have > different get_block funcs for different purposes. What I'm really trying > to do is squeeze into block_prepare_write, as a generic setup function for > file modifications. It is not. Period. Full stop. Check cont_prepare_write() and you will see a counterexample in the same fs/buffer.c. Yes, they are used in exactly the same manner. Generic function for file modifications is generic_file_write(). block_prepare_write() lives several layers below and is used only for some of the local filesystems (ones where it fits). > The other reason why I convert in get_block is that is when I've got the > most information about the internals of the file, including a path through > the btree to the item that may or may not need converting. In reiserfs, > the only way to find out if the last item of the file is packed is to > search the tree for it, and the same searching code is used if it is a > packed or an unpacked item. ? Chris, I hope you've noticed that e.g. ext2_prepare_write() lives in fs/ext2/inode.c, not in fs/buffer.c. So I don't see what stops you from pulling _any_ data you want. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
--On 09/18/00 12:23:43 -0400 Alexander Viro <[EMAIL PROTECTED]> wrote: > On Mon, 18 Sep 2000, Chris Mason wrote: > >> When reiserfs_get_block is called on the packed data (with create == 1), >> we unpack it to a full block, so the generic functions can handle >> dirties/writes from then on. If the data in the page cache isn't up to >> date, we need to copy the packed data into the page cache so it doesn't >> get lost in the conversion. > > why bother with bh in that case? You can do all this work on the > ->writepage()/->readpage()/->prepare_write()/->commit_write() level. > For us, the conversion only needs to be done during prepare_write (readpage reads directly from the tail into the page). We chose to do it in get_block because it seemed to fit...prepare_write gives us an offset in the file, and we give prepare_write a block. Sometimes that means a tail conversion is involved, sometimes not. prepare_write just doesn't care. >> This isn't for writing directly to the tail, I'm only doing that in >> writepage, where I always overwrite the tail data with page data. > >Well, duh. Now I can see where the complaints about VFS being > ext2-friendly at expense of other filesystems come from. > Sure thing, if > you bend your code so that it would use helper functions written for > different layout model... It will hurt. The question being: what for? > These functions do not belong to VFS API - they are convenient to > implement the real thing (address_space_operations) for a class of > filesystems, but that's it. > >Look how 'no-holes' filesystems are doing that - yup, different > layout, different helper functions. It would hurt like hell if we would > try to do e.g. FAT on get_block() level. So that stuff was done on > ...page() level, originally in fs/fat/inode.c. When it became obvious that > there are other filesystems that might share the code - well, into > fs/buffer.c it went. You are in the same situation. > Yes, that's why I posted the reiserfs specific code for writepage and truncate. If we can pull a generic function out of that, and the work done for other fragment supporting filesystems, perfect. >Moreover, right now I'm testing yet another class (fragment > support). Currently it's UFS only, but ext2 also might eventually win > from that. Yup, also work on ..._page() level, doing that in > ufs_get_frag() would be a horrible PITA. > >I don't see why are you trying to squeeze everything into > get_block() - it's not even a method anymore... > I'm not trying to put it all into a single get_block call, we have different get_block funcs for different purposes. What I'm really trying to do is squeeze into block_prepare_write, as a generic setup function for file modifications. The other reason why I convert in get_block is that is when I've got the most information about the internals of the file, including a path through the btree to the item that may or may not need converting. In reiserfs, the only way to find out if the last item of the file is packed is to search the tree for it, and the same searching code is used if it is a packed or an unpacked item. In other words, even if reiserfs_prepare_write didn't use block_prepare_write, I would probably still convert in a function similar to reiserfs_get_block. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Mon, 18 Sep 2000, Chris Mason wrote: > When reiserfs_get_block is called on the packed data (with create == 1), we > unpack it to a full block, so the generic functions can handle > dirties/writes from then on. If the data in the page cache isn't up to > date, we need to copy the packed data into the page cache so it doesn't get > lost in the conversion. why bother with bh in that case? You can do all this work on the ->writepage()/->readpage()/->prepare_write()/->commit_write() level. > This isn't for writing directly to the tail, I'm only doing that in > writepage, where I always overwrite the tail data with page data. Well, duh. Now I can see where the complaints about VFS being ext2-friendly at expense of other filesystems come from. Sure thing, if you bend your code so that it would use helper functions written for different layout model... It will hurt. The question being: what for? These functions do not belong to VFS API - they are convenient to implement the real thing (address_space_operations) for a class of filesystems, but that's it. Look how 'no-holes' filesystems are doing that - yup, different layout, different helper functions. It would hurt like hell if we would try to do e.g. FAT on get_block() level. So that stuff was done on ...page() level, originally in fs/fat/inode.c. When it became obvious that there are other filesystems that might share the code - well, into fs/buffer.c it went. You are in the same situation. Moreover, right now I'm testing yet another class (fragment support). Currently it's UFS only, but ext2 also might eventually win from that. Yup, also work on ..._page() level, doing that in ufs_get_frag() would be a horrible PITA. I don't see why are you trying to squeeze everything into get_block() - it's not even a method anymore... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
--On 09/18/00 08:52:12 -0700 Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > On Mon, 18 Sep 2000, Chris Mason wrote: >> >> ReiserFS depends on the buffer head up to date flag being correct when it >> is sent to get_block. When unpacking the tail, we have to know if the >> packed data on disk should be copied over the data in the page. >> >> So, the above should work for us if you tested Page_Uptodate before >> testing buffer_mapped. > > Can't be done. > > A up-to-date but non-mapped buffer is a perfectly valid thing to have - > it's just a hole. > > So we can't mark the buffer uptodate before marking it mapped - that would > be a totally different state. > Ok, I can also check to see if the page is up to date during the tail conversion. Mostly I wanted to make everyone aware that nastiness like this exists, and that I haven't found a clean solution yet. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Mon, 18 Sep 2000, Chris Mason wrote: > > ReiserFS depends on the buffer head up to date flag being correct when it > is sent to get_block. When unpacking the tail, we have to know if the > packed data on disk should be copied over the data in the page. > > So, the above should work for us if you tested Page_Uptodate before testing > buffer_mapped. Can't be done. A up-to-date but non-mapped buffer is a perfectly valid thing to have - it's just a hole. So we can't mark the buffer uptodate before marking it mapped - that would be a totally different state. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
Alexander Viro wrote: > > On Sun, 17 Sep 2000, Linus Torvalds wrote: > > > On Sun, 17 Sep 2000, Alexander Viro wrote: > > > > > > Looks sane. But I really wonder if we could just do it in > > > create_page_buffers() if page is up-to-date. OTOH it would require attempt > > > to map them all. Comments? > > > > That would certainly simplify a lot. > > > > And as we've seen, simplifying this area would not necessarily be a bad > > thing ;) > > > > Right now I'll just do the minimal fix, though, I think. > > Fine with me. I'll do the full analysis tonight, anyway, and try to write > down the rules for that stuff. One thing that makes me seriously uneasy > is the fact that VM knows about ->buffers - I would be much happier if all > this stuff would be contained in fs/buffer.c. IOW, I'm not sure that > block_flushpage()/try_to_free_buffers() is a happy camper. As you know, I've begun to analyse it; Canonic buffer states and transitions http://marc.theaimsgroup.com/?l=linux-fsdevel&m=96893011609105&w=2 Since that post I've found one additional useful buffer state (*) bringing the total to 5: 0: undef - Neither data on disk or in cache known to be valid 1: inval - Data on disk known to be invalid (*) 2: known - Data on disk known to be valid 3: dirty - Data in cache known to be valid 4: clean - Data in cache known to match data on disk This is still a lot less that the current 16 states. I'm still waiting for comments on this analysis one way or the other. I think it strikes at the heart of the problem. I already used this way of thinking to find and fix a similar bug in my tailmerging code where __block_prepare_write was wrongly clearing part of a dirty buffer just because ext2_get_block reported it had created a new block. The quick fix was to suppress the buffer_new bit coming back from ext2_get_block if the buffer is dirty. A better fix would be to eliminate the buffer_new bit entirely, but of course this can wait. My contribution to this will be to update the buffer states post and if nobody thinks I'm completely off-base, do a similar one for page states. It seems to me that you and Linus have already found exactly the crevice that this bug lives in. > I'm not proposing it for immediate inclusion, but I don't feel good about > all this code - current rules are too complex and rely on details of > VM/buffer interaction too much for my taste. It may be correct, but it's > not obviously correct... Amen. -- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
--On 09/18/00 09:16:22 -0400 Alexander Viro <[EMAIL PROTECTED]> wrote: > > > On Mon, 18 Sep 2000, Chris Mason wrote: > >> ReiserFS depends on the buffer head up to date flag being correct when it >> is sent to get_block. When unpacking the tail, we have to know if the >> packed data on disk should be copied over the data in the page. > > ??? Details, please. What the hell are you doing with buffer-heads for a > piece of data that can be completely unaligned? > When reiserfs_get_block is called on the packed data (with create == 1), we unpack it to a full block, so the generic functions can handle dirties/writes from then on. If the data in the page cache isn't up to date, we need to copy the packed data into the page cache so it doesn't get lost in the conversion. This isn't for writing directly to the tail, I'm only doing that in writepage, where I always overwrite the tail data with page data. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Mon, 18 Sep 2000, Chris Mason wrote: > ReiserFS depends on the buffer head up to date flag being correct when it > is sent to get_block. When unpacking the tail, we have to know if the > packed data on disk should be copied over the data in the page. ??? Details, please. What the hell are you doing with buffer-heads for a piece of data that can be completely unaligned? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
--On 09/17/00 20:30:29 -0700 Linus Torvalds <[EMAIL PROTECTED]> wrote: > > Basically, both "truncate()" and "write()" have this bug where they can > end up re-reading stuff from disk even though the in-memory copy is newer. > > And because write() had this bug, the bug also got into > block_write_full_page(). Not because block_write_full_page() was buggy in > itself, but because it used a buggy routine. > > And your patch fixes the corruption, not by fixing the bug, but by > avoiding the buggy routing in block_write_full_page(). > > We need to fix the real bug - otherwise anybody doing both write() and > shared mmap's to the same file is going to be bit by it later on... > > The easy fix is probably to do something like > >/* Map the buffer */ >if (!buffer_mapped(bh)) { >... >} > + /* If the page is up-to-date, so is the buffer */ > + if (Page_Uptodate(page)) > + set_bit(BH_Uptodate, &bh->b_state); > >/* Ok, now it was _truly_ not uptodate */ >if (!buffer_uptodate(bh)) >ll_rw_block(READ, 1, &bh); > > Comments? The above should fix block_write_full_page() automatically, as > well as fixing the other cases too - and actually improve performance at > the same time by getting rid of unnecessary re-reads. > ReiserFS depends on the buffer head up to date flag being correct when it is sent to get_block. When unpacking the tail, we have to know if the packed data on disk should be copied over the data in the page. So, the above should work for us if you tested Page_Uptodate before testing buffer_mapped. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Mon, 18 Sep 2000, Alexander Viro wrote: > On Sun, 17 Sep 2000, Linus Torvalds wrote: > > > At that point, block_write_full_page() would become something like > > > > unsigned long index = inode->i_size >> PAGE_CACHE_SHIFT; > > > > offset = inode->i_size & (PAGE_CACHE_SIZE-1); > > if (index > page->index) > > offset = PAGE_CACHE_SIZE; > > > > create_empty_buffers(page, 0, offset); > > for_each_buffer(page) { > > if (buffer_mapped(bh)) > > mark_buffer_dirty(bh); > > } > > > > which would certainly be simpler than the current code. Hmm. > > At that point it will become prepare_write+commit_write. Which may be a > good thing in many other respects. I wonder if we need the > ->writepage() at all - I seriosuly suspect that the reason why I didn't > kill it back when I added ->prepare_write() was that I _did_ stumble on > that bug with reads and didn't realize that. Nope. NFS does things differently in cases of write and pageout. Ouch. On the write path for local filesystems we generally ignore O_SYNC for data. And IMO that's firmly on the "must be fixed before 2.4-final" side. Possibly by sync/async flag for ->commit_write()... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Sun, 17 Sep 2000, Linus Torvalds wrote: > At that point, block_write_full_page() would become something like > > unsigned long index = inode->i_size >> PAGE_CACHE_SHIFT; > > offset = inode->i_size & (PAGE_CACHE_SIZE-1); > if (index > page->index) > offset = PAGE_CACHE_SIZE; > > create_empty_buffers(page, 0, offset); > for_each_buffer(page) { > if (buffer_mapped(bh)) > mark_buffer_dirty(bh); > } > > which would certainly be simpler than the current code. Hmm. At that point it will become prepare_write+commit_write. Which may be a good thing in many other respects. I wonder if we need the ->writepage() at all - I seriosuly suspect that the reason why I didn't kill it back when I added ->prepare_write() was that I _did_ stumble on that bug with reads and didn't realize that. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Sun, 17 Sep 2000, Linus Torvalds wrote: > On Sun, 17 Sep 2000, Alexander Viro wrote: > > > > Looks sane. But I really wonder if we could just do it in > > create_page_buffers() if page is up-to-date. OTOH it would require attempt > > to map them all. Comments? > > That would certainly simplify a lot. > > And as we've seen, simplifying this area would not necessarily be a bad > thing ;) > > Right now I'll just do the minimal fix, though, I think. Fine with me. I'll do the full analysis tonight, anyway, and try to write down the rules for that stuff. One thing that makes me seriously uneasy is the fact that VM knows about ->buffers - I would be much happier if all this stuff would be contained in fs/buffer.c. IOW, I'm not sure that block_flushpage()/try_to_free_buffers() is a happy camper. I'm not proposing it for immediate inclusion, but I don't feel good about all this code - current rules are too complex and rely on details of VM/buffer interaction too much for my taste. It may be correct, but it's not obviously correct... [mingo, Daniel and Chris added to cc for obvious reasons] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Mon, 18 Sep 2000, Alexander Viro wrote: > > We might as well do it in create_empty_buffers(), FWIW. Do you see any > case when it would mean extra work? I'm thinking about the following: > > if (page is uptodate) > do get_block(...,0) for all buffers and mark them uptodate > > in the end of buffer-ring creation. Comments? The above _will_ cause extra work, notably for the case of writing to a hole (first create_empty_buffers() will do the get_block( .., 0) and get a zero, then the writing will do a get_block( .., 1) to actually allocate the block. We could push the "create" thing into create_empty_buffers(), though. We'd give it a "begin,end" the way writing wants, and a reader (or truncate) would just pass in 0,0. Hmmm.. I like that approach - because I suspect that eventually we want to change the FS mapping function away from the current buffer-based one, and make it a page-based one to let the low-level filesystem handle the tail etc issues itself. And when we do that, we'd have to pass in that "I'm going to want to dirty this range" information anyway, and the current "create_empty_buffers()" would in fact become the new "map this page, please" function. At that point, block_write_full_page() would become something like unsigned long index = inode->i_size >> PAGE_CACHE_SHIFT; offset = inode->i_size & (PAGE_CACHE_SIZE-1); if (index > page->index) offset = PAGE_CACHE_SIZE; create_empty_buffers(page, 0, offset); for_each_buffer(page) { if (buffer_mapped(bh)) mark_buffer_dirty(bh); } which would certainly be simpler than the current code. Hmm. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Sun, 17 Sep 2000, Alexander Viro wrote: > > Looks sane. But I really wonder if we could just do it in > create_page_buffers() if page is up-to-date. OTOH it would require attempt > to map them all. Comments? That would certainly simplify a lot. And as we've seen, simplifying this area would not necessarily be a bad thing ;) Right now I'll just do the minimal fix, though, I think. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Sun, 17 Sep 2000, Linus Torvalds wrote: > Yeah. See my other mail - I don't think the patch is needed at all, > although I'm convinced that it would actually hide the problem for the > particular case of INN (zeroing out the partial block is not an issue > there, as truncate will have done it - it's really only protection against > badly behaved software and shouldn't matter to any "regular" use). True, but that way we also get the consistent rules wrt. async/sync. > > We know that ->writepage() is called and we know that data doesn't get > > to the disk. How can it happen? If we can... Oh, fsck. Linus, we could > > very well lose the buffers since the moment when page had been read. > > See what happens? We recreate the buffer ring for the page and it's > > nowhere near "everything uptodate" state. > > Yeah. See my suggested fix in the other mail. We just must not read in > buffers for pages that are already up-to-date. We might as well do it in create_empty_buffers(), FWIW. Do you see any case when it would mean extra work? I'm thinking about the following: if (page is uptodate) do get_block(...,0) for all buffers and mark them uptodate in the end of buffer-ring creation. Comments? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Sun, 17 Sep 2000, Linus Torvalds wrote: > The basic problem is that right now we have code that does > > if (!page->buffers) > create_empty_buffers(page, inode, inode->i_sb->s_blocksize); > ... > if (!buffer_uptodate(bh)) > ll_rw_block(READ, 1, &bh); > > And this is WRONG. > > If the whole page is up-to-date, we must NOT try to read the buffer in > from disk, because the in-memory copy is always more up-to-date. Yep. > We need to fix the real bug - otherwise anybody doing both write() and > shared mmap's to the same file is going to be bit by it later on... > > The easy fix is probably to do something like > > /* Map the buffer */ > if (!buffer_mapped(bh)) { > ... > } > + /* If the page is up-to-date, so is the buffer */ > + if (Page_Uptodate(page)) > + set_bit(BH_Uptodate, &bh->b_state); > > /* Ok, now it was _truly_ not uptodate */ > if (!buffer_uptodate(bh)) > ll_rw_block(READ, 1, &bh); > > Comments? The above should fix block_write_full_page() automatically, as > well as fixing the other cases too - and actually improve performance at > the same time by getting rid of unnecessary re-reads. > > Looks fairly simple. It only happens in __block_prepare_write() and in > block_truncate_page(), so there's just two places to fix. > > Can you se anything else wrong? Looks sane. But I really wonder if we could just do it in create_page_buffers() if page is up-to-date. OTOH it would require attempt to map them all. Comments? That way we restore the bh state if buffer ring is recreated - IMO somewhat simpler... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Sun, 17 Sep 2000, Alexander Viro wrote: > > Ugh. OK, first of all, patch is _not_ correct. Doesn't zero out the end of > partial block. > Yeah. See my other mail - I don't think the patch is needed at all, although I'm convinced that it would actually hide the problem for the particular case of INN (zeroing out the partial block is not an issue there, as truncate will have done it - it's really only protection against badly behaved software and shouldn't matter to any "regular" use). > We know that ->writepage() is called and we know that data doesn't get > to the disk. How can it happen? If we can... Oh, fsck. Linus, we could > very well lose the buffers since the moment when page had been read. > See what happens? We recreate the buffer ring for the page and it's > nowhere near "everything uptodate" state. Yeah. See my suggested fix in the other mail. We just must not read in buffers for pages that are already up-to-date. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Sun, 17 Sep 2000, Linus Torvalds wrote: > > I bet your patch fixes the corruption, but I want to understand _why_. > Call me dense, but __block_commit_write() seems to do everything we want > done.. Ok, I may be dense, but I see the bug. And no, your patch was not the right thing either. It _will_ make that particular corruption go away, but it's a more insidious problem, and your patch only addreses part of the whole problem. The basic problem is that right now we have code that does if (!page->buffers) create_empty_buffers(page, inode, inode->i_sb->s_blocksize); ... if (!buffer_uptodate(bh)) ll_rw_block(READ, 1, &bh); And this is WRONG. If the whole page is up-to-date, we must NOT try to read the buffer in from disk, because the in-memory copy is always more up-to-date. Normally this bug shows itself only as a small performance issue - if we only use read() and write(), all changes to the page will be done "synchronously" with "page->buffers" being held, and as such the page will never have contents that are newer than the on-disk image. But whenever somebody writes to the page through a shared mapping, that is no longer true - we MUST NOT do the buffer read, because it's going to overwrite data that is newer than the disk contents. The bug again didn't show up in the trivial test-cases, because it depends on us "losing" the page buffers and having to re-create them in order to show up, and that only happens under memory pressure. Basically, both "truncate()" and "write()" have this bug where they can end up re-reading stuff from disk even though the in-memory copy is newer. And because write() had this bug, the bug also got into block_write_full_page(). Not because block_write_full_page() was buggy in itself, but because it used a buggy routine. And your patch fixes the corruption, not by fixing the bug, but by avoiding the buggy routing in block_write_full_page(). We need to fix the real bug - otherwise anybody doing both write() and shared mmap's to the same file is going to be bit by it later on... The easy fix is probably to do something like /* Map the buffer */ if (!buffer_mapped(bh)) { ... } + /* If the page is up-to-date, so is the buffer */ + if (Page_Uptodate(page)) + set_bit(BH_Uptodate, &bh->b_state); /* Ok, now it was _truly_ not uptodate */ if (!buffer_uptodate(bh)) ll_rw_block(READ, 1, &bh); Comments? The above should fix block_write_full_page() automatically, as well as fixing the other cases too - and actually improve performance at the same time by getting rid of unnecessary re-reads. Looks fairly simple. It only happens in __block_prepare_write() and in block_truncate_page(), so there's just two places to fix. Can you se anything else wrong? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Sun, 17 Sep 2000, Linus Torvalds wrote: > Hmm.. Ok, I see what you're saying. And I looked at the diff the wrong way > around - we _lost_ two articles rather than getting two new ones. > > However, what's wrong with the block_write_full_page() logic? Your patch > looks fine to me, but so does the old logic, I have to admit. > > Sure, __block_prepare_write() does the "optimize away reads" stuff, but > this all can only happen for pages that are up-to-date, so that shouldn't > matter. __block_commit_write() will certainly mark the buffers dirty, so I > don't see how the data would get lost anywhere. > > I bet your patch fixes the corruption, but I want to understand _why_. > Call me dense, but __block_commit_write() seems to do everything we want > done.. > > I wonder if we have some cache aliasing problem somewhere. Or what's the > bug I overlooked? Ugh. OK, first of all, patch is _not_ correct. Doesn't zero out the end of partial block. We know that ->writepage() is called and we know that data doesn't get to the disk. How can it happen? If we can... Oh, fsck. Linus, we could very well lose the buffers since the moment when page had been read. See what happens? We recreate the buffer ring for the page and it's nowhere near "everything uptodate" state. So yes, reading the buffers is the problem. I'll fix the zero-out part of that and send the corrected variant. From my reading of __block_write_full_page() it looks like we are OK wrt assumptions about the bh state there, but we may need to check other places where we use prepare_write(). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Sun, 17 Sep 2000, Alexander Viro wrote: > > It _is_ a bug, unfortunately. Look: he had taken a copy when the thing was > running. I.e. took the copy of in-core page. After that he stopped innd > and that had triggered block_write_full_page(). So far, so good, right? > Except that the page contents didn't reach the disk. Hmm.. Ok, I see what you're saying. And I looked at the diff the wrong way around - we _lost_ two articles rather than getting two new ones. However, what's wrong with the block_write_full_page() logic? Your patch looks fine to me, but so does the old logic, I have to admit. Sure, __block_prepare_write() does the "optimize away reads" stuff, but this all can only happen for pages that are up-to-date, so that shouldn't matter. __block_commit_write() will certainly mark the buffers dirty, so I don't see how the data would get lost anywhere. I bet your patch fixes the corruption, but I want to understand _why_. Call me dense, but __block_commit_write() seems to do everything we want done.. I wonder if we have some cache aliasing problem somewhere. Or what's the bug I overlooked? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Sun, 17 Sep 2000, Linus Torvalds wrote: > > > On Sun, 17 Sep 2000, Marco d'Itri wrote: > > > > I added to block_write_full_page() the debugging code suggested by > > Alexander Viro: > > > > if (inode->i_dev == 0x305 && inode->i_ino == 48991) // Md > > printk("block_write_full_page: writing page %d, size %Ld\n", > > page->index, inode->i_size); > > > > and I have just been able to trigger the bug (two times): > > Bug? > > How do you think updates get written back to disk when innd has made > changes to its in-core file? > > Right. With block_write_full_page. > > > root@wonderland:vc/2:/var/lib/news$diff -u active active.ok > > --- active Sun Sep 17 16:09:01 2000 > > +++ active.ok Sun Sep 17 21:03:16 2000 > > @@ -1,11 +1,11 @@ > > control 004774 004775 y > > control.cancel 021917 021889 n > > junk 001777 001768 y > > -fido.ita.ridere 014632 014600 y > > +fido.ita.ridere 014634 014600 y > > So? innd got two new articles. > > The above is what you want. If the active file wouldn't get updated, you'd > never see any new articles ever again. > > Doesn't look like a bug to me. It _is_ a bug, unfortunately. Look: he had taken a copy when the thing was running. I.e. took the copy of in-core page. After that he stopped innd and that had triggered block_write_full_page(). So far, so good, right? Except that the page contents didn't reach the disk. How about the following: --- buffer.cMon Sep 18 01:13:42 2000 +++ buffer.c.newMon Sep 18 01:28:45 2000 @@ -1411,9 +1411,12 @@ */ static int __block_write_full_page(struct inode *inode, struct page *page, get_block_t *get_block) { - int err, i, need_balance_dirty = 0; + int err = -EIO, i, need_balance_dirty = 0; unsigned long block; struct buffer_head *bh, *head; + unsigned long end_block = (inode->i_size+inode->i_sb->s_blocksize-1) >> + inode->i_sb->s_blocksize_bits; + unsigned offset; if (!PageLocked(page)) BUG(); @@ -1423,6 +1426,9 @@ head = page->buffers; block = page->index << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits); + /* Are we completely out? */ + if (block > end_block) + goto out; bh = head; i = 0; @@ -1451,6 +1457,8 @@ bh = bh->b_this_page; block++; + if (block > end_block) + break; } while (bh != head); if (need_balance_dirty) @@ -1823,31 +1831,7 @@ int block_write_full_page(struct page *page, get_block_t *get_block) { struct inode *inode = (struct inode*)page->mapping->host; - unsigned long end_index = inode->i_size >> PAGE_CACHE_SHIFT; - unsigned offset; - int err; - - /* easy case */ - if (page->index < end_index) - return __block_write_full_page(inode, page, get_block); - - /* things got complicated... */ - offset = inode->i_size & (PAGE_CACHE_SIZE-1); - /* OK, are we completely out? */ - if (page->index >= end_index+1 || !offset) - return -EIO; - /* Sigh... will have to work, then... */ - err = __block_prepare_write(inode, page, 0, offset, get_block); - if (!err) { - memset(page_address(page) + offset, 0, PAGE_CACHE_SIZE - offset); - flush_dcache_page(page); - __block_commit_write(inode,page,0,offset); -done: - kunmap(page); - return err; - } - ClearPageUptodate(page); - goto done; + return __block_write_full_page(inode, page, get_block); } int generic_block_bmap(struct address_space *mapping, long block, get_block_t *get_block) Unlike the current variant it does the right thing with ->b_end_io and is less intrusive (and shorter, BTW). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The INN/mmap bug
On Sun, 17 Sep 2000, Marco d'Itri wrote: > > I added to block_write_full_page() the debugging code suggested by > Alexander Viro: > > if (inode->i_dev == 0x305 && inode->i_ino == 48991) // Md > printk("block_write_full_page: writing page %d, size %Ld\n", > page->index, inode->i_size); > > and I have just been able to trigger the bug (two times): Bug? How do you think updates get written back to disk when innd has made changes to its in-core file? Right. With block_write_full_page. > root@wonderland:vc/2:/var/lib/news$diff -u active active.ok > --- active Sun Sep 17 16:09:01 2000 > +++ active.ok Sun Sep 17 21:03:16 2000 > @@ -1,11 +1,11 @@ > control 004774 004775 y > control.cancel 021917 021889 n > junk 001777 001768 y > -fido.ita.ridere 014632 014600 y > +fido.ita.ridere 014634 014600 y So? innd got two new articles. The above is what you want. If the active file wouldn't get updated, you'd never see any new articles ever again. Doesn't look like a bug to me. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/