Re: [PATCH] mm: fix page_mkclean_one
On Fri, Feb 02, 2007 at 07:42:52PM +1100, Nick Piggin ([EMAIL PROTECTED]) wrote: > Anyway, I had a look at your bugzilla test-case and managed to slim it > down to something that easily shows what the problem is (available on > request) -- the problem is that recipient of the sendfile is seeing > modifications that occur to the source file _after_ the sender has > completed the sendfile, because the file pages are not copied but > queued. > > I think the usual approach to what you are trying to do is to set TCP_CORK, > then write(2) the header into the socket, then sendfile directly from the > file you want. > > Another approach I guess is to implement an ack in your userland protocol > so you do not modify the sendfile source file until the client acks that > it has all the data. Mark, don't you use e1000 or other scatter-gather capable nic with checksum offload? Likely yes. Actual data sucking in that case happens when packet is supposed to be transmitted by the NIC, not when sendfile() is returned. The same applies to the case, when you have fancy egress filtering. It is not allowed to modify pages until they are really transmitted, if you want data integrity. There are _no_ bugs in network or VFS cache in this test case. > I'm not sure if there are any other usual ways to do this (ie. a barrier > for sendfile, to ensure it will not pick up "future" modifications to the > file). netdev cc'ed, someone there might have additional comments. > > Please close this bug if/when you are satisfied it is not a kernel problem. > > Thanks, > Nick > > -- > SUSE Labs, Novell Inc. > Send instant messages to your online friends http://au.messenger.yahoo.com -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
Mark Groves wrote: Hi, I have been been seeing a problem when using sendfile repeatedly on an SMP server, which I believe is related to the problem that was discovered recently with marking dirty pages. The bug, as well as a test script, is listed at http://bugzilla.kernel.org/show_bug.cgi?id=7650. Currently, we're experiencing errors where part of a previous packet is being sent out rather than the current packet. I have applied the patch Linus posted to a 2.6.19 kernel but am still getting the problem. So I am wondering if there are any other places in the kernel which mark pages as dirty which might require a similar patch? Your issue is not related, firstly because the page_mkclean bug did not exist before 2.6.19 kernels. Anyway, I had a look at your bugzilla test-case and managed to slim it down to something that easily shows what the problem is (available on request) -- the problem is that recipient of the sendfile is seeing modifications that occur to the source file _after_ the sender has completed the sendfile, because the file pages are not copied but queued. I think the usual approach to what you are trying to do is to set TCP_CORK, then write(2) the header into the socket, then sendfile directly from the file you want. Another approach I guess is to implement an ack in your userland protocol so you do not modify the sendfile source file until the client acks that it has all the data. I'm not sure if there are any other usual ways to do this (ie. a barrier for sendfile, to ensure it will not pick up "future" modifications to the file). netdev cc'ed, someone there might have additional comments. Please close this bug if/when you are satisfied it is not a kernel problem. Thanks, Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
Hi, I have been been seeing a problem when using sendfile repeatedly on an SMP server, which I believe is related to the problem that was discovered recently with marking dirty pages. The bug, as well as a test script, is listed at http://bugzilla.kernel.org/show_bug.cgi?id=7650. Currently, we're experiencing errors where part of a previous packet is being sent out rather than the current packet. I have applied the patch Linus posted to a 2.6.19 kernel but am still getting the problem. So I am wondering if there are any other places in the kernel which mark pages as dirty which might require a similar patch? Regards, Mark Groves Researcher University of Waterloo Waterloo, Ontario, Canada - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 7 Jan 2007 12:36:18 +1030 "Tom Lanyon" <[EMAIL PROTECTED]> wrote: > On 12/27/06, Linus Torvalds <[EMAIL PROTECTED]> wrote: > > What would also actually be interesting is whether somebody can reproduce > > this on Reiserfs, for example. I _think_ all the reports I've seen are on > > ext2 or ext3, and if this is somehow writeback-related, it could be some > > bug that is just shared between the two by virtue of them still having a > > lot of stuff in common. > > > > Linus > > I've been following this thread for a while now as I started > experiencing file corruption in rtorrent when I upgraded to 2.6.19. I > am using reiserfs. reiserfs defaults to data=ordered, so it's quite possibly the same bug. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On 1/7/07, Tom Lanyon <[EMAIL PROTECTED]> wrote: I've been following this thread for a while now as I started experiencing file corruption in rtorrent when I upgraded to 2.6.19. I am using reiserfs. However, moving to 2.6.20-rc3 does indeed seem to fix the issue thus far... -- Tom Lanyon - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On 12/27/06, Linus Torvalds <[EMAIL PROTECTED]> wrote: What would also actually be interesting is whether somebody can reproduce this on Reiserfs, for example. I _think_ all the reports I've seen are on ext2 or ext3, and if this is somehow writeback-related, it could be some bug that is just shared between the two by virtue of them still having a lot of stuff in common. Linus I've been following this thread for a while now as I started experiencing file corruption in rtorrent when I upgraded to 2.6.19. I am using reiserfs. -- Tom Lanyon - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006 16:58:41 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > On Fri, 29 Dec 2006, Andrew Morton wrote: > > > > > > Somewhat nastily, but as ext3 directories are metadata it is appropriate > > > > that modifications to them be done in terms of buffer_heads (ie: > > > > blocks). > > > > > > No. There is nothing "appropriate" about using buffer_heads for metadata. > > > > I said "modification". > > You said "metadata". > > Why do you think directories are any different from files? Yes, they are > metadata. So what? What does that have to do with anything? We journal the contents of directories. Fully. So we handle their dirty data at the block (ie: buffer_head) level. When someone tries to dirty part of a directory we need to cheat and not mark that part of the page as dirty and we need to then write the block to the journal and then mark the block as really dirty for checkpointing (but still attached to the journal) and all that goop. The regular page-based writeback doesn't apply until the block has been written to the journal. At that stage the block is considered dirty against its real position on disk. It will then be written back by pdflush via the blockdev inode -> blkdev_writepage(). Unless kjournald needs to do an early flush to reclaim the journal space, in which case kjournald will write the block itself. > > So I really don't understand why you make excuses for ext3 and talk about > "modifications" and "metadata". It was a fine design ten years ago. It's > not really very good any longer. > As I said in another apparently-neglected email: : We could possibly move ext3/4 directories out of the blockdev pagecache and : into per-directory pagecache, but that wouldn't change anything - the : journalling would still be block-based. We already have all the code in place to journal blocks which are cached in an address_space other than the blockdev inode's: ext3_journalled_aops. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Andrew Morton wrote: > > > > Somewhat nastily, but as ext3 directories are metadata it is appropriate > > > that modifications to them be done in terms of buffer_heads (ie: blocks). > > > > No. There is nothing "appropriate" about using buffer_heads for metadata. > > I said "modification". You said "metadata". Why do you think directories are any different from files? Yes, they are metadata. So what? What does that have to do with anything? They should still use virtual indexes, the way files do. That doesn't preclude them from using buffer-heads to mark their (partial-page) modifications and for keeping the virtual->physical translations cached. I mean, really. Look at ext2. It does exactly that. It keeps the directories in the page cache - virtually indexed. And it even modifies them there. Exactly the same way it modifies regular file data. It all works exactly the same way it works for regular files. It uses page->mapping->a_ops->prepare_write(NULL, page, from, to); ... do modification ... ext2_commit_chunk(page, from, to); exactly the way regular file data works. That's why I'm saying there is absolutely _zero_ thing about "metadata" here, or even about "modifications". It all works better in a virtual cache, because you get all the support that we give to page caches. So I really don't understand why you make excuses for ext3 and talk about "modifications" and "metadata". It was a fine design ten years ago. It's not really very good any longer. I suspect we're stuck with the design, but that doesn't make it any _better_. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Andrew Morton wrote: > > Adam Richter spent considerable time a few years ago trying to make the > mpage code go direct-to-BIO in all cases and we eventually gave up. The > conceptual layering of page<->blocks<->bio is pretty clean, and it is hard > and ugly to fully optimise away the "block" bit in the middle. Using the buffer cache as a translation layer to the physical address is fine. That's what _any_ block device will do. I'm not at all sayign that "buffer heads must go away". They work fine. What I'm saying is that - if you index by buffer heads, you're screwed. - if you do IO by starting at buffer heads, you're screwed. Both indexing and writeback decisions should be done at the page cache layer. Then, when you actually need to do IO, you look at the buffers. But you start from the "page". YOU SHOULD NEVER LOOK UP a buffer on its own merits, and YOU SHOULD NEVER DO IO on a buffer head on its own cognizance. So by all means keep the buffer heads as a way to keep the "virtual->physical" translation. It's what they were designed for. But they were _originally_ also designed for "lookup" and "driving the start of IO", and that is wrong, and has been wrong for a long time now, because - lookup based on physical address is fundamentally slow and inefficient. You have to look up the virtual->physical translation somewhere else, so it's by design an unnecessary indirection _and_ that "somewere else" is also by definition filesystem-specific, so you can't do any of these things at the VFS layer. Ergo: anything that needs to look up the physical address in order to find the buffer head is BROKEN in this day and age. We look up the _virtual_ page cache page, and then we can trivially find the buffer heads within that page thanks to page->buffers. Example: ext2 vs ext3 readdir. One of them sucks, the other doesn't. - starting IO based on the physical entity is insane. It's insane exactly _because_ the VM doesn't actually think in physical addresses, or in buffer-sized blocks. The VM only really knows about whole pages, and all the VM decisions fundamentally have to be page-based. We don't ever "free a buffer". We free a whole page, and as such, doing writeback based on buffers is pointless, because it doesn't actually say anything about the "page state" which is what the VM tracks. But neither of these means that "buffer_head" itself has to go away. They both really boil down to the same thing: you should never KEY things by the buffer head. All actions should be based on virtual indexes as far as at all humanly possible. Once you do lookup and locking and writeback _starting_ from the page, it's then easy to look up the actual buffer head within the page, and use that as a way to do the actual _IO_ on the physical address. So the buffer heads still exist in ext2, for example, but they don't drive the show quite as much. (They still do in some areas: the allocation bitmaps, the xattr code etc. But as long as none of those have big VM footprints, and as long as no _common_ operations really care deeply, and as long as those data structures never need to be touched by the VM or VFS layer, nobody will ever really care). The directory case comes up just because "readdir()" actually is very common, and sometimes very slow. And it can have a big VM working set footprint ("find"), so trying to be page-based actually really helps, because it all drives things like writeback on the _right_ issues, and we can do things like LRU's and writeback decisions on the level that really matters. I actually suspect that the inode tables could benefit from being in the page cache too (although I think that the inode buffer address is actually "physical", so there's no indirection for inode tables, which means that the virtual vs physical addressing doesn't matter). For directories, there definitely is a big cost to continually doing the virtual->physical translation all the time. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006 16:11:44 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > > JBD implements physical block-based journalling, so it is 100% appropriate > > that JBD deal with these disk blocks using their buffer_head > > representation. > > And as long as it does that, you just have to face the fact that it's > going to perform like crap, including what you call "extra" writes, and > what I call "deal with it". It is quite tiresome to delete things which your interlocutor said and to then restate them as if it were some sort of relevation. > > Somewhat nastily, but as ext3 directories are metadata it is appropriate > > that modifications to them be done in terms of buffer_heads (ie: blocks). > > No. There is nothing "appropriate" about using buffer_heads for metadata. I said "modification". > [stuff about directory reads elided] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Andrew Morton wrote: > > They're extra. As in "can be optimised away". Sure. Don't use buffer heads. > The buffer_head is not an IO container. It is the kernel's core > representation of a disk block. Please come back from the 90's. The buffer heads are nothing but a mapping of where the hardware block is. If you use it for anything else, you're basically screwed. > JBD implements physical block-based journalling, so it is 100% appropriate > that JBD deal with these disk blocks using their buffer_head > representation. And as long as it does that, you just have to face the fact that it's going to perform like crap, including what you call "extra" writes, and what I call "deal with it". Btw, you can make pages be physically indexed too, but they obviously (a) won't be coherent with any virtual mapping laid on top of it (b) will be _physical_, so any readahead etc will be based on physical addresses too. > I thought I fixed the performance problem? No, you papered over it, for the reasonably common case where things were physically contiguous - exactly by using a physical page cache, so now it can do read-ahead based on that. Then, because the pages contain buffer heads, the directory accesses can look up buffers, and if it was all physically contiguous, it all works fine. But if you actually want virtualluy indexed caching (and all _users_ want it), it really doesn't work. > Somewhat nastily, but as ext3 directories are metadata it is appropriate > that modifications to them be done in terms of buffer_heads (ie: blocks). No. There is nothing "appropriate" about using buffer_heads for metadata. It's quite proper - and a hell of a lot more efficient - to use virtual page-caching for metadata too. Look at the ext2 readdir() implementation, and compare it to the crapola horror that is ext3. Guess what? ext2 uses virtually indexed metadata, and as a result it is both simpler, smaller and a LOT faster than ext3 in accessing that metadata. Face it, Andrew, you're wrong on this one. Really. Just take a look at ext2_readdir(). [ I'm not saying that ext2_readdir() is _beautiful_. If it had been written with the page cache in mind, it would probably have been done very differently. And it doesn't do any readahead, probably because nobody cared enough, but it should be trivial to add, and it would automatically "do the right thing" just because it's much easier at the page cache level. But I _am_ saying that compared to ext3, the ext2 readdir is a work of art. ] "metadata" has _zero_ to do with "physically indexed". There is no correlation what-so-ever. If you think there is a correlation, it's all in your mind. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006 18:32:07 -0500 Theodore Tso <[EMAIL PROTECTED]> wrote: > On Fri, Dec 29, 2006 at 02:42:51PM -0800, Linus Torvalds wrote: > > I think ext3 is terminally crap by now. It still uses buffer heads in > > places where it really really shouldn't, and as a result, things like > > directory accesses are simply slower than they should be. Sadly, I don't > > think ext4 is going to fix any of this, either. > > Not just ext3; ocfs2 is using the jbd layer as well. I think we're > going to have to put this (a rework of jbd2 to use the page cache) on > the ext4 todo list, and work with the ocfs2 folks to try to come up > with something that suits their needs as well. Fortunately we have > this filesystem/storage summit thing coming up in the next few months, > and we can try to get some discussion going on the linux-ext4 mailing > list in the meantime. Unfortunately, I don't think this is going to > be trivial. I suspect it would be insane to move any part of JBD (apart from the ordered-data flush) to use pagecache. The whole thing is fundamentally block-based. But only for metadata - there's no strong reason why ext3/4 needs to manipulate file data via buffer_heads if data=journal and chattr +j aren't in use. We could possibly move ext3/4 directories out of the blockdev pagecache and into per-directory pagecache, but that wouldn't change anything - the journalling would still be block-based. Adam Richter spent considerable time a few years ago trying to make the mpage code go direct-to-BIO in all cases and we eventually gave up. The conceptual layering of page<->blocks<->bio is pretty clean, and it is hard and ugly to fully optimise away the "block" bit in the middle. buffer_heads become more important with large PAGE_CACHE_SIZE. I'd expect nobh mode to be quite inefficient with some workloads on 64k pages. We need that representation of the state (and location) of the block-sized hunks which make up the page. > If we do get this fixed for ext4, one interesting question is whether > people would accept a patch to backport the fixes to ext3, given the > the grief this is causing the page I/O and VM routines. OTOH, reiser3 > probably has the same problems, and I suspect the changes to ext3 to > cause it to avoid buffer heads, especially in order to support for > filesystem blocksizes < pagesize, are going to be sufficiently risky > in terms of introducing regressions to ext3 that they would probably > be rejected on those grounds. So unfortunately, we probably are going > to have to support flushes via buffer heads for the foreseeable > future. We'll see. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Theodore Tso wrote: > > If we do get this fixed for ext4, one interesting question is whether > people would accept a patch to backport the fixes to ext3, given the > the grief this is causing the page I/O and VM routines. I don't think backporting is the smartest option (unless it's done _way_ later), but the real problem with it isn't actually the VM behaviour, but simply the fact that cached performance absoluely _sucks_ with the buffer cache. With the physically indexed buffer cache thing, you end up always having to do these complicated translations into block numbers for every single access, and at some point when I benchmarked it, it was a huge overhead for doing simple things like readdir. It's also a major pain for read-ahead, exactly partly due to the high cost of translation - because you can't cheaply check whether the next block is there, the cost of even asking the question "should I try to read ahead?" is much much higher. As a result, read-ahead is seriously limited, because it's so expensive for the cached case (which is still hopefully the _common_ case). So because read-ahead is limited, the non-cached case then _really_ sucks. It was somewhat fixed in a really god-awful fashion by having ext3_readdir() actually do _readahead_ though the page cache, even though it does everything else through the buffer cache. And that just happens to work because we hopefully have physically contiguous blocks, but when that isn't true, the readahead doesn't do squat. It's really quite fundamentally broken. But none of that causes any problems for the VM, since directories cannot be mmap'ed anyway. But it's really pitiful, and it really doesn't work very well. Of course, other filesystems _also_ suck at this, and other operating systems haev even MORE problems, so people don't always seem to realize how horribly horribly broken this all is. I really wish somebody would write a filesystem that did large cold-cache directories well. Open some horrible file manager on /usr/bin with cold caches, and weep. The biggest problem is the inode indirection, but at some point when I looked at why it sucked, it was doing basically synchronous single-buffer reads on the directory too, because readahead didn't work properly. I was hoping that something like SpadFS would actually take off, because it seemed to do a lot of good design choices (having inodes in-line in the directory for when there are no hardlinks is probably a requirement for a good filesystem these days. The separate inode table had its uses, but indirection in a filesystem really does suck, and stat information is too important to be indirect unless it absolutely has to). But I suspect it needs more than somebody who just wants to get his thesis written ;) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006 14:42:51 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > On Fri, 29 Dec 2006, Andrew Morton wrote: > > > > - The above change means that we do extra writeout. If a page is dirtied > > once, kjournald will write it and then pdflush will come along and > > needlessly write it again. > > There's zero extra writeout for any flushing that flushes BY PAGES. > > Only broken flushers that flush by buffer heads (which really really > really shouldn't be done any more: welcome to the 21st century) will cause > extra writeouts. And those extra writeouts are obviously required for all > the dirty state to actually hit the disk - which is the point of the > patch. > > So they're not "extra" - they are "required for correct working". They're extra. As in "can be optimised away". > But I can't stress the fact enough that people SHOULD NOT do writeback by > buffer heads. The buffer head has been purely an "IO entity" for the last > several years now, and it's not a cache entity. The buffer_head is not an IO container. It is the kernel's core representation of a disk block. Usually (but not always) it is backed by some memory which is in pagecache. We can feed buffer_heads into IO containers via submit_bh(), but that's far from the only thing we use buffer_heads for. We should have done s/buffer_head/block/g years ago. JBD implements physical block-based journalling, so it is 100% appropriate that JBD deal with these disk blocks using their buffer_head representation. That being said, ordered-data mode isn't really part of the JBD journalling system at all (the data doesn't get journalled!) - ordered-mode is an add-on to the JBD journal to make the metadata which we're about to journal point at more-likely-to-be-correct data. JBD's ordered-mode writeback is just a sync and I see no conceptual problems with killing its old buffer_head based sync and moving it into the 21st century. > Anybody who does writeback > by buffer heads is basically bypassing the real cache (the page cache), > and that's why all the problems happen. > > I think ext3 is terminally crap by now. It still uses buffer heads in > places where it really really shouldn't, The ordered-data mode flush: sure. The rest of JBD's use of buffer_heads is quite appropriate. > and as a result, things like > directory accesses are simply slower than they should be. Sadly, I don't > think ext4 is going to fix any of this, either. I thought I fixed the performance problem? Somewhat nastily, but as ext3 directories are metadata it is appropriate that modifications to them be done in terms of buffer_heads (ie: blocks). > It's all just too inherently wrongly designed around the buffer head > (which was correct in 1995, but hasn't been correct for a long time in the > kernel any more). > > > - Poor old IO accounting broke again. > > No. That's why I used "set_page_dirty()" and did it that strange ugly way > ("set page dirty, even though it's already dirty, and even though the very > next thing we will do is TestClearPageDirty???"). nfs_set_page_dirty() and reiserfs_set_page_dirty() should now bail if PageDirty() to avoid needless work. > > - For a long time I've wanted to nuke the current ext3/jbd ordered-data > > implementation altogether, and just make kjournald call into the > > standard writeback code to do a standard suberblock->inodes->pages walk. > > I really would like to see less of the buffer-head-based stuff, and yes, > more of the normal inode page walking. I don't think you can "order" > accesses within a page anyway, exactly because of memory mapping issues, > so any page ordering is not about buffer heads on the page itself, it > should be purely about metadata. In this context ext3's "ordered" mode means "sync the file contents before journalling the metadata which points at it". > > - It's pretty obnoxious that the VM now sets a clean page "dirty" and > > then proceeds to modify its contents. It would be nice to stop doing > > that. > > No. I think this really the fundamental confusion people had. People > thought that setting the page dirty meant that it was no longer being > modified. No. Setting a page (or bh, or inode) dirty means "this is known to have been modified". ie: this cached entity is now out of sync with backing store. Ho hum. I don't care much, really. But then, I understand how all this stuff works. Try explaining to someone the relationship between pte-dirtiness, page-dirtiness, radix-tree-dirtiness and buffer_head-dirtiness. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, Dec 29, 2006 at 02:42:51PM -0800, Linus Torvalds wrote: > I think ext3 is terminally crap by now. It still uses buffer heads in > places where it really really shouldn't, and as a result, things like > directory accesses are simply slower than they should be. Sadly, I don't > think ext4 is going to fix any of this, either. Not just ext3; ocfs2 is using the jbd layer as well. I think we're going to have to put this (a rework of jbd2 to use the page cache) on the ext4 todo list, and work with the ocfs2 folks to try to come up with something that suits their needs as well. Fortunately we have this filesystem/storage summit thing coming up in the next few months, and we can try to get some discussion going on the linux-ext4 mailing list in the meantime. Unfortunately, I don't think this is going to be trivial. If we do get this fixed for ext4, one interesting question is whether people would accept a patch to backport the fixes to ext3, given the the grief this is causing the page I/O and VM routines. OTOH, reiser3 probably has the same problems, and I suspect the changes to ext3 to cause it to avoid buffer heads, especially in order to support for filesystem blocksizes < pagesize, are going to be sufficiently risky in terms of introducing regressions to ext3 that they would probably be rejected on those grounds. So unfortunately, we probably are going to have to support flushes via buffer heads for the foreseeable future. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Andrew Morton wrote: > > - The above change means that we do extra writeout. If a page is dirtied > once, kjournald will write it and then pdflush will come along and > needlessly write it again. There's zero extra writeout for any flushing that flushes BY PAGES. Only broken flushers that flush by buffer heads (which really really really shouldn't be done any more: welcome to the 21st century) will cause extra writeouts. And those extra writeouts are obviously required for all the dirty state to actually hit the disk - which is the point of the patch. So they're not "extra" - they are "required for correct working". But I can't stress the fact enough that people SHOULD NOT do writeback by buffer heads. The buffer head has been purely an "IO entity" for the last several years now, and it's not a cache entity. Anybody who does writeback by buffer heads is basically bypassing the real cache (the page cache), and that's why all the problems happen. I think ext3 is terminally crap by now. It still uses buffer heads in places where it really really shouldn't, and as a result, things like directory accesses are simply slower than they should be. Sadly, I don't think ext4 is going to fix any of this, either. It's all just too inherently wrongly designed around the buffer head (which was correct in 1995, but hasn't been correct for a long time in the kernel any more). > - Poor old IO accounting broke again. No. That's why I used "set_page_dirty()" and did it that strange ugly way ("set page dirty, even though it's already dirty, and even though the very next thing we will do is TestClearPageDirty???"). That code looks strange as a result, which is why it now has more comments on it than actual code ;) > - People were saying that ext2 and ext3,data=writeback were also showing > corruption. What's up with that? I thought the "ext3,data=writeback" case was reported to be fine by several people? I'm not sure about ext2. I didn't look at what it did based on buffer heads. I would have expected it to be ok. That said, at least one report was later shown to be bogus (errors due to out of disk, not due to actual errors ;). > - For a long time I've wanted to nuke the current ext3/jbd ordered-data > implementation altogether, and just make kjournald call into the > standard writeback code to do a standard suberblock->inodes->pages walk. I really would like to see less of the buffer-head-based stuff, and yes, more of the normal inode page walking. I don't think you can "order" accesses within a page anyway, exactly because of memory mapping issues, so any page ordering is not about buffer heads on the page itself, it should be purely about metadata. > - It's pretty obnoxious that the VM now sets a clean page "dirty" and > then proceeds to modify its contents. It would be nice to stop doing > that. No. I think this really the fundamental confusion people had. People thought that setting the page dirty meant that it was no longer being modified. It hasn't meant that in a LONG time - ever since the whole DIRTY_TAG thing, the most important part of the PG_dirty thing has really been that it's now efficiently findable by the writeout logic. And that is very much what the whole page accounting _depends_ on. When we mmap a page, we need to mark it "findable" as dirty _before_ people actually start writing to it, because it's too late afterwards. > We could stop marking the page dirty in do_wp_page() and create a new > VM counter "NR_PTE_DIRTY", which means > > "number of mapping_cap_account_dirty() pages which have a dirty pte > pointing at them". Well, then you need to change what PAGE_MAPPING_TAG_DIRTY means too. That's very fundamental. That DIRTY _tag_ is now even more important than the PG_dirty bit itself, since that's what we actually use to _access_ those things. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006 14:16:32 -0800 Andrew Morton <[EMAIL PROTECTED]> wrote: > - Poor old IO accounting broke again. No it didn't - we're relying upon the behaviour of __set_page_dirty_buffers() against an already-dirty page. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006 02:48:35 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> wrote: > + if (mapping && mapping_cap_account_dirty(mapping)) { > + /* > + * Yes, Virginia, this is indeed insane. > + * > + * We use this sequence to make sure that > + * (a) we account for dirty stats properly > + * (b) we tell the low-level filesystem to > + * mark the whole page dirty if it was > + * dirty in a pagetable. Only to then > + * (c) clean the page again and return 1 to > + * cause the writeback. > + * > + * This way we avoid all nasty races with the > + * dirty bit in multiple places and clearing > + * them concurrently from different threads. > + * > + * Note! Normally the "set_page_dirty(page)" > + * has no effect on the actual dirty bit - since > + * that will already usually be set. But we > + * need the side effects, and it can help us > + * avoid races. > + * > + * We basically use the page "master dirty bit" > + * as a serialization point for all the different > + * threds doing their things. > + * > + * FIXME! We still have a race here: if somebody > + * adds the page back to the page tables in > + * between the "page_mkclean()" and the "TestClearPageDirty()", > + * we might have it mapped without the dirty bit set. > + */ > + if (page_mkclean(page)) > + set_page_dirty(page); > + if (TestClearPageDirty(page)) { > dec_zone_page_state(page, NR_FILE_DIRTY); > + return 1; > } - Presumably reiser3's ordered-data mode has the same problem. And ext4, of course. Dunno about other filesytems. - The above change means that we do extra writeout. If a page is dirtied once, kjournald will write it and then pdflush will come along and needlessly write it again. But otoh, if a mapping is being repeatedly dirtied, kjournald will write the page once per 30 seconds (dirty_expire_centisecs) and pdflush will write the page once per 30 seconds as well. But we _should_ be writing it once per five seconds (kjournald commit interval). So we're still ahead ;) - Poor old IO accounting broke again. - People were saying that ext2 and ext3,data=writeback were also showing corruption. What's up with that? - For a long time I've wanted to nuke the current ext3/jbd ordered-data implementation altogether, and just make kjournald call into the standard writeback code to do a standard suberblock->inodes->pages walk. I think it'd be fairly straightforward to do. We'd need to teach the writeback code to be able to skip dirty pages which don't have a disk mapping, so that kjournald doesn't end up waiting for kjournald to free up journal space.. Would need to avoid possible deadlocks where someone calls ext3_force_commit() or otherwise does a synchronous commit while holding VFS locks. reiser3 and ext4 could be converted too. Not a short-term project, but this would avoid the problem. - It's pretty obnoxious that the VM now sets a clean page "dirty" and then proceeds to modify its contents. It would be nice to stop doing that. We could stop marking the page dirty in do_wp_page() and create a new VM counter "NR_PTE_DIRTY", which means "number of mapping_cap_account_dirty() pages which have a dirty pte pointing at them". Or, perhaps "number of dirty ptes which point at mapping_cap_account_dirty() pages". Which can be larger, but the writeout code will probably cope. Then we take NR_PTE_DIRTY into account in the dirty-page balancing act. So - do_wp_page() will still run balance_dirty_pages() - but it would no longer run set_page_dirty(). - But it needs to run mark_inode_dirty() so the fs-writeback code notices the file. - And mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) becomes insufficient. The tricky part here is "how do we do the writeback"? The pte-dirty,!PageDirty pages aren't tagged as dirty in the radix-tree and writeback needs to find them so that it can effectively do an msync() on them. Walking all the mm's and vma's would be insane. Visiting all the pages in the file would also probably be insane. Perhaps this can be solved by adding a new radix-tree tag which means "this page might have dirty ptes pointing at it". For each file writeback would do a radix-tree walk of these pages, cleaning-and-write-protecting ptes, marking the corresponding pages dirty and clearing their PAGECACHE_TAG_PTE_DIRTY tags. Then we can fix the mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) problem by doing mapping_tagg
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Theodore Tso wrote: > > I'm confused. Does this mean that if "fs blocksize"=="VM pagesize" > this bug can't trigger? No. Even if there is just a single buffer-head, if the filesystem ever writes out that _single_ buffer-head out of turn (ie before the VM actually asks it to, with "->writepage()"), then the same issue will happen. In fact, a bigger fs blocksize will likely just make this easier to trigger (although I doubt it makes a big difference), since any out-of-order buffer flushback will happen for the whole page, rather than just a part of the page. So the "problem" really ends up being that the filesystem does flushing that the VM isn't aware of, so when the VM did "set_page_dirty()" at an earlier time, the VM _expected_ the "->writepages()" call that happened much later to write the whole page - but because the FS had flushed things behind it backs even _before_ the "->writepage" happens, by the time the VM actually asks for the page to be written out, the FS layer won't actually write it all out any more. Blocksize doesn't matter, the only thing that matters is whether something writes out data on a buffer-cache level, not on a "page cache" level. Ext3 apparently does this in "ordered" data more at least (and hey, I suspect that the code that tries to release buffer head data might try to do it on its own too). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Nick Piggin wrote: > > > It still has a tiny tiny race (see the comment), but I bet nobody can really > > hit it in real life anyway, and I know several ways to fix it, so I'm not > > really _that_ worried about it. > > Well the race isn't a data loss one, is it? Just a case where the > pte may be dirty but the page dirty state not accounted for. Right. We should be picking it up eventually, since it's still in the page tables, but if we've lost sight of the page dirtyness we won't react correctly to msync() and/or fdatasync(). So we don't _lose_ the data, we just might not write it out in a timely manner if we ever hit the race. > Can we fix it by just putting the page_mkclean back inside the > TestClearPageDirty check, and re-clearing PG_dirty after redoing > the set_page_dirty? I considered it, but quite frankly, if we did it that way, I'd really like to just fix the whole insane "set_page_dirty()" instead. I think set_page_dirty() should be split up. One thing that confused me mentally was that almost all of the dirty handling was actualyl done only if PG_dirty wasn't already set, so the _bulk_ of set_page_dirty() really ends up being if (!TestSetPageDirty(page)) { .. we just marked the page dirty, it was clean before, so we need to add it to the queues etc .. } and that's the part that I (and probably others) always really thought about. But then we have the _one_ thing that runs outside of that "do only once per dirty bit" logic, and it's the buffer dirtying. If we had had two separate operations for this all: "set_dirty_every_time()" and the regular "set_dirty()", I don't think this would have been nearly as confusing. (And then the difference between "__set_page_dirty_nobuffers()" and "__set_page_dirty_buffers()" really boils down to one doing the "everytime" _and_ the "once per dirty" checks and the other one doing just the "once per dirty bit" act - and we could rename the damn things to something saner too). If we split it up that way, then the whole clear_page_dirty_for_io() logic would boil down to if (TestClearPageDirty(page)) { if (page_mkclean(page)) set_dirty_every_time(); return 1; } return 0; and we wouldn't even need to do any of the "clear dirty again" kind of idiocy, because the "set_dirty_every_time()" stuff is the one that doesn't even care about the state of the PG_dirty bit - it's done regardless, and doesn't really touch it. That's what I wanted to do, but with the current "set_page_dirty()" setup, I think my patch makes reasonable sense. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
* Stephen Clark <[EMAIL PROTECTED]> [2006-12-29 10:17]: > >It works for me now, both your testcase as well as an installation of > >Debian on this ARM device. I manually applied the patch to 2.6.19. > > Can you post a diff against 2.6.19? --- a/mm/page-writeback.c 2006-11-29 21:57:37.0 + +++ b/mm/page-writeback.c 2006-12-29 11:02:55.555147896 + @@ -893,16 +893,45 @@ { struct address_space *mapping = page_mapping(page); - if (mapping) { + if (mapping && mapping_cap_account_dirty(mapping)) { + /* +* Yes, Virginia, this is indeed insane. +* +* We use this sequence to make sure that +* (a) we account for dirty stats properly +* (b) we tell the low-level filesystem to +* mark the whole page dirty if it was +* dirty in a pagetable. Only to then +* (c) clean the page again and return 1 to +* cause the writeback. +* +* This way we avoid all nasty races with the +* dirty bit in multiple places and clearing +* them concurrently from different threads. +* +* Note! Normally the "set_page_dirty(page)" +* has no effect on the actual dirty bit - since +* that will already usually be set. But we +* need the side effects, and it can help us +* avoid races. +* +* We basically use the page "master dirty bit" +* as a serialization point for all the different +* threds doing their things. +* +* FIXME! We still have a race here: if somebody +* adds the page back to the page tables in +* between the "page_mkclean()" and the "TestClearPageDirty()", +* we might have it mapped without the dirty bit set. +*/ + if (page_mkclean(page)) + set_page_dirty(page); if (TestClearPageDirty(page)) { - if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); - dec_zone_page_state(page, NR_FILE_DIRTY); - } + dec_zone_page_state(page, NR_FILE_DIRTY); return 1; } return 0; - } + } return TestClearPageDirty(page); } EXPORT_SYMBOL(clear_page_dirty_for_io); -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, Dec 29, 2006 at 12:58:12AM -0800, Linus Torvalds wrote: > Because what "__set_page_dirty_buffers()" does is that AT THE TIME THE > "set_page_dirty()" IS CALLED, it will mark all the buffers on that page as > dirty. That may _sound_ like what we want, but it really isn't. Because by > the time "writepage()" is actually called (which can be MUCH MUCH later), > some internal filesystem activity may actually have cleaned one or more of > those buffers in the meantime, and now we call "writepage()" (which really > wants to write them _all_), and it will write only part of them, or none > at all. I'm confused. Does this mean that if "fs blocksize"=="VM pagesize" this bug can't trigger? But I thought at least one of people reporting corruption was using a filesystem with a 4k block size on an i386? - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
Martin Michlmayr wrote: * Linus Torvalds <[EMAIL PROTECTED]> [2006-12-29 02:48]: Can anybody get corruption with this thing applied? It goes on top of plain v2.6.20-rc2. It works for me now, both your testcase as well as an installation of Debian on this ARM device. I manually applied the patch to 2.6.19. Thanks. Hi Martin, Can you post a diff against 2.6.19? Thanks, Steve -- "They that give up essential liberty to obtain temporary safety, deserve neither liberty nor safety." (Ben Franklin) "The course of history shows that as a government grows, liberty decreases." (Thomas Jefferson) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
* Linus Torvalds <[EMAIL PROTECTED]> [2006-12-29 02:48]: > Can anybody get corruption with this thing applied? It goes on top > of plain v2.6.20-rc2. It works for me now, both your testcase as well as an installation of Debian on this ARM device. I manually applied the patch to 2.6.19. Thanks. -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
Linus Torvalds wrote: [...] The patch is mostly a comment. The "real" meat of it is actually just a few lines. Can anybody get corruption with this thing applied? It goes on top of plain v2.6.20-rc2. No corruption with the testcase here. Will check with rtorrent too later today but I suppose it will work just fine. Nice work! It has been interesting (and educating) to follow this bug-hunt :) /Martin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > > Hmm? I'd love it if somebody else wrote the patch and tested it, > > because I'm getting sick and tired of this bug ;) > > Who the hell am I kidding? I haven't been able to sleep right for the > last few days over this bug. It was really getting to me. > > And putting on the thinking cap, there's actually a fairly simple an > nonintrusive patch. [...] ok, your patch seems to fix the testcase here too on -rc2-rt. [ Damn, i should have slept a bit more, that would have saved me a ~4 hour debug and tracing session today to analyze your testcase, just to find your patch and your explanation on lkml, right after i sent my analysis and workaround patch ;-) At least now we know it from two independent tracing results that the suspect code is the same. ] Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
Hey nice work Linus! Linus Torvalds wrote: On Fri, 29 Dec 2006, Linus Torvalds wrote: Hmm? I'd love it if somebody else wrote the patch and tested it, because I'm getting sick and tired of this bug ;) Who the hell am I kidding? I haven't been able to sleep right for the last few days over this bug. It was really getting to me. And putting on the thinking cap, there's actually a fairly simple an nonintrusive patch. Yeah *this* makes more sense. And in retrospect it was simple, we can't just throw out pte dirtiness information if the page doesn't have all buffers dirtied. It still has a tiny tiny race (see the comment), but I bet nobody can really hit it in real life anyway, and I know several ways to fix it, so I'm not really _that_ worried about it. Well the race isn't a data loss one, is it? Just a case where the pte may be dirty but the page dirty state not accounted for. Can we fix it by just putting the page_mkclean back inside the TestClearPageDirty check, and re-clearing PG_dirty after redoing the set_page_dirty? The patch is mostly a comment. The "real" meat of it is actually just a few lines. Can anybody get corruption with this thing applied? It goes on top of plain v2.6.20-rc2. Linus diff --git a/mm/page-writeback.c b/mm/page-writeback.c index b3a198c..ec01da1 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page) { struct address_space *mapping = page_mapping(page); - if (!mapping) - return TestClearPageDirty(page); - - if (TestClearPageDirty(page)) { - if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); + if (mapping && mapping_cap_account_dirty(mapping)) { + /* +* Yes, Virginia, this is indeed insane. +* +* We use this sequence to make sure that +* (a) we account for dirty stats properly +* (b) we tell the low-level filesystem to +* mark the whole page dirty if it was +* dirty in a pagetable. Only to then +* (c) clean the page again and return 1 to +* cause the writeback. +* +* This way we avoid all nasty races with the +* dirty bit in multiple places and clearing +* them concurrently from different threads. +* +* Note! Normally the "set_page_dirty(page)" +* has no effect on the actual dirty bit - since +* that will already usually be set. But we +* need the side effects, and it can help us +* avoid races. +* +* We basically use the page "master dirty bit" +* as a serialization point for all the different +* threds doing their things. +* +* FIXME! We still have a race here: if somebody +* adds the page back to the page tables in +* between the "page_mkclean()" and the "TestClearPageDirty()", +* we might have it mapped without the dirty bit set. +*/ + if (page_mkclean(page)) + set_page_dirty(page); + if (TestClearPageDirty(page)) { dec_zone_page_state(page, NR_FILE_DIRTY); + return 1; } - return 1; + return 0; } - return 0; + return TestClearPageDirty(page); } EXPORT_SYMBOL(clear_page_dirty_for_io); -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 2006-12-29 at 02:48 -0800, Linus Torvalds wrote: > > On Fri, 29 Dec 2006, Linus Torvalds wrote: > > > > Hmm? I'd love it if somebody else wrote the patch and tested it, because > > I'm getting sick and tired of this bug ;) > > Who the hell am I kidding? I haven't been able to sleep right for the last > few days over this bug. It was really getting to me. > > And putting on the thinking cap, there's actually a fairly simple an > nonintrusive patch. It still has a tiny tiny race (see the comment), but I > bet nobody can really hit it in real life anyway, and I know several ways > to fix it, so I'm not really _that_ worried about it. > > The patch is mostly a comment. The "real" meat of it is actually just a > few lines. > > Can anybody get corruption with this thing applied? It goes on top of > plain v2.6.20-rc2. Tested with rtorrent and there is no corruption. > > Linus > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index b3a198c..ec01da1 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page) > { > struct address_space *mapping = page_mapping(page); > > - if (!mapping) > - return TestClearPageDirty(page); > - > - if (TestClearPageDirty(page)) { > - if (mapping_cap_account_dirty(mapping)) { > - page_mkclean(page); > + if (mapping && mapping_cap_account_dirty(mapping)) { > + /* > + * Yes, Virginia, this is indeed insane. > + * > + * We use this sequence to make sure that > + * (a) we account for dirty stats properly > + * (b) we tell the low-level filesystem to > + * mark the whole page dirty if it was > + * dirty in a pagetable. Only to then > + * (c) clean the page again and return 1 to > + * cause the writeback. > + * > + * This way we avoid all nasty races with the > + * dirty bit in multiple places and clearing > + * them concurrently from different threads. > + * > + * Note! Normally the "set_page_dirty(page)" > + * has no effect on the actual dirty bit - since > + * that will already usually be set. But we > + * need the side effects, and it can help us > + * avoid races. > + * > + * We basically use the page "master dirty bit" > + * as a serialization point for all the different > + * threds doing their things. > + * > + * FIXME! We still have a race here: if somebody > + * adds the page back to the page tables in > + * between the "page_mkclean()" and the "TestClearPageDirty()", > + * we might have it mapped without the dirty bit set. > + */ > + if (page_mkclean(page)) > + set_page_dirty(page); > + if (TestClearPageDirty(page)) { > dec_zone_page_state(page, NR_FILE_DIRTY); > + return 1; > } > - return 1; > + return 0; > } > - return 0; > + return TestClearPageDirty(page); > } > EXPORT_SYMBOL(clear_page_dirty_for_io); > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Linus Torvalds wrote: > > Hmm? I'd love it if somebody else wrote the patch and tested it, because > I'm getting sick and tired of this bug ;) Who the hell am I kidding? I haven't been able to sleep right for the last few days over this bug. It was really getting to me. And putting on the thinking cap, there's actually a fairly simple an nonintrusive patch. It still has a tiny tiny race (see the comment), but I bet nobody can really hit it in real life anyway, and I know several ways to fix it, so I'm not really _that_ worried about it. The patch is mostly a comment. The "real" meat of it is actually just a few lines. Can anybody get corruption with this thing applied? It goes on top of plain v2.6.20-rc2. Linus diff --git a/mm/page-writeback.c b/mm/page-writeback.c index b3a198c..ec01da1 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page) { struct address_space *mapping = page_mapping(page); - if (!mapping) - return TestClearPageDirty(page); - - if (TestClearPageDirty(page)) { - if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); + if (mapping && mapping_cap_account_dirty(mapping)) { + /* +* Yes, Virginia, this is indeed insane. +* +* We use this sequence to make sure that +* (a) we account for dirty stats properly +* (b) we tell the low-level filesystem to +* mark the whole page dirty if it was +* dirty in a pagetable. Only to then +* (c) clean the page again and return 1 to +* cause the writeback. +* +* This way we avoid all nasty races with the +* dirty bit in multiple places and clearing +* them concurrently from different threads. +* +* Note! Normally the "set_page_dirty(page)" +* has no effect on the actual dirty bit - since +* that will already usually be set. But we +* need the side effects, and it can help us +* avoid races. +* +* We basically use the page "master dirty bit" +* as a serialization point for all the different +* threds doing their things. +* +* FIXME! We still have a race here: if somebody +* adds the page back to the page tables in +* between the "page_mkclean()" and the "TestClearPageDirty()", +* we might have it mapped without the dirty bit set. +*/ + if (page_mkclean(page)) + set_page_dirty(page); + if (TestClearPageDirty(page)) { dec_zone_page_state(page, NR_FILE_DIRTY); + return 1; } - return 1; + return 0; } - return 0; + return TestClearPageDirty(page); } EXPORT_SYMBOL(clear_page_dirty_for_io); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Thu, 28 Dec 2006, Linus Torvalds wrote: > > So everything I have ever seen says that the VM layer is actually doing > everything right. That was true, but at the same time, it's not. Let me explain. > That to me says: "somebody didn't actually write out out". The VM layer > asked the filesystem to do the write, but the filesystem just didn't do > it. I personally think it's because some buffer-head BH_dirty bit got > scrogged Ok, I have proof of this now. Here's a trace (with cycle counts), and with a new trace event added: this is for another corrupted page. I have: 49105 PG 15d8 (14800): mm/page-writeback.c:872 clean_for_io 49106 PG 15d8 (6900): mm/rmap.c:451 cleaning PTE b7fa6000 49107 PG 15d8 (9900): mm/page-writeback.c:914 set writeback 49108 PG 15d8 (6480): mm/page-writeback.c:916 setting TAG_WRITEBACK 49109 PG 15d8 (7110): mm/page-writeback.c:922 clearing TAG_DIRTY 49110 PG 15d8 (7190): fs/buffer.c:1713 no IO underway 49111 PG 15d8 (6180): mm/page-writeback.c:891 end writeback 49112 PG 15d8 (6460): mm/page-writeback.c:893 clearing TAG_WRITEBACK where that first column is the trace event number again, and the "PG 15d8" is that corrupted page. The thing in the parenthesis is "CPU cycles since last event), and the important part to note is that this is indeed all one single thing with no actual IO anywhere (~7000 CPU cycles may sound like a lot, but (a) it's not that many cache misses and (b) a lot of it is the logging overhead - back-to-back log events will take about 3500 cycles) just because it does the actual ASCII printk() etc. Also, the new event is: fs/buffer.c:1713 no IO underway which is just the if (nr_underway == 0) case in fs/buffer.c And I now finally really believe that I fully understand the corruption, but I don't have a simple solution, much less a patch. What the problem basically boils down to is that "set_page_dirty()" is supposed to be a mark for dirtying THE WHOLE PAGE, but it really is not "the whole page when the 'set_page_dirty()' itself happens", but more of a "the next writepage() needs to write back the whole page" thing. And that's not what "__set_page_dirty_buffers()" really does. Because what "__set_page_dirty_buffers()" does is that AT THE TIME THE "set_page_dirty()" IS CALLED, it will mark all the buffers on that page as dirty. That may _sound_ like what we want, but it really isn't. Because by the time "writepage()" is actually called (which can be MUCH MUCH later), some internal filesystem activity may actually have cleaned one or more of those buffers in the meantime, and now we call "writepage()" (which really wants to write them _all_), and it will write only part of them, or none at all. So the VM thought that since it did a "writepage()", all the dirty state at that point got written back. But it didn't - the filesystem could have written back part or all of the page much earlier, and the writepage() actually does nothing at all. Both filesystem and VM actually _think_ they do the right thing, because they simply have totally different expectations. The filesystem thinks that it should care about dirty buffers (that got marked dirty _after_ they were dirtied), while the filesystem thinks that it cares about dirty _pages_ (that got dirted at any time _before_ "writepage()" was called). Neither is really "wrong", per se, it's just that the two parts have different expectations, and the _combination_ just doesn't work. "set_page_dirty()" at some point meant "the writes have been done", but these days it really means something else. Now, the reason there is no trivial patch is not that this is conceptually really hard to fix. I can see several different approaches to fixing it, but they all really boil down to two alternatives: (a) splitting the one "PG_dirty" bit up into two bits: the "PG_writescheduled" bit and the "PG_alldirty" bit. The "PG_write_scheduled" bit would be the bit that the filesystem would set when it has pending dirty data that it wrote itself (and that may not cover the whole page), and is the part of PG_dirty that sets the PAGECACHE_TAG_DIRTY. It's also what forces "writepage()" to be called. The "PG_alldirty" bit is just an additional "somebody else dirtied random parts of this page, and we don't know what" flag, which is set by "set_page_dirty()" in addition to doing the PG_write_scheduled stuff. We would test-and-clear it at "writepage()" time, and pass it in to "writepages()" to tell the writepage() function that it can't just write out its own small limited notion of what is dirty. (There are various variations on this whole theme: instead of having a flag to "writepage()", we could split the "whole page" case out as a separate callback or similar) (b) making sure that all "set_page_dirty()" calls are _after_ the page
Re: [PATCH] mm: fix page_mkclean_one
On Fri, 29 Dec 2006, Segher Boessenkool wrote: > > > I think what might be happening is that pdflush writes them out fine, > > however we don't trap writes by the application _during_ that writeout. > > Yeah. I believe that more exactly it happens if the very last > write to the page causes a writeback (due to dirty balancing) > while another writeback for the page is already happening. > > As usual in these cases, I have zero proof. I actually have proof to the contrary, ie I have traces that say "the write was started" after the last write. And the VM layer in this area is actually fairly sane and civilized. It has a bit that says "writeback in progress", and if that bit is set, it simply _will_not_ start a new write. It even has various BUG_ON()'s to that effect. So everything I have ever seen says that the VM layer is actually doing everything right. > It's the do_wp_page -> balance_dirty_pages -> generic_writepages > path for sure. Maybe it's enough to change > > if (wbc->sync_mode != WB_SYNC_NONE) > wait_on_page_writeback(page); > > if (PageWriteback(page) || > !clear_page_dirty_for_io(page)) { > unlock_page(page); > continue; > } Notive how this one basically says: - if it's under writeback, don't even clear the page dirty flag. Your suggested change: > if (wbc->sync_mode != WB_SYNC_NONE) > wait_on_page_writeback(page); > > if (PageWriteback(page)) { > redirty_page_for_writepage(wbc, page); makes no sense, because we simply never _did_ the "clear_page_dirty()" if the thing was under writeback in the first place. That's how C conditionals work. So there's no reason to "redirty" it, because it wasn't cleaned in the first place. I've double- and triple-checked the dirty bits, including having traces that actually say that the IO was started (from a VM perspective) _after_ the last write was done. The IO just didn't hit the disk. I'm personally fairly convinced that it's not a VM issue, but a "IO issue". Either in a low-level filesystem or in some of the fs/buffer.c helper routines. But I'd love to be proven wrong. I do have a few interesting details from the trace I haven't really analyzed yet. Here's the trace for events on one of the pages that was corrupted. Note how the events are numbered (there were 171640 events total), so the thing you see is just a small set of events from the whole big trace, but it's the ones that talk about _that_ particular page. I've grouped them so hat "consecutive" events group together. That just means that no events on any other pages happened in between those events, and it is usually a sign that it's really one single call-chain that causes all the events. For example, for the first group of three events (44366-44368), it's the page fault that brings in the page, and since it's a write-fault, it will not only map the page, it will mark the page itself dirty and then also set the TAG_DIRTY on the mapping. So the "group" is just really a result of one single event happening, which causes several things to happen to that page. That's exactly what you'd expect. Anyway, here is the list of events that page went through: 44366 PG 0f6d: mm/memory.c:2254 mapping at b789fc54 (write) 44367 PG 0f6d: mm/page-writeback.c:817 setting dirty 44368 PG 0f6d: fs/buffer.c:738 setting TAG_DIRTY 64231 PG 0f6d: mm/page-writeback.c:872 clean_for_io 64232 PG 0f6d: mm/rmap.c:451 cleaning PTE b789f000 64233 PG 0f6d: mm/page-writeback.c:914 set writeback 64234 PG 0f6d: mm/page-writeback.c:916 setting TAG_WRITEBACK 64235 PG 0f6d: mm/page-writeback.c:922 clearing TAG_DIRTY 67570 PG 0f6d: mm/page-writeback.c:891 end writeback 67571 PG 0f6d: mm/page-writeback.c:893 clearing TAG_WRITEBACK 76705 PG 0f6d: mm/page-writeback.c:817 setting dirty 76706 PG 0f6d: fs/buffer.c:725 dirtied buffers 76707 PG 0f6d: fs/buffer.c:738 setting TAG_DIRTY 105267 PG 0f6d: mm/page-writeback.c:872 clean_for_io 105268 PG 0f6d: mm/rmap.c:451 cleaning PTE b789f000 105269 PG 0f6d: mm/page-writeback.c:914 set writeback 105270 PG 0f6d: mm/page-writeback.c:916 setting TAG_WRITEBACK 105271 PG 0f6d: mm/page-writeback.c:922 clearing TAG_DIRTY 105272 PG 0f6d: mm/page-writeback.c:891 end writeback 105273 PG 0f6d: mm/page-writeback.c:893 clearing TAG_WRITEBACK 128032 PG 0f6d: mm/memory.c:670 unmapped at b789f000 132662 PG 0f6d: mm/filemap.c:119 Removing page cache 148278 PG 0f6d: mm/memory.c:2254 mapping at b789f000 (read) 166326 PG 0f6d: mm/memory.c:670 unmapped at b789f000 171958 PG
Re: [PATCH] mm: fix page_mkclean_one
I think what might be happening is that pdflush writes them out fine, however we don't trap writes by the application _during_ that writeout. Yeah. I believe that more exactly it happens if the very last write to the page causes a writeback (due to dirty balancing) while another writeback for the page is already happening. As usual in these cases, I have zero proof. It's something that will only occur with writeback and MAP_SHARED writable access to the file pages. It's the do_wp_page -> balance_dirty_pages -> generic_writepages path for sure. Maybe it's enough to change if (wbc->sync_mode != WB_SYNC_NONE) wait_on_page_writeback(page); if (PageWriteback(page) || !clear_page_dirty_for_io(page)) { unlock_page(page); continue; } to if (wbc->sync_mode != WB_SYNC_NONE) wait_on_page_writeback(page); if (PageWriteback(page)) { redirty_page_for_writepage(wbc, page); unlock_page(page); continue; } if (!clear_page_dirty_for_io(page)) { unlock_page(page); continue; } or something along those lines. Completely untested of course :-) Segher - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
From: Linus Torvalds <[EMAIL PROTECTED]> Date: Thu, 28 Dec 2006 12:14:31 -0800 (PST) > I get corruption - but the whole point is that it's very much pdflush that > should be writing these pages out. I think what might be happening is that pdflush writes them out fine, however we don't trap writes by the application _during_ that writeout. These corruptions look exactly as if: 1) pdflush begins writeback of page X 2) page goes to disk 3) application writes a chunk to the page 4) pdflush et al. think the page is clean, so it gets tossed, losing the writes done in #3 So there's a missing PTE change in there, so that we never get proper re-dirtying of the page if the application tries to write to the page during the writeback. It's something that will only occur with writeback and MAP_SHARED writable access to the file pages. That's why we never see this with normal filesystem writes, since those explicitly manage the page dirty state. I think the dirty balancing logic etc. isn't where the problems are, to me it's a PTE state update issue for sure. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Thu, 2006-12-28 at 11:45 -0800, Andrew Morton wrote: > On Thu, 28 Dec 2006 11:28:52 -0800 (PST) > Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > > > > > On Thu, 28 Dec 2006, Guillaume Chazarain wrote: > > > > > > The attached patch fixes the corruption for me. > > > > Well, that's a good hint, but it's really just a symptom. You effectively > > just made the test-program not even try to flush the data to disk, so the > > page cache would stay in memory, and you'd not see the corruption as well. > > > > So you basically disabled the code that tried to trigger the bug more > > easily. > > > > But the reason I say it's interesting is that "WB_SYNC_NONE" is very much > > implicated in mm/page-writeback.c, and if there is a bug triggered by > > WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also > > fails.. > > > > It would be interesting to convert your app to do fsync() before > FADV_DONTNEED. That would take WB_SYNC_NONE out of the picture as well > (apart from pdflush activity). I did fdatasync(), tried remapping before unmapping... nogo here. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Thu, 28 Dec 2006, Andrew Morton wrote: > > It would be interesting to convert your app to do fsync() before > FADV_DONTNEED. That would take WB_SYNC_NONE out of the picture as well > (apart from pdflush activity). I get corruption - but the whole point is that it's very much pdflush that should be writing these pages out. Andrew - give my test-program a try. It can run in about 1 minute if you have a 256MB machine (I didn't, but "mem=256M" is my friend), and it seems to very consistently cause corruption. What I do is: # Make sure we write back aggressively echo 5 > /proc/sys/vm/dirty_ratio as root, and then just run the thing. Tons of corruption. But the corruption goes away if I just leave the default dirty ratio alone (but then I can increse the file size to trigger it, of course - but that also makes the test run a lot slower). Now, with a pre-2.6.19 kernel, I bet you won't get the corruption as easily (at least with the "fsync()"), but that's less to do with anything new, and probably just because then you simply won't have any pdflushing going on - since the kernel won't even notice that you have tons of dirty pages ;) It might also depend on the speed of your disk drive - the machine I test this on has a slow 4200 rpm laptop drive in it, and that probably makes things go south more easily. That's _especially_ true if this is related to any "bdi_write_congested()" logic. Now, it could also be related to various code snippets like ... if (wbc->sync_mode != WB_SYNC_NONE) wait_on_page_writeback(page); if (PageWriteback(page) || !clear_page_dirty_for_io(page)) { unlock_page(page); continue; } ... where the WB_SYNC_NONE case will hit the "PageWriteback()" and just not do the writeback at all (but it also won't clear the dirty bit, so it's certainly not an *OBVIOUS* bug). We also have code like this ("pageout()"): if (clear_page_dirty_for_io(page)) { int res; struct writeback_control wbc = { .sync_mode = WB_SYNC_NONE, .. } ... res = mapping->a_ops->writepage(page, &wbc); and in this case, if the "WB_SYNC_NONE" means that the "writepage()" call won't do anything at all because of congestion, then that would be a _bad_ thing, and would certainly explain how something didn't get written out. But that particular path should only trigger for the "shrink_page_list()" case, and it's not the case I seem to be testing with my "low dirty_ratio" testing. Linus#include #include #include #include #include #include #include #define TARGETSIZE (22 << 20) #define CHUNKSIZE (1460) #define NRCHUNKS (TARGETSIZE / CHUNKSIZE) #define SIZE (NRCHUNKS * CHUNKSIZE) static void fillmem(void *start, int nr) { memset(start, nr, CHUNKSIZE); } #define page_offset(buf, off) (unsigned)((unsigned long)(buf)+(off)-(unsigned long)(mapping)) static int chunkorder[NRCHUNKS]; static char *mapping; static int order(int nr) { int i; if (nr < 0 || nr >= NRCHUNKS) return -1; for (i = 0; i < NRCHUNKS; i++) if (chunkorder[i] == nr) return i; return -2; } static void checkmem(void *buf, int nr) { unsigned int start = ~0u, end = 0; unsigned char c = nr, *p = buf, differs = 0; int i; for (i = 0; i < CHUNKSIZE; i++) { unsigned char got = *p++; if (got != c) { if (i < start) start = i; if (i > end) end = i; differs = got; } } if (start < end) { printf("Chunk %d corrupted (%u-%u) (%x-%x)\n", nr, start, end, page_offset(buf, start), page_offset(buf, end)); printf("Expected %u, got %u\n", c, differs); printf("Written as (%d)%d(%d)\n", order(nr-1), order(nr), order(nr+1)); } } static char *remap(int fd, char *mapping) { if (mapping) { munmap(mapping, SIZE); // fsync(fd); posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED); } return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); } int main(int argc, char **argv) { int fd, i; /* * Make some random ordering of writing the chunks to the * memory map.. * * Start with fully ordered.. */ for (i = 0; i < NRCHUNKS; i++) chunkorder[i] = i; /* ..and then mix it up randomly */ srandom(time(NULL)); for (i = 0; i < NRCHUNKS; i++) { int index = (unsigned int) random() % NRCH
Re: [PATCH] mm: fix page_mkclean_one
On Thu, 28 Dec 2006 11:28:52 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > On Thu, 28 Dec 2006, Guillaume Chazarain wrote: > > > > The attached patch fixes the corruption for me. > > Well, that's a good hint, but it's really just a symptom. You effectively > just made the test-program not even try to flush the data to disk, so the > page cache would stay in memory, and you'd not see the corruption as well. > > So you basically disabled the code that tried to trigger the bug more > easily. > > But the reason I say it's interesting is that "WB_SYNC_NONE" is very much > implicated in mm/page-writeback.c, and if there is a bug triggered by > WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also > fails.. > It would be interesting to convert your app to do fsync() before FADV_DONTNEED. That would take WB_SYNC_NONE out of the picture as well (apart from pdflush activity). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Thu, 28 Dec 2006, Guillaume Chazarain wrote: > > The attached patch fixes the corruption for me. Well, that's a good hint, but it's really just a symptom. You effectively just made the test-program not even try to flush the data to disk, so the page cache would stay in memory, and you'd not see the corruption as well. So you basically disabled the code that tried to trigger the bug more easily. But the reason I say it's interesting is that "WB_SYNC_NONE" is very much implicated in mm/page-writeback.c, and if there is a bug triggered by WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also fails.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
Guillaume Chazarain a écrit : I get this kind of corruption: http://guichaz.free.fr/linux-bug/corruption.png Actually in qemu, I get three different behaviours: - no corruption at all : with linux-2.4 - corruption only on the first chunks: before [PATCH] mm: balance dirty pages as identified by Kenneth - corruption of all chunks: after the balance dirty pages patch Bisecting in linux-2.5 land I found http://kernel.org/pub/linux/kernel/people/akpm/patches/2.5/2.5.66/2.5.66-mm3/broken-out/fadvise-flush-data.patch to cause the corruption for me. The attached patch fixes the corruption for me. -- Guillaume diff -r 3859b1144d3a mm/fadvise.c --- a/mm/fadvise.c Sun Dec 24 05:00:03 2006 + +++ b/mm/fadvise.c Thu Dec 28 19:53:40 2006 +0100 @@ -96,9 +96,6 @@ asmlinkage long sys_fadvise64_64(int fd, case POSIX_FADV_NOREUSE: break; case POSIX_FADV_DONTNEED: - if (!bdi_write_congested(mapping->backing_dev_info)) - filemap_flush(mapping); - /* First and last FULL page! */ start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT; end_index = (endbyte >> PAGE_CACHE_SHIFT);
Re: [PATCH] mm: fix page_mkclean_one
On Thu, 28 Dec 2006, Russell King wrote: > > Yup, but I have nothing to do with glibc because I refuse to do that > silly copyright assignment FSF thing. Hopefully someone else can > resolve it, but... Yeah, me too. > _is_ a fix whether _you_ like it or not to work around the issue so > people can at least run your test program. I'm not saying it's a > proper fix though. My point was that it wasn't a "fix", it's a "workaround". The _fix_ would be in glibc. Nothing more.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Thu, Dec 28, 2006 at 09:27:12AM -0800, Linus Torvalds wrote: > On Thu, 28 Dec 2006, Russell King wrote: > > and if you look at glibc's memset() function, you'll notice that's exactly > > what you expect if you pass a non-8bit value to it. Ergo, what you're > > seeing is utterly expected given glibc's memset() implementation on ARM. > > Guys, you _really_ should fix memset(). What you describe is a _bug_. Yup, but I have nothing to do with glibc because I refuse to do that silly copyright assignment FSF thing. Hopefully someone else can resolve it, but... > > Fixing Linus' test program to pass nr & 255 to memset > > No. I'm almost certain that that is not a "fix", it's a workaround for a > serious bug in your glibc crap. _is_ a fix whether _you_ like it or not to work around the issue so people can at least run your test program. I'm not saying it's a proper fix though. Of course, if you prefer to be mislead by incorrect bug reports... -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Thu, 28 Dec 2006, Russell King wrote: > > and if you look at glibc's memset() function, you'll notice that's exactly > what you expect if you pass a non-8bit value to it. Ergo, what you're > seeing is utterly expected given glibc's memset() implementation on ARM. Guys, you _really_ should fix memset(). What you describe is a _bug_. "memset()" takes an "int" as its argument (always has), and has to convert it to a byte _itself_. It may not be common, but it's perfectly normal, to pass it values outside 0-255 (negative values that still fit in a "signed char" in particular are very normal, but my usage of "let the thing truncate it itself" is also quite fine). > Fixing Linus' test program to pass nr & 255 to memset No. I'm almost certain that that is not a "fix", it's a workaround for a serious bug in your glibc crap. But it does explain all the unexpected strange behaviour (and the really small writeback size - now it doesn't need any /proc/sys/vm/dirty_ratio assumptions to be explicable. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Thu, 28 Dec 2006, Zhang, Yanmin wrote: > > The test program is a process to write/read data. pdflush might write data > to disk asynchronously. After pdflush writes a page to disk, it will call > (either by > softirq) clear_page_dirty to clear the dirty bit after getting the interrupt > notification. That would indeed be a horrible bug. However, we don't do "clear_page_dirty()" _after_ the IO has completed, we do it _before_ the IO starts. If you can actually find a place that does clear_page_dirty _after_ IO, then yes, you've just found the bug. But I haven't found it. So the rule is _always_: - call "clear_page_dirty_for_io()" with the page lock held, and _before_ the IO starts. - do "set_page_writeback()" before unlocking the page again - do a "end_page_writeback()" when the IO actually finishes. and any code sequence that doesn't honor those rules would be buggy. A beer for anybody that finds it.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] mm: fix page_mkclean_one
On Wed, 27 Dec 2006, Chen, Kenneth W wrote: > > > > Running the test code, git bisect points its finger at this commit. > > Reverting > > this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code. > > > > [PATCH] mm: balance dirty pages > > > > Now that we can detect writers of shared mappings, throttle them. > > Avoids OOM > > by surprise. > > Oh, never mind :-( I just didn't create enough write out pressure when > test this. I just saw bug got triggered on a kernel I previously thought > was OK. Btw, this is an important point - people have long felt that the new page balancing in 2.6.19 was to blame, but you've just confirmed the long-held suspicion (at least by me) that it's not actually a new bug at all, it's just that the dirty page balancing causes writeback to happen _earlier_, and thus is better able to _show_ a bug that we've likely had for a long long time. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Wed, 27 Dec 2006, Gordon Farquharson wrote: > > 100kB and 200kB files always succeed on the ARM system. 400kB and > larger always seem to fail. Oh, wow. Yeah, I've just repressed how tiny 32MB is. And especially if you lowered the /proc/sys/vm/dirty_ratio to a smaller percentage, I guess 400kB should be enough to cause writeback. Ugh. I tested a 128MB machine a few weeks ago, and found it painful. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Thu Dec 28 15:09 , Guillaume Chazarain sent: >I set a qemu environment to test kernels: http://guichaz.free.fr/linux-bug/ >I have corruption with every Fedora release kernel except the first, that is >2.4.22 works, but 2.6.5, 2.6.9, 2.6.11, 2.6.15 and 2.6.18-1.2798 exhibit >some corruption. Confirm that I see corruption with Fedora kernel 2.6.18-1.2239.fc5: ... Chunk 142969 corrupted (0-1459) (2580-4039) Expected 121, got 0 Written as (89632)127652(124721) Chunk 142976 corrupted (0-1459) (512-1971) Expected 128, got 0 Written as (121734)128324(108589) Checking chunk 143639/143640 (99%) $ uname -a Linux 2.6.18-1.2239.fc5 #1 Fri Nov 10 13:04:06 EST 2006 i686 athlon i386 GNU/Linux /Martin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
* Gordon Farquharson <[EMAIL PROTECTED]> [2006-12-28 07:15]: > Thanks for the fix, Russell. > > I can now trigger the (real) problem by using a 25 MB file (100 << 18) > and the Linksys NSLU2 (ARM, IXP420 processor, 32 MB RAM). Me too (using 100 << 18). Interestingly, I don't seem to get any corruption on a different ARM board, an IOP32x based machine with 128 MB RAM. -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: [PATCH] mm: fix page_mkclean_one
I set a qemu environment to test kernels: http://guichaz.free.fr/linux-bug/ I have corruption with every Fedora release kernel except the first, that is 2.4.22 works, but 2.6.5, 2.6.9, 2.6.11, 2.6.15 and 2.6.18-1.2798 exhibit some corruption. Command line to test: qemu root_fs -snapshot -kernel FC-kernels/FC2-vmlinuz-2.6.5-1.358 -append 'rw root=/dev/hda' I get this kind of corruption: http://guichaz.free.fr/linux-bug/corruption.png -- Guillaume - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
* Russell King <[EMAIL PROTECTED]> [2006-12-28 10:49]: > > By the way, I just tried it with TARGETSIZE (100 << 12) on a different > > ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of > > memory) and I see similar results to that from Gordon: > > Work around the glibc memset() problem by passing nr & 255, and re-run > the test. You're getting false positives. Yeah, I saw your message about this problem after I sent mine. -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On 12/28/06, Russell King <[EMAIL PROTECTED]> wrote: Fixing Linus' test program to pass nr & 255 to memset results in clean passes on 2.6.9 on TheCus N2100 (IOP8032x) and 2.6.16.9 StrongARM machines (as would be expected.) Thanks for the fix, Russell. I can now trigger the (real) problem by using a 25 MB file (100 << 18) and the Linksys NSLU2 (ARM, IXP420 processor, 32 MB RAM). $ ./linus-test Writing chunk 17954/17955 (99%) Chunk 514 corrupted (0-1459) (872-2331) Expected 2, got 0 Written as (8479)11160(10312) Chunk 516 corrupted (0-303) (3792-4095) Expected 4, got 0 Written as (10312)10569(4426) Chunk 959 corrupted (0-691) (3404-4095) Expected 191, got 0 Written as (687)4881(1522) Chunk 1895 corrupted (0-1459) (1900-3359) Expected 103, got 0 Written as (7746)8389(6231) Chunk 2702 corrupted (0-1459) (472-1931) Expected 142, got 0 Written as (4866)7103(2409) Chunk 3314 corrupted (0-1459) (1064-2523) Expected 242, got 0 Written as (4287)7064(1730) Chunk 4043 corrupted (0-1459) (444-1903) Expected 203, got 0 Written as (6495)8509(4464) Chunk 5180 corrupted (0-1459) (1584-3043) Expected 60, got 0 Written as (11056)12826(10797) Chunk 5672 corrupted (0-991) (3104-4095) Expected 40, got 0 Written as (9944)4872(41) Chunk 5793 corrupted (460-1459) (0-999) Expected 161, got 0 Written as (7059)5038(4377) Chunk 6089 corrupted (0-1459) (1620-3079) Expected 201, got 0 Written as (4672)5230(4403) Chunk 6545 corrupted (268-1459) (0-1191) Expected 145, got 0 Written as (3701)5969(4668) Chunk 7578 corrupted (0-1459) (584-2043) Expected 154, got 0 Written as (10015)5082(1648) Chunk 7880 corrupted (864-1459) (0-595) Expected 200, got 0 Written as (17869)5064(4745) Chunk 8086 corrupted (0-1459) (888-2347) Expected 150, got 0 Written as (10206)11050(10374) Chunk 8749 corrupted (0-1459) (2212-3671) Expected 45, got 0 Written as (15263)7132(4825) Chunk 9068 corrupted (0-1459) (1008-2467) Expected 108, got 0 Written as (5557)7571(6771) Chunk 9193 corrupted (812-1459) (0-647) Expected 233, got 0 Written as (9238)7277(4757) Chunk 10032 corrupted (576-1459) (0-883) Expected 48, got 0 Written as (15741)10012(1753) Chunk 10056 corrupted (0-1459) (1696-3155) Expected 72, got 0 Written as (5379)7431(262) Chunk 10395 corrupted (0-1459) (1020-2479) Expected 155, got 0 Written as (21)7442(5902) Chunk 10791 corrupted (0-1459) (1644-3103) Expected 39, got 0 Written as (4753)5925(5926) Chunk 10792 corrupted (0-991) (3104-4095) Expected 40, got 0 Written as (5925)5926(8555) Chunk 11036 corrupted (0-1103) (2992-4095) Expected 28, got 0 Written as (13755)14449(7458) Chunk 11387 corrupted (644-1459) (0-815) Expected 123, got 0 Written as (10853)11459(9445) Chunk 11586 corrupted (920-1459) (0-539) Expected 66, got 0 Written as (3769)11691(11123) Chunk 11882 corrupted (0-1459) (1160-2619) Expected 106, got 0 Written as (10736)11696(2788) Chunk 12397 corrupted (0-603) (3492-4095) Expected 109, got 0 Written as (2352)7515(2437) Chunk 12669 corrupted (0-795) (3300-4095) Expected 125, got 0 Written as (1191)7661(5266) Chunk 13162 corrupted (0-1459) (2184-3643) Expected 106, got 0 Written as (9383)13662(11544) Chunk 14653 corrupted (0-27) (4068-4095) Expected 61, got 0 Written as (8100)9456(1275) Chunk 17332 corrupted (0-367) (3728-4095) Expected 180, got 0 Written as (760)12247(1244) Chunk 17445 corrupted (0-1459) (772-2231) Expected 37, got 0 Written as (8007)16481(14439) Chunk 17556 corrupted (0-1007) (3088-4095) Expected 148, got 0 Written as (10113)10657(10477) Chunk 17859 corrupted (0-995) (3100-4095) Expected 195, got 0 Written as (14472)14767(11426) Checking chunk 17954/17955 (99%) Gordon -- Gordon Farquharson - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Wed, Dec 27, 2006 at 07:04:34PM -0800, Linus Torvalds wrote: > [ Modified test-program that tells you where the corruption happens (and > when the missing parts were supposed to be written out) appended, in > case people care. ] Hi 2.6.18 (and 2.6.18.6) is ok, 2.6.19-rc1 is broken. I tried some snapshots between them but they hung before shell (2.6.18-git11, 2.6.18-git16, 2.6.18-git20, 2.6.18-git21). 2.6.18-git22 booted and was broken. (UP, no preempt) -Petri - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Thu, Dec 28, 2006 at 11:16:59AM +0100, Martin Michlmayr wrote: > * Gordon Farquharson <[EMAIL PROTECTED]> [2006-12-27 22:38]: > > >> #define TARGETSIZE (100 << 12) > > > > > >That's just 400kB! > > > > > >There's no way you should see corruption with that kind of value. It > > >should all stay solidly in the cache. > > > > > >Is this perhaps with ARM nommu or something else strange? It may be that > > >the program just doesn't work at all if mmap() is faked out with a malloc > > >or similar. > > > > Definitely a question for the ARM gurus. I'm out of my depth. > > By the way, I just tried it with TARGETSIZE (100 << 12) on a different > ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of > memory) and I see similar results to that from Gordon: Work around the glibc memset() problem by passing nr & 255, and re-run the test. You're getting false positives. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
* Gordon Farquharson <[EMAIL PROTECTED]> [2006-12-27 22:38]: > >> #define TARGETSIZE (100 << 12) > > > >That's just 400kB! > > > >There's no way you should see corruption with that kind of value. It > >should all stay solidly in the cache. > > > >Is this perhaps with ARM nommu or something else strange? It may be that > >the program just doesn't work at all if mmap() is faked out with a malloc > >or similar. > > Definitely a question for the ARM gurus. I'm out of my depth. By the way, I just tried it with TARGETSIZE (100 << 12) on a different ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of memory) and I see similar results to that from Gordon: Writing chunk 279/280 (99%) Chunk 256 corrupted (1-1455) (1025-2479) Expected 0, got 1 Written as (199)43(184) Chunk 258 corrupted (1-1455) (3945-1303) Expected 2, got 3 Written as (184)74(145) Chunk 260 corrupted (1-1455) (2769-127) Expected 4, got 5 Written as (145)89(237) Chunk 262 corrupted (1-1455) (1593-3047) Expected 6, got 7 Written as (237)168(174) Chunk 264 corrupted (1-1455) (417-1871) Expected 8, got 9 Written as (174)135(161) Chunk 266 corrupted (1-1455) (3337-695) Expected 10, got 11 Written as (161)123(180) Chunk 268 corrupted (1-1455) (2161-3615) Expected 12, got 13 Written as (180)13(19) Chunk 270 corrupted (1-1455) (985-2439) Expected 14, got 15 Written as (19)172(106) Chunk 272 corrupted (1-1455) (3905-1263) Expected 16, got 17 Written as (106)212(140) Chunk 274 corrupted (1-1455) (2729-87) Expected 18, got 19 Written as (140)124(73) Chunk 276 corrupted (1-1455) (1553-3007) Expected 20, got 21 Written as (73)151(52) Chunk 278 corrupted (1-1455) (377-1831) Expected 22, got 23 Written as (52)215(99) Checking chunk 279/280 (99%) -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Wed, Dec 27, 2006 at 10:20:20PM -0700, Gordon Farquharson wrote: > I have run the program a few times, and the output is pretty > consistent. However, when I increase the target size, the difference > between the expected and actual values is larger. > > Written as (749)935(738) > Chunk 1113 corrupted (1-1455) (2965-323) > Expected 89, got 93 This is not the corruption Linus is after. Note that the corruption starts at offset '1'. Also note that: 89 = 1113 & 255 93 = 1113 & 255 | (1113 >> 8) and if you look at glibc's memset() function, you'll notice that's exactly what you expect if you pass a non-8bit value to it. Ergo, what you're seeing is utterly expected given glibc's memset() implementation on ARM. Fixing Linus' test program to pass nr & 255 to memset results in clean passes on 2.6.9 on TheCus N2100 (IOP8032x) and 2.6.16.9 StrongARM machines (as would be expected.) -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
* Gordon Farquharson <[EMAIL PROTECTED]> [2006-12-27 22:38]: > >That's just 400kB! > > > >There's no way you should see corruption with that kind of value. It > >should all stay solidly in the cache. > > > >Is this perhaps with ARM nommu or something else strange? It may be that > >the program just doesn't work at all if mmap() is faked out with a malloc > >or similar. > > Definitely a question for the ARM gurus. I'm out of my depth. The CPU has a MMU. For reference, it's a IXP4xx based device with 32 MB of memory. -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Wed, 2006-12-27 at 19:04 -0800, Linus Torvalds wrote: > > On Wed, 27 Dec 2006, David Miller wrote: > > > > > > I still don't see _why_, though. But maybe smarter people than me can see > > > it.. > > > > FWIW this program definitely triggers the bug for me. > > Ok, now that I have something simple to do repeatable stuff with, I can > say what the pattern is.. It's not all that surprising, but it's still > worth just stating for the record. > > What happens is that when I do the "packetized writes" in random order, > the _last_ write to a page occasionally just goes missing. It's not always > at the end of a page, as shown by for example: > > - A whole chunk got dropped: > > Chunk 2094 corrupted (0-1459) (1624-3083) > Expected 46, got 0 > Written as (30912)55414(1) > >That "Written as (x)y(z)" line means that the corrupted chunk was >written as chunk #y, and the preceding and following chunks (that were >_not_ corrupt) on the page was written as #x and #z respectively. > >In other words, the missing chunk (which is still zero) was written >much later than the ones that were ok, and never hit the disk. It's a >contiguous chunk in the middle of the page (chunks are 1460 bytes in >size) > >The first line means that all bytes of the chunk (0-1459) were >corrupted, and the values in parenthesis are the offsets within a page. >In other words, this was a chunk in the _middle_ of a page. > > - The missing data can also be at the beginning or ends of pages: > >Beginning of the chunk missing, it was at the end of a page (page >offsets 3288-4095) and the _next_ page got written out fine: > > Chunk 2126 corrupted (0-807) (3288-4095) > Expected 78, got 0 > Written as (32713)55573(14301) > >End of a chunk missing, it was the beginning of a page (and the >_previous_ page that contained the beginning of the chunk was written >out fine) > > Chunk 2179 corrupted (1252-1459) (0-207) > Expected 131, got 0 > Written as (45189)55489(15515) > > Now, the reason I say this isn't surprising is that this is entirely > consistent with the dirty bit being dropped on the floor somewhere, and > likely through some interaction with the previous changes being in the > process of being written out. > > Something (incorrectly) ends up deciding that it doesn't need to write the > page, since it's already written, or alternatively clears the dirty bit > too late (clears it because an _earlier_ write finished, never mind that > the new dirty data didn't make it). There might be a narrow race between set_page_dirty and clear_page_dirty. The test program is a process to write/read data. pdflush might write data to disk asynchronously. After pdflush writes a page to disk, it will call (either by softirq) clear_page_dirty to clear the dirty bit after getting the interrupt notification. But just after the page is written to disk and before the interrupt reports the result, the test process might change the data and unmap the area. When the area is unmapped, the page is marked as dirty again, but just after that, the interrupt arrives and the dirty bit is cleared, so the late data will not be written to disk. Function zap_pte_range checks pte to set page dirty if needed, but it doesn't hold page lock. If the page lock is held before set page dirty, the race might be avoided. Yanmin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
From: "Chen, Kenneth W" <[EMAIL PROTECTED]> Date: Wed, 27 Dec 2006 22:10:52 -0800 > Chen, Kenneth wrote on Wednesday, December 27, 2006 9:55 PM > > Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM > > > On Wed, 27 Dec 2006, David Miller wrote: > > > > > > > > > > I still don't see _why_, though. But maybe smarter people than me can > > > > > see > > > > > it.. > > > > > > > > FWIW this program definitely triggers the bug for me. > > > > > > Ok, now that I have something simple to do repeatable stuff with, I can > > > say what the pattern is.. It's not all that surprising, but it's still > > > worth just stating for the record. > > > > > > Running the test code, git bisect points its finger at this commit. > > Reverting > > this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code. > > > > edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit > > commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f > > Author: Peter Zijlstra <[EMAIL PROTECTED]> > > Date: Mon Sep 25 23:30:58 2006 -0700 > > > > [PATCH] mm: balance dirty pages > > > > Now that we can detect writers of shared mappings, throttle them. > > Avoids OOM > > by surprise. > > > Oh, never mind :-( I just didn't create enough write out pressure when > test this. I just saw bug got triggered on a kernel I previously thought > was OK. Besides, I'm pretty sure that from the Debian bug entry it's been established that the dirty-page tracking changes from a few releases ago introduced this problem. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] mm: fix page_mkclean_one
Chen, Kenneth wrote on Wednesday, December 27, 2006 9:55 PM > Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM > > On Wed, 27 Dec 2006, David Miller wrote: > > > > > > > > I still don't see _why_, though. But maybe smarter people than me can > > > > see > > > > it.. > > > > > > FWIW this program definitely triggers the bug for me. > > > > Ok, now that I have something simple to do repeatable stuff with, I can > > say what the pattern is.. It's not all that surprising, but it's still > > worth just stating for the record. > > > Running the test code, git bisect points its finger at this commit. Reverting > this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code. > > edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit > commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f > Author: Peter Zijlstra <[EMAIL PROTECTED]> > Date: Mon Sep 25 23:30:58 2006 -0700 > > [PATCH] mm: balance dirty pages > > Now that we can detect writers of shared mappings, throttle them. Avoids > OOM > by surprise. Oh, never mind :-( I just didn't create enough write out pressure when test this. I just saw bug got triggered on a kernel I previously thought was OK. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On 12/27/06, Linus Torvalds <[EMAIL PROTECTED]> wrote: That's just 400kB! There's no way you should see corruption with that kind of value. It should all stay solidly in the cache. 100kB and 200kB files always succeed on the ARM system. 400kB and larger always seem to fail. Does the following help interpret the results on ARM at all ? $ free total used free sharedbuffers cached Mem: 3 23620 6380 0808 15676 -/+ buffers/cache: 7136 22864 Swap:88316 3664 84652 Gordon -- Gordon Farquharson - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] mm: fix page_mkclean_one
Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM > On Wed, 27 Dec 2006, David Miller wrote: > > > > > > I still don't see _why_, though. But maybe smarter people than me can see > > > it.. > > > > FWIW this program definitely triggers the bug for me. > > Ok, now that I have something simple to do repeatable stuff with, I can > say what the pattern is.. It's not all that surprising, but it's still > worth just stating for the record. Running the test code, git bisect points its finger at this commit. Reverting this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code. edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f Author: Peter Zijlstra <[EMAIL PROTECTED]> Date: Mon Sep 25 23:30:58 2006 -0700 [PATCH] mm: balance dirty pages Now that we can detect writers of shared mappings, throttle them. Avoids OOM by surprise. Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]> Cc: Hugh Dickins <[EMAIL PROTECTED]> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
Hi David On 12/27/06, David Miller <[EMAIL PROTECTED]> wrote: Me too, I added "-D_POSIX_C_SOURCE=200112" to "fix" this. That works for me. Thanks for the tip. Gordon -- Gordon Farquharson - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
From: "Gordon Farquharson" <[EMAIL PROTECTED]> Date: Wed, 27 Dec 2006 22:20:20 -0700 > and for some reason I get > > linus-test.c: In function 'remap': > linus-test.c:61: error: 'POSIX_FADV_DONTNEED' undeclared (first use in > this function) > > when I compile the program, so I replaced POSIX_FADV_DONTNEED with 4 > as defined in /usr/include/bits/fcntl.h. Me too, I added "-D_POSIX_C_SOURCE=200112" to "fix" this. Perhaps Linus's GCC sets that by default and our's doesn't. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On 12/27/06, Linus Torvalds <[EMAIL PROTECTED]> wrote: On Wed, 27 Dec 2006, Gordon Farquharson wrote: > > I don't think so. I did reduce the target size > > #define TARGETSIZE (100 << 12) That's just 400kB! There's no way you should see corruption with that kind of value. It should all stay solidly in the cache. Is this perhaps with ARM nommu or something else strange? It may be that the program just doesn't work at all if mmap() is faked out with a malloc or similar. Definitely a question for the ARM gurus. I'm out of my depth. Gordon -- Gordon Farquharson - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
[Oops - forgot to hit "Reply to All" first time round.] Hi Linus On 12/27/06, Linus Torvalds <[EMAIL PROTECTED]> wrote: For all I know, my test-program is buggy wrt the ordering printouts, though. Did you perhaps change the logic in any way? I don't think so. I did reduce the target size #define TARGETSIZE (100 << 12) to make the program finish a little quicker, and for some reason I get linus-test.c: In function 'remap': linus-test.c:61: error: 'POSIX_FADV_DONTNEED' undeclared (first use in this function) when I compile the program, so I replaced POSIX_FADV_DONTNEED with 4 as defined in /usr/include/bits/fcntl.h. Other than these two changes, the program is identical to the version you posted. I have run the program a few times, and the output is pretty consistent. However, when I increase the target size, the difference between the expected and actual values is larger. Written as (749)935(738) Chunk 1113 corrupted (1-1455) (2965-323) Expected 89, got 93 Written as (935)738(538) Chunk 1114 corrupted (1-1455) (329-1783) Expected 90, got 94 Written as (738)538(678) Chunk 1115 corrupted (1-1455) (1789-3243) Expected 91, got 95 Written as (538)678(989) Chunk 1120 corrupted (1-1455) (897-2351) Expected 96, got 100 Written as (537)265(1005) Chunk 1121 corrupted (1-1455) (2357-3811) Expected 97, got 101 Written as (265)1005(-1) --- linus-test.c.orig 2006-12-28 06:17:24.0 +0100 +++ linus-test.c2006-12-28 06:18:24.0 +0100 @@ -6,7 +6,7 @@ #include #include -#define TARGETSIZE (100 << 20) +#define TARGETSIZE (100 << 14) #define CHUNKSIZE (1460) #define NRCHUNKS (TARGETSIZE / CHUNKSIZE) #define SIZE (NRCHUNKS * CHUNKSIZE) @@ -61,7 +61,7 @@ { if (mapping) { munmap(mapping, SIZE); -posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED); +posix_fadvise(fd, 0, SIZE, 4); } return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); Gordon -- Gordon Farquharson - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Wed, 27 Dec 2006, Gordon Farquharson wrote: > > It is at all suprising that the second offset within a page can be > less than the first offset within a page ? e.g. > > Chunk 260 corrupted (1-1455) (2769-127) No, that just means that it went over to the next page (so you actually had two consecutive pages that weren't written out). That said, your output is very different from mine in another way. You don't have zeroes in your pages, rather the thing seems to have data from the next block (ie the chunk that should have 20 is reported as having 21 etc). You also have your offsets shifted up by one (ie offset 0 looks ok for you, and then you have a strange pattern of corruption at bytes 1...1455 instead of 0..1459. You also seem to have an example of the _earlier_ writes being corrupted, rather than the later ones. For example (but it's also a page-crosser, so maybe that's part of it): Chunk 274 corrupted (1-1455) (2729-87) Expected 18, got 19 Written as (154)11(85) says that block chunk 274 is the corrupt one, but it was written fairly early as #11, and the blocks around it (chunks 273 and 275) were actually written later. For all I know, my test-program is buggy wrt the ordering printouts, though. Did you perhaps change the logic in any way? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On 12/27/06, Linus Torvalds <[EMAIL PROTECTED]> wrote: [ Modified test-program that tells you where the corruption happens (and when the missing parts were supposed to be written out) appended, in case people care. ] For the record, this is the output from a run on our ARM machine (32 MB RAM) with 2.6.18 + the following patches: mm: tracking shared dirty pages mm: balance dirty pages mm: optimize the new mprotect() code a bit mm: small cleanup of install_page() mm: fixup do_wp_page() mm: msync() cleanup It is at all suprising that the second offset within a page can be less than the first offset within a page ? e.g. Chunk 260 corrupted (1-1455) (2769-127) $ ./linus-test Writing chunk 279/280 (99%) Chunk 256 corrupted (1-1455) (1025-2479) Expected 0, got 1 Written as (82)175(56) Chunk 258 corrupted (1-1455) (3945-1303) Expected 2, got 3 Written as (56)51(20) Chunk 260 corrupted (1-1455) (2769-127) Expected 4, got 5 Written as (20)30(18) Chunk 262 corrupted (1-1455) (1593-3047) Expected 6, got 7 Written as (18)196(158) Chunk 264 corrupted (1-1455) (417-1871) Expected 8, got 9 Written as (158)133(146) Chunk 266 corrupted (1-1455) (3337-695) Expected 10, got 11 Written as (146)43(77) Chunk 268 corrupted (1-1455) (2161-3615) Expected 12, got 13 Written as (77)251(211) Chunk 270 corrupted (1-1455) (985-2439) Expected 14, got 15 Written as (211)257(231) Chunk 272 corrupted (1-1455) (3905-1263) Expected 16, got 17 Written as (231)254(154) Chunk 274 corrupted (1-1455) (2729-87) Expected 18, got 19 Written as (154)11(85) Chunk 276 corrupted (1-1455) (1553-3007) Expected 20, got 21 Written as (85)230(134) Chunk 278 corrupted (1-1455) (377-1831) Expected 22, got 23 Written as (134)233(103) Checking chunk 279/280 (99%) Gordon -- Gordon Farquharson - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Wed, 27 Dec 2006, David Miller wrote: > > > > I still don't see _why_, though. But maybe smarter people than me can see > > it.. > > FWIW this program definitely triggers the bug for me. Ok, now that I have something simple to do repeatable stuff with, I can say what the pattern is.. It's not all that surprising, but it's still worth just stating for the record. What happens is that when I do the "packetized writes" in random order, the _last_ write to a page occasionally just goes missing. It's not always at the end of a page, as shown by for example: - A whole chunk got dropped: Chunk 2094 corrupted (0-1459) (1624-3083) Expected 46, got 0 Written as (30912)55414(1) That "Written as (x)y(z)" line means that the corrupted chunk was written as chunk #y, and the preceding and following chunks (that were _not_ corrupt) on the page was written as #x and #z respectively. In other words, the missing chunk (which is still zero) was written much later than the ones that were ok, and never hit the disk. It's a contiguous chunk in the middle of the page (chunks are 1460 bytes in size) The first line means that all bytes of the chunk (0-1459) were corrupted, and the values in parenthesis are the offsets within a page. In other words, this was a chunk in the _middle_ of a page. - The missing data can also be at the beginning or ends of pages: Beginning of the chunk missing, it was at the end of a page (page offsets 3288-4095) and the _next_ page got written out fine: Chunk 2126 corrupted (0-807) (3288-4095) Expected 78, got 0 Written as (32713)55573(14301) End of a chunk missing, it was the beginning of a page (and the _previous_ page that contained the beginning of the chunk was written out fine) Chunk 2179 corrupted (1252-1459) (0-207) Expected 131, got 0 Written as (45189)55489(15515) Now, the reason I say this isn't surprising is that this is entirely consistent with the dirty bit being dropped on the floor somewhere, and likely through some interaction with the previous changes being in the process of being written out. Something (incorrectly) ends up deciding that it doesn't need to write the page, since it's already written, or alternatively clears the dirty bit too late (clears it because an _earlier_ write finished, never mind that the new dirty data didn't make it). I also figured out that it's not the low-memory situation that does it, it really must be the "page_mkclean()" triggering. Becuase I can do echo 5 > /proc/sys/vm/dirty_ratio echo 3 > /proc/sys/vm/dirty_background_ratio to make it clean the pages much more aggressively than the default, and I can see corruption on my 256MB machine with just a 40MB shared file, and 70MB of memory consistently free. So this thing is definitely giving some answers. It's NOT about low memory, and it very much seems to be about the whole "balance_dirty_ratio" thing. I don't think I triggered the actual low-memory stuff once in that situation.. So I have some more data on the behaviour, but I _still_ don't see the reason behind it. It's probably something really obvious once it's pointed out.. [ Modified test-program that tells you where the corruption happens (and when the missing parts were supposed to be written out) appended, in case people care. ] Linus --- #include #include #include #include #include #include #include #define TARGETSIZE (100 << 20) #define CHUNKSIZE (1460) #define NRCHUNKS (TARGETSIZE / CHUNKSIZE) #define SIZE (NRCHUNKS * CHUNKSIZE) static void fillmem(void *start, int nr) { memset(start, nr, CHUNKSIZE); } #define page_offset(buf, off) (0xfff & ((unsigned)(unsigned long)(buf)+(off))) static int chunkorder[NRCHUNKS]; static int order(int nr) { int i; if (nr < 0 || nr >= NRCHUNKS) return -1; for (i = 0; i < NRCHUNKS; i++) if (chunkorder[i] == nr) return i; return -2; } static void checkmem(void *buf, int nr) { unsigned int start = ~0u, end = 0; unsigned char c = nr, *p = buf, differs = 0; int i; for (i = 0; i < CHUNKSIZE; i++) { unsigned char got = *p++; if (got != c) { if (i < start) start = i; if (i > end) end = i; differs = got; } } if (start < end) { printf("Chunk %d corrupted (%u-%u) (%u-%u)\n", nr, start, end, page_offset(buf, start), page_offset(buf, end)); printf("Expected %u, got %u\n", c, differs); printf("Written as (%d)%d(%d)\n", order(nr-1), order(nr), order(nr+1)); } } static char
Re: [PATCH] mm: fix page_mkclean_one
From: Linus Torvalds <[EMAIL PROTECTED]> Date: Wed, 27 Dec 2006 16:39:43 -0800 (PST) > > > On Wed, 27 Dec 2006, Linus Torvalds wrote: > > > > I think the test-case could probably be improved by having a munmap() and > > page-cache flush in between the writing and the checking, to see whether > > that shows the corruption easier (and possibly without having to start > > paging in order to throw the pages out, which would simplify testing a > > lot). > > I think the page-writeout is implicated, because I do seem to need it, but > the page-cache flush does seem to make corruption _easier_ to see. I now > seem about to trigger it with a 100MB file on a 256MB machine in a minute > or so, with this slight modification. > > I still don't see _why_, though. But maybe smarter people than me can see > it.. FWIW this program definitely triggers the bug for me. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
From: Linus Torvalds <[EMAIL PROTECTED]> Date: Wed, 27 Dec 2006 16:42:40 -0800 (PST) > That's fine. In that situation, you shouldn't need any atomic ops at all, > I think all our sw page-table operations are already done under the pte > lock. This is true, but there is one subtlety to this I want to point out in passing. That lock can possibly only protect a page of PTEs. When NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS, the locking is done per page of PTEs, not for all of the page tables of an address space at once. What this means is that it's really difficult to forcefully block out all page table operations for a given mm, and I actually needed to do something like this on sparc64 (when growing the TLB lookup hash table, you can't let any PTEs change state while the table is changing). For my case, I added a spinlock to mm->context since actually what I need is to block modifications to the hash table itself during PTE changes. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Thu, 28 Dec 2006, Martin Schwidefsky wrote: > > For s390 there are two aspects to consider: > 1) the pte values are 100% software controlled. That's fine. In that situation, you shouldn't need any atomic ops at all, I think all our sw page-table operations are already done under the pte lock. The reason x86 needs to be careful is exactly the fact that the hardware will obviously do a lot on its own, and the hardware is _not_ going to honor our page table locking ;) In an all-sw situation, a lot of this should be easier. S390 has _other_ things that are inconvenient (the strange "dirty bit is not in the page tables" thing that makes it look different from everybody else), but hey, it's a balance.. So for s390, ptep_exchange() in my example should be able to be a simple "load old value and store new one", assuming everybody honors the pte lock (and they _should_). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Wed, 27 Dec 2006, Linus Torvalds wrote: > > I think the test-case could probably be improved by having a munmap() and > page-cache flush in between the writing and the checking, to see whether > that shows the corruption easier (and possibly without having to start > paging in order to throw the pages out, which would simplify testing a > lot). I think the page-writeout is implicated, because I do seem to need it, but the page-cache flush does seem to make corruption _easier_ to see. I now seem about to trigger it with a 100MB file on a 256MB machine in a minute or so, with this slight modification. I still don't see _why_, though. But maybe smarter people than me can see it.. Linus --- #include #include #include #include #include #include #include #define TARGETSIZE (100 << 20) #define CHUNKSIZE (1460) #define NRCHUNKS (TARGETSIZE / CHUNKSIZE) #define SIZE (NRCHUNKS * CHUNKSIZE) static void fillmem(void *start, int nr) { memset(start, nr, CHUNKSIZE); } static void checkmem(void *start, int nr) { unsigned char c = nr, *p = start; int i; for (i = 0; i < CHUNKSIZE; i++) { if (*p++ != c) { printf("Chunk %d corrupted \n", nr); return; } } } static char *remap(int fd, char *mapping) { if (mapping) { munmap(mapping, SIZE); posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED); } return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); } int main(int argc, char **argv) { char *mapping; int fd, i; static int chunkorder[NRCHUNKS]; /* * Make some random ordering of writing the chunks to the * memory map.. * * Start with fully ordered.. */ for (i = 0; i < NRCHUNKS; i++) chunkorder[i] = i; /* ..and then mix it up randomly */ srandom(time(NULL)); for (i = 0; i < NRCHUNKS; i++) { int index = (unsigned int) random() % NRCHUNKS; int nr = chunkorder[index]; chunkorder[index] = chunkorder[i]; chunkorder[i] = nr; } fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666); if (fd < 0) return -1; if (ftruncate(fd, SIZE) < 0) return -1; mapping = remap(fd, NULL); if (-1 == (int)(long)mapping) return -1; for (i = 0; i < NRCHUNKS; i++) { int chunk = chunkorder[i]; printf("Writing chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS); fillmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); /* Unmap, drop, and remap.. */ mapping = remap(fd, mapping); /* .. and check */ for (i = 0; i < NRCHUNKS; i++) { int chunk = i; printf("Checking chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS); checkmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); return 0; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Tue, 26 Dec 2006, David Miller wrote: > > I've seen it on sparc64, UP kernel, no preempt. Ok, I still don't have a clue, but I think I at least have a new test-case. It can probably be improved upon, but this would _seem_ to trigger the problem. Can people check? You'd want to make sure you get page-put activity, by making TARGETSIZE be big enough to cause memory pressure (and rather than making it bigger, you might want to make your memory smaller instead, to make it run more quickly. Either using "mem=128M" or big compiles or something...). If it finds corruption, you'll see something like Writing chunk 183858/183859 (99%) Chunk .. Chunk 120887 corrupted Chunk 122372 corrupted Chunk ... Checking chunk 183858/183859 (99%) otherwise it will just say Writing chunk 183858/183859 (99%) Checking chunk 183858/183859 (99%) and exit. I didn't spend a lot of time verifying this, but I _was_ able to cause those "Chunk xxx corrupted" messages with this. There's probably a more efficient better way to do it, but this is better than trying to use rtorrent, and also makes any worries about what rtorrent does go away. Of course, maybe it's this test-program that is buggy now, although it looks trivial enough that I don't think it is. I think my earlier stress-tester may not have triggered this, because it just did all its writing in a linear order, so any LRU logic will happen to write back old pages that we are no longer touching. The randomization (and using a chunksize that isn't a multiple of a page-size) makes sure that we're actually going to have lots of rewriting going on. I think the test-case could probably be improved by having a munmap() and page-cache flush in between the writing and the checking, to see whether that shows the corruption easier (and possibly without having to start paging in order to throw the pages out, which would simplify testing a lot). But I haven't tested. I decided to post this asap, now that I've recreated the corruption with something else, and something that is possibly easier to analyze.. Linus #include #include #include #include #include #include #include #define TARGETSIZE (256 << 20) #define CHUNKSIZE (1460) #define NRCHUNKS (TARGETSIZE / CHUNKSIZE) #define SIZE (NRCHUNKS * CHUNKSIZE) static void fillmem(void *start, int nr) { memset(start, nr, CHUNKSIZE); } static void checkmem(void *start, int nr) { unsigned char c = nr, *p = start; int i; for (i = 0; i < CHUNKSIZE; i++) { if (*p++ != c) { printf("Chunk %d corrupted \n", nr); return; } } } int main(int argc, char **argv) { char *mapping; int fd, i; static int chunkorder[NRCHUNKS]; /* * Make some random ordering of writing the chunks to the * memory map.. * * Start with fully ordered.. */ for (i = 0; i < NRCHUNKS; i++) chunkorder[i] = i; /* ..and then mix it up randomly */ srandom(time(NULL)); for (i = 0; i < NRCHUNKS; i++) { int index = (unsigned int) random() % NRCHUNKS; int nr = chunkorder[index]; chunkorder[index] = chunkorder[i]; chunkorder[i] = nr; } fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666); if (fd < 0) return -1; if (ftruncate(fd, SIZE) < 0) return -1; mapping = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (-1 == (int)(long)mapping) return -1; for (i = 0; i < NRCHUNKS; i++) { int chunk = chunkorder[i]; printf("Writing chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS); fillmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); for (i = 0; i < NRCHUNKS; i++) { int chunk = i; printf("Checking chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS); checkmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); return 0; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Thu, 2006-12-21 at 12:01 -0800, Linus Torvalds wrote: > What do you guys think? Does something like this work out for S/390 too? I > tried to make that "ptep_flush_dirty()" concept work for architectures > that hide the dirty bit somewhere else too, but.. For s390 there are two aspects to consider: 1) the pte values are 100% software controlled. They only change because a cpu stored a value to it or issued one of the specialized instructions (csp, ipte and idte). The ptep_flush_dirty would be a nop for s390. 2) ptep_exchange is a bit dangerous. For s390 we need a lock that protects the software controlled updates of the ptes. The reason is the ipte instruction. It is implemented by the machine microcode in a non-atomic way in regard to the memory. It reads the byte of the pte that contains the invalid bit, flushes the tlb entries for it and then writes back the byte with the invalid bit set. The microcode makes sure that this pte cannot be used for form a new tlb on any cpu while the ipte is in progress. That means a compare-and-swap semantics on ptes won't work together with the ipte optimization. As long as there is the pte lock that protects all software accesses to the pte we are fine. But if any code expects that ptep_exchange does something like an xchg things break. -- blue skies, Martin. Martin Schwidefsky Linux for zSeries Development & Services IBM Deutschland Entwicklung GmbH "Reality continues to ruin my life." - Calvin. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On 12/27/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: I do get this error on reiserfs ( old one, didn't try on reiser4 ). Stock 2.6.19 plus reiser4 patch. Previously reported by me only in the debian bts. I've had reports of corrupted data on earlier kernel releases with reiserfs3, which were fixed by upgrading to reiserfs4. Jari Sundell - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Tue, Dec 26, 2006 at 11:26:50AM -0800, Linus Torvalds wrote: > What would also actually be interesting is whether somebody can reproduce > this on Reiserfs, for example. I _think_ all the reports I've seen are on > ext2 or ext3, and if this is somehow writeback-related, it could be some > bug that is just shared between the two by virtue of them still having a > lot of stuff in common. > > Linus I do get this error on reiserfs ( old one, didn't try on reiser4 ). Stock 2.6.19 plus reiser4 patch. Previously reported by me only in the debian bts. flo attenberger --- Linux master 2.6.19 #1 PREEMPT Thu Dec 21 10:55:34 CET 2006 x86_64 GNU/Linux # # Automatically generated make config: don't edit # Linux kernel version: 2.6.19 # Thu Dec 21 10:45:05 2006 # CONFIG_X86_64=y CONFIG_64BIT=y CONFIG_X86=y CONFIG_ZONE_DMA32=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_MMU=y CONFIG_RWSEM_GENERIC_SPINLOCK=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_CMPXCHG=y CONFIG_EARLY_PRINTK=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_DMI=y CONFIG_AUDIT_ARCH=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # Code maturity level options # CONFIG_EXPERIMENTAL=y CONFIG_BROKEN_ON_SMP=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 # # General setup # CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y CONFIG_SWAP=y CONFIG_SYSVIPC=y # CONFIG_IPC_NS is not set CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y # CONFIG_BSD_PROCESS_ACCT_V3 is not set # CONFIG_TASKSTATS is not set # CONFIG_UTS_NS is not set # CONFIG_AUDIT is not set CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y # CONFIG_RELAY is not set CONFIG_INITRAMFS_SOURCE="" # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y # CONFIG_EMBEDDED is not set CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y CONFIG_KALLSYMS_ALL=y # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_EPOLL=y CONFIG_SHMEM=y CONFIG_SLAB=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 # CONFIG_SLOB is not set # # Loadable module support # CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y CONFIG_MODVERSIONS=y # CONFIG_MODULE_SRCVERSION_ALL is not set CONFIG_KMOD=y # # Block layer # CONFIG_BLOCK=y # CONFIG_LBD is not set # CONFIG_BLK_DEV_IO_TRACE is not set # CONFIG_LSF is not set # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=m CONFIG_IOSCHED_DEADLINE=m CONFIG_IOSCHED_CFQ=y # CONFIG_DEFAULT_AS is not set # CONFIG_DEFAULT_DEADLINE is not set CONFIG_DEFAULT_CFQ=y # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="cfq" # # Processor type and features # CONFIG_X86_PC=y # CONFIG_X86_VSMP is not set CONFIG_MK8=y # CONFIG_MPSC is not set # CONFIG_GENERIC_CPU is not set CONFIG_X86_L1_CACHE_BYTES=64 CONFIG_X86_L1_CACHE_SHIFT=6 CONFIG_X86_INTERNODE_CACHE_BYTES=64 CONFIG_X86_TSC=y CONFIG_X86_GOOD_APIC=y CONFIG_MICROCODE=m CONFIG_MICROCODE_OLD_INTERFACE=y CONFIG_X86_MSR=m CONFIG_X86_CPUID=m CONFIG_X86_IO_APIC=y CONFIG_X86_LOCAL_APIC=y CONFIG_MTRR=y # CONFIG_SMP is not set # CONFIG_PREEMPT_NONE is not set # CONFIG_PREEMPT_VOLUNTARY is not set CONFIG_PREEMPT=y CONFIG_PREEMPT_BKL=y CONFIG_ARCH_SPARSEMEM_ENABLE=y CONFIG_ARCH_FLATMEM_ENABLE=y CONFIG_SELECT_MEMORY_MODEL=y CONFIG_FLATMEM_MANUAL=y # CONFIG_DISCONTIGMEM_MANUAL is not set # CONFIG_SPARSEMEM_MANUAL is not set CONFIG_FLATMEM=y CONFIG_FLAT_NODE_MEM_MAP=y # CONFIG_SPARSEMEM_STATIC is not set CONFIG_SPLIT_PTLOCK_CPUS=4 CONFIG_RESOURCES_64BIT=y CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y CONFIG_HPET_TIMER=y CONFIG_IOMMU=y # CONFIG_CALGARY_IOMMU is not set CONFIG_SWIOTLB=y CONFIG_X86_MCE=y # CONFIG_X86_MCE_INTEL is not set CONFIG_X86_MCE_AMD=y CONFIG_KEXEC=y # CONFIG_CRASH_DUMP is not set CONFIG_PHYSICAL_START=0x20 CONFIG_SECCOMP=y # CONFIG_CC_STACKPROTECTOR is not set # CONFIG_HZ_100 is not set # CONFIG_HZ_250 is not set CONFIG_HZ_1000=y CONFIG_HZ=1000 CONFIG_REORDER=y CONFIG_K8_NB=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_ISA_DMA_API=y # # Power management options # CONFIG_PM=y CONFIG_PM_LEGACY=y # CONFIG_PM_DEBUG is not set CONFIG_PM_SYSFS_DEPRECATED=y # CONFIG_SOFTWARE_SUSPEND is not set # # ACPI (Advanced Configuration and Power Interface) Support # CONFIG_ACPI=y CONFIG_ACPI_SLEEP=y CONFIG_ACPI_SLEEP_PROC_FS=y # CONFIG_ACPI_SLEEP_PROC_SLEEP is not set CONFIG_ACPI_AC=m # CONFIG_ACPI_BATTERY is not set CONFIG_ACPI_BUTTON=m CONFIG_ACPI_VIDEO=m CONFIG_ACPI_HOTKEY=m CONFIG_ACPI_FAN=m # CONFIG_ACPI_DOCK is not set CONFIG_ACPI_PROCESSOR=m CONFIG_ACPI_THERMAL=m # CONFIG_ACPI_ASUS is not set # CONFIG_ACPI_IBM is not set # CONFIG_ACPI_TOSHIBA is not set CONFIG_ACPI_BLACKLIST_YEAR=0 # CONFIG_ACPI_DEBUG is not set CONFIG_ACPI_EC=y CONFIG_ACPI_POWER=y CONFIG_ACPI_SYSTEM=y CONFIG_X86_P
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On 12/27/06, Linus Torvalds <[EMAIL PROTECTED]> wrote: - It never uses mprotect on the shared mappings, but it _does_ do: "mincore()" - but the return values don't much matter (it's used as a heuristic on which parts to hash, apparently) I double- and triple-checked this one, because I did make changes to "mincore()", but those didn't go into the affected kernels anyway (ie they are not in plain 2.6.19, nor in 2.6.18.3 either) Correct, mincore is only used to check if it should delay the hash checking. "madvise(MADV_WILLNEED)" "msync(MS_ASYNC)" (or MS_SYNC if you use a command line flag) "munmap()" of course - it never seems to mix mmap() and write() - it does _only_ mmap. - it seems to mmap/munmap the shared files in nice 64-page chunks, all 64-page aligned in the file (ie it does NOT create one big mapping, it has some kind of LRU of thse 64-page chunks). The only exception being the last chunk, which it maps byte-accurate to the size. The length of the chunks is only page aligned on single file torrents, not so on multi-file torrents. I've attached a patch for rtorrent that will extend the length to the page boundary. - I haven't checked whether it only ever has the same chunk mapped once at a time. This should be the case, but two mapped chunks may share a page, sometimes with different r/w permissions. Jari Sundell extend_mapping.diff Description: Binary data
Re: [PATCH] mm: fix page_mkclean_one
I have corrupted files... > --- > diff --git a/fs/buffer.c b/fs/buffer.c > index 263f88e..4652ef1 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -1653,19 +1653,7 @@ static int __block_write_full_page(struct inode > *inode, struct page *page, > do { > if (!buffer_mapped(bh)) > continue; > - /* > - * If it's a fully non-blocking write attempt and we cannot > - * lock the buffer then redirty the page. Note that this can > - * potentially cause a busy-wait loop from pdflush and kswapd > - * activity, but those code paths have their own higher-level > - * throttling. > - */ > - if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) { > - lock_buffer(bh); > - } else if (test_set_buffer_locked(bh)) { > - redirty_page_for_writepage(wbc, page); > - continue; > - } > + lock_buffer(bh); > if (test_clear_buffer_dirty(bh)) { > mark_buffer_async_write(bh); > } else { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Tue, 26 Dec 2006, David Miller wrote: > > I've seen it on sparc64, UP kernel, no preempt. Btw, having tried to debug the writeback code, there's one very special case that just makes me go "hmm". If we have a buffer that is "busy" when we try to write back a page, we have this magic "wbc->sync_mode == WB_SYNC_NONE && wbc->nonblocking" mode, where we won't wait for it, but instead we'll redirty the page and redo the whole thing. Looking at the code, that should all work, but at the same time, it triggers some of my debug messages about having a dirty page during writeback, and one way to trigger that debug message is to try to run rtorrent on the machine.. I dunno. Witht he writeback being suspicious, and the normal "block_write_full_page()" path being implicated in at least ext2, I just wonder. This is one of those "let's see if behaviour changes" patches, that I'm just throwing out there.. Linus --- diff --git a/fs/buffer.c b/fs/buffer.c index 263f88e..4652ef1 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -1653,19 +1653,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page, do { if (!buffer_mapped(bh)) continue; - /* -* If it's a fully non-blocking write attempt and we cannot -* lock the buffer then redirty the page. Note that this can -* potentially cause a busy-wait loop from pdflush and kswapd -* activity, but those code paths have their own higher-level -* throttling. -*/ - if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) { - lock_buffer(bh); - } else if (test_set_buffer_locked(bh)) { - redirty_page_for_writepage(wbc, page); - continue; - } + lock_buffer(bh); if (test_clear_buffer_dirty(bh)) { mark_buffer_async_write(bh); } else { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
From: Tobias Diedrich <[EMAIL PROTECTED]> Date: Tue, 26 Dec 2006 17:17:00 +0100 > Linus Torvalds wrote: > > I don't think it's a page table issue any more, it just doesn't look > > likely with the ARM UP corruption. It's also not apparently even on a > > cacheline boundary, so it probably is really a dirty bit that got cleared > > wrogn due to some race with IO. > > So, until now it's only been reported for SMP on i386? > I'm seeing the issue on my Pentium-M Notebook (Thinkpad R52) over > here, UP kernel, no preempt. I've seen it on sparc64, UP kernel, no preempt. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Tue, 26 Dec 2006, Nick Piggin wrote: > Linus Torvalds wrote: > > > > Ok, so how about this diff. > > > > I'm actually feeling good about this one. It really looks like > > "do_no_page()" was simply buggy, and that this explains everything. > > Still trying to catch up here, so I'm not going to reply to any old > stuff and just start at the tip of the thread... Other than to say > that I really like cancel_page_dirty ;) Yeah, I think that part is a bit clearer about what's going on now. > I think your patch is quite right so that's a good catch. Actually, since people told me it didn't matter, I went back and looked at _why_ - the thing is, "vma->vm_page_prot" should always be read-only anyway, except for mappings that don't do dirty accounting at all, so I think my patch only found cases that are unimportant (ie pages that get faulted on on filesystems like ramfs that doesn't do any dirty page accounting because they're all dirty anyway). > But I'm not too surprised that it does not help the problem, because I > don't think we have started shedding any old pte_dirty tests at > unmap/reclaim-time, have we? So the dirty bit isn't going to get lost, > as such. True. We should no longer _need_ those dirty bit reclaims at unmap/reclaim, but we still do them, so you're right, even if we were buggy in this area, it should only really matter for the dirty page counting, not for any lost data. > I was hoping that you've almost narrowed it down to the filesystem > writeback code, with the last few mails? I think so, yes. However, I've checked, and "rtorrent" really does seem to be fairly well-behaved wrt any filesystem activity. It does - no threading. It's 100% single-threaded, and doesn't even appear to use signals. - exactly _one_ "ftruncate()", and it does it at the beginning, for the full final size. IOW, it's not anything subtle with truncate and dirty page cancel. - It never uses mprotect on the shared mappings, but it _does_ do: "mincore()" - but the return values don't much matter (it's used as a heuristic on which parts to hash, apparently) I double- and triple-checked this one, because I did make changes to "mincore()", but those didn't go into the affected kernels anyway (ie they are not in plain 2.6.19, nor in 2.6.18.3 either) "madvise(MADV_WILLNEED)" "msync(MS_ASYNC)" (or MS_SYNC if you use a command line flag) "munmap()" of course - it never seems to mix mmap() and write() - it does _only_ mmap. - it seems to mmap/munmap the shared files in nice 64-page chunks, all 64-page aligned in the file (ie it does NOT create one big mapping, it has some kind of LRU of thse 64-page chunks). The only exception being the last chunk, which it maps byte-accurate to the size. - I haven't checked whether it only ever has the same chunk mapped once at a time. Anyway, the _one_ half-way interesting thing is the fact that it doesn't allocate any backing store at all for the file, and as such the page writeback needs to create all the underlying buffers on the filesystem. I really don't see why that would be a problem either, but I could imagine that if we have some writeback bug where we can end up writing back the _same_ page concurrently, we'd actually end up racing in the kernel, and allocating two different backing stores, and then maybe the other one would effectively "get lost" (and the earlier writeback would win the race, explaining why we'd end up with zeroes at the end of a block). Or something. However, all the codepaths _seem_ to test for PG_writeback, and not even try to start another writeback while the first one is still active. What would also actually be interesting is whether somebody can reproduce this on Reiserfs, for example. I _think_ all the reports I've seen are on ext2 or ext3, and if this is somehow writeback-related, it could be some bug that is just shared between the two by virtue of them still having a lot of stuff in common. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Tue, Dec 26, 2006 at 05:51:55PM +, Al Viro wrote: > On Sun, Dec 24, 2006 at 12:24:46PM -0800, Linus Torvalds wrote: > > > > > > On Sun, 24 Dec 2006, Andrei Popa wrote: > > > > > > Hash check on download completion found bad chunks, consider using > > > "safe_sync". > > > > Dang. Did you get any warning messages from the kernel? > > > > Linus > > BTW, rmap.c patch is broken - needs at least ... but that doesn't affect most of the architectures - only sparc64 and some of powerpc. So it's definitely not enough. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, Dec 24, 2006 at 12:24:46PM -0800, Linus Torvalds wrote: > > > On Sun, 24 Dec 2006, Andrei Popa wrote: > > > > Hash check on download completion found bad chunks, consider using > > "safe_sync". > > Dang. Did you get any warning messages from the kernel? > > Linus BTW, rmap.c patch is broken - needs at least Signed-off-by: Al Viro <[EMAIL PROTECTED]> --- diff --git a/mm/rmap.c b/mm/rmap.c index 57306fa..669acb2 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -452,7 +452,7 @@ static int page_mkclean_one(struct page entry = ptep_clear_flush(vma, address, pte); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); - set_pte_at(vma, address, pte, entry); + set_pte_at(mm, address, pte, entry); lazy_mmu_prot_update(entry); ret = 1; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
Linus Torvalds wrote: On Sun, 24 Dec 2006, Linus Torvalds wrote: Peter, tell me I'm crazy, but with the new rules, the following condition is a bug: - shared mapping - writable - not already marked dirty in the PTE Ok, so how about this diff. I'm actually feeling good about this one. It really looks like "do_no_page()" was simply buggy, and that this explains everything. Still trying to catch up here, so I'm not going to reply to any old stuff and just start at the tip of the thread... Other than to say that I really like cancel_page_dirty ;) I think your patch is quite right so that's a good catch. But I'm not too surprised that it does not help the problem, because I don't think we have started shedding any old pte_dirty tests at unmap/reclaim-time, have we? So the dirty bit isn't going to get lost, as such. I was hoping that you've almost narrowed it down to the filesystem writeback code, with the last few mails? Nick Please please please test. Throw all the other patches away (with the possible exception of the "update_mmu_cache()" sanity checker, which is still interesting in case some _other_ place does this too). Don't do the "wait_on_page_writeback()" thing, because it changes timings and might hide thngs for the wrong reasons. Just apply this on top of a known failing kernel, and test. Linus --- diff --git a/mm/memory.c b/mm/memory.c index 563792f..cf429c4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2247,21 +2249,23 @@ retry: if (pte_none(*page_table)) { flush_icache_page(vma, new_page); entry = mk_pte(new_page, vma->vm_page_prot); - if (write_access) - entry = maybe_mkwrite(pte_mkdirty(entry), vma); - set_pte_at(mm, address, page_table, entry); if (anon) { inc_mm_counter(mm, anon_rss); lru_cache_add_active(new_page); page_add_new_anon_rmap(new_page, vma, address); + if (write_access) + entry = maybe_mkwrite(pte_mkdirty(entry), vma); } else { inc_mm_counter(mm, file_rss); page_add_file_rmap(new_page); + entry = pte_wrprotect(entry); if (write_access) { dirty_page = new_page; get_page(dirty_page); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); } } + set_pte_at(mm, address, page_table, entry); } else { /* One of our sibling threads was faster, back out. */ page_cache_release(new_page); -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
* Linus Torvalds <[EMAIL PROTECTED]> [2006-12-24 11:35]: > And if this doesn't fix it, I don't know what will.. Sorry, but it still fails (on top of plain 2.6.19). -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
> Quoting Linus Torvalds <[EMAIL PROTECTED]>: > Subject: Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content > corruption on ext3) > > Peter, tell me I'm crazy, but with the new rules, the following condition > is a bug: > > - shared mapping > - writable > - not already marked dirty in the PTE > > because that combination means that the hardware can mark the PTE dirty > without us even realizing (and thus not marking the "struct page *" > dirty). Er. Sorry about bumping in, and I'm not sure I understand all of the discussion, but this reminded me of an old issue with COW that created what looks like a vaguely similiar data corruption on infiniband. We solved this for infiniband with MADV_DONTFORK, but I always wondered why does it not affect other parts of kernel. Small reminder from that discussion: down mmap sem get user pages up mmap sem page becomes shared, and COW (e.g. fork) process writes to first byte of page <- gets a copy Now we had a problem: struct page that we got from get user pages does not point to a correct page in our process. For example: if at some point we map this page for DMA, and hardware writes to last byte of page -> process does not see this data. So for infiniband, what we do is a combination of - prevent page from becoming COW while hardware might DMA to this page, and - ask users not to write to page if hardware might DMA to same page (even if its using different bytes). I just wandered - is there some chance something like this could be happening in the fs code? HTH, -- MST - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On 12/24/06, Linus Torvalds <[EMAIL PROTECTED]> wrote: Ok, so how about this diff. I'm actually feeling good about this one. It really looks like "do_no_page()" was simply buggy, and that this explains everything. I tested with just this patch and 2.6.19 and no change. Sorry Linus, no early Christmas present :-( Gordon -- Gordon Farquharson - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 2006-12-24 at 12:24 -0800, Linus Torvalds wrote: > > On Sun, 24 Dec 2006, Andrei Popa wrote: > > > > Hash check on download completion found bad chunks, consider using > > "safe_sync". > > Dang. Did you get any warning messages from the kernel? > only these: ACPI: EC: evaluating _Q80 ACPI: EC: evaluating _Q80 ACPI: EC: evaluating _Q80 but I don't think has anything to do with... > Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 24 Dec 2006, Andrei Popa wrote: > > Hash check on download completion found bad chunks, consider using > "safe_sync". Dang. Did you get any warning messages from the kernel? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 2006-12-24 at 11:35 -0800, Linus Torvalds wrote: > > On Sun, 24 Dec 2006, Gordon Farquharson wrote: > > > > The apt cache files (/var/cache/apt/*.bin) still get corrupted with > > this patch and 2.6.19. > > Yeah, if my guess about do_no_page() is right, _none_ of the previous > patches should have ANY effect what-so-ever. In fact, I'd say that even > the "ext3 works in writeback mode" thing that Andrei reports is probably a > total fluke brought on by timing changes rather than anything else. > > So please try the latest patch instead (on top of anything that shows > corruption reliably - the patch should be _totally_ independent of all the > other issues, and I think it will apply cleanly on top of 2.6.18.3 and > 2.6.19 too, so anything that shows corruption is a fine target - but try > to choose something that has been the "best" at corrupting things for you, > to make the testing as good as possible). > > Patch included here again (although I think you were cc'd on my previous > email too, so you should already have it, and our emails just crossed) > > And if this doesn't fix it, I don't know what will.. With latest git and patches: http://lkml.org/lkml/diff/2006/12/24/56/1 http://lkml.org/lkml/diff/2006/12/24/61/1 Hash check on download completion found bad chunks, consider using "safe_sync". > > Linus > > --- > diff --git a/mm/memory.c b/mm/memory.c > index 563792f..cf429c4 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2247,21 +2249,23 @@ retry: > if (pte_none(*page_table)) { > flush_icache_page(vma, new_page); > entry = mk_pte(new_page, vma->vm_page_prot); > - if (write_access) > - entry = maybe_mkwrite(pte_mkdirty(entry), vma); > - set_pte_at(mm, address, page_table, entry); > if (anon) { > inc_mm_counter(mm, anon_rss); > lru_cache_add_active(new_page); > page_add_new_anon_rmap(new_page, vma, address); > + if (write_access) > + entry = maybe_mkwrite(pte_mkdirty(entry), vma); > } else { > inc_mm_counter(mm, file_rss); > page_add_file_rmap(new_page); > + entry = pte_wrprotect(entry); > if (write_access) { > dirty_page = new_page; > get_page(dirty_page); > + entry = maybe_mkwrite(pte_mkdirty(entry), vma); > } > } > + set_pte_at(mm, address, page_table, entry); > } else { > /* One of our sibling threads was faster, back out. */ > page_cache_release(new_page); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 24 Dec 2006, Gordon Farquharson wrote: > > The apt cache files (/var/cache/apt/*.bin) still get corrupted with > this patch and 2.6.19. Yeah, if my guess about do_no_page() is right, _none_ of the previous patches should have ANY effect what-so-ever. In fact, I'd say that even the "ext3 works in writeback mode" thing that Andrei reports is probably a total fluke brought on by timing changes rather than anything else. So please try the latest patch instead (on top of anything that shows corruption reliably - the patch should be _totally_ independent of all the other issues, and I think it will apply cleanly on top of 2.6.18.3 and 2.6.19 too, so anything that shows corruption is a fine target - but try to choose something that has been the "best" at corrupting things for you, to make the testing as good as possible). Patch included here again (although I think you were cc'd on my previous email too, so you should already have it, and our emails just crossed) And if this doesn't fix it, I don't know what will.. Linus --- diff --git a/mm/memory.c b/mm/memory.c index 563792f..cf429c4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2247,21 +2249,23 @@ retry: if (pte_none(*page_table)) { flush_icache_page(vma, new_page); entry = mk_pte(new_page, vma->vm_page_prot); - if (write_access) - entry = maybe_mkwrite(pte_mkdirty(entry), vma); - set_pte_at(mm, address, page_table, entry); if (anon) { inc_mm_counter(mm, anon_rss); lru_cache_add_active(new_page); page_add_new_anon_rmap(new_page, vma, address); + if (write_access) + entry = maybe_mkwrite(pte_mkdirty(entry), vma); } else { inc_mm_counter(mm, file_rss); page_add_file_rmap(new_page); + entry = pte_wrprotect(entry); if (write_access) { dirty_page = new_page; get_page(dirty_page); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); } } + set_pte_at(mm, address, page_table, entry); } else { /* One of our sibling threads was faster, back out. */ page_cache_release(new_page); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On 12/24/06, Linus Torvalds <[EMAIL PROTECTED]> wrote: How about this particularly stupid diff? (please test with something that _would_ cause corruption normally). It is _entirely_ untested, but what it tries to do is to simply serialize any writeback in progress with any process that tries to re-map a shared page into its address space and dirty it. I haven't tested it, and maybe it misses some case, but it looks likea good way to try to avoid races with marking pages dirty and the writeback phase .. The apt cache files (/var/cache/apt/*.bin) still get corrupted with this patch and 2.6.19. Gordon diff -Naupr linux-2.6.19.orig/fs/buffer.c linux-2.6.19/fs/buffer.c --- linux-2.6.19.orig/fs/buffer.c 2006-11-29 14:57:37.0 -0700 +++ linux-2.6.19/fs/buffer.c2006-12-21 01:16:31.0 -0700 @@ -2832,7 +2832,7 @@ int try_to_free_buffers(struct page *pag int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2843,17 +2843,6 @@ int try_to_free_buffers(struct page *pag spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); - if (ret) { - /* -* If the filesystem writes its buffers by hand (eg ext3) -* then we can have clean buffers against a dirty page. We -* clean the page here; otherwise later reattachment of buffers -* could encounter a non-uptodate page, which is unresolvable. -* This only applies in the rare case where try_to_free_buffers -* succeeds but the page is not freed. -*/ - clear_page_dirty(page); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; diff -Naupr linux-2.6.19.orig/fs/hugetlbfs/inode.c linux-2.6.19/fs/hugetlbfs/inode.c --- linux-2.6.19.orig/fs/hugetlbfs/inode.c 2006-11-29 14:57:37.0 -0700 +++ linux-2.6.19/fs/hugetlbfs/inode.c 2006-12-21 01:15:21.0 -0700 @@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct static void truncate_huge_page(struct page *page) { - clear_page_dirty(page); + cancel_dirty_page(page, /* No IO accounting for huge pages? */0); ClearPageUptodate(page); remove_from_page_cache(page); put_page(page); diff -Naupr linux-2.6.19.orig/include/linux/page-flags.h linux-2.6.19/include/linux/page-flags.h --- linux-2.6.19.orig/include/linux/page-flags.h2006-11-29 14:57:37.0 -0700 +++ linux-2.6.19/include/linux/page-flags.h 2006-12-21 01:15:21.0 -0700 @@ -253,15 +253,11 @@ static inline void SetPageUptodate(struc struct page; /* forward declaration */ -int test_clear_page_dirty(struct page *page); +extern void cancel_dirty_page(struct page *page, unsigned int account_size); + int test_clear_page_writeback(struct page *page); int test_set_page_writeback(struct page *page); -static inline void clear_page_dirty(struct page *page) -{ - test_clear_page_dirty(page); -} - static inline void set_page_writeback(struct page *page) { test_set_page_writeback(page); diff -Naupr linux-2.6.19.orig/mm/memory.c linux-2.6.19/mm/memory.c --- linux-2.6.19.orig/mm/memory.c 2006-11-29 14:57:37.0 -0700 +++ linux-2.6.19/mm/memory.c2006-12-24 11:04:03.0 -0700 @@ -1534,6 +1534,7 @@ static int do_wp_page(struct mm_struct * if (!pte_same(*page_table, orig_pte)) goto unlock; } + wait_on_page_writeback(old_page); dirty_page = old_page; get_page(dirty_page); reuse = 1; @@ -1832,6 +1833,33 @@ void unmap_mapping_range(struct address_ } EXPORT_SYMBOL(unmap_mapping_range); +static void check_last_page(struct address_space *mapping, loff_t size) +{ + pgoff_t index; + unsigned int offset; + struct page *page; + + if (!mapping) + return; + offset = size & ~PAGE_MASK; + if (!offset) + return; + index = size >> PAGE_SHIFT; + page = find_lock_page(mapping, index); + if (page) { + unsigned int check = 0; + unsigned char *kaddr = kmap_atomic(page, KM_USER0); + do { + check += kaddr[offset++]; + } while (offset < PAGE_SIZE); + kunmap_atomic(kaddr,KM_USER0); + unlock_page(page); + page_cache_release(page); + if (check) + printk("%s: BADNESS: truncate check %u\n", current->comm, check); + } +} + /** * vmtruncate - unmap mappings "freed" by truncate() syscall * @inode: inode of the file used @@ -1865,6 +1893,7 @@ do_expand: goto ou
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 24 Dec 2006, Linus Torvalds wrote: > > Peter, tell me I'm crazy, but with the new rules, the following condition > is a bug: > > - shared mapping > - writable > - not already marked dirty in the PTE Ok, so how about this diff. I'm actually feeling good about this one. It really looks like "do_no_page()" was simply buggy, and that this explains everything. Please please please test. Throw all the other patches away (with the possible exception of the "update_mmu_cache()" sanity checker, which is still interesting in case some _other_ place does this too). Don't do the "wait_on_page_writeback()" thing, because it changes timings and might hide thngs for the wrong reasons. Just apply this on top of a known failing kernel, and test. Linus --- diff --git a/mm/memory.c b/mm/memory.c index 563792f..cf429c4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2247,21 +2249,23 @@ retry: if (pte_none(*page_table)) { flush_icache_page(vma, new_page); entry = mk_pte(new_page, vma->vm_page_prot); - if (write_access) - entry = maybe_mkwrite(pte_mkdirty(entry), vma); - set_pte_at(mm, address, page_table, entry); if (anon) { inc_mm_counter(mm, anon_rss); lru_cache_add_active(new_page); page_add_new_anon_rmap(new_page, vma, address); + if (write_access) + entry = maybe_mkwrite(pte_mkdirty(entry), vma); } else { inc_mm_counter(mm, file_rss); page_add_file_rmap(new_page); + entry = pte_wrprotect(entry); if (write_access) { dirty_page = new_page; get_page(dirty_page); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); } } + set_pte_at(mm, address, page_table, entry); } else { /* One of our sibling threads was faster, back out. */ page_cache_release(new_page); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 24 Dec 2006, Linus Torvalds wrote: > > How about this particularly stupid diff? (please test with something that > _would_ cause corruption normally). Actually, here's an even more stupid diff, which actually to some degree seems to capture the real problem better. Peter, tell me I'm crazy, but with the new rules, the following condition is a bug: - shared mapping - writable - not already marked dirty in the PTE because that combination means that the hardware can mark the PTE dirty without us even realizing (and thus not marking the "struct page *" dirty). (The above is actually a valid situation for IO mappings, but not for "real" mappings. And IO mappings should never take page faults, I think). So, with that in mind, I wrote this stupid patch (for 32-bit x86, since I used my Mac Mini for testing ratehr than my main machine - but the x86-64 version should be pretty much identcal).. And you know what, Peter? It triggers for me. I get WARNING at mm/memory.c:2274 do_no_page() [] show_trace_log_lvl+0x1a/0x2f [] show_trace+0x12/0x14 [] dump_stack+0x16/0x18 [] __handle_mm_fault+0x38d/0x919 [] do_page_fault+0x1ff/0x507 [] error_code+0x7c/0x84 which seems to say that do_no_page() can be used to insert shared and non-dirty, but still writable, pages. But maybe my patch is just bogus, and I didn't think it through. Peter, I realize it's Christmas Eve, but let's face it, Santa appreciates good boys and girls, and we all want tons of loot. So please be good, and waste some time looking at this and tell me why I'm either wrong, or there's a real smoking gun here.. ;) Linus --- diff --git a/include/asm-i386/pgtable.h b/include/asm-i386/pgtable.h index e6a4723..1389bb7 100644 --- a/include/asm-i386/pgtable.h +++ b/include/asm-i386/pgtable.h @@ -494,7 +494,13 @@ do { \ * The i386 doesn't have any external MMU info: the kernel page * tables contain all the necessary information. */ -#define update_mmu_cache(vma,address,pte) do { } while (0) +#define bad_shared_pte(pte) (pte_write(pte) && !pte_dirty(pte)) +#define update_mmu_cache(vma,address,pte) do { \ + static int __cnt; \ + WARN_ON(((vma)->vm_flags & VM_SHARED) \ +&& bad_shared_pte(pte) \ +&& ++__cnt < 5); \ +} while (0) #endif /* !__ASSEMBLY__ */ #ifdef CONFIG_FLATMEM - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 24 Dec 2006 09:16:06 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > On Sun, 24 Dec 2006, Andrei Popa wrote: > > > On Sun, 2006-12-24 at 04:31 -0800, Andrew Morton wrote: > > > Andrei Popa <[EMAIL PROTECTED]> wrote: > > > > /dev/sda7 on / type ext3 (rw,noatime,nobh) > > > > > > > > I don't have corruption. I tested twice. > > > > > > This is a surprising result. Can you pleas retest ext3 > > > data=writeback,nobh? > > > > Yes, no corruption. Also tested only with data=writeback and had no > > corruption. > > Ok, so it would seem to be writeback related _somehow_. However, most of > the differences (I _thought_) in ext3 actually show up only if you have > *both* "nobh" and "data=writeback", and as far as I can tell, just a > simple "data=writeback" should still use the bog-standard > "block_write_full_page()". > > Andrew? > > Although as far as I can see, then ext2 should work as-is too (since it > too also just uses "block_write_full_page()" without anything fancy). ext2 uses the multipage-bio assembly code for writeback whereas ext3 doesn't. But ext3 doesn't use that code in data=ordered mode, of course. Still, this: --- a/fs/ext2/inode.c~a +++ a/fs/ext2/inode.c @@ -693,7 +693,7 @@ const struct address_space_operations ex .commit_write = generic_commit_write, .bmap = ext2_bmap, .direct_IO = ext2_direct_IO, - .writepages = ext2_writepages, +// .writepages = ext2_writepages, .migratepage= buffer_migrate_page, }; @@ -711,7 +711,7 @@ const struct address_space_operations ex .commit_write = nobh_commit_write, .bmap = ext2_bmap, .direct_IO = ext2_direct_IO, - .writepages = ext2_writepages, +// .writepages = ext2_writepages, .migratepage= buffer_migrate_page, }; _ will switch it off for ext2. > Strange. > > How about this particularly stupid diff? (please test with something that > _would_ cause corruption normally). > > It is _entirely_ untested, but what it tries to do is to simply serialize > any writeback in progress with any process that tries to re-map a shared > page into its address space and dirty it. I haven't tested it, and maybe > it misses some case, but it looks likea good way to try to avoid races > with marking pages dirty and the writeback phase .. > > Linus > --- > diff --git a/mm/memory.c b/mm/memory.c > index 563792f..64ed10b 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1544,6 +1544,7 @@ static int do_wp_page(struct mm_struct *mm, struct > vm_area_struct *vma, > if (!pte_same(*page_table, orig_pte)) > goto unlock; > } > + wait_on_page_writeback(old_page); > dirty_page = old_page; > get_page(dirty_page); > reuse = 1; > @@ -2215,6 +2216,7 @@ retry: > page_cache_release(new_page); > return VM_FAULT_SIGBUS; > } > + wait_on_page_writeback(new_page); > } > } yup. Also, we could perhaps lock the target page during pagefaults.. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 24 Dec 2006, Andrei Popa wrote: > On Sun, 2006-12-24 at 04:31 -0800, Andrew Morton wrote: > > Andrei Popa <[EMAIL PROTECTED]> wrote: > > > /dev/sda7 on / type ext3 (rw,noatime,nobh) > > > > > > I don't have corruption. I tested twice. > > > > This is a surprising result. Can you pleas retest ext3 data=writeback,nobh? > > Yes, no corruption. Also tested only with data=writeback and had no > corruption. Ok, so it would seem to be writeback related _somehow_. However, most of the differences (I _thought_) in ext3 actually show up only if you have *both* "nobh" and "data=writeback", and as far as I can tell, just a simple "data=writeback" should still use the bog-standard "block_write_full_page()". Andrew? Although as far as I can see, then ext2 should work as-is too (since it too also just uses "block_write_full_page()" without anything fancy). Strange. How about this particularly stupid diff? (please test with something that _would_ cause corruption normally). It is _entirely_ untested, but what it tries to do is to simply serialize any writeback in progress with any process that tries to re-map a shared page into its address space and dirty it. I haven't tested it, and maybe it misses some case, but it looks likea good way to try to avoid races with marking pages dirty and the writeback phase .. Linus --- diff --git a/mm/memory.c b/mm/memory.c index 563792f..64ed10b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1544,6 +1544,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, if (!pte_same(*page_table, orig_pte)) goto unlock; } + wait_on_page_writeback(old_page); dirty_page = old_page; get_page(dirty_page); reuse = 1; @@ -2215,6 +2216,7 @@ retry: page_cache_release(new_page); return VM_FAULT_SIGBUS; } + wait_on_page_writeback(new_page); } } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 2006-12-24 at 04:31 -0800, Andrew Morton wrote: > On Sun, 24 Dec 2006 14:14:38 +0200 > Andrei Popa <[EMAIL PROTECTED]> wrote: > > > > - mount the fs with ext2 with the no-buffer-head option. That means > > > either: > > > > > > grub.conf: rootfstype=ext2 rootflags=nobh > > > /etc/fstab: ext2 nobh > > > > ierdnac ~ # mount > > /dev/sda7 on / type ext2 (rw,noatime,nobh) > > > > I have corruption. > > > > > > > > - mount the fs with ext3 data=writeback, nobh > > > > > > grub.conf: rootfstype=ext3 rootflags=nobh,data=writeback (I hope this > > > works) > > > /etc/fstab: ext2 data=writeback,nobh > > > > ierdnac ~ # mount > > /dev/sda7 on / type ext3 (rw,noatime,nobh) > > > > ierdnac ~ # dmesg|grep EXT3 > > EXT3-fs: mounted filesystem with writeback data mode. > > EXT3 FS on sda7, internal journal > > > > I don't have corruption. I tested twice. > > This is a surprising result. Can you pleas retest ext3 data=writeback,nobh? Yes, no corruption. Also tested only with data=writeback and had no corruption. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
* Andrew Morton <[EMAIL PROTECTED]> [2006-12-24 00:57]: > /etc/fstab: ext2 nobh > /etc/fstab: ext3 data=writeback,nobh It seems that busybox mount ignores the nobh option but both ext2 and ext3 data=writeback work for me. This is with plain 2.6.19 which normally always fails. -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 24 Dec 2006 14:14:38 +0200 Andrei Popa <[EMAIL PROTECTED]> wrote: > > - mount the fs with ext2 with the no-buffer-head option. That means either: > > > > grub.conf: rootfstype=ext2 rootflags=nobh > > /etc/fstab: ext2 nobh > > ierdnac ~ # mount > /dev/sda7 on / type ext2 (rw,noatime,nobh) > > I have corruption. > > > > > - mount the fs with ext3 data=writeback, nobh > > > > grub.conf: rootfstype=ext3 rootflags=nobh,data=writeback (I hope this > > works) > > /etc/fstab: ext2 data=writeback,nobh > > ierdnac ~ # mount > /dev/sda7 on / type ext3 (rw,noatime,nobh) > > ierdnac ~ # dmesg|grep EXT3 > EXT3-fs: mounted filesystem with writeback data mode. > EXT3 FS on sda7, internal journal > > I don't have corruption. I tested twice. This is a surprising result. Can you pleas retest ext3 data=writeback,nobh? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 24 Dec 2006 14:26:01 +0200 Andrei Popa <[EMAIL PROTECTED]> wrote: > I also tested with ext3 ordered, nobh and I have file corruption... ordered+nobh isn't a possible combination. The filesystem probably ignored nobh. nobh mode only makes sense with data=writeback. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 2006-12-24 at 14:14 +0200, Andrei Popa wrote: > On Sun, 2006-12-24 at 00:57 -0800, Andrew Morton wrote: > > On Sun, 24 Dec 2006 00:43:54 -0800 (PST) > > Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > > > I now _suspect_ that we're talking about something like > > > > > > - we started a writeout. The IO is still pending, and the page was > > >marked clean and is now in the "writeback" phase. > > > - a write happens to the page, and the page gets marked dirty again. > > >Marking the page dirty also marks all the _buffers_ in the page dirty, > > >but they were actually already dirty, because the IO hasn't completed > > >yet. > > > - the IO from the _previous_ write completes, and marks the buffers > > > clean > > >again. > > > > Some things for the testers to try, please: > > > > - mount the fs with ext2 with the no-buffer-head option. That means either: > > > > grub.conf: rootfstype=ext2 rootflags=nobh > > /etc/fstab: ext2 nobh > > ierdnac ~ # mount > /dev/sda7 on / type ext2 (rw,noatime,nobh) > > I have corruption. > > > > > - mount the fs with ext3 data=writeback, nobh > > > > grub.conf: rootfstype=ext3 rootflags=nobh,data=writeback (I hope this > > works) > > /etc/fstab: ext2 data=writeback,nobh > > ierdnac ~ # mount > /dev/sda7 on / type ext3 (rw,noatime,nobh) > > ierdnac ~ # dmesg|grep EXT3 > EXT3-fs: mounted filesystem with writeback data mode. > EXT3 FS on sda7, internal journal > > I don't have corruption. I tested twice. > I also tested with ext3 ordered, nobh and I have file corruption... > > > > if that still fails we can rule out buffer_head funnies. > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 2006-12-24 at 00:57 -0800, Andrew Morton wrote: > On Sun, 24 Dec 2006 00:43:54 -0800 (PST) > Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > I now _suspect_ that we're talking about something like > > > > - we started a writeout. The IO is still pending, and the page was > >marked clean and is now in the "writeback" phase. > > - a write happens to the page, and the page gets marked dirty again. > >Marking the page dirty also marks all the _buffers_ in the page dirty, > >but they were actually already dirty, because the IO hasn't completed > >yet. > > - the IO from the _previous_ write completes, and marks the buffers clean > >again. > > Some things for the testers to try, please: > > - mount the fs with ext2 with the no-buffer-head option. That means either: > > grub.conf: rootfstype=ext2 rootflags=nobh > /etc/fstab: ext2 nobh ierdnac ~ # mount /dev/sda7 on / type ext2 (rw,noatime,nobh) I have corruption. > > - mount the fs with ext3 data=writeback, nobh > > grub.conf: rootfstype=ext3 rootflags=nobh,data=writeback (I hope this > works) > /etc/fstab: ext2 data=writeback,nobh ierdnac ~ # mount /dev/sda7 on / type ext3 (rw,noatime,nobh) ierdnac ~ # dmesg|grep EXT3 EXT3-fs: mounted filesystem with writeback data mode. EXT3 FS on sda7, internal journal I don't have corruption. I tested twice. > > if that still fails we can rule out buffer_head funnies. > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/