Re: [PATCH] mm: fix page_mkclean_one
On Fri, Feb 02, 2007 at 07:42:52PM +1100, Nick Piggin ([EMAIL PROTECTED]) wrote: > Anyway, I had a look at your bugzilla test-case and managed to slim it > down to something that easily shows what the problem is (available on > request) -- the problem is that recipient of the sendfile is seeing > modifications that occur to the source file _after_ the sender has > completed the sendfile, because the file pages are not copied but > queued. > > I think the usual approach to what you are trying to do is to set TCP_CORK, > then write(2) the header into the socket, then sendfile directly from the > file you want. > > Another approach I guess is to implement an ack in your userland protocol > so you do not modify the sendfile source file until the client acks that > it has all the data. Mark, don't you use e1000 or other scatter-gather capable nic with checksum offload? Likely yes. Actual data sucking in that case happens when packet is supposed to be transmitted by the NIC, not when sendfile() is returned. The same applies to the case, when you have fancy egress filtering. It is not allowed to modify pages until they are really transmitted, if you want data integrity. There are _no_ bugs in network or VFS cache in this test case. > I'm not sure if there are any other usual ways to do this (ie. a barrier > for sendfile, to ensure it will not pick up "future" modifications to the > file). netdev cc'ed, someone there might have additional comments. > > Please close this bug if/when you are satisfied it is not a kernel problem. > > Thanks, > Nick > > -- > SUSE Labs, Novell Inc. > Send instant messages to your online friends http://au.messenger.yahoo.com -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
Mark Groves wrote: Hi, I have been been seeing a problem when using sendfile repeatedly on an SMP server, which I believe is related to the problem that was discovered recently with marking dirty pages. The bug, as well as a test script, is listed at http://bugzilla.kernel.org/show_bug.cgi?id=7650. Currently, we're experiencing errors where part of a previous packet is being sent out rather than the current packet. I have applied the patch Linus posted to a 2.6.19 kernel but am still getting the problem. So I am wondering if there are any other places in the kernel which mark pages as dirty which might require a similar patch? Your issue is not related, firstly because the page_mkclean bug did not exist before 2.6.19 kernels. Anyway, I had a look at your bugzilla test-case and managed to slim it down to something that easily shows what the problem is (available on request) -- the problem is that recipient of the sendfile is seeing modifications that occur to the source file _after_ the sender has completed the sendfile, because the file pages are not copied but queued. I think the usual approach to what you are trying to do is to set TCP_CORK, then write(2) the header into the socket, then sendfile directly from the file you want. Another approach I guess is to implement an ack in your userland protocol so you do not modify the sendfile source file until the client acks that it has all the data. I'm not sure if there are any other usual ways to do this (ie. a barrier for sendfile, to ensure it will not pick up "future" modifications to the file). netdev cc'ed, someone there might have additional comments. Please close this bug if/when you are satisfied it is not a kernel problem. Thanks, Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
Mark Groves wrote: Hi, I have been been seeing a problem when using sendfile repeatedly on an SMP server, which I believe is related to the problem that was discovered recently with marking dirty pages. The bug, as well as a test script, is listed at http://bugzilla.kernel.org/show_bug.cgi?id=7650. Currently, we're experiencing errors where part of a previous packet is being sent out rather than the current packet. I have applied the patch Linus posted to a 2.6.19 kernel but am still getting the problem. So I am wondering if there are any other places in the kernel which mark pages as dirty which might require a similar patch? Your issue is not related, firstly because the page_mkclean bug did not exist before 2.6.19 kernels. Anyway, I had a look at your bugzilla test-case and managed to slim it down to something that easily shows what the problem is (available on request) -- the problem is that recipient of the sendfile is seeing modifications that occur to the source file _after_ the sender has completed the sendfile, because the file pages are not copied but queued. I think the usual approach to what you are trying to do is to set TCP_CORK, then write(2) the header into the socket, then sendfile directly from the file you want. Another approach I guess is to implement an ack in your userland protocol so you do not modify the sendfile source file until the client acks that it has all the data. I'm not sure if there are any other usual ways to do this (ie. a barrier for sendfile, to ensure it will not pick up future modifications to the file). netdev cc'ed, someone there might have additional comments. Please close this bug if/when you are satisfied it is not a kernel problem. Thanks, Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Fri, Feb 02, 2007 at 07:42:52PM +1100, Nick Piggin ([EMAIL PROTECTED]) wrote: Anyway, I had a look at your bugzilla test-case and managed to slim it down to something that easily shows what the problem is (available on request) -- the problem is that recipient of the sendfile is seeing modifications that occur to the source file _after_ the sender has completed the sendfile, because the file pages are not copied but queued. I think the usual approach to what you are trying to do is to set TCP_CORK, then write(2) the header into the socket, then sendfile directly from the file you want. Another approach I guess is to implement an ack in your userland protocol so you do not modify the sendfile source file until the client acks that it has all the data. Mark, don't you use e1000 or other scatter-gather capable nic with checksum offload? Likely yes. Actual data sucking in that case happens when packet is supposed to be transmitted by the NIC, not when sendfile() is returned. The same applies to the case, when you have fancy egress filtering. It is not allowed to modify pages until they are really transmitted, if you want data integrity. There are _no_ bugs in network or VFS cache in this test case. I'm not sure if there are any other usual ways to do this (ie. a barrier for sendfile, to ensure it will not pick up future modifications to the file). netdev cc'ed, someone there might have additional comments. Please close this bug if/when you are satisfied it is not a kernel problem. Thanks, Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
Hi, I have been been seeing a problem when using sendfile repeatedly on an SMP server, which I believe is related to the problem that was discovered recently with marking dirty pages. The bug, as well as a test script, is listed at http://bugzilla.kernel.org/show_bug.cgi?id=7650. Currently, we're experiencing errors where part of a previous packet is being sent out rather than the current packet. I have applied the patch Linus posted to a 2.6.19 kernel but am still getting the problem. So I am wondering if there are any other places in the kernel which mark pages as dirty which might require a similar patch? Regards, Mark Groves Researcher University of Waterloo Waterloo, Ontario, Canada - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
Hi, I have been been seeing a problem when using sendfile repeatedly on an SMP server, which I believe is related to the problem that was discovered recently with marking dirty pages. The bug, as well as a test script, is listed at http://bugzilla.kernel.org/show_bug.cgi?id=7650. Currently, we're experiencing errors where part of a previous packet is being sent out rather than the current packet. I have applied the patch Linus posted to a 2.6.19 kernel but am still getting the problem. So I am wondering if there are any other places in the kernel which mark pages as dirty which might require a similar patch? Regards, Mark Groves Researcher University of Waterloo Waterloo, Ontario, Canada - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 7 Jan 2007 12:36:18 +1030 "Tom Lanyon" <[EMAIL PROTECTED]> wrote: > On 12/27/06, Linus Torvalds <[EMAIL PROTECTED]> wrote: > > What would also actually be interesting is whether somebody can reproduce > > this on Reiserfs, for example. I _think_ all the reports I've seen are on > > ext2 or ext3, and if this is somehow writeback-related, it could be some > > bug that is just shared between the two by virtue of them still having a > > lot of stuff in common. > > > > Linus > > I've been following this thread for a while now as I started > experiencing file corruption in rtorrent when I upgraded to 2.6.19. I > am using reiserfs. reiserfs defaults to data=ordered, so it's quite possibly the same bug. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On 1/7/07, Tom Lanyon <[EMAIL PROTECTED]> wrote: I've been following this thread for a while now as I started experiencing file corruption in rtorrent when I upgraded to 2.6.19. I am using reiserfs. However, moving to 2.6.20-rc3 does indeed seem to fix the issue thus far... -- Tom Lanyon - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On 12/27/06, Linus Torvalds <[EMAIL PROTECTED]> wrote: What would also actually be interesting is whether somebody can reproduce this on Reiserfs, for example. I _think_ all the reports I've seen are on ext2 or ext3, and if this is somehow writeback-related, it could be some bug that is just shared between the two by virtue of them still having a lot of stuff in common. Linus I've been following this thread for a while now as I started experiencing file corruption in rtorrent when I upgraded to 2.6.19. I am using reiserfs. -- Tom Lanyon - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On 12/27/06, Linus Torvalds [EMAIL PROTECTED] wrote: What would also actually be interesting is whether somebody can reproduce this on Reiserfs, for example. I _think_ all the reports I've seen are on ext2 or ext3, and if this is somehow writeback-related, it could be some bug that is just shared between the two by virtue of them still having a lot of stuff in common. Linus I've been following this thread for a while now as I started experiencing file corruption in rtorrent when I upgraded to 2.6.19. I am using reiserfs. -- Tom Lanyon - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On 1/7/07, Tom Lanyon [EMAIL PROTECTED] wrote: I've been following this thread for a while now as I started experiencing file corruption in rtorrent when I upgraded to 2.6.19. I am using reiserfs. However, moving to 2.6.20-rc3 does indeed seem to fix the issue thus far... -- Tom Lanyon - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
On Sun, 7 Jan 2007 12:36:18 +1030 Tom Lanyon [EMAIL PROTECTED] wrote: On 12/27/06, Linus Torvalds [EMAIL PROTECTED] wrote: What would also actually be interesting is whether somebody can reproduce this on Reiserfs, for example. I _think_ all the reports I've seen are on ext2 or ext3, and if this is somehow writeback-related, it could be some bug that is just shared between the two by virtue of them still having a lot of stuff in common. Linus I've been following this thread for a while now as I started experiencing file corruption in rtorrent when I upgraded to 2.6.19. I am using reiserfs. reiserfs defaults to data=ordered, so it's quite possibly the same bug. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006 16:58:41 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > On Fri, 29 Dec 2006, Andrew Morton wrote: > > > > > > Somewhat nastily, but as ext3 directories are metadata it is appropriate > > > > that modifications to them be done in terms of buffer_heads (ie: > > > > blocks). > > > > > > No. There is nothing "appropriate" about using buffer_heads for metadata. > > > > I said "modification". > > You said "metadata". > > Why do you think directories are any different from files? Yes, they are > metadata. So what? What does that have to do with anything? We journal the contents of directories. Fully. So we handle their dirty data at the block (ie: buffer_head) level. When someone tries to dirty part of a directory we need to cheat and not mark that part of the page as dirty and we need to then write the block to the journal and then mark the block as really dirty for checkpointing (but still attached to the journal) and all that goop. The regular page-based writeback doesn't apply until the block has been written to the journal. At that stage the block is considered dirty against its real position on disk. It will then be written back by pdflush via the blockdev inode -> blkdev_writepage(). Unless kjournald needs to do an early flush to reclaim the journal space, in which case kjournald will write the block itself. > > So I really don't understand why you make excuses for ext3 and talk about > "modifications" and "metadata". It was a fine design ten years ago. It's > not really very good any longer. > As I said in another apparently-neglected email: : We could possibly move ext3/4 directories out of the blockdev pagecache and : into per-directory pagecache, but that wouldn't change anything - the : journalling would still be block-based. We already have all the code in place to journal blocks which are cached in an address_space other than the blockdev inode's: ext3_journalled_aops. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Andrew Morton wrote: > > > > Somewhat nastily, but as ext3 directories are metadata it is appropriate > > > that modifications to them be done in terms of buffer_heads (ie: blocks). > > > > No. There is nothing "appropriate" about using buffer_heads for metadata. > > I said "modification". You said "metadata". Why do you think directories are any different from files? Yes, they are metadata. So what? What does that have to do with anything? They should still use virtual indexes, the way files do. That doesn't preclude them from using buffer-heads to mark their (partial-page) modifications and for keeping the virtual->physical translations cached. I mean, really. Look at ext2. It does exactly that. It keeps the directories in the page cache - virtually indexed. And it even modifies them there. Exactly the same way it modifies regular file data. It all works exactly the same way it works for regular files. It uses page->mapping->a_ops->prepare_write(NULL, page, from, to); ... do modification ... ext2_commit_chunk(page, from, to); exactly the way regular file data works. That's why I'm saying there is absolutely _zero_ thing about "metadata" here, or even about "modifications". It all works better in a virtual cache, because you get all the support that we give to page caches. So I really don't understand why you make excuses for ext3 and talk about "modifications" and "metadata". It was a fine design ten years ago. It's not really very good any longer. I suspect we're stuck with the design, but that doesn't make it any _better_. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Andrew Morton wrote: > > Adam Richter spent considerable time a few years ago trying to make the > mpage code go direct-to-BIO in all cases and we eventually gave up. The > conceptual layering of page<->blocks<->bio is pretty clean, and it is hard > and ugly to fully optimise away the "block" bit in the middle. Using the buffer cache as a translation layer to the physical address is fine. That's what _any_ block device will do. I'm not at all sayign that "buffer heads must go away". They work fine. What I'm saying is that - if you index by buffer heads, you're screwed. - if you do IO by starting at buffer heads, you're screwed. Both indexing and writeback decisions should be done at the page cache layer. Then, when you actually need to do IO, you look at the buffers. But you start from the "page". YOU SHOULD NEVER LOOK UP a buffer on its own merits, and YOU SHOULD NEVER DO IO on a buffer head on its own cognizance. So by all means keep the buffer heads as a way to keep the "virtual->physical" translation. It's what they were designed for. But they were _originally_ also designed for "lookup" and "driving the start of IO", and that is wrong, and has been wrong for a long time now, because - lookup based on physical address is fundamentally slow and inefficient. You have to look up the virtual->physical translation somewhere else, so it's by design an unnecessary indirection _and_ that "somewere else" is also by definition filesystem-specific, so you can't do any of these things at the VFS layer. Ergo: anything that needs to look up the physical address in order to find the buffer head is BROKEN in this day and age. We look up the _virtual_ page cache page, and then we can trivially find the buffer heads within that page thanks to page->buffers. Example: ext2 vs ext3 readdir. One of them sucks, the other doesn't. - starting IO based on the physical entity is insane. It's insane exactly _because_ the VM doesn't actually think in physical addresses, or in buffer-sized blocks. The VM only really knows about whole pages, and all the VM decisions fundamentally have to be page-based. We don't ever "free a buffer". We free a whole page, and as such, doing writeback based on buffers is pointless, because it doesn't actually say anything about the "page state" which is what the VM tracks. But neither of these means that "buffer_head" itself has to go away. They both really boil down to the same thing: you should never KEY things by the buffer head. All actions should be based on virtual indexes as far as at all humanly possible. Once you do lookup and locking and writeback _starting_ from the page, it's then easy to look up the actual buffer head within the page, and use that as a way to do the actual _IO_ on the physical address. So the buffer heads still exist in ext2, for example, but they don't drive the show quite as much. (They still do in some areas: the allocation bitmaps, the xattr code etc. But as long as none of those have big VM footprints, and as long as no _common_ operations really care deeply, and as long as those data structures never need to be touched by the VM or VFS layer, nobody will ever really care). The directory case comes up just because "readdir()" actually is very common, and sometimes very slow. And it can have a big VM working set footprint ("find"), so trying to be page-based actually really helps, because it all drives things like writeback on the _right_ issues, and we can do things like LRU's and writeback decisions on the level that really matters. I actually suspect that the inode tables could benefit from being in the page cache too (although I think that the inode buffer address is actually "physical", so there's no indirection for inode tables, which means that the virtual vs physical addressing doesn't matter). For directories, there definitely is a big cost to continually doing the virtual->physical translation all the time. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006 16:11:44 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > > JBD implements physical block-based journalling, so it is 100% appropriate > > that JBD deal with these disk blocks using their buffer_head > > representation. > > And as long as it does that, you just have to face the fact that it's > going to perform like crap, including what you call "extra" writes, and > what I call "deal with it". It is quite tiresome to delete things which your interlocutor said and to then restate them as if it were some sort of relevation. > > Somewhat nastily, but as ext3 directories are metadata it is appropriate > > that modifications to them be done in terms of buffer_heads (ie: blocks). > > No. There is nothing "appropriate" about using buffer_heads for metadata. I said "modification". > [stuff about directory reads elided] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Andrew Morton wrote: > > They're extra. As in "can be optimised away". Sure. Don't use buffer heads. > The buffer_head is not an IO container. It is the kernel's core > representation of a disk block. Please come back from the 90's. The buffer heads are nothing but a mapping of where the hardware block is. If you use it for anything else, you're basically screwed. > JBD implements physical block-based journalling, so it is 100% appropriate > that JBD deal with these disk blocks using their buffer_head > representation. And as long as it does that, you just have to face the fact that it's going to perform like crap, including what you call "extra" writes, and what I call "deal with it". Btw, you can make pages be physically indexed too, but they obviously (a) won't be coherent with any virtual mapping laid on top of it (b) will be _physical_, so any readahead etc will be based on physical addresses too. > I thought I fixed the performance problem? No, you papered over it, for the reasonably common case where things were physically contiguous - exactly by using a physical page cache, so now it can do read-ahead based on that. Then, because the pages contain buffer heads, the directory accesses can look up buffers, and if it was all physically contiguous, it all works fine. But if you actually want virtualluy indexed caching (and all _users_ want it), it really doesn't work. > Somewhat nastily, but as ext3 directories are metadata it is appropriate > that modifications to them be done in terms of buffer_heads (ie: blocks). No. There is nothing "appropriate" about using buffer_heads for metadata. It's quite proper - and a hell of a lot more efficient - to use virtual page-caching for metadata too. Look at the ext2 readdir() implementation, and compare it to the crapola horror that is ext3. Guess what? ext2 uses virtually indexed metadata, and as a result it is both simpler, smaller and a LOT faster than ext3 in accessing that metadata. Face it, Andrew, you're wrong on this one. Really. Just take a look at ext2_readdir(). [ I'm not saying that ext2_readdir() is _beautiful_. If it had been written with the page cache in mind, it would probably have been done very differently. And it doesn't do any readahead, probably because nobody cared enough, but it should be trivial to add, and it would automatically "do the right thing" just because it's much easier at the page cache level. But I _am_ saying that compared to ext3, the ext2 readdir is a work of art. ] "metadata" has _zero_ to do with "physically indexed". There is no correlation what-so-ever. If you think there is a correlation, it's all in your mind. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006 18:32:07 -0500 Theodore Tso <[EMAIL PROTECTED]> wrote: > On Fri, Dec 29, 2006 at 02:42:51PM -0800, Linus Torvalds wrote: > > I think ext3 is terminally crap by now. It still uses buffer heads in > > places where it really really shouldn't, and as a result, things like > > directory accesses are simply slower than they should be. Sadly, I don't > > think ext4 is going to fix any of this, either. > > Not just ext3; ocfs2 is using the jbd layer as well. I think we're > going to have to put this (a rework of jbd2 to use the page cache) on > the ext4 todo list, and work with the ocfs2 folks to try to come up > with something that suits their needs as well. Fortunately we have > this filesystem/storage summit thing coming up in the next few months, > and we can try to get some discussion going on the linux-ext4 mailing > list in the meantime. Unfortunately, I don't think this is going to > be trivial. I suspect it would be insane to move any part of JBD (apart from the ordered-data flush) to use pagecache. The whole thing is fundamentally block-based. But only for metadata - there's no strong reason why ext3/4 needs to manipulate file data via buffer_heads if data=journal and chattr +j aren't in use. We could possibly move ext3/4 directories out of the blockdev pagecache and into per-directory pagecache, but that wouldn't change anything - the journalling would still be block-based. Adam Richter spent considerable time a few years ago trying to make the mpage code go direct-to-BIO in all cases and we eventually gave up. The conceptual layering of page<->blocks<->bio is pretty clean, and it is hard and ugly to fully optimise away the "block" bit in the middle. buffer_heads become more important with large PAGE_CACHE_SIZE. I'd expect nobh mode to be quite inefficient with some workloads on 64k pages. We need that representation of the state (and location) of the block-sized hunks which make up the page. > If we do get this fixed for ext4, one interesting question is whether > people would accept a patch to backport the fixes to ext3, given the > the grief this is causing the page I/O and VM routines. OTOH, reiser3 > probably has the same problems, and I suspect the changes to ext3 to > cause it to avoid buffer heads, especially in order to support for > filesystem blocksizes < pagesize, are going to be sufficiently risky > in terms of introducing regressions to ext3 that they would probably > be rejected on those grounds. So unfortunately, we probably are going > to have to support flushes via buffer heads for the foreseeable > future. We'll see. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Theodore Tso wrote: > > If we do get this fixed for ext4, one interesting question is whether > people would accept a patch to backport the fixes to ext3, given the > the grief this is causing the page I/O and VM routines. I don't think backporting is the smartest option (unless it's done _way_ later), but the real problem with it isn't actually the VM behaviour, but simply the fact that cached performance absoluely _sucks_ with the buffer cache. With the physically indexed buffer cache thing, you end up always having to do these complicated translations into block numbers for every single access, and at some point when I benchmarked it, it was a huge overhead for doing simple things like readdir. It's also a major pain for read-ahead, exactly partly due to the high cost of translation - because you can't cheaply check whether the next block is there, the cost of even asking the question "should I try to read ahead?" is much much higher. As a result, read-ahead is seriously limited, because it's so expensive for the cached case (which is still hopefully the _common_ case). So because read-ahead is limited, the non-cached case then _really_ sucks. It was somewhat fixed in a really god-awful fashion by having ext3_readdir() actually do _readahead_ though the page cache, even though it does everything else through the buffer cache. And that just happens to work because we hopefully have physically contiguous blocks, but when that isn't true, the readahead doesn't do squat. It's really quite fundamentally broken. But none of that causes any problems for the VM, since directories cannot be mmap'ed anyway. But it's really pitiful, and it really doesn't work very well. Of course, other filesystems _also_ suck at this, and other operating systems haev even MORE problems, so people don't always seem to realize how horribly horribly broken this all is. I really wish somebody would write a filesystem that did large cold-cache directories well. Open some horrible file manager on /usr/bin with cold caches, and weep. The biggest problem is the inode indirection, but at some point when I looked at why it sucked, it was doing basically synchronous single-buffer reads on the directory too, because readahead didn't work properly. I was hoping that something like SpadFS would actually take off, because it seemed to do a lot of good design choices (having inodes in-line in the directory for when there are no hardlinks is probably a requirement for a good filesystem these days. The separate inode table had its uses, but indirection in a filesystem really does suck, and stat information is too important to be indirect unless it absolutely has to). But I suspect it needs more than somebody who just wants to get his thesis written ;) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006 14:42:51 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > On Fri, 29 Dec 2006, Andrew Morton wrote: > > > > - The above change means that we do extra writeout. If a page is dirtied > > once, kjournald will write it and then pdflush will come along and > > needlessly write it again. > > There's zero extra writeout for any flushing that flushes BY PAGES. > > Only broken flushers that flush by buffer heads (which really really > really shouldn't be done any more: welcome to the 21st century) will cause > extra writeouts. And those extra writeouts are obviously required for all > the dirty state to actually hit the disk - which is the point of the > patch. > > So they're not "extra" - they are "required for correct working". They're extra. As in "can be optimised away". > But I can't stress the fact enough that people SHOULD NOT do writeback by > buffer heads. The buffer head has been purely an "IO entity" for the last > several years now, and it's not a cache entity. The buffer_head is not an IO container. It is the kernel's core representation of a disk block. Usually (but not always) it is backed by some memory which is in pagecache. We can feed buffer_heads into IO containers via submit_bh(), but that's far from the only thing we use buffer_heads for. We should have done s/buffer_head/block/g years ago. JBD implements physical block-based journalling, so it is 100% appropriate that JBD deal with these disk blocks using their buffer_head representation. That being said, ordered-data mode isn't really part of the JBD journalling system at all (the data doesn't get journalled!) - ordered-mode is an add-on to the JBD journal to make the metadata which we're about to journal point at more-likely-to-be-correct data. JBD's ordered-mode writeback is just a sync and I see no conceptual problems with killing its old buffer_head based sync and moving it into the 21st century. > Anybody who does writeback > by buffer heads is basically bypassing the real cache (the page cache), > and that's why all the problems happen. > > I think ext3 is terminally crap by now. It still uses buffer heads in > places where it really really shouldn't, The ordered-data mode flush: sure. The rest of JBD's use of buffer_heads is quite appropriate. > and as a result, things like > directory accesses are simply slower than they should be. Sadly, I don't > think ext4 is going to fix any of this, either. I thought I fixed the performance problem? Somewhat nastily, but as ext3 directories are metadata it is appropriate that modifications to them be done in terms of buffer_heads (ie: blocks). > It's all just too inherently wrongly designed around the buffer head > (which was correct in 1995, but hasn't been correct for a long time in the > kernel any more). > > > - Poor old IO accounting broke again. > > No. That's why I used "set_page_dirty()" and did it that strange ugly way > ("set page dirty, even though it's already dirty, and even though the very > next thing we will do is TestClearPageDirty???"). nfs_set_page_dirty() and reiserfs_set_page_dirty() should now bail if PageDirty() to avoid needless work. > > - For a long time I've wanted to nuke the current ext3/jbd ordered-data > > implementation altogether, and just make kjournald call into the > > standard writeback code to do a standard suberblock->inodes->pages walk. > > I really would like to see less of the buffer-head-based stuff, and yes, > more of the normal inode page walking. I don't think you can "order" > accesses within a page anyway, exactly because of memory mapping issues, > so any page ordering is not about buffer heads on the page itself, it > should be purely about metadata. In this context ext3's "ordered" mode means "sync the file contents before journalling the metadata which points at it". > > - It's pretty obnoxious that the VM now sets a clean page "dirty" and > > then proceeds to modify its contents. It would be nice to stop doing > > that. > > No. I think this really the fundamental confusion people had. People > thought that setting the page dirty meant that it was no longer being > modified. No. Setting a page (or bh, or inode) dirty means "this is known to have been modified". ie: this cached entity is now out of sync with backing store. Ho hum. I don't care much, really. But then, I understand how all this stuff works. Try explaining to someone the relationship between pte-dirtiness, page-dirtiness, radix-tree-dirtiness and buffer_head-dirtiness. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, Dec 29, 2006 at 02:42:51PM -0800, Linus Torvalds wrote: > I think ext3 is terminally crap by now. It still uses buffer heads in > places where it really really shouldn't, and as a result, things like > directory accesses are simply slower than they should be. Sadly, I don't > think ext4 is going to fix any of this, either. Not just ext3; ocfs2 is using the jbd layer as well. I think we're going to have to put this (a rework of jbd2 to use the page cache) on the ext4 todo list, and work with the ocfs2 folks to try to come up with something that suits their needs as well. Fortunately we have this filesystem/storage summit thing coming up in the next few months, and we can try to get some discussion going on the linux-ext4 mailing list in the meantime. Unfortunately, I don't think this is going to be trivial. If we do get this fixed for ext4, one interesting question is whether people would accept a patch to backport the fixes to ext3, given the the grief this is causing the page I/O and VM routines. OTOH, reiser3 probably has the same problems, and I suspect the changes to ext3 to cause it to avoid buffer heads, especially in order to support for filesystem blocksizes < pagesize, are going to be sufficiently risky in terms of introducing regressions to ext3 that they would probably be rejected on those grounds. So unfortunately, we probably are going to have to support flushes via buffer heads for the foreseeable future. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Andrew Morton wrote: > > - The above change means that we do extra writeout. If a page is dirtied > once, kjournald will write it and then pdflush will come along and > needlessly write it again. There's zero extra writeout for any flushing that flushes BY PAGES. Only broken flushers that flush by buffer heads (which really really really shouldn't be done any more: welcome to the 21st century) will cause extra writeouts. And those extra writeouts are obviously required for all the dirty state to actually hit the disk - which is the point of the patch. So they're not "extra" - they are "required for correct working". But I can't stress the fact enough that people SHOULD NOT do writeback by buffer heads. The buffer head has been purely an "IO entity" for the last several years now, and it's not a cache entity. Anybody who does writeback by buffer heads is basically bypassing the real cache (the page cache), and that's why all the problems happen. I think ext3 is terminally crap by now. It still uses buffer heads in places where it really really shouldn't, and as a result, things like directory accesses are simply slower than they should be. Sadly, I don't think ext4 is going to fix any of this, either. It's all just too inherently wrongly designed around the buffer head (which was correct in 1995, but hasn't been correct for a long time in the kernel any more). > - Poor old IO accounting broke again. No. That's why I used "set_page_dirty()" and did it that strange ugly way ("set page dirty, even though it's already dirty, and even though the very next thing we will do is TestClearPageDirty???"). That code looks strange as a result, which is why it now has more comments on it than actual code ;) > - People were saying that ext2 and ext3,data=writeback were also showing > corruption. What's up with that? I thought the "ext3,data=writeback" case was reported to be fine by several people? I'm not sure about ext2. I didn't look at what it did based on buffer heads. I would have expected it to be ok. That said, at least one report was later shown to be bogus (errors due to out of disk, not due to actual errors ;). > - For a long time I've wanted to nuke the current ext3/jbd ordered-data > implementation altogether, and just make kjournald call into the > standard writeback code to do a standard suberblock->inodes->pages walk. I really would like to see less of the buffer-head-based stuff, and yes, more of the normal inode page walking. I don't think you can "order" accesses within a page anyway, exactly because of memory mapping issues, so any page ordering is not about buffer heads on the page itself, it should be purely about metadata. > - It's pretty obnoxious that the VM now sets a clean page "dirty" and > then proceeds to modify its contents. It would be nice to stop doing > that. No. I think this really the fundamental confusion people had. People thought that setting the page dirty meant that it was no longer being modified. It hasn't meant that in a LONG time - ever since the whole DIRTY_TAG thing, the most important part of the PG_dirty thing has really been that it's now efficiently findable by the writeout logic. And that is very much what the whole page accounting _depends_ on. When we mmap a page, we need to mark it "findable" as dirty _before_ people actually start writing to it, because it's too late afterwards. > We could stop marking the page dirty in do_wp_page() and create a new > VM counter "NR_PTE_DIRTY", which means > > "number of mapping_cap_account_dirty() pages which have a dirty pte > pointing at them". Well, then you need to change what PAGE_MAPPING_TAG_DIRTY means too. That's very fundamental. That DIRTY _tag_ is now even more important than the PG_dirty bit itself, since that's what we actually use to _access_ those things. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006 14:16:32 -0800 Andrew Morton <[EMAIL PROTECTED]> wrote: > - Poor old IO accounting broke again. No it didn't - we're relying upon the behaviour of __set_page_dirty_buffers() against an already-dirty page. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006 02:48:35 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> wrote: > + if (mapping && mapping_cap_account_dirty(mapping)) { > + /* > + * Yes, Virginia, this is indeed insane. > + * > + * We use this sequence to make sure that > + * (a) we account for dirty stats properly > + * (b) we tell the low-level filesystem to > + * mark the whole page dirty if it was > + * dirty in a pagetable. Only to then > + * (c) clean the page again and return 1 to > + * cause the writeback. > + * > + * This way we avoid all nasty races with the > + * dirty bit in multiple places and clearing > + * them concurrently from different threads. > + * > + * Note! Normally the "set_page_dirty(page)" > + * has no effect on the actual dirty bit - since > + * that will already usually be set. But we > + * need the side effects, and it can help us > + * avoid races. > + * > + * We basically use the page "master dirty bit" > + * as a serialization point for all the different > + * threds doing their things. > + * > + * FIXME! We still have a race here: if somebody > + * adds the page back to the page tables in > + * between the "page_mkclean()" and the "TestClearPageDirty()", > + * we might have it mapped without the dirty bit set. > + */ > + if (page_mkclean(page)) > + set_page_dirty(page); > + if (TestClearPageDirty(page)) { > dec_zone_page_state(page, NR_FILE_DIRTY); > + return 1; > } - Presumably reiser3's ordered-data mode has the same problem. And ext4, of course. Dunno about other filesytems. - The above change means that we do extra writeout. If a page is dirtied once, kjournald will write it and then pdflush will come along and needlessly write it again. But otoh, if a mapping is being repeatedly dirtied, kjournald will write the page once per 30 seconds (dirty_expire_centisecs) and pdflush will write the page once per 30 seconds as well. But we _should_ be writing it once per five seconds (kjournald commit interval). So we're still ahead ;) - Poor old IO accounting broke again. - People were saying that ext2 and ext3,data=writeback were also showing corruption. What's up with that? - For a long time I've wanted to nuke the current ext3/jbd ordered-data implementation altogether, and just make kjournald call into the standard writeback code to do a standard suberblock->inodes->pages walk. I think it'd be fairly straightforward to do. We'd need to teach the writeback code to be able to skip dirty pages which don't have a disk mapping, so that kjournald doesn't end up waiting for kjournald to free up journal space.. Would need to avoid possible deadlocks where someone calls ext3_force_commit() or otherwise does a synchronous commit while holding VFS locks. reiser3 and ext4 could be converted too. Not a short-term project, but this would avoid the problem. - It's pretty obnoxious that the VM now sets a clean page "dirty" and then proceeds to modify its contents. It would be nice to stop doing that. We could stop marking the page dirty in do_wp_page() and create a new VM counter "NR_PTE_DIRTY", which means "number of mapping_cap_account_dirty() pages which have a dirty pte pointing at them". Or, perhaps "number of dirty ptes which point at mapping_cap_account_dirty() pages". Which can be larger, but the writeout code will probably cope. Then we take NR_PTE_DIRTY into account in the dirty-page balancing act. So - do_wp_page() will still run balance_dirty_pages() - but it would no longer run set_page_dirty(). - But it needs to run mark_inode_dirty() so the fs-writeback code notices the file. - And mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) becomes insufficient. The tricky part here is "how do we do the writeback"? The pte-dirty,!PageDirty pages aren't tagged as dirty in the radix-tree and writeback needs to find them so that it can effectively do an msync() on them. Walking all the mm's and vma's would be insane. Visiting all the pages in the file would also probably be insane. Perhaps this can be solved by adding a new radix-tree tag which means "this page might have dirty ptes pointing at it". For each file writeback would do a radix-tree walk of these pages, cleaning-and-write-protecting ptes, marking the corresponding pages dirty and clearing their PAGECACHE_TAG_PTE_DIRTY tags. Then we can fix the mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) problem by doing
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Theodore Tso wrote: > > I'm confused. Does this mean that if "fs blocksize"=="VM pagesize" > this bug can't trigger? No. Even if there is just a single buffer-head, if the filesystem ever writes out that _single_ buffer-head out of turn (ie before the VM actually asks it to, with "->writepage()"), then the same issue will happen. In fact, a bigger fs blocksize will likely just make this easier to trigger (although I doubt it makes a big difference), since any out-of-order buffer flushback will happen for the whole page, rather than just a part of the page. So the "problem" really ends up being that the filesystem does flushing that the VM isn't aware of, so when the VM did "set_page_dirty()" at an earlier time, the VM _expected_ the "->writepages()" call that happened much later to write the whole page - but because the FS had flushed things behind it backs even _before_ the "->writepage" happens, by the time the VM actually asks for the page to be written out, the FS layer won't actually write it all out any more. Blocksize doesn't matter, the only thing that matters is whether something writes out data on a buffer-cache level, not on a "page cache" level. Ext3 apparently does this in "ordered" data more at least (and hey, I suspect that the code that tries to release buffer head data might try to do it on its own too). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Nick Piggin wrote: > > > It still has a tiny tiny race (see the comment), but I bet nobody can really > > hit it in real life anyway, and I know several ways to fix it, so I'm not > > really _that_ worried about it. > > Well the race isn't a data loss one, is it? Just a case where the > pte may be dirty but the page dirty state not accounted for. Right. We should be picking it up eventually, since it's still in the page tables, but if we've lost sight of the page dirtyness we won't react correctly to msync() and/or fdatasync(). So we don't _lose_ the data, we just might not write it out in a timely manner if we ever hit the race. > Can we fix it by just putting the page_mkclean back inside the > TestClearPageDirty check, and re-clearing PG_dirty after redoing > the set_page_dirty? I considered it, but quite frankly, if we did it that way, I'd really like to just fix the whole insane "set_page_dirty()" instead. I think set_page_dirty() should be split up. One thing that confused me mentally was that almost all of the dirty handling was actualyl done only if PG_dirty wasn't already set, so the _bulk_ of set_page_dirty() really ends up being if (!TestSetPageDirty(page)) { .. we just marked the page dirty, it was clean before, so we need to add it to the queues etc .. } and that's the part that I (and probably others) always really thought about. But then we have the _one_ thing that runs outside of that "do only once per dirty bit" logic, and it's the buffer dirtying. If we had had two separate operations for this all: "set_dirty_every_time()" and the regular "set_dirty()", I don't think this would have been nearly as confusing. (And then the difference between "__set_page_dirty_nobuffers()" and "__set_page_dirty_buffers()" really boils down to one doing the "everytime" _and_ the "once per dirty" checks and the other one doing just the "once per dirty bit" act - and we could rename the damn things to something saner too). If we split it up that way, then the whole clear_page_dirty_for_io() logic would boil down to if (TestClearPageDirty(page)) { if (page_mkclean(page)) set_dirty_every_time(); return 1; } return 0; and we wouldn't even need to do any of the "clear dirty again" kind of idiocy, because the "set_dirty_every_time()" stuff is the one that doesn't even care about the state of the PG_dirty bit - it's done regardless, and doesn't really touch it. That's what I wanted to do, but with the current "set_page_dirty()" setup, I think my patch makes reasonable sense. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
* Stephen Clark <[EMAIL PROTECTED]> [2006-12-29 10:17]: > >It works for me now, both your testcase as well as an installation of > >Debian on this ARM device. I manually applied the patch to 2.6.19. > > Can you post a diff against 2.6.19? --- a/mm/page-writeback.c 2006-11-29 21:57:37.0 + +++ b/mm/page-writeback.c 2006-12-29 11:02:55.555147896 + @@ -893,16 +893,45 @@ { struct address_space *mapping = page_mapping(page); - if (mapping) { + if (mapping && mapping_cap_account_dirty(mapping)) { + /* +* Yes, Virginia, this is indeed insane. +* +* We use this sequence to make sure that +* (a) we account for dirty stats properly +* (b) we tell the low-level filesystem to +* mark the whole page dirty if it was +* dirty in a pagetable. Only to then +* (c) clean the page again and return 1 to +* cause the writeback. +* +* This way we avoid all nasty races with the +* dirty bit in multiple places and clearing +* them concurrently from different threads. +* +* Note! Normally the "set_page_dirty(page)" +* has no effect on the actual dirty bit - since +* that will already usually be set. But we +* need the side effects, and it can help us +* avoid races. +* +* We basically use the page "master dirty bit" +* as a serialization point for all the different +* threds doing their things. +* +* FIXME! We still have a race here: if somebody +* adds the page back to the page tables in +* between the "page_mkclean()" and the "TestClearPageDirty()", +* we might have it mapped without the dirty bit set. +*/ + if (page_mkclean(page)) + set_page_dirty(page); if (TestClearPageDirty(page)) { - if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); - dec_zone_page_state(page, NR_FILE_DIRTY); - } + dec_zone_page_state(page, NR_FILE_DIRTY); return 1; } return 0; - } + } return TestClearPageDirty(page); } EXPORT_SYMBOL(clear_page_dirty_for_io); -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, Dec 29, 2006 at 12:58:12AM -0800, Linus Torvalds wrote: > Because what "__set_page_dirty_buffers()" does is that AT THE TIME THE > "set_page_dirty()" IS CALLED, it will mark all the buffers on that page as > dirty. That may _sound_ like what we want, but it really isn't. Because by > the time "writepage()" is actually called (which can be MUCH MUCH later), > some internal filesystem activity may actually have cleaned one or more of > those buffers in the meantime, and now we call "writepage()" (which really > wants to write them _all_), and it will write only part of them, or none > at all. I'm confused. Does this mean that if "fs blocksize"=="VM pagesize" this bug can't trigger? But I thought at least one of people reporting corruption was using a filesystem with a 4k block size on an i386? - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
Martin Michlmayr wrote: * Linus Torvalds <[EMAIL PROTECTED]> [2006-12-29 02:48]: Can anybody get corruption with this thing applied? It goes on top of plain v2.6.20-rc2. It works for me now, both your testcase as well as an installation of Debian on this ARM device. I manually applied the patch to 2.6.19. Thanks. Hi Martin, Can you post a diff against 2.6.19? Thanks, Steve -- "They that give up essential liberty to obtain temporary safety, deserve neither liberty nor safety." (Ben Franklin) "The course of history shows that as a government grows, liberty decreases." (Thomas Jefferson) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
* Linus Torvalds <[EMAIL PROTECTED]> [2006-12-29 02:48]: > Can anybody get corruption with this thing applied? It goes on top > of plain v2.6.20-rc2. It works for me now, both your testcase as well as an installation of Debian on this ARM device. I manually applied the patch to 2.6.19. Thanks. -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
Linus Torvalds wrote: [...] The patch is mostly a comment. The "real" meat of it is actually just a few lines. Can anybody get corruption with this thing applied? It goes on top of plain v2.6.20-rc2. No corruption with the testcase here. Will check with rtorrent too later today but I suppose it will work just fine. Nice work! It has been interesting (and educating) to follow this bug-hunt :) /Martin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > > Hmm? I'd love it if somebody else wrote the patch and tested it, > > because I'm getting sick and tired of this bug ;) > > Who the hell am I kidding? I haven't been able to sleep right for the > last few days over this bug. It was really getting to me. > > And putting on the thinking cap, there's actually a fairly simple an > nonintrusive patch. [...] ok, your patch seems to fix the testcase here too on -rc2-rt. [ Damn, i should have slept a bit more, that would have saved me a ~4 hour debug and tracing session today to analyze your testcase, just to find your patch and your explanation on lkml, right after i sent my analysis and workaround patch ;-) At least now we know it from two independent tracing results that the suspect code is the same. ] Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
Hey nice work Linus! Linus Torvalds wrote: On Fri, 29 Dec 2006, Linus Torvalds wrote: Hmm? I'd love it if somebody else wrote the patch and tested it, because I'm getting sick and tired of this bug ;) Who the hell am I kidding? I haven't been able to sleep right for the last few days over this bug. It was really getting to me. And putting on the thinking cap, there's actually a fairly simple an nonintrusive patch. Yeah *this* makes more sense. And in retrospect it was simple, we can't just throw out pte dirtiness information if the page doesn't have all buffers dirtied. It still has a tiny tiny race (see the comment), but I bet nobody can really hit it in real life anyway, and I know several ways to fix it, so I'm not really _that_ worried about it. Well the race isn't a data loss one, is it? Just a case where the pte may be dirty but the page dirty state not accounted for. Can we fix it by just putting the page_mkclean back inside the TestClearPageDirty check, and re-clearing PG_dirty after redoing the set_page_dirty? The patch is mostly a comment. The "real" meat of it is actually just a few lines. Can anybody get corruption with this thing applied? It goes on top of plain v2.6.20-rc2. Linus diff --git a/mm/page-writeback.c b/mm/page-writeback.c index b3a198c..ec01da1 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page) { struct address_space *mapping = page_mapping(page); - if (!mapping) - return TestClearPageDirty(page); - - if (TestClearPageDirty(page)) { - if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); + if (mapping && mapping_cap_account_dirty(mapping)) { + /* +* Yes, Virginia, this is indeed insane. +* +* We use this sequence to make sure that +* (a) we account for dirty stats properly +* (b) we tell the low-level filesystem to +* mark the whole page dirty if it was +* dirty in a pagetable. Only to then +* (c) clean the page again and return 1 to +* cause the writeback. +* +* This way we avoid all nasty races with the +* dirty bit in multiple places and clearing +* them concurrently from different threads. +* +* Note! Normally the "set_page_dirty(page)" +* has no effect on the actual dirty bit - since +* that will already usually be set. But we +* need the side effects, and it can help us +* avoid races. +* +* We basically use the page "master dirty bit" +* as a serialization point for all the different +* threds doing their things. +* +* FIXME! We still have a race here: if somebody +* adds the page back to the page tables in +* between the "page_mkclean()" and the "TestClearPageDirty()", +* we might have it mapped without the dirty bit set. +*/ + if (page_mkclean(page)) + set_page_dirty(page); + if (TestClearPageDirty(page)) { dec_zone_page_state(page, NR_FILE_DIRTY); + return 1; } - return 1; + return 0; } - return 0; + return TestClearPageDirty(page); } EXPORT_SYMBOL(clear_page_dirty_for_io); -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 2006-12-29 at 02:48 -0800, Linus Torvalds wrote: > > On Fri, 29 Dec 2006, Linus Torvalds wrote: > > > > Hmm? I'd love it if somebody else wrote the patch and tested it, because > > I'm getting sick and tired of this bug ;) > > Who the hell am I kidding? I haven't been able to sleep right for the last > few days over this bug. It was really getting to me. > > And putting on the thinking cap, there's actually a fairly simple an > nonintrusive patch. It still has a tiny tiny race (see the comment), but I > bet nobody can really hit it in real life anyway, and I know several ways > to fix it, so I'm not really _that_ worried about it. > > The patch is mostly a comment. The "real" meat of it is actually just a > few lines. > > Can anybody get corruption with this thing applied? It goes on top of > plain v2.6.20-rc2. Tested with rtorrent and there is no corruption. > > Linus > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index b3a198c..ec01da1 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page) > { > struct address_space *mapping = page_mapping(page); > > - if (!mapping) > - return TestClearPageDirty(page); > - > - if (TestClearPageDirty(page)) { > - if (mapping_cap_account_dirty(mapping)) { > - page_mkclean(page); > + if (mapping && mapping_cap_account_dirty(mapping)) { > + /* > + * Yes, Virginia, this is indeed insane. > + * > + * We use this sequence to make sure that > + * (a) we account for dirty stats properly > + * (b) we tell the low-level filesystem to > + * mark the whole page dirty if it was > + * dirty in a pagetable. Only to then > + * (c) clean the page again and return 1 to > + * cause the writeback. > + * > + * This way we avoid all nasty races with the > + * dirty bit in multiple places and clearing > + * them concurrently from different threads. > + * > + * Note! Normally the "set_page_dirty(page)" > + * has no effect on the actual dirty bit - since > + * that will already usually be set. But we > + * need the side effects, and it can help us > + * avoid races. > + * > + * We basically use the page "master dirty bit" > + * as a serialization point for all the different > + * threds doing their things. > + * > + * FIXME! We still have a race here: if somebody > + * adds the page back to the page tables in > + * between the "page_mkclean()" and the "TestClearPageDirty()", > + * we might have it mapped without the dirty bit set. > + */ > + if (page_mkclean(page)) > + set_page_dirty(page); > + if (TestClearPageDirty(page)) { > dec_zone_page_state(page, NR_FILE_DIRTY); > + return 1; > } > - return 1; > + return 0; > } > - return 0; > + return TestClearPageDirty(page); > } > EXPORT_SYMBOL(clear_page_dirty_for_io); > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Linus Torvalds wrote: > > Hmm? I'd love it if somebody else wrote the patch and tested it, because > I'm getting sick and tired of this bug ;) Who the hell am I kidding? I haven't been able to sleep right for the last few days over this bug. It was really getting to me. And putting on the thinking cap, there's actually a fairly simple an nonintrusive patch. It still has a tiny tiny race (see the comment), but I bet nobody can really hit it in real life anyway, and I know several ways to fix it, so I'm not really _that_ worried about it. The patch is mostly a comment. The "real" meat of it is actually just a few lines. Can anybody get corruption with this thing applied? It goes on top of plain v2.6.20-rc2. Linus diff --git a/mm/page-writeback.c b/mm/page-writeback.c index b3a198c..ec01da1 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page) { struct address_space *mapping = page_mapping(page); - if (!mapping) - return TestClearPageDirty(page); - - if (TestClearPageDirty(page)) { - if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); + if (mapping && mapping_cap_account_dirty(mapping)) { + /* +* Yes, Virginia, this is indeed insane. +* +* We use this sequence to make sure that +* (a) we account for dirty stats properly +* (b) we tell the low-level filesystem to +* mark the whole page dirty if it was +* dirty in a pagetable. Only to then +* (c) clean the page again and return 1 to +* cause the writeback. +* +* This way we avoid all nasty races with the +* dirty bit in multiple places and clearing +* them concurrently from different threads. +* +* Note! Normally the "set_page_dirty(page)" +* has no effect on the actual dirty bit - since +* that will already usually be set. But we +* need the side effects, and it can help us +* avoid races. +* +* We basically use the page "master dirty bit" +* as a serialization point for all the different +* threds doing their things. +* +* FIXME! We still have a race here: if somebody +* adds the page back to the page tables in +* between the "page_mkclean()" and the "TestClearPageDirty()", +* we might have it mapped without the dirty bit set. +*/ + if (page_mkclean(page)) + set_page_dirty(page); + if (TestClearPageDirty(page)) { dec_zone_page_state(page, NR_FILE_DIRTY); + return 1; } - return 1; + return 0; } - return 0; + return TestClearPageDirty(page); } EXPORT_SYMBOL(clear_page_dirty_for_io); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Thu, 28 Dec 2006, Linus Torvalds wrote: > > So everything I have ever seen says that the VM layer is actually doing > everything right. That was true, but at the same time, it's not. Let me explain. > That to me says: "somebody didn't actually write out out". The VM layer > asked the filesystem to do the write, but the filesystem just didn't do > it. I personally think it's because some buffer-head BH_dirty bit got > scrogged Ok, I have proof of this now. Here's a trace (with cycle counts), and with a new trace event added: this is for another corrupted page. I have: 49105 PG 15d8 (14800): mm/page-writeback.c:872 clean_for_io 49106 PG 15d8 (6900): mm/rmap.c:451 cleaning PTE b7fa6000 49107 PG 15d8 (9900): mm/page-writeback.c:914 set writeback 49108 PG 15d8 (6480): mm/page-writeback.c:916 setting TAG_WRITEBACK 49109 PG 15d8 (7110): mm/page-writeback.c:922 clearing TAG_DIRTY 49110 PG 15d8 (7190): fs/buffer.c:1713 no IO underway 49111 PG 15d8 (6180): mm/page-writeback.c:891 end writeback 49112 PG 15d8 (6460): mm/page-writeback.c:893 clearing TAG_WRITEBACK where that first column is the trace event number again, and the "PG 15d8" is that corrupted page. The thing in the parenthesis is "CPU cycles since last event), and the important part to note is that this is indeed all one single thing with no actual IO anywhere (~7000 CPU cycles may sound like a lot, but (a) it's not that many cache misses and (b) a lot of it is the logging overhead - back-to-back log events will take about 3500 cycles) just because it does the actual ASCII printk() etc. Also, the new event is: fs/buffer.c:1713 no IO underway which is just the if (nr_underway == 0) case in fs/buffer.c And I now finally really believe that I fully understand the corruption, but I don't have a simple solution, much less a patch. What the problem basically boils down to is that "set_page_dirty()" is supposed to be a mark for dirtying THE WHOLE PAGE, but it really is not "the whole page when the 'set_page_dirty()' itself happens", but more of a "the next writepage() needs to write back the whole page" thing. And that's not what "__set_page_dirty_buffers()" really does. Because what "__set_page_dirty_buffers()" does is that AT THE TIME THE "set_page_dirty()" IS CALLED, it will mark all the buffers on that page as dirty. That may _sound_ like what we want, but it really isn't. Because by the time "writepage()" is actually called (which can be MUCH MUCH later), some internal filesystem activity may actually have cleaned one or more of those buffers in the meantime, and now we call "writepage()" (which really wants to write them _all_), and it will write only part of them, or none at all. So the VM thought that since it did a "writepage()", all the dirty state at that point got written back. But it didn't - the filesystem could have written back part or all of the page much earlier, and the writepage() actually does nothing at all. Both filesystem and VM actually _think_ they do the right thing, because they simply have totally different expectations. The filesystem thinks that it should care about dirty buffers (that got marked dirty _after_ they were dirtied), while the filesystem thinks that it cares about dirty _pages_ (that got dirted at any time _before_ "writepage()" was called). Neither is really "wrong", per se, it's just that the two parts have different expectations, and the _combination_ just doesn't work. "set_page_dirty()" at some point meant "the writes have been done", but these days it really means something else. Now, the reason there is no trivial patch is not that this is conceptually really hard to fix. I can see several different approaches to fixing it, but they all really boil down to two alternatives: (a) splitting the one "PG_dirty" bit up into two bits: the "PG_writescheduled" bit and the "PG_alldirty" bit. The "PG_write_scheduled" bit would be the bit that the filesystem would set when it has pending dirty data that it wrote itself (and that may not cover the whole page), and is the part of PG_dirty that sets the PAGECACHE_TAG_DIRTY. It's also what forces "writepage()" to be called. The "PG_alldirty" bit is just an additional "somebody else dirtied random parts of this page, and we don't know what" flag, which is set by "set_page_dirty()" in addition to doing the PG_write_scheduled stuff. We would test-and-clear it at "writepage()" time, and pass it in to "writepages()" to tell the writepage() function that it can't just write out its own small limited notion of what is dirty. (There are various variations on this whole theme: instead of having a flag to "writepage()", we could split the "whole page" case out as a separate callback or similar) (b) making sure that all "set_page_dirty()" calls are _after_ the page
Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Thu, 28 Dec 2006, Linus Torvalds wrote: So everything I have ever seen says that the VM layer is actually doing everything right. That was true, but at the same time, it's not. Let me explain. That to me says: somebody didn't actually write out out. The VM layer asked the filesystem to do the write, but the filesystem just didn't do it. I personally think it's because some buffer-head BH_dirty bit got scrogged Ok, I have proof of this now. Here's a trace (with cycle counts), and with a new trace event added: this is for another corrupted page. I have: 49105 PG 15d8 (14800): mm/page-writeback.c:872 clean_for_io 49106 PG 15d8 (6900): mm/rmap.c:451 cleaning PTE b7fa6000 49107 PG 15d8 (9900): mm/page-writeback.c:914 set writeback 49108 PG 15d8 (6480): mm/page-writeback.c:916 setting TAG_WRITEBACK 49109 PG 15d8 (7110): mm/page-writeback.c:922 clearing TAG_DIRTY 49110 PG 15d8 (7190): fs/buffer.c:1713 no IO underway 49111 PG 15d8 (6180): mm/page-writeback.c:891 end writeback 49112 PG 15d8 (6460): mm/page-writeback.c:893 clearing TAG_WRITEBACK where that first column is the trace event number again, and the PG 15d8 is that corrupted page. The thing in the parenthesis is CPU cycles since last event), and the important part to note is that this is indeed all one single thing with no actual IO anywhere (~7000 CPU cycles may sound like a lot, but (a) it's not that many cache misses and (b) a lot of it is the logging overhead - back-to-back log events will take about 3500 cycles) just because it does the actual ASCII printk() etc. Also, the new event is: fs/buffer.c:1713 no IO underway which is just the if (nr_underway == 0) case in fs/buffer.c And I now finally really believe that I fully understand the corruption, but I don't have a simple solution, much less a patch. What the problem basically boils down to is that set_page_dirty() is supposed to be a mark for dirtying THE WHOLE PAGE, but it really is not the whole page when the 'set_page_dirty()' itself happens, but more of a the next writepage() needs to write back the whole page thing. And that's not what __set_page_dirty_buffers() really does. Because what __set_page_dirty_buffers() does is that AT THE TIME THE set_page_dirty() IS CALLED, it will mark all the buffers on that page as dirty. That may _sound_ like what we want, but it really isn't. Because by the time writepage() is actually called (which can be MUCH MUCH later), some internal filesystem activity may actually have cleaned one or more of those buffers in the meantime, and now we call writepage() (which really wants to write them _all_), and it will write only part of them, or none at all. So the VM thought that since it did a writepage(), all the dirty state at that point got written back. But it didn't - the filesystem could have written back part or all of the page much earlier, and the writepage() actually does nothing at all. Both filesystem and VM actually _think_ they do the right thing, because they simply have totally different expectations. The filesystem thinks that it should care about dirty buffers (that got marked dirty _after_ they were dirtied), while the filesystem thinks that it cares about dirty _pages_ (that got dirted at any time _before_ writepage() was called). Neither is really wrong, per se, it's just that the two parts have different expectations, and the _combination_ just doesn't work. set_page_dirty() at some point meant the writes have been done, but these days it really means something else. Now, the reason there is no trivial patch is not that this is conceptually really hard to fix. I can see several different approaches to fixing it, but they all really boil down to two alternatives: (a) splitting the one PG_dirty bit up into two bits: the PG_writescheduled bit and the PG_alldirty bit. The PG_write_scheduled bit would be the bit that the filesystem would set when it has pending dirty data that it wrote itself (and that may not cover the whole page), and is the part of PG_dirty that sets the PAGECACHE_TAG_DIRTY. It's also what forces writepage() to be called. The PG_alldirty bit is just an additional somebody else dirtied random parts of this page, and we don't know what flag, which is set by set_page_dirty() in addition to doing the PG_write_scheduled stuff. We would test-and-clear it at writepage() time, and pass it in to writepages() to tell the writepage() function that it can't just write out its own small limited notion of what is dirty. (There are various variations on this whole theme: instead of having a flag to writepage(), we could split the whole page case out as a separate callback or similar) (b) making sure that all set_page_dirty() calls are _after_ the page has been marked dirty (which in the case of memory mapped pages
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Linus Torvalds wrote: Hmm? I'd love it if somebody else wrote the patch and tested it, because I'm getting sick and tired of this bug ;) Who the hell am I kidding? I haven't been able to sleep right for the last few days over this bug. It was really getting to me. And putting on the thinking cap, there's actually a fairly simple an nonintrusive patch. It still has a tiny tiny race (see the comment), but I bet nobody can really hit it in real life anyway, and I know several ways to fix it, so I'm not really _that_ worried about it. The patch is mostly a comment. The real meat of it is actually just a few lines. Can anybody get corruption with this thing applied? It goes on top of plain v2.6.20-rc2. Linus diff --git a/mm/page-writeback.c b/mm/page-writeback.c index b3a198c..ec01da1 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page) { struct address_space *mapping = page_mapping(page); - if (!mapping) - return TestClearPageDirty(page); - - if (TestClearPageDirty(page)) { - if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); + if (mapping mapping_cap_account_dirty(mapping)) { + /* +* Yes, Virginia, this is indeed insane. +* +* We use this sequence to make sure that +* (a) we account for dirty stats properly +* (b) we tell the low-level filesystem to +* mark the whole page dirty if it was +* dirty in a pagetable. Only to then +* (c) clean the page again and return 1 to +* cause the writeback. +* +* This way we avoid all nasty races with the +* dirty bit in multiple places and clearing +* them concurrently from different threads. +* +* Note! Normally the set_page_dirty(page) +* has no effect on the actual dirty bit - since +* that will already usually be set. But we +* need the side effects, and it can help us +* avoid races. +* +* We basically use the page master dirty bit +* as a serialization point for all the different +* threds doing their things. +* +* FIXME! We still have a race here: if somebody +* adds the page back to the page tables in +* between the page_mkclean() and the TestClearPageDirty(), +* we might have it mapped without the dirty bit set. +*/ + if (page_mkclean(page)) + set_page_dirty(page); + if (TestClearPageDirty(page)) { dec_zone_page_state(page, NR_FILE_DIRTY); + return 1; } - return 1; + return 0; } - return 0; + return TestClearPageDirty(page); } EXPORT_SYMBOL(clear_page_dirty_for_io); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 2006-12-29 at 02:48 -0800, Linus Torvalds wrote: On Fri, 29 Dec 2006, Linus Torvalds wrote: Hmm? I'd love it if somebody else wrote the patch and tested it, because I'm getting sick and tired of this bug ;) Who the hell am I kidding? I haven't been able to sleep right for the last few days over this bug. It was really getting to me. And putting on the thinking cap, there's actually a fairly simple an nonintrusive patch. It still has a tiny tiny race (see the comment), but I bet nobody can really hit it in real life anyway, and I know several ways to fix it, so I'm not really _that_ worried about it. The patch is mostly a comment. The real meat of it is actually just a few lines. Can anybody get corruption with this thing applied? It goes on top of plain v2.6.20-rc2. Tested with rtorrent and there is no corruption. Linus diff --git a/mm/page-writeback.c b/mm/page-writeback.c index b3a198c..ec01da1 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page) { struct address_space *mapping = page_mapping(page); - if (!mapping) - return TestClearPageDirty(page); - - if (TestClearPageDirty(page)) { - if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); + if (mapping mapping_cap_account_dirty(mapping)) { + /* + * Yes, Virginia, this is indeed insane. + * + * We use this sequence to make sure that + * (a) we account for dirty stats properly + * (b) we tell the low-level filesystem to + * mark the whole page dirty if it was + * dirty in a pagetable. Only to then + * (c) clean the page again and return 1 to + * cause the writeback. + * + * This way we avoid all nasty races with the + * dirty bit in multiple places and clearing + * them concurrently from different threads. + * + * Note! Normally the set_page_dirty(page) + * has no effect on the actual dirty bit - since + * that will already usually be set. But we + * need the side effects, and it can help us + * avoid races. + * + * We basically use the page master dirty bit + * as a serialization point for all the different + * threds doing their things. + * + * FIXME! We still have a race here: if somebody + * adds the page back to the page tables in + * between the page_mkclean() and the TestClearPageDirty(), + * we might have it mapped without the dirty bit set. + */ + if (page_mkclean(page)) + set_page_dirty(page); + if (TestClearPageDirty(page)) { dec_zone_page_state(page, NR_FILE_DIRTY); + return 1; } - return 1; + return 0; } - return 0; + return TestClearPageDirty(page); } EXPORT_SYMBOL(clear_page_dirty_for_io); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
Hey nice work Linus! Linus Torvalds wrote: On Fri, 29 Dec 2006, Linus Torvalds wrote: Hmm? I'd love it if somebody else wrote the patch and tested it, because I'm getting sick and tired of this bug ;) Who the hell am I kidding? I haven't been able to sleep right for the last few days over this bug. It was really getting to me. And putting on the thinking cap, there's actually a fairly simple an nonintrusive patch. Yeah *this* makes more sense. And in retrospect it was simple, we can't just throw out pte dirtiness information if the page doesn't have all buffers dirtied. It still has a tiny tiny race (see the comment), but I bet nobody can really hit it in real life anyway, and I know several ways to fix it, so I'm not really _that_ worried about it. Well the race isn't a data loss one, is it? Just a case where the pte may be dirty but the page dirty state not accounted for. Can we fix it by just putting the page_mkclean back inside the TestClearPageDirty check, and re-clearing PG_dirty after redoing the set_page_dirty? The patch is mostly a comment. The real meat of it is actually just a few lines. Can anybody get corruption with this thing applied? It goes on top of plain v2.6.20-rc2. Linus diff --git a/mm/page-writeback.c b/mm/page-writeback.c index b3a198c..ec01da1 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page) { struct address_space *mapping = page_mapping(page); - if (!mapping) - return TestClearPageDirty(page); - - if (TestClearPageDirty(page)) { - if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); + if (mapping mapping_cap_account_dirty(mapping)) { + /* +* Yes, Virginia, this is indeed insane. +* +* We use this sequence to make sure that +* (a) we account for dirty stats properly +* (b) we tell the low-level filesystem to +* mark the whole page dirty if it was +* dirty in a pagetable. Only to then +* (c) clean the page again and return 1 to +* cause the writeback. +* +* This way we avoid all nasty races with the +* dirty bit in multiple places and clearing +* them concurrently from different threads. +* +* Note! Normally the set_page_dirty(page) +* has no effect on the actual dirty bit - since +* that will already usually be set. But we +* need the side effects, and it can help us +* avoid races. +* +* We basically use the page master dirty bit +* as a serialization point for all the different +* threds doing their things. +* +* FIXME! We still have a race here: if somebody +* adds the page back to the page tables in +* between the page_mkclean() and the TestClearPageDirty(), +* we might have it mapped without the dirty bit set. +*/ + if (page_mkclean(page)) + set_page_dirty(page); + if (TestClearPageDirty(page)) { dec_zone_page_state(page, NR_FILE_DIRTY); + return 1; } - return 1; + return 0; } - return 0; + return TestClearPageDirty(page); } EXPORT_SYMBOL(clear_page_dirty_for_io); -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
* Linus Torvalds [EMAIL PROTECTED] wrote: Hmm? I'd love it if somebody else wrote the patch and tested it, because I'm getting sick and tired of this bug ;) Who the hell am I kidding? I haven't been able to sleep right for the last few days over this bug. It was really getting to me. And putting on the thinking cap, there's actually a fairly simple an nonintrusive patch. [...] ok, your patch seems to fix the testcase here too on -rc2-rt. [ Damn, i should have slept a bit more, that would have saved me a ~4 hour debug and tracing session today to analyze your testcase, just to find your patch and your explanation on lkml, right after i sent my analysis and workaround patch ;-) At least now we know it from two independent tracing results that the suspect code is the same. ] Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
Linus Torvalds wrote: [...] The patch is mostly a comment. The real meat of it is actually just a few lines. Can anybody get corruption with this thing applied? It goes on top of plain v2.6.20-rc2. No corruption with the testcase here. Will check with rtorrent too later today but I suppose it will work just fine. Nice work! It has been interesting (and educating) to follow this bug-hunt :) /Martin - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
* Linus Torvalds [EMAIL PROTECTED] [2006-12-29 02:48]: Can anybody get corruption with this thing applied? It goes on top of plain v2.6.20-rc2. It works for me now, both your testcase as well as an installation of Debian on this ARM device. I manually applied the patch to 2.6.19. Thanks. -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
Martin Michlmayr wrote: * Linus Torvalds [EMAIL PROTECTED] [2006-12-29 02:48]: Can anybody get corruption with this thing applied? It goes on top of plain v2.6.20-rc2. It works for me now, both your testcase as well as an installation of Debian on this ARM device. I manually applied the patch to 2.6.19. Thanks. Hi Martin, Can you post a diff against 2.6.19? Thanks, Steve -- They that give up essential liberty to obtain temporary safety, deserve neither liberty nor safety. (Ben Franklin) The course of history shows that as a government grows, liberty decreases. (Thomas Jefferson) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, Dec 29, 2006 at 12:58:12AM -0800, Linus Torvalds wrote: Because what __set_page_dirty_buffers() does is that AT THE TIME THE set_page_dirty() IS CALLED, it will mark all the buffers on that page as dirty. That may _sound_ like what we want, but it really isn't. Because by the time writepage() is actually called (which can be MUCH MUCH later), some internal filesystem activity may actually have cleaned one or more of those buffers in the meantime, and now we call writepage() (which really wants to write them _all_), and it will write only part of them, or none at all. I'm confused. Does this mean that if fs blocksize==VM pagesize this bug can't trigger? But I thought at least one of people reporting corruption was using a filesystem with a 4k block size on an i386? - Ted - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
* Stephen Clark [EMAIL PROTECTED] [2006-12-29 10:17]: It works for me now, both your testcase as well as an installation of Debian on this ARM device. I manually applied the patch to 2.6.19. Can you post a diff against 2.6.19? --- a/mm/page-writeback.c 2006-11-29 21:57:37.0 + +++ b/mm/page-writeback.c 2006-12-29 11:02:55.555147896 + @@ -893,16 +893,45 @@ { struct address_space *mapping = page_mapping(page); - if (mapping) { + if (mapping mapping_cap_account_dirty(mapping)) { + /* +* Yes, Virginia, this is indeed insane. +* +* We use this sequence to make sure that +* (a) we account for dirty stats properly +* (b) we tell the low-level filesystem to +* mark the whole page dirty if it was +* dirty in a pagetable. Only to then +* (c) clean the page again and return 1 to +* cause the writeback. +* +* This way we avoid all nasty races with the +* dirty bit in multiple places and clearing +* them concurrently from different threads. +* +* Note! Normally the set_page_dirty(page) +* has no effect on the actual dirty bit - since +* that will already usually be set. But we +* need the side effects, and it can help us +* avoid races. +* +* We basically use the page master dirty bit +* as a serialization point for all the different +* threds doing their things. +* +* FIXME! We still have a race here: if somebody +* adds the page back to the page tables in +* between the page_mkclean() and the TestClearPageDirty(), +* we might have it mapped without the dirty bit set. +*/ + if (page_mkclean(page)) + set_page_dirty(page); if (TestClearPageDirty(page)) { - if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); - dec_zone_page_state(page, NR_FILE_DIRTY); - } + dec_zone_page_state(page, NR_FILE_DIRTY); return 1; } return 0; - } + } return TestClearPageDirty(page); } EXPORT_SYMBOL(clear_page_dirty_for_io); -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Nick Piggin wrote: It still has a tiny tiny race (see the comment), but I bet nobody can really hit it in real life anyway, and I know several ways to fix it, so I'm not really _that_ worried about it. Well the race isn't a data loss one, is it? Just a case where the pte may be dirty but the page dirty state not accounted for. Right. We should be picking it up eventually, since it's still in the page tables, but if we've lost sight of the page dirtyness we won't react correctly to msync() and/or fdatasync(). So we don't _lose_ the data, we just might not write it out in a timely manner if we ever hit the race. Can we fix it by just putting the page_mkclean back inside the TestClearPageDirty check, and re-clearing PG_dirty after redoing the set_page_dirty? I considered it, but quite frankly, if we did it that way, I'd really like to just fix the whole insane set_page_dirty() instead. I think set_page_dirty() should be split up. One thing that confused me mentally was that almost all of the dirty handling was actualyl done only if PG_dirty wasn't already set, so the _bulk_ of set_page_dirty() really ends up being if (!TestSetPageDirty(page)) { .. we just marked the page dirty, it was clean before, so we need to add it to the queues etc .. } and that's the part that I (and probably others) always really thought about. But then we have the _one_ thing that runs outside of that do only once per dirty bit logic, and it's the buffer dirtying. If we had had two separate operations for this all: set_dirty_every_time() and the regular set_dirty(), I don't think this would have been nearly as confusing. (And then the difference between __set_page_dirty_nobuffers() and __set_page_dirty_buffers() really boils down to one doing the everytime _and_ the once per dirty checks and the other one doing just the once per dirty bit act - and we could rename the damn things to something saner too). If we split it up that way, then the whole clear_page_dirty_for_io() logic would boil down to if (TestClearPageDirty(page)) { if (page_mkclean(page)) set_dirty_every_time(); return 1; } return 0; and we wouldn't even need to do any of the clear dirty again kind of idiocy, because the set_dirty_every_time() stuff is the one that doesn't even care about the state of the PG_dirty bit - it's done regardless, and doesn't really touch it. That's what I wanted to do, but with the current set_page_dirty() setup, I think my patch makes reasonable sense. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Theodore Tso wrote: I'm confused. Does this mean that if fs blocksize==VM pagesize this bug can't trigger? No. Even if there is just a single buffer-head, if the filesystem ever writes out that _single_ buffer-head out of turn (ie before the VM actually asks it to, with -writepage()), then the same issue will happen. In fact, a bigger fs blocksize will likely just make this easier to trigger (although I doubt it makes a big difference), since any out-of-order buffer flushback will happen for the whole page, rather than just a part of the page. So the problem really ends up being that the filesystem does flushing that the VM isn't aware of, so when the VM did set_page_dirty() at an earlier time, the VM _expected_ the -writepages() call that happened much later to write the whole page - but because the FS had flushed things behind it backs even _before_ the -writepage happens, by the time the VM actually asks for the page to be written out, the FS layer won't actually write it all out any more. Blocksize doesn't matter, the only thing that matters is whether something writes out data on a buffer-cache level, not on a page cache level. Ext3 apparently does this in ordered data more at least (and hey, I suspect that the code that tries to release buffer head data might try to do it on its own too). Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006 02:48:35 -0800 (PST) Linus Torvalds [EMAIL PROTECTED] wrote: + if (mapping mapping_cap_account_dirty(mapping)) { + /* + * Yes, Virginia, this is indeed insane. + * + * We use this sequence to make sure that + * (a) we account for dirty stats properly + * (b) we tell the low-level filesystem to + * mark the whole page dirty if it was + * dirty in a pagetable. Only to then + * (c) clean the page again and return 1 to + * cause the writeback. + * + * This way we avoid all nasty races with the + * dirty bit in multiple places and clearing + * them concurrently from different threads. + * + * Note! Normally the set_page_dirty(page) + * has no effect on the actual dirty bit - since + * that will already usually be set. But we + * need the side effects, and it can help us + * avoid races. + * + * We basically use the page master dirty bit + * as a serialization point for all the different + * threds doing their things. + * + * FIXME! We still have a race here: if somebody + * adds the page back to the page tables in + * between the page_mkclean() and the TestClearPageDirty(), + * we might have it mapped without the dirty bit set. + */ + if (page_mkclean(page)) + set_page_dirty(page); + if (TestClearPageDirty(page)) { dec_zone_page_state(page, NR_FILE_DIRTY); + return 1; } - Presumably reiser3's ordered-data mode has the same problem. And ext4, of course. Dunno about other filesytems. - The above change means that we do extra writeout. If a page is dirtied once, kjournald will write it and then pdflush will come along and needlessly write it again. But otoh, if a mapping is being repeatedly dirtied, kjournald will write the page once per 30 seconds (dirty_expire_centisecs) and pdflush will write the page once per 30 seconds as well. But we _should_ be writing it once per five seconds (kjournald commit interval). So we're still ahead ;) - Poor old IO accounting broke again. - People were saying that ext2 and ext3,data=writeback were also showing corruption. What's up with that? - For a long time I've wanted to nuke the current ext3/jbd ordered-data implementation altogether, and just make kjournald call into the standard writeback code to do a standard suberblock-inodes-pages walk. I think it'd be fairly straightforward to do. We'd need to teach the writeback code to be able to skip dirty pages which don't have a disk mapping, so that kjournald doesn't end up waiting for kjournald to free up journal space.. Would need to avoid possible deadlocks where someone calls ext3_force_commit() or otherwise does a synchronous commit while holding VFS locks. reiser3 and ext4 could be converted too. Not a short-term project, but this would avoid the problem. - It's pretty obnoxious that the VM now sets a clean page dirty and then proceeds to modify its contents. It would be nice to stop doing that. We could stop marking the page dirty in do_wp_page() and create a new VM counter NR_PTE_DIRTY, which means number of mapping_cap_account_dirty() pages which have a dirty pte pointing at them. Or, perhaps number of dirty ptes which point at mapping_cap_account_dirty() pages. Which can be larger, but the writeout code will probably cope. Then we take NR_PTE_DIRTY into account in the dirty-page balancing act. So - do_wp_page() will still run balance_dirty_pages() - but it would no longer run set_page_dirty(). - But it needs to run mark_inode_dirty() so the fs-writeback code notices the file. - And mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) becomes insufficient. The tricky part here is how do we do the writeback? The pte-dirty,!PageDirty pages aren't tagged as dirty in the radix-tree and writeback needs to find them so that it can effectively do an msync() on them. Walking all the mm's and vma's would be insane. Visiting all the pages in the file would also probably be insane. Perhaps this can be solved by adding a new radix-tree tag which means this page might have dirty ptes pointing at it. For each file writeback would do a radix-tree walk of these pages, cleaning-and-write-protecting ptes, marking the corresponding pages dirty and clearing their PAGECACHE_TAG_PTE_DIRTY tags. Then we can fix the mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) problem by doing mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) ||
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006 14:16:32 -0800 Andrew Morton [EMAIL PROTECTED] wrote: - Poor old IO accounting broke again. No it didn't - we're relying upon the behaviour of __set_page_dirty_buffers() against an already-dirty page. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Andrew Morton wrote: - The above change means that we do extra writeout. If a page is dirtied once, kjournald will write it and then pdflush will come along and needlessly write it again. There's zero extra writeout for any flushing that flushes BY PAGES. Only broken flushers that flush by buffer heads (which really really really shouldn't be done any more: welcome to the 21st century) will cause extra writeouts. And those extra writeouts are obviously required for all the dirty state to actually hit the disk - which is the point of the patch. So they're not extra - they are required for correct working. But I can't stress the fact enough that people SHOULD NOT do writeback by buffer heads. The buffer head has been purely an IO entity for the last several years now, and it's not a cache entity. Anybody who does writeback by buffer heads is basically bypassing the real cache (the page cache), and that's why all the problems happen. I think ext3 is terminally crap by now. It still uses buffer heads in places where it really really shouldn't, and as a result, things like directory accesses are simply slower than they should be. Sadly, I don't think ext4 is going to fix any of this, either. It's all just too inherently wrongly designed around the buffer head (which was correct in 1995, but hasn't been correct for a long time in the kernel any more). - Poor old IO accounting broke again. No. That's why I used set_page_dirty() and did it that strange ugly way (set page dirty, even though it's already dirty, and even though the very next thing we will do is TestClearPageDirty???). That code looks strange as a result, which is why it now has more comments on it than actual code ;) - People were saying that ext2 and ext3,data=writeback were also showing corruption. What's up with that? I thought the ext3,data=writeback case was reported to be fine by several people? I'm not sure about ext2. I didn't look at what it did based on buffer heads. I would have expected it to be ok. That said, at least one report was later shown to be bogus (errors due to out of disk, not due to actual errors ;). - For a long time I've wanted to nuke the current ext3/jbd ordered-data implementation altogether, and just make kjournald call into the standard writeback code to do a standard suberblock-inodes-pages walk. I really would like to see less of the buffer-head-based stuff, and yes, more of the normal inode page walking. I don't think you can order accesses within a page anyway, exactly because of memory mapping issues, so any page ordering is not about buffer heads on the page itself, it should be purely about metadata. - It's pretty obnoxious that the VM now sets a clean page dirty and then proceeds to modify its contents. It would be nice to stop doing that. No. I think this really the fundamental confusion people had. People thought that setting the page dirty meant that it was no longer being modified. It hasn't meant that in a LONG time - ever since the whole DIRTY_TAG thing, the most important part of the PG_dirty thing has really been that it's now efficiently findable by the writeout logic. And that is very much what the whole page accounting _depends_ on. When we mmap a page, we need to mark it findable as dirty _before_ people actually start writing to it, because it's too late afterwards. We could stop marking the page dirty in do_wp_page() and create a new VM counter NR_PTE_DIRTY, which means number of mapping_cap_account_dirty() pages which have a dirty pte pointing at them. Well, then you need to change what PAGE_MAPPING_TAG_DIRTY means too. That's very fundamental. That DIRTY _tag_ is now even more important than the PG_dirty bit itself, since that's what we actually use to _access_ those things. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006 14:42:51 -0800 (PST) Linus Torvalds [EMAIL PROTECTED] wrote: On Fri, 29 Dec 2006, Andrew Morton wrote: - The above change means that we do extra writeout. If a page is dirtied once, kjournald will write it and then pdflush will come along and needlessly write it again. There's zero extra writeout for any flushing that flushes BY PAGES. Only broken flushers that flush by buffer heads (which really really really shouldn't be done any more: welcome to the 21st century) will cause extra writeouts. And those extra writeouts are obviously required for all the dirty state to actually hit the disk - which is the point of the patch. So they're not extra - they are required for correct working. They're extra. As in can be optimised away. But I can't stress the fact enough that people SHOULD NOT do writeback by buffer heads. The buffer head has been purely an IO entity for the last several years now, and it's not a cache entity. The buffer_head is not an IO container. It is the kernel's core representation of a disk block. Usually (but not always) it is backed by some memory which is in pagecache. We can feed buffer_heads into IO containers via submit_bh(), but that's far from the only thing we use buffer_heads for. We should have done s/buffer_head/block/g years ago. JBD implements physical block-based journalling, so it is 100% appropriate that JBD deal with these disk blocks using their buffer_head representation. That being said, ordered-data mode isn't really part of the JBD journalling system at all (the data doesn't get journalled!) - ordered-mode is an add-on to the JBD journal to make the metadata which we're about to journal point at more-likely-to-be-correct data. JBD's ordered-mode writeback is just a sync and I see no conceptual problems with killing its old buffer_head based sync and moving it into the 21st century. Anybody who does writeback by buffer heads is basically bypassing the real cache (the page cache), and that's why all the problems happen. I think ext3 is terminally crap by now. It still uses buffer heads in places where it really really shouldn't, The ordered-data mode flush: sure. The rest of JBD's use of buffer_heads is quite appropriate. and as a result, things like directory accesses are simply slower than they should be. Sadly, I don't think ext4 is going to fix any of this, either. I thought I fixed the performance problem? Somewhat nastily, but as ext3 directories are metadata it is appropriate that modifications to them be done in terms of buffer_heads (ie: blocks). It's all just too inherently wrongly designed around the buffer head (which was correct in 1995, but hasn't been correct for a long time in the kernel any more). - Poor old IO accounting broke again. No. That's why I used set_page_dirty() and did it that strange ugly way (set page dirty, even though it's already dirty, and even though the very next thing we will do is TestClearPageDirty???). nfs_set_page_dirty() and reiserfs_set_page_dirty() should now bail if PageDirty() to avoid needless work. - For a long time I've wanted to nuke the current ext3/jbd ordered-data implementation altogether, and just make kjournald call into the standard writeback code to do a standard suberblock-inodes-pages walk. I really would like to see less of the buffer-head-based stuff, and yes, more of the normal inode page walking. I don't think you can order accesses within a page anyway, exactly because of memory mapping issues, so any page ordering is not about buffer heads on the page itself, it should be purely about metadata. In this context ext3's ordered mode means sync the file contents before journalling the metadata which points at it. - It's pretty obnoxious that the VM now sets a clean page dirty and then proceeds to modify its contents. It would be nice to stop doing that. No. I think this really the fundamental confusion people had. People thought that setting the page dirty meant that it was no longer being modified. No. Setting a page (or bh, or inode) dirty means this is known to have been modified. ie: this cached entity is now out of sync with backing store. Ho hum. I don't care much, really. But then, I understand how all this stuff works. Try explaining to someone the relationship between pte-dirtiness, page-dirtiness, radix-tree-dirtiness and buffer_head-dirtiness. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Andrew Morton wrote: They're extra. As in can be optimised away. Sure. Don't use buffer heads. The buffer_head is not an IO container. It is the kernel's core representation of a disk block. Please come back from the 90's. The buffer heads are nothing but a mapping of where the hardware block is. If you use it for anything else, you're basically screwed. JBD implements physical block-based journalling, so it is 100% appropriate that JBD deal with these disk blocks using their buffer_head representation. And as long as it does that, you just have to face the fact that it's going to perform like crap, including what you call extra writes, and what I call deal with it. Btw, you can make pages be physically indexed too, but they obviously (a) won't be coherent with any virtual mapping laid on top of it (b) will be _physical_, so any readahead etc will be based on physical addresses too. I thought I fixed the performance problem? No, you papered over it, for the reasonably common case where things were physically contiguous - exactly by using a physical page cache, so now it can do read-ahead based on that. Then, because the pages contain buffer heads, the directory accesses can look up buffers, and if it was all physically contiguous, it all works fine. But if you actually want virtualluy indexed caching (and all _users_ want it), it really doesn't work. Somewhat nastily, but as ext3 directories are metadata it is appropriate that modifications to them be done in terms of buffer_heads (ie: blocks). No. There is nothing appropriate about using buffer_heads for metadata. It's quite proper - and a hell of a lot more efficient - to use virtual page-caching for metadata too. Look at the ext2 readdir() implementation, and compare it to the crapola horror that is ext3. Guess what? ext2 uses virtually indexed metadata, and as a result it is both simpler, smaller and a LOT faster than ext3 in accessing that metadata. Face it, Andrew, you're wrong on this one. Really. Just take a look at ext2_readdir(). [ I'm not saying that ext2_readdir() is _beautiful_. If it had been written with the page cache in mind, it would probably have been done very differently. And it doesn't do any readahead, probably because nobody cared enough, but it should be trivial to add, and it would automatically do the right thing just because it's much easier at the page cache level. But I _am_ saying that compared to ext3, the ext2 readdir is a work of art. ] metadata has _zero_ to do with physically indexed. There is no correlation what-so-ever. If you think there is a correlation, it's all in your mind. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, Dec 29, 2006 at 02:42:51PM -0800, Linus Torvalds wrote: I think ext3 is terminally crap by now. It still uses buffer heads in places where it really really shouldn't, and as a result, things like directory accesses are simply slower than they should be. Sadly, I don't think ext4 is going to fix any of this, either. Not just ext3; ocfs2 is using the jbd layer as well. I think we're going to have to put this (a rework of jbd2 to use the page cache) on the ext4 todo list, and work with the ocfs2 folks to try to come up with something that suits their needs as well. Fortunately we have this filesystem/storage summit thing coming up in the next few months, and we can try to get some discussion going on the linux-ext4 mailing list in the meantime. Unfortunately, I don't think this is going to be trivial. If we do get this fixed for ext4, one interesting question is whether people would accept a patch to backport the fixes to ext3, given the the grief this is causing the page I/O and VM routines. OTOH, reiser3 probably has the same problems, and I suspect the changes to ext3 to cause it to avoid buffer heads, especially in order to support for filesystem blocksizes pagesize, are going to be sufficiently risky in terms of introducing regressions to ext3 that they would probably be rejected on those grounds. So unfortunately, we probably are going to have to support flushes via buffer heads for the foreseeable future. - Ted - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Theodore Tso wrote: If we do get this fixed for ext4, one interesting question is whether people would accept a patch to backport the fixes to ext3, given the the grief this is causing the page I/O and VM routines. I don't think backporting is the smartest option (unless it's done _way_ later), but the real problem with it isn't actually the VM behaviour, but simply the fact that cached performance absoluely _sucks_ with the buffer cache. With the physically indexed buffer cache thing, you end up always having to do these complicated translations into block numbers for every single access, and at some point when I benchmarked it, it was a huge overhead for doing simple things like readdir. It's also a major pain for read-ahead, exactly partly due to the high cost of translation - because you can't cheaply check whether the next block is there, the cost of even asking the question should I try to read ahead? is much much higher. As a result, read-ahead is seriously limited, because it's so expensive for the cached case (which is still hopefully the _common_ case). So because read-ahead is limited, the non-cached case then _really_ sucks. It was somewhat fixed in a really god-awful fashion by having ext3_readdir() actually do _readahead_ though the page cache, even though it does everything else through the buffer cache. And that just happens to work because we hopefully have physically contiguous blocks, but when that isn't true, the readahead doesn't do squat. It's really quite fundamentally broken. But none of that causes any problems for the VM, since directories cannot be mmap'ed anyway. But it's really pitiful, and it really doesn't work very well. Of course, other filesystems _also_ suck at this, and other operating systems haev even MORE problems, so people don't always seem to realize how horribly horribly broken this all is. I really wish somebody would write a filesystem that did large cold-cache directories well. Open some horrible file manager on /usr/bin with cold caches, and weep. The biggest problem is the inode indirection, but at some point when I looked at why it sucked, it was doing basically synchronous single-buffer reads on the directory too, because readahead didn't work properly. I was hoping that something like SpadFS would actually take off, because it seemed to do a lot of good design choices (having inodes in-line in the directory for when there are no hardlinks is probably a requirement for a good filesystem these days. The separate inode table had its uses, but indirection in a filesystem really does suck, and stat information is too important to be indirect unless it absolutely has to). But I suspect it needs more than somebody who just wants to get his thesis written ;) Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006 18:32:07 -0500 Theodore Tso [EMAIL PROTECTED] wrote: On Fri, Dec 29, 2006 at 02:42:51PM -0800, Linus Torvalds wrote: I think ext3 is terminally crap by now. It still uses buffer heads in places where it really really shouldn't, and as a result, things like directory accesses are simply slower than they should be. Sadly, I don't think ext4 is going to fix any of this, either. Not just ext3; ocfs2 is using the jbd layer as well. I think we're going to have to put this (a rework of jbd2 to use the page cache) on the ext4 todo list, and work with the ocfs2 folks to try to come up with something that suits their needs as well. Fortunately we have this filesystem/storage summit thing coming up in the next few months, and we can try to get some discussion going on the linux-ext4 mailing list in the meantime. Unfortunately, I don't think this is going to be trivial. I suspect it would be insane to move any part of JBD (apart from the ordered-data flush) to use pagecache. The whole thing is fundamentally block-based. But only for metadata - there's no strong reason why ext3/4 needs to manipulate file data via buffer_heads if data=journal and chattr +j aren't in use. We could possibly move ext3/4 directories out of the blockdev pagecache and into per-directory pagecache, but that wouldn't change anything - the journalling would still be block-based. Adam Richter spent considerable time a few years ago trying to make the mpage code go direct-to-BIO in all cases and we eventually gave up. The conceptual layering of page-blocks-bio is pretty clean, and it is hard and ugly to fully optimise away the block bit in the middle. buffer_heads become more important with large PAGE_CACHE_SIZE. I'd expect nobh mode to be quite inefficient with some workloads on 64k pages. We need that representation of the state (and location) of the block-sized hunks which make up the page. If we do get this fixed for ext4, one interesting question is whether people would accept a patch to backport the fixes to ext3, given the the grief this is causing the page I/O and VM routines. OTOH, reiser3 probably has the same problems, and I suspect the changes to ext3 to cause it to avoid buffer heads, especially in order to support for filesystem blocksizes pagesize, are going to be sufficiently risky in terms of introducing regressions to ext3 that they would probably be rejected on those grounds. So unfortunately, we probably are going to have to support flushes via buffer heads for the foreseeable future. We'll see. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006 16:11:44 -0800 (PST) Linus Torvalds [EMAIL PROTECTED] wrote: JBD implements physical block-based journalling, so it is 100% appropriate that JBD deal with these disk blocks using their buffer_head representation. And as long as it does that, you just have to face the fact that it's going to perform like crap, including what you call extra writes, and what I call deal with it. It is quite tiresome to delete things which your interlocutor said and to then restate them as if it were some sort of relevation. Somewhat nastily, but as ext3 directories are metadata it is appropriate that modifications to them be done in terms of buffer_heads (ie: blocks). No. There is nothing appropriate about using buffer_heads for metadata. I said modification. [stuff about directory reads elided] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Andrew Morton wrote: Somewhat nastily, but as ext3 directories are metadata it is appropriate that modifications to them be done in terms of buffer_heads (ie: blocks). No. There is nothing appropriate about using buffer_heads for metadata. I said modification. You said metadata. Why do you think directories are any different from files? Yes, they are metadata. So what? What does that have to do with anything? They should still use virtual indexes, the way files do. That doesn't preclude them from using buffer-heads to mark their (partial-page) modifications and for keeping the virtual-physical translations cached. I mean, really. Look at ext2. It does exactly that. It keeps the directories in the page cache - virtually indexed. And it even modifies them there. Exactly the same way it modifies regular file data. It all works exactly the same way it works for regular files. It uses page-mapping-a_ops-prepare_write(NULL, page, from, to); ... do modification ... ext2_commit_chunk(page, from, to); exactly the way regular file data works. That's why I'm saying there is absolutely _zero_ thing about metadata here, or even about modifications. It all works better in a virtual cache, because you get all the support that we give to page caches. So I really don't understand why you make excuses for ext3 and talk about modifications and metadata. It was a fine design ten years ago. It's not really very good any longer. I suspect we're stuck with the design, but that doesn't make it any _better_. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006 16:58:41 -0800 (PST) Linus Torvalds [EMAIL PROTECTED] wrote: On Fri, 29 Dec 2006, Andrew Morton wrote: Somewhat nastily, but as ext3 directories are metadata it is appropriate that modifications to them be done in terms of buffer_heads (ie: blocks). No. There is nothing appropriate about using buffer_heads for metadata. I said modification. You said metadata. Why do you think directories are any different from files? Yes, they are metadata. So what? What does that have to do with anything? We journal the contents of directories. Fully. So we handle their dirty data at the block (ie: buffer_head) level. When someone tries to dirty part of a directory we need to cheat and not mark that part of the page as dirty and we need to then write the block to the journal and then mark the block as really dirty for checkpointing (but still attached to the journal) and all that goop. The regular page-based writeback doesn't apply until the block has been written to the journal. At that stage the block is considered dirty against its real position on disk. It will then be written back by pdflush via the blockdev inode - blkdev_writepage(). Unless kjournald needs to do an early flush to reclaim the journal space, in which case kjournald will write the block itself. So I really don't understand why you make excuses for ext3 and talk about modifications and metadata. It was a fine design ten years ago. It's not really very good any longer. As I said in another apparently-neglected email: : We could possibly move ext3/4 directories out of the blockdev pagecache and : into per-directory pagecache, but that wouldn't change anything - the : journalling would still be block-based. We already have all the code in place to journal blocks which are cached in an address_space other than the blockdev inode's: ext3_journalled_aops. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
On Fri, 29 Dec 2006, Andrew Morton wrote: Adam Richter spent considerable time a few years ago trying to make the mpage code go direct-to-BIO in all cases and we eventually gave up. The conceptual layering of page-blocks-bio is pretty clean, and it is hard and ugly to fully optimise away the block bit in the middle. Using the buffer cache as a translation layer to the physical address is fine. That's what _any_ block device will do. I'm not at all sayign that buffer heads must go away. They work fine. What I'm saying is that - if you index by buffer heads, you're screwed. - if you do IO by starting at buffer heads, you're screwed. Both indexing and writeback decisions should be done at the page cache layer. Then, when you actually need to do IO, you look at the buffers. But you start from the page. YOU SHOULD NEVER LOOK UP a buffer on its own merits, and YOU SHOULD NEVER DO IO on a buffer head on its own cognizance. So by all means keep the buffer heads as a way to keep the virtual-physical translation. It's what they were designed for. But they were _originally_ also designed for lookup and driving the start of IO, and that is wrong, and has been wrong for a long time now, because - lookup based on physical address is fundamentally slow and inefficient. You have to look up the virtual-physical translation somewhere else, so it's by design an unnecessary indirection _and_ that somewere else is also by definition filesystem-specific, so you can't do any of these things at the VFS layer. Ergo: anything that needs to look up the physical address in order to find the buffer head is BROKEN in this day and age. We look up the _virtual_ page cache page, and then we can trivially find the buffer heads within that page thanks to page-buffers. Example: ext2 vs ext3 readdir. One of them sucks, the other doesn't. - starting IO based on the physical entity is insane. It's insane exactly _because_ the VM doesn't actually think in physical addresses, or in buffer-sized blocks. The VM only really knows about whole pages, and all the VM decisions fundamentally have to be page-based. We don't ever free a buffer. We free a whole page, and as such, doing writeback based on buffers is pointless, because it doesn't actually say anything about the page state which is what the VM tracks. But neither of these means that buffer_head itself has to go away. They both really boil down to the same thing: you should never KEY things by the buffer head. All actions should be based on virtual indexes as far as at all humanly possible. Once you do lookup and locking and writeback _starting_ from the page, it's then easy to look up the actual buffer head within the page, and use that as a way to do the actual _IO_ on the physical address. So the buffer heads still exist in ext2, for example, but they don't drive the show quite as much. (They still do in some areas: the allocation bitmaps, the xattr code etc. But as long as none of those have big VM footprints, and as long as no _common_ operations really care deeply, and as long as those data structures never need to be touched by the VM or VFS layer, nobody will ever really care). The directory case comes up just because readdir() actually is very common, and sometimes very slow. And it can have a big VM working set footprint (find), so trying to be page-based actually really helps, because it all drives things like writeback on the _right_ issues, and we can do things like LRU's and writeback decisions on the level that really matters. I actually suspect that the inode tables could benefit from being in the page cache too (although I think that the inode buffer address is actually physical, so there's no indirection for inode tables, which means that the virtual vs physical addressing doesn't matter). For directories, there definitely is a big cost to continually doing the virtual-physical translation all the time. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Fri, 29 Dec 2006, Segher Boessenkool wrote: > > > I think what might be happening is that pdflush writes them out fine, > > however we don't trap writes by the application _during_ that writeout. > > Yeah. I believe that more exactly it happens if the very last > write to the page causes a writeback (due to dirty balancing) > while another writeback for the page is already happening. > > As usual in these cases, I have zero proof. I actually have proof to the contrary, ie I have traces that say "the write was started" after the last write. And the VM layer in this area is actually fairly sane and civilized. It has a bit that says "writeback in progress", and if that bit is set, it simply _will_not_ start a new write. It even has various BUG_ON()'s to that effect. So everything I have ever seen says that the VM layer is actually doing everything right. > It's the do_wp_page -> balance_dirty_pages -> generic_writepages > path for sure. Maybe it's enough to change > > if (wbc->sync_mode != WB_SYNC_NONE) > wait_on_page_writeback(page); > > if (PageWriteback(page) || > !clear_page_dirty_for_io(page)) { > unlock_page(page); > continue; > } Notive how this one basically says: - if it's under writeback, don't even clear the page dirty flag. Your suggested change: > if (wbc->sync_mode != WB_SYNC_NONE) > wait_on_page_writeback(page); > > if (PageWriteback(page)) { > redirty_page_for_writepage(wbc, page); makes no sense, because we simply never _did_ the "clear_page_dirty()" if the thing was under writeback in the first place. That's how C conditionals work. So there's no reason to "redirty" it, because it wasn't cleaned in the first place. I've double- and triple-checked the dirty bits, including having traces that actually say that the IO was started (from a VM perspective) _after_ the last write was done. The IO just didn't hit the disk. I'm personally fairly convinced that it's not a VM issue, but a "IO issue". Either in a low-level filesystem or in some of the fs/buffer.c helper routines. But I'd love to be proven wrong. I do have a few interesting details from the trace I haven't really analyzed yet. Here's the trace for events on one of the pages that was corrupted. Note how the events are numbered (there were 171640 events total), so the thing you see is just a small set of events from the whole big trace, but it's the ones that talk about _that_ particular page. I've grouped them so hat "consecutive" events group together. That just means that no events on any other pages happened in between those events, and it is usually a sign that it's really one single call-chain that causes all the events. For example, for the first group of three events (44366-44368), it's the page fault that brings in the page, and since it's a write-fault, it will not only map the page, it will mark the page itself dirty and then also set the TAG_DIRTY on the mapping. So the "group" is just really a result of one single event happening, which causes several things to happen to that page. That's exactly what you'd expect. Anyway, here is the list of events that page went through: 44366 PG 0f6d: mm/memory.c:2254 mapping at b789fc54 (write) 44367 PG 0f6d: mm/page-writeback.c:817 setting dirty 44368 PG 0f6d: fs/buffer.c:738 setting TAG_DIRTY 64231 PG 0f6d: mm/page-writeback.c:872 clean_for_io 64232 PG 0f6d: mm/rmap.c:451 cleaning PTE b789f000 64233 PG 0f6d: mm/page-writeback.c:914 set writeback 64234 PG 0f6d: mm/page-writeback.c:916 setting TAG_WRITEBACK 64235 PG 0f6d: mm/page-writeback.c:922 clearing TAG_DIRTY 67570 PG 0f6d: mm/page-writeback.c:891 end writeback 67571 PG 0f6d: mm/page-writeback.c:893 clearing TAG_WRITEBACK 76705 PG 0f6d: mm/page-writeback.c:817 setting dirty 76706 PG 0f6d: fs/buffer.c:725 dirtied buffers 76707 PG 0f6d: fs/buffer.c:738 setting TAG_DIRTY 105267 PG 0f6d: mm/page-writeback.c:872 clean_for_io 105268 PG 0f6d: mm/rmap.c:451 cleaning PTE b789f000 105269 PG 0f6d: mm/page-writeback.c:914 set writeback 105270 PG 0f6d: mm/page-writeback.c:916 setting TAG_WRITEBACK 105271 PG 0f6d: mm/page-writeback.c:922 clearing TAG_DIRTY 105272 PG 0f6d: mm/page-writeback.c:891 end writeback 105273 PG 0f6d: mm/page-writeback.c:893 clearing TAG_WRITEBACK 128032 PG 0f6d: mm/memory.c:670 unmapped at b789f000 132662 PG 0f6d: mm/filemap.c:119 Removing page cache 148278 PG 0f6d: mm/memory.c:2254 mapping at b789f000 (read) 166326 PG 0f6d: mm/memory.c:670 unmapped at b789f000 171958 PG
Re: [PATCH] mm: fix page_mkclean_one
I think what might be happening is that pdflush writes them out fine, however we don't trap writes by the application _during_ that writeout. Yeah. I believe that more exactly it happens if the very last write to the page causes a writeback (due to dirty balancing) while another writeback for the page is already happening. As usual in these cases, I have zero proof. It's something that will only occur with writeback and MAP_SHARED writable access to the file pages. It's the do_wp_page -> balance_dirty_pages -> generic_writepages path for sure. Maybe it's enough to change if (wbc->sync_mode != WB_SYNC_NONE) wait_on_page_writeback(page); if (PageWriteback(page) || !clear_page_dirty_for_io(page)) { unlock_page(page); continue; } to if (wbc->sync_mode != WB_SYNC_NONE) wait_on_page_writeback(page); if (PageWriteback(page)) { redirty_page_for_writepage(wbc, page); unlock_page(page); continue; } if (!clear_page_dirty_for_io(page)) { unlock_page(page); continue; } or something along those lines. Completely untested of course :-) Segher - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
From: Linus Torvalds <[EMAIL PROTECTED]> Date: Thu, 28 Dec 2006 12:14:31 -0800 (PST) > I get corruption - but the whole point is that it's very much pdflush that > should be writing these pages out. I think what might be happening is that pdflush writes them out fine, however we don't trap writes by the application _during_ that writeout. These corruptions look exactly as if: 1) pdflush begins writeback of page X 2) page goes to disk 3) application writes a chunk to the page 4) pdflush et al. think the page is clean, so it gets tossed, losing the writes done in #3 So there's a missing PTE change in there, so that we never get proper re-dirtying of the page if the application tries to write to the page during the writeback. It's something that will only occur with writeback and MAP_SHARED writable access to the file pages. That's why we never see this with normal filesystem writes, since those explicitly manage the page dirty state. I think the dirty balancing logic etc. isn't where the problems are, to me it's a PTE state update issue for sure. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Thu, 2006-12-28 at 11:45 -0800, Andrew Morton wrote: > On Thu, 28 Dec 2006 11:28:52 -0800 (PST) > Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > > > > > On Thu, 28 Dec 2006, Guillaume Chazarain wrote: > > > > > > The attached patch fixes the corruption for me. > > > > Well, that's a good hint, but it's really just a symptom. You effectively > > just made the test-program not even try to flush the data to disk, so the > > page cache would stay in memory, and you'd not see the corruption as well. > > > > So you basically disabled the code that tried to trigger the bug more > > easily. > > > > But the reason I say it's interesting is that "WB_SYNC_NONE" is very much > > implicated in mm/page-writeback.c, and if there is a bug triggered by > > WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also > > fails.. > > > > It would be interesting to convert your app to do fsync() before > FADV_DONTNEED. That would take WB_SYNC_NONE out of the picture as well > (apart from pdflush activity). I did fdatasync(), tried remapping before unmapping... nogo here. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Thu, 28 Dec 2006, Andrew Morton wrote: > > It would be interesting to convert your app to do fsync() before > FADV_DONTNEED. That would take WB_SYNC_NONE out of the picture as well > (apart from pdflush activity). I get corruption - but the whole point is that it's very much pdflush that should be writing these pages out. Andrew - give my test-program a try. It can run in about 1 minute if you have a 256MB machine (I didn't, but "mem=256M" is my friend), and it seems to very consistently cause corruption. What I do is: # Make sure we write back aggressively echo 5 > /proc/sys/vm/dirty_ratio as root, and then just run the thing. Tons of corruption. But the corruption goes away if I just leave the default dirty ratio alone (but then I can increse the file size to trigger it, of course - but that also makes the test run a lot slower). Now, with a pre-2.6.19 kernel, I bet you won't get the corruption as easily (at least with the "fsync()"), but that's less to do with anything new, and probably just because then you simply won't have any pdflushing going on - since the kernel won't even notice that you have tons of dirty pages ;) It might also depend on the speed of your disk drive - the machine I test this on has a slow 4200 rpm laptop drive in it, and that probably makes things go south more easily. That's _especially_ true if this is related to any "bdi_write_congested()" logic. Now, it could also be related to various code snippets like ... if (wbc->sync_mode != WB_SYNC_NONE) wait_on_page_writeback(page); if (PageWriteback(page) || !clear_page_dirty_for_io(page)) { unlock_page(page); continue; } ... where the WB_SYNC_NONE case will hit the "PageWriteback()" and just not do the writeback at all (but it also won't clear the dirty bit, so it's certainly not an *OBVIOUS* bug). We also have code like this ("pageout()"): if (clear_page_dirty_for_io(page)) { int res; struct writeback_control wbc = { .sync_mode = WB_SYNC_NONE, .. } ... res = mapping->a_ops->writepage(page, ); and in this case, if the "WB_SYNC_NONE" means that the "writepage()" call won't do anything at all because of congestion, then that would be a _bad_ thing, and would certainly explain how something didn't get written out. But that particular path should only trigger for the "shrink_page_list()" case, and it's not the case I seem to be testing with my "low dirty_ratio" testing. Linus#include #include #include #include #include #include #include #define TARGETSIZE (22 << 20) #define CHUNKSIZE (1460) #define NRCHUNKS (TARGETSIZE / CHUNKSIZE) #define SIZE (NRCHUNKS * CHUNKSIZE) static void fillmem(void *start, int nr) { memset(start, nr, CHUNKSIZE); } #define page_offset(buf, off) (unsigned)((unsigned long)(buf)+(off)-(unsigned long)(mapping)) static int chunkorder[NRCHUNKS]; static char *mapping; static int order(int nr) { int i; if (nr < 0 || nr >= NRCHUNKS) return -1; for (i = 0; i < NRCHUNKS; i++) if (chunkorder[i] == nr) return i; return -2; } static void checkmem(void *buf, int nr) { unsigned int start = ~0u, end = 0; unsigned char c = nr, *p = buf, differs = 0; int i; for (i = 0; i < CHUNKSIZE; i++) { unsigned char got = *p++; if (got != c) { if (i < start) start = i; if (i > end) end = i; differs = got; } } if (start < end) { printf("Chunk %d corrupted (%u-%u) (%x-%x)\n", nr, start, end, page_offset(buf, start), page_offset(buf, end)); printf("Expected %u, got %u\n", c, differs); printf("Written as (%d)%d(%d)\n", order(nr-1), order(nr), order(nr+1)); } } static char *remap(int fd, char *mapping) { if (mapping) { munmap(mapping, SIZE); // fsync(fd); posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED); } return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); } int main(int argc, char **argv) { int fd, i; /* * Make some random ordering of writing the chunks to the * memory map.. * * Start with fully ordered.. */ for (i = 0; i < NRCHUNKS; i++) chunkorder[i] = i; /* ..and then mix it up randomly */ srandom(time(NULL)); for (i = 0; i < NRCHUNKS; i++) { int index = (unsigned int) random() %
Re: [PATCH] mm: fix page_mkclean_one
On Thu, 28 Dec 2006 11:28:52 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > On Thu, 28 Dec 2006, Guillaume Chazarain wrote: > > > > The attached patch fixes the corruption for me. > > Well, that's a good hint, but it's really just a symptom. You effectively > just made the test-program not even try to flush the data to disk, so the > page cache would stay in memory, and you'd not see the corruption as well. > > So you basically disabled the code that tried to trigger the bug more > easily. > > But the reason I say it's interesting is that "WB_SYNC_NONE" is very much > implicated in mm/page-writeback.c, and if there is a bug triggered by > WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also > fails.. > It would be interesting to convert your app to do fsync() before FADV_DONTNEED. That would take WB_SYNC_NONE out of the picture as well (apart from pdflush activity). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Thu, 28 Dec 2006, Guillaume Chazarain wrote: > > The attached patch fixes the corruption for me. Well, that's a good hint, but it's really just a symptom. You effectively just made the test-program not even try to flush the data to disk, so the page cache would stay in memory, and you'd not see the corruption as well. So you basically disabled the code that tried to trigger the bug more easily. But the reason I say it's interesting is that "WB_SYNC_NONE" is very much implicated in mm/page-writeback.c, and if there is a bug triggered by WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also fails.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
Guillaume Chazarain a écrit : I get this kind of corruption: http://guichaz.free.fr/linux-bug/corruption.png Actually in qemu, I get three different behaviours: - no corruption at all : with linux-2.4 - corruption only on the first chunks: before [PATCH] mm: balance dirty pages as identified by Kenneth - corruption of all chunks: after the balance dirty pages patch Bisecting in linux-2.5 land I found http://kernel.org/pub/linux/kernel/people/akpm/patches/2.5/2.5.66/2.5.66-mm3/broken-out/fadvise-flush-data.patch to cause the corruption for me. The attached patch fixes the corruption for me. -- Guillaume diff -r 3859b1144d3a mm/fadvise.c --- a/mm/fadvise.c Sun Dec 24 05:00:03 2006 + +++ b/mm/fadvise.c Thu Dec 28 19:53:40 2006 +0100 @@ -96,9 +96,6 @@ asmlinkage long sys_fadvise64_64(int fd, case POSIX_FADV_NOREUSE: break; case POSIX_FADV_DONTNEED: - if (!bdi_write_congested(mapping->backing_dev_info)) - filemap_flush(mapping); - /* First and last FULL page! */ start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT; end_index = (endbyte >> PAGE_CACHE_SHIFT);
Re: [PATCH] mm: fix page_mkclean_one
On Thu, 28 Dec 2006, Russell King wrote: > > Yup, but I have nothing to do with glibc because I refuse to do that > silly copyright assignment FSF thing. Hopefully someone else can > resolve it, but... Yeah, me too. > _is_ a fix whether _you_ like it or not to work around the issue so > people can at least run your test program. I'm not saying it's a > proper fix though. My point was that it wasn't a "fix", it's a "workaround". The _fix_ would be in glibc. Nothing more.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Thu, Dec 28, 2006 at 09:27:12AM -0800, Linus Torvalds wrote: > On Thu, 28 Dec 2006, Russell King wrote: > > and if you look at glibc's memset() function, you'll notice that's exactly > > what you expect if you pass a non-8bit value to it. Ergo, what you're > > seeing is utterly expected given glibc's memset() implementation on ARM. > > Guys, you _really_ should fix memset(). What you describe is a _bug_. Yup, but I have nothing to do with glibc because I refuse to do that silly copyright assignment FSF thing. Hopefully someone else can resolve it, but... > > Fixing Linus' test program to pass nr & 255 to memset > > No. I'm almost certain that that is not a "fix", it's a workaround for a > serious bug in your glibc crap. _is_ a fix whether _you_ like it or not to work around the issue so people can at least run your test program. I'm not saying it's a proper fix though. Of course, if you prefer to be mislead by incorrect bug reports... -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Thu, 28 Dec 2006, Russell King wrote: > > and if you look at glibc's memset() function, you'll notice that's exactly > what you expect if you pass a non-8bit value to it. Ergo, what you're > seeing is utterly expected given glibc's memset() implementation on ARM. Guys, you _really_ should fix memset(). What you describe is a _bug_. "memset()" takes an "int" as its argument (always has), and has to convert it to a byte _itself_. It may not be common, but it's perfectly normal, to pass it values outside 0-255 (negative values that still fit in a "signed char" in particular are very normal, but my usage of "let the thing truncate it itself" is also quite fine). > Fixing Linus' test program to pass nr & 255 to memset No. I'm almost certain that that is not a "fix", it's a workaround for a serious bug in your glibc crap. But it does explain all the unexpected strange behaviour (and the really small writeback size - now it doesn't need any /proc/sys/vm/dirty_ratio assumptions to be explicable. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Thu, 28 Dec 2006, Zhang, Yanmin wrote: > > The test program is a process to write/read data. pdflush might write data > to disk asynchronously. After pdflush writes a page to disk, it will call > (either by > softirq) clear_page_dirty to clear the dirty bit after getting the interrupt > notification. That would indeed be a horrible bug. However, we don't do "clear_page_dirty()" _after_ the IO has completed, we do it _before_ the IO starts. If you can actually find a place that does clear_page_dirty _after_ IO, then yes, you've just found the bug. But I haven't found it. So the rule is _always_: - call "clear_page_dirty_for_io()" with the page lock held, and _before_ the IO starts. - do "set_page_writeback()" before unlocking the page again - do a "end_page_writeback()" when the IO actually finishes. and any code sequence that doesn't honor those rules would be buggy. A beer for anybody that finds it.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] mm: fix page_mkclean_one
On Wed, 27 Dec 2006, Chen, Kenneth W wrote: > > > > Running the test code, git bisect points its finger at this commit. > > Reverting > > this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code. > > > > [PATCH] mm: balance dirty pages > > > > Now that we can detect writers of shared mappings, throttle them. > > Avoids OOM > > by surprise. > > Oh, never mind :-( I just didn't create enough write out pressure when > test this. I just saw bug got triggered on a kernel I previously thought > was OK. Btw, this is an important point - people have long felt that the new page balancing in 2.6.19 was to blame, but you've just confirmed the long-held suspicion (at least by me) that it's not actually a new bug at all, it's just that the dirty page balancing causes writeback to happen _earlier_, and thus is better able to _show_ a bug that we've likely had for a long long time. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Wed, 27 Dec 2006, Gordon Farquharson wrote: > > 100kB and 200kB files always succeed on the ARM system. 400kB and > larger always seem to fail. Oh, wow. Yeah, I've just repressed how tiny 32MB is. And especially if you lowered the /proc/sys/vm/dirty_ratio to a smaller percentage, I guess 400kB should be enough to cause writeback. Ugh. I tested a 128MB machine a few weeks ago, and found it painful. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Thu Dec 28 15:09 , Guillaume Chazarain sent: >I set a qemu environment to test kernels: http://guichaz.free.fr/linux-bug/ >I have corruption with every Fedora release kernel except the first, that is >2.4.22 works, but 2.6.5, 2.6.9, 2.6.11, 2.6.15 and 2.6.18-1.2798 exhibit >some corruption. Confirm that I see corruption with Fedora kernel 2.6.18-1.2239.fc5: ... Chunk 142969 corrupted (0-1459) (2580-4039) Expected 121, got 0 Written as (89632)127652(124721) Chunk 142976 corrupted (0-1459) (512-1971) Expected 128, got 0 Written as (121734)128324(108589) Checking chunk 143639/143640 (99%) $ uname -a Linux 2.6.18-1.2239.fc5 #1 Fri Nov 10 13:04:06 EST 2006 i686 athlon i386 GNU/Linux /Martin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
* Gordon Farquharson <[EMAIL PROTECTED]> [2006-12-28 07:15]: > Thanks for the fix, Russell. > > I can now trigger the (real) problem by using a 25 MB file (100 << 18) > and the Linksys NSLU2 (ARM, IXP420 processor, 32 MB RAM). Me too (using 100 << 18). Interestingly, I don't seem to get any corruption on a different ARM board, an IOP32x based machine with 128 MB RAM. -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: [PATCH] mm: fix page_mkclean_one
I set a qemu environment to test kernels: http://guichaz.free.fr/linux-bug/ I have corruption with every Fedora release kernel except the first, that is 2.4.22 works, but 2.6.5, 2.6.9, 2.6.11, 2.6.15 and 2.6.18-1.2798 exhibit some corruption. Command line to test: qemu root_fs -snapshot -kernel FC-kernels/FC2-vmlinuz-2.6.5-1.358 -append 'rw root=/dev/hda' I get this kind of corruption: http://guichaz.free.fr/linux-bug/corruption.png -- Guillaume - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
* Russell King <[EMAIL PROTECTED]> [2006-12-28 10:49]: > > By the way, I just tried it with TARGETSIZE (100 << 12) on a different > > ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of > > memory) and I see similar results to that from Gordon: > > Work around the glibc memset() problem by passing nr & 255, and re-run > the test. You're getting false positives. Yeah, I saw your message about this problem after I sent mine. -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On 12/28/06, Russell King <[EMAIL PROTECTED]> wrote: Fixing Linus' test program to pass nr & 255 to memset results in clean passes on 2.6.9 on TheCus N2100 (IOP8032x) and 2.6.16.9 StrongARM machines (as would be expected.) Thanks for the fix, Russell. I can now trigger the (real) problem by using a 25 MB file (100 << 18) and the Linksys NSLU2 (ARM, IXP420 processor, 32 MB RAM). $ ./linus-test Writing chunk 17954/17955 (99%) Chunk 514 corrupted (0-1459) (872-2331) Expected 2, got 0 Written as (8479)11160(10312) Chunk 516 corrupted (0-303) (3792-4095) Expected 4, got 0 Written as (10312)10569(4426) Chunk 959 corrupted (0-691) (3404-4095) Expected 191, got 0 Written as (687)4881(1522) Chunk 1895 corrupted (0-1459) (1900-3359) Expected 103, got 0 Written as (7746)8389(6231) Chunk 2702 corrupted (0-1459) (472-1931) Expected 142, got 0 Written as (4866)7103(2409) Chunk 3314 corrupted (0-1459) (1064-2523) Expected 242, got 0 Written as (4287)7064(1730) Chunk 4043 corrupted (0-1459) (444-1903) Expected 203, got 0 Written as (6495)8509(4464) Chunk 5180 corrupted (0-1459) (1584-3043) Expected 60, got 0 Written as (11056)12826(10797) Chunk 5672 corrupted (0-991) (3104-4095) Expected 40, got 0 Written as (9944)4872(41) Chunk 5793 corrupted (460-1459) (0-999) Expected 161, got 0 Written as (7059)5038(4377) Chunk 6089 corrupted (0-1459) (1620-3079) Expected 201, got 0 Written as (4672)5230(4403) Chunk 6545 corrupted (268-1459) (0-1191) Expected 145, got 0 Written as (3701)5969(4668) Chunk 7578 corrupted (0-1459) (584-2043) Expected 154, got 0 Written as (10015)5082(1648) Chunk 7880 corrupted (864-1459) (0-595) Expected 200, got 0 Written as (17869)5064(4745) Chunk 8086 corrupted (0-1459) (888-2347) Expected 150, got 0 Written as (10206)11050(10374) Chunk 8749 corrupted (0-1459) (2212-3671) Expected 45, got 0 Written as (15263)7132(4825) Chunk 9068 corrupted (0-1459) (1008-2467) Expected 108, got 0 Written as (5557)7571(6771) Chunk 9193 corrupted (812-1459) (0-647) Expected 233, got 0 Written as (9238)7277(4757) Chunk 10032 corrupted (576-1459) (0-883) Expected 48, got 0 Written as (15741)10012(1753) Chunk 10056 corrupted (0-1459) (1696-3155) Expected 72, got 0 Written as (5379)7431(262) Chunk 10395 corrupted (0-1459) (1020-2479) Expected 155, got 0 Written as (21)7442(5902) Chunk 10791 corrupted (0-1459) (1644-3103) Expected 39, got 0 Written as (4753)5925(5926) Chunk 10792 corrupted (0-991) (3104-4095) Expected 40, got 0 Written as (5925)5926(8555) Chunk 11036 corrupted (0-1103) (2992-4095) Expected 28, got 0 Written as (13755)14449(7458) Chunk 11387 corrupted (644-1459) (0-815) Expected 123, got 0 Written as (10853)11459(9445) Chunk 11586 corrupted (920-1459) (0-539) Expected 66, got 0 Written as (3769)11691(11123) Chunk 11882 corrupted (0-1459) (1160-2619) Expected 106, got 0 Written as (10736)11696(2788) Chunk 12397 corrupted (0-603) (3492-4095) Expected 109, got 0 Written as (2352)7515(2437) Chunk 12669 corrupted (0-795) (3300-4095) Expected 125, got 0 Written as (1191)7661(5266) Chunk 13162 corrupted (0-1459) (2184-3643) Expected 106, got 0 Written as (9383)13662(11544) Chunk 14653 corrupted (0-27) (4068-4095) Expected 61, got 0 Written as (8100)9456(1275) Chunk 17332 corrupted (0-367) (3728-4095) Expected 180, got 0 Written as (760)12247(1244) Chunk 17445 corrupted (0-1459) (772-2231) Expected 37, got 0 Written as (8007)16481(14439) Chunk 17556 corrupted (0-1007) (3088-4095) Expected 148, got 0 Written as (10113)10657(10477) Chunk 17859 corrupted (0-995) (3100-4095) Expected 195, got 0 Written as (14472)14767(11426) Checking chunk 17954/17955 (99%) Gordon -- Gordon Farquharson - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Wed, Dec 27, 2006 at 07:04:34PM -0800, Linus Torvalds wrote: > [ Modified test-program that tells you where the corruption happens (and > when the missing parts were supposed to be written out) appended, in > case people care. ] Hi 2.6.18 (and 2.6.18.6) is ok, 2.6.19-rc1 is broken. I tried some snapshots between them but they hung before shell (2.6.18-git11, 2.6.18-git16, 2.6.18-git20, 2.6.18-git21). 2.6.18-git22 booted and was broken. (UP, no preempt) -Petri - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Thu, Dec 28, 2006 at 11:16:59AM +0100, Martin Michlmayr wrote: > * Gordon Farquharson <[EMAIL PROTECTED]> [2006-12-27 22:38]: > > >> #define TARGETSIZE (100 << 12) > > > > > >That's just 400kB! > > > > > >There's no way you should see corruption with that kind of value. It > > >should all stay solidly in the cache. > > > > > >Is this perhaps with ARM nommu or something else strange? It may be that > > >the program just doesn't work at all if mmap() is faked out with a malloc > > >or similar. > > > > Definitely a question for the ARM gurus. I'm out of my depth. > > By the way, I just tried it with TARGETSIZE (100 << 12) on a different > ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of > memory) and I see similar results to that from Gordon: Work around the glibc memset() problem by passing nr & 255, and re-run the test. You're getting false positives. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
* Gordon Farquharson <[EMAIL PROTECTED]> [2006-12-27 22:38]: > >> #define TARGETSIZE (100 << 12) > > > >That's just 400kB! > > > >There's no way you should see corruption with that kind of value. It > >should all stay solidly in the cache. > > > >Is this perhaps with ARM nommu or something else strange? It may be that > >the program just doesn't work at all if mmap() is faked out with a malloc > >or similar. > > Definitely a question for the ARM gurus. I'm out of my depth. By the way, I just tried it with TARGETSIZE (100 << 12) on a different ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of memory) and I see similar results to that from Gordon: Writing chunk 279/280 (99%) Chunk 256 corrupted (1-1455) (1025-2479) Expected 0, got 1 Written as (199)43(184) Chunk 258 corrupted (1-1455) (3945-1303) Expected 2, got 3 Written as (184)74(145) Chunk 260 corrupted (1-1455) (2769-127) Expected 4, got 5 Written as (145)89(237) Chunk 262 corrupted (1-1455) (1593-3047) Expected 6, got 7 Written as (237)168(174) Chunk 264 corrupted (1-1455) (417-1871) Expected 8, got 9 Written as (174)135(161) Chunk 266 corrupted (1-1455) (3337-695) Expected 10, got 11 Written as (161)123(180) Chunk 268 corrupted (1-1455) (2161-3615) Expected 12, got 13 Written as (180)13(19) Chunk 270 corrupted (1-1455) (985-2439) Expected 14, got 15 Written as (19)172(106) Chunk 272 corrupted (1-1455) (3905-1263) Expected 16, got 17 Written as (106)212(140) Chunk 274 corrupted (1-1455) (2729-87) Expected 18, got 19 Written as (140)124(73) Chunk 276 corrupted (1-1455) (1553-3007) Expected 20, got 21 Written as (73)151(52) Chunk 278 corrupted (1-1455) (377-1831) Expected 22, got 23 Written as (52)215(99) Checking chunk 279/280 (99%) -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Wed, Dec 27, 2006 at 10:20:20PM -0700, Gordon Farquharson wrote: > I have run the program a few times, and the output is pretty > consistent. However, when I increase the target size, the difference > between the expected and actual values is larger. > > Written as (749)935(738) > Chunk 1113 corrupted (1-1455) (2965-323) > Expected 89, got 93 This is not the corruption Linus is after. Note that the corruption starts at offset '1'. Also note that: 89 = 1113 & 255 93 = 1113 & 255 | (1113 >> 8) and if you look at glibc's memset() function, you'll notice that's exactly what you expect if you pass a non-8bit value to it. Ergo, what you're seeing is utterly expected given glibc's memset() implementation on ARM. Fixing Linus' test program to pass nr & 255 to memset results in clean passes on 2.6.9 on TheCus N2100 (IOP8032x) and 2.6.16.9 StrongARM machines (as would be expected.) -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
* Gordon Farquharson <[EMAIL PROTECTED]> [2006-12-27 22:38]: > >That's just 400kB! > > > >There's no way you should see corruption with that kind of value. It > >should all stay solidly in the cache. > > > >Is this perhaps with ARM nommu or something else strange? It may be that > >the program just doesn't work at all if mmap() is faked out with a malloc > >or similar. > > Definitely a question for the ARM gurus. I'm out of my depth. The CPU has a MMU. For reference, it's a IXP4xx based device with 32 MB of memory. -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Wed, 2006-12-27 at 19:04 -0800, Linus Torvalds wrote: > > On Wed, 27 Dec 2006, David Miller wrote: > > > > > > I still don't see _why_, though. But maybe smarter people than me can see > > > it.. > > > > FWIW this program definitely triggers the bug for me. > > Ok, now that I have something simple to do repeatable stuff with, I can > say what the pattern is.. It's not all that surprising, but it's still > worth just stating for the record. > > What happens is that when I do the "packetized writes" in random order, > the _last_ write to a page occasionally just goes missing. It's not always > at the end of a page, as shown by for example: > > - A whole chunk got dropped: > > Chunk 2094 corrupted (0-1459) (1624-3083) > Expected 46, got 0 > Written as (30912)55414(1) > >That "Written as (x)y(z)" line means that the corrupted chunk was >written as chunk #y, and the preceding and following chunks (that were >_not_ corrupt) on the page was written as #x and #z respectively. > >In other words, the missing chunk (which is still zero) was written >much later than the ones that were ok, and never hit the disk. It's a >contiguous chunk in the middle of the page (chunks are 1460 bytes in >size) > >The first line means that all bytes of the chunk (0-1459) were >corrupted, and the values in parenthesis are the offsets within a page. >In other words, this was a chunk in the _middle_ of a page. > > - The missing data can also be at the beginning or ends of pages: > >Beginning of the chunk missing, it was at the end of a page (page >offsets 3288-4095) and the _next_ page got written out fine: > > Chunk 2126 corrupted (0-807) (3288-4095) > Expected 78, got 0 > Written as (32713)55573(14301) > >End of a chunk missing, it was the beginning of a page (and the >_previous_ page that contained the beginning of the chunk was written >out fine) > > Chunk 2179 corrupted (1252-1459) (0-207) > Expected 131, got 0 > Written as (45189)55489(15515) > > Now, the reason I say this isn't surprising is that this is entirely > consistent with the dirty bit being dropped on the floor somewhere, and > likely through some interaction with the previous changes being in the > process of being written out. > > Something (incorrectly) ends up deciding that it doesn't need to write the > page, since it's already written, or alternatively clears the dirty bit > too late (clears it because an _earlier_ write finished, never mind that > the new dirty data didn't make it). There might be a narrow race between set_page_dirty and clear_page_dirty. The test program is a process to write/read data. pdflush might write data to disk asynchronously. After pdflush writes a page to disk, it will call (either by softirq) clear_page_dirty to clear the dirty bit after getting the interrupt notification. But just after the page is written to disk and before the interrupt reports the result, the test process might change the data and unmap the area. When the area is unmapped, the page is marked as dirty again, but just after that, the interrupt arrives and the dirty bit is cleared, so the late data will not be written to disk. Function zap_pte_range checks pte to set page dirty if needed, but it doesn't hold page lock. If the page lock is held before set page dirty, the race might be avoided. Yanmin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
I think what might be happening is that pdflush writes them out fine, however we don't trap writes by the application _during_ that writeout. Yeah. I believe that more exactly it happens if the very last write to the page causes a writeback (due to dirty balancing) while another writeback for the page is already happening. As usual in these cases, I have zero proof. It's something that will only occur with writeback and MAP_SHARED writable access to the file pages. It's the do_wp_page - balance_dirty_pages - generic_writepages path for sure. Maybe it's enough to change if (wbc-sync_mode != WB_SYNC_NONE) wait_on_page_writeback(page); if (PageWriteback(page) || !clear_page_dirty_for_io(page)) { unlock_page(page); continue; } to if (wbc-sync_mode != WB_SYNC_NONE) wait_on_page_writeback(page); if (PageWriteback(page)) { redirty_page_for_writepage(wbc, page); unlock_page(page); continue; } if (!clear_page_dirty_for_io(page)) { unlock_page(page); continue; } or something along those lines. Completely untested of course :-) Segher - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Fri, 29 Dec 2006, Segher Boessenkool wrote: I think what might be happening is that pdflush writes them out fine, however we don't trap writes by the application _during_ that writeout. Yeah. I believe that more exactly it happens if the very last write to the page causes a writeback (due to dirty balancing) while another writeback for the page is already happening. As usual in these cases, I have zero proof. I actually have proof to the contrary, ie I have traces that say the write was started after the last write. And the VM layer in this area is actually fairly sane and civilized. It has a bit that says writeback in progress, and if that bit is set, it simply _will_not_ start a new write. It even has various BUG_ON()'s to that effect. So everything I have ever seen says that the VM layer is actually doing everything right. It's the do_wp_page - balance_dirty_pages - generic_writepages path for sure. Maybe it's enough to change if (wbc-sync_mode != WB_SYNC_NONE) wait_on_page_writeback(page); if (PageWriteback(page) || !clear_page_dirty_for_io(page)) { unlock_page(page); continue; } Notive how this one basically says: - if it's under writeback, don't even clear the page dirty flag. Your suggested change: if (wbc-sync_mode != WB_SYNC_NONE) wait_on_page_writeback(page); if (PageWriteback(page)) { redirty_page_for_writepage(wbc, page); makes no sense, because we simply never _did_ the clear_page_dirty() if the thing was under writeback in the first place. That's how C conditionals work. So there's no reason to redirty it, because it wasn't cleaned in the first place. I've double- and triple-checked the dirty bits, including having traces that actually say that the IO was started (from a VM perspective) _after_ the last write was done. The IO just didn't hit the disk. I'm personally fairly convinced that it's not a VM issue, but a IO issue. Either in a low-level filesystem or in some of the fs/buffer.c helper routines. But I'd love to be proven wrong. I do have a few interesting details from the trace I haven't really analyzed yet. Here's the trace for events on one of the pages that was corrupted. Note how the events are numbered (there were 171640 events total), so the thing you see is just a small set of events from the whole big trace, but it's the ones that talk about _that_ particular page. I've grouped them so hat consecutive events group together. That just means that no events on any other pages happened in between those events, and it is usually a sign that it's really one single call-chain that causes all the events. For example, for the first group of three events (44366-44368), it's the page fault that brings in the page, and since it's a write-fault, it will not only map the page, it will mark the page itself dirty and then also set the TAG_DIRTY on the mapping. So the group is just really a result of one single event happening, which causes several things to happen to that page. That's exactly what you'd expect. Anyway, here is the list of events that page went through: 44366 PG 0f6d: mm/memory.c:2254 mapping at b789fc54 (write) 44367 PG 0f6d: mm/page-writeback.c:817 setting dirty 44368 PG 0f6d: fs/buffer.c:738 setting TAG_DIRTY 64231 PG 0f6d: mm/page-writeback.c:872 clean_for_io 64232 PG 0f6d: mm/rmap.c:451 cleaning PTE b789f000 64233 PG 0f6d: mm/page-writeback.c:914 set writeback 64234 PG 0f6d: mm/page-writeback.c:916 setting TAG_WRITEBACK 64235 PG 0f6d: mm/page-writeback.c:922 clearing TAG_DIRTY 67570 PG 0f6d: mm/page-writeback.c:891 end writeback 67571 PG 0f6d: mm/page-writeback.c:893 clearing TAG_WRITEBACK 76705 PG 0f6d: mm/page-writeback.c:817 setting dirty 76706 PG 0f6d: fs/buffer.c:725 dirtied buffers 76707 PG 0f6d: fs/buffer.c:738 setting TAG_DIRTY 105267 PG 0f6d: mm/page-writeback.c:872 clean_for_io 105268 PG 0f6d: mm/rmap.c:451 cleaning PTE b789f000 105269 PG 0f6d: mm/page-writeback.c:914 set writeback 105270 PG 0f6d: mm/page-writeback.c:916 setting TAG_WRITEBACK 105271 PG 0f6d: mm/page-writeback.c:922 clearing TAG_DIRTY 105272 PG 0f6d: mm/page-writeback.c:891 end writeback 105273 PG 0f6d: mm/page-writeback.c:893 clearing TAG_WRITEBACK 128032 PG 0f6d: mm/memory.c:670 unmapped at b789f000 132662 PG 0f6d: mm/filemap.c:119 Removing page cache 148278 PG 0f6d: mm/memory.c:2254 mapping at b789f000 (read) 166326 PG 0f6d: mm/memory.c:670 unmapped at b789f000 171958 PG 0f6d: mm/filemap.c:119 Removing page cache
Re: [PATCH] mm: fix page_mkclean_one
On Wed, 2006-12-27 at 19:04 -0800, Linus Torvalds wrote: On Wed, 27 Dec 2006, David Miller wrote: I still don't see _why_, though. But maybe smarter people than me can see it.. FWIW this program definitely triggers the bug for me. Ok, now that I have something simple to do repeatable stuff with, I can say what the pattern is.. It's not all that surprising, but it's still worth just stating for the record. What happens is that when I do the packetized writes in random order, the _last_ write to a page occasionally just goes missing. It's not always at the end of a page, as shown by for example: - A whole chunk got dropped: Chunk 2094 corrupted (0-1459) (1624-3083) Expected 46, got 0 Written as (30912)55414(1) That Written as (x)y(z) line means that the corrupted chunk was written as chunk #y, and the preceding and following chunks (that were _not_ corrupt) on the page was written as #x and #z respectively. In other words, the missing chunk (which is still zero) was written much later than the ones that were ok, and never hit the disk. It's a contiguous chunk in the middle of the page (chunks are 1460 bytes in size) The first line means that all bytes of the chunk (0-1459) were corrupted, and the values in parenthesis are the offsets within a page. In other words, this was a chunk in the _middle_ of a page. - The missing data can also be at the beginning or ends of pages: Beginning of the chunk missing, it was at the end of a page (page offsets 3288-4095) and the _next_ page got written out fine: Chunk 2126 corrupted (0-807) (3288-4095) Expected 78, got 0 Written as (32713)55573(14301) End of a chunk missing, it was the beginning of a page (and the _previous_ page that contained the beginning of the chunk was written out fine) Chunk 2179 corrupted (1252-1459) (0-207) Expected 131, got 0 Written as (45189)55489(15515) Now, the reason I say this isn't surprising is that this is entirely consistent with the dirty bit being dropped on the floor somewhere, and likely through some interaction with the previous changes being in the process of being written out. Something (incorrectly) ends up deciding that it doesn't need to write the page, since it's already written, or alternatively clears the dirty bit too late (clears it because an _earlier_ write finished, never mind that the new dirty data didn't make it). There might be a narrow race between set_page_dirty and clear_page_dirty. The test program is a process to write/read data. pdflush might write data to disk asynchronously. After pdflush writes a page to disk, it will call (either by softirq) clear_page_dirty to clear the dirty bit after getting the interrupt notification. But just after the page is written to disk and before the interrupt reports the result, the test process might change the data and unmap the area. When the area is unmapped, the page is marked as dirty again, but just after that, the interrupt arrives and the dirty bit is cleared, so the late data will not be written to disk. Function zap_pte_range checks pte to set page dirty if needed, but it doesn't hold page lock. If the page lock is held before set page dirty, the race might be avoided. Yanmin - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
* Gordon Farquharson [EMAIL PROTECTED] [2006-12-27 22:38]: That's just 400kB! There's no way you should see corruption with that kind of value. It should all stay solidly in the cache. Is this perhaps with ARM nommu or something else strange? It may be that the program just doesn't work at all if mmap() is faked out with a malloc or similar. Definitely a question for the ARM gurus. I'm out of my depth. The CPU has a MMU. For reference, it's a IXP4xx based device with 32 MB of memory. -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Wed, Dec 27, 2006 at 10:20:20PM -0700, Gordon Farquharson wrote: I have run the program a few times, and the output is pretty consistent. However, when I increase the target size, the difference between the expected and actual values is larger. Written as (749)935(738) Chunk 1113 corrupted (1-1455) (2965-323) Expected 89, got 93 This is not the corruption Linus is after. Note that the corruption starts at offset '1'. Also note that: 89 = 1113 255 93 = 1113 255 | (1113 8) and if you look at glibc's memset() function, you'll notice that's exactly what you expect if you pass a non-8bit value to it. Ergo, what you're seeing is utterly expected given glibc's memset() implementation on ARM. Fixing Linus' test program to pass nr 255 to memset results in clean passes on 2.6.9 on TheCus N2100 (IOP8032x) and 2.6.16.9 StrongARM machines (as would be expected.) -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
* Gordon Farquharson [EMAIL PROTECTED] [2006-12-27 22:38]: #define TARGETSIZE (100 12) That's just 400kB! There's no way you should see corruption with that kind of value. It should all stay solidly in the cache. Is this perhaps with ARM nommu or something else strange? It may be that the program just doesn't work at all if mmap() is faked out with a malloc or similar. Definitely a question for the ARM gurus. I'm out of my depth. By the way, I just tried it with TARGETSIZE (100 12) on a different ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of memory) and I see similar results to that from Gordon: Writing chunk 279/280 (99%) Chunk 256 corrupted (1-1455) (1025-2479) Expected 0, got 1 Written as (199)43(184) Chunk 258 corrupted (1-1455) (3945-1303) Expected 2, got 3 Written as (184)74(145) Chunk 260 corrupted (1-1455) (2769-127) Expected 4, got 5 Written as (145)89(237) Chunk 262 corrupted (1-1455) (1593-3047) Expected 6, got 7 Written as (237)168(174) Chunk 264 corrupted (1-1455) (417-1871) Expected 8, got 9 Written as (174)135(161) Chunk 266 corrupted (1-1455) (3337-695) Expected 10, got 11 Written as (161)123(180) Chunk 268 corrupted (1-1455) (2161-3615) Expected 12, got 13 Written as (180)13(19) Chunk 270 corrupted (1-1455) (985-2439) Expected 14, got 15 Written as (19)172(106) Chunk 272 corrupted (1-1455) (3905-1263) Expected 16, got 17 Written as (106)212(140) Chunk 274 corrupted (1-1455) (2729-87) Expected 18, got 19 Written as (140)124(73) Chunk 276 corrupted (1-1455) (1553-3007) Expected 20, got 21 Written as (73)151(52) Chunk 278 corrupted (1-1455) (377-1831) Expected 22, got 23 Written as (52)215(99) Checking chunk 279/280 (99%) -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Thu, Dec 28, 2006 at 11:16:59AM +0100, Martin Michlmayr wrote: * Gordon Farquharson [EMAIL PROTECTED] [2006-12-27 22:38]: #define TARGETSIZE (100 12) That's just 400kB! There's no way you should see corruption with that kind of value. It should all stay solidly in the cache. Is this perhaps with ARM nommu or something else strange? It may be that the program just doesn't work at all if mmap() is faked out with a malloc or similar. Definitely a question for the ARM gurus. I'm out of my depth. By the way, I just tried it with TARGETSIZE (100 12) on a different ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of memory) and I see similar results to that from Gordon: Work around the glibc memset() problem by passing nr 255, and re-run the test. You're getting false positives. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Wed, Dec 27, 2006 at 07:04:34PM -0800, Linus Torvalds wrote: [ Modified test-program that tells you where the corruption happens (and when the missing parts were supposed to be written out) appended, in case people care. ] Hi 2.6.18 (and 2.6.18.6) is ok, 2.6.19-rc1 is broken. I tried some snapshots between them but they hung before shell (2.6.18-git11, 2.6.18-git16, 2.6.18-git20, 2.6.18-git21). 2.6.18-git22 booted and was broken. (UP, no preempt) -Petri - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On 12/28/06, Russell King [EMAIL PROTECTED] wrote: Fixing Linus' test program to pass nr 255 to memset results in clean passes on 2.6.9 on TheCus N2100 (IOP8032x) and 2.6.16.9 StrongARM machines (as would be expected.) Thanks for the fix, Russell. I can now trigger the (real) problem by using a 25 MB file (100 18) and the Linksys NSLU2 (ARM, IXP420 processor, 32 MB RAM). $ ./linus-test Writing chunk 17954/17955 (99%) Chunk 514 corrupted (0-1459) (872-2331) Expected 2, got 0 Written as (8479)11160(10312) Chunk 516 corrupted (0-303) (3792-4095) Expected 4, got 0 Written as (10312)10569(4426) Chunk 959 corrupted (0-691) (3404-4095) Expected 191, got 0 Written as (687)4881(1522) Chunk 1895 corrupted (0-1459) (1900-3359) Expected 103, got 0 Written as (7746)8389(6231) Chunk 2702 corrupted (0-1459) (472-1931) Expected 142, got 0 Written as (4866)7103(2409) Chunk 3314 corrupted (0-1459) (1064-2523) Expected 242, got 0 Written as (4287)7064(1730) Chunk 4043 corrupted (0-1459) (444-1903) Expected 203, got 0 Written as (6495)8509(4464) Chunk 5180 corrupted (0-1459) (1584-3043) Expected 60, got 0 Written as (11056)12826(10797) Chunk 5672 corrupted (0-991) (3104-4095) Expected 40, got 0 Written as (9944)4872(41) Chunk 5793 corrupted (460-1459) (0-999) Expected 161, got 0 Written as (7059)5038(4377) Chunk 6089 corrupted (0-1459) (1620-3079) Expected 201, got 0 Written as (4672)5230(4403) Chunk 6545 corrupted (268-1459) (0-1191) Expected 145, got 0 Written as (3701)5969(4668) Chunk 7578 corrupted (0-1459) (584-2043) Expected 154, got 0 Written as (10015)5082(1648) Chunk 7880 corrupted (864-1459) (0-595) Expected 200, got 0 Written as (17869)5064(4745) Chunk 8086 corrupted (0-1459) (888-2347) Expected 150, got 0 Written as (10206)11050(10374) Chunk 8749 corrupted (0-1459) (2212-3671) Expected 45, got 0 Written as (15263)7132(4825) Chunk 9068 corrupted (0-1459) (1008-2467) Expected 108, got 0 Written as (5557)7571(6771) Chunk 9193 corrupted (812-1459) (0-647) Expected 233, got 0 Written as (9238)7277(4757) Chunk 10032 corrupted (576-1459) (0-883) Expected 48, got 0 Written as (15741)10012(1753) Chunk 10056 corrupted (0-1459) (1696-3155) Expected 72, got 0 Written as (5379)7431(262) Chunk 10395 corrupted (0-1459) (1020-2479) Expected 155, got 0 Written as (21)7442(5902) Chunk 10791 corrupted (0-1459) (1644-3103) Expected 39, got 0 Written as (4753)5925(5926) Chunk 10792 corrupted (0-991) (3104-4095) Expected 40, got 0 Written as (5925)5926(8555) Chunk 11036 corrupted (0-1103) (2992-4095) Expected 28, got 0 Written as (13755)14449(7458) Chunk 11387 corrupted (644-1459) (0-815) Expected 123, got 0 Written as (10853)11459(9445) Chunk 11586 corrupted (920-1459) (0-539) Expected 66, got 0 Written as (3769)11691(11123) Chunk 11882 corrupted (0-1459) (1160-2619) Expected 106, got 0 Written as (10736)11696(2788) Chunk 12397 corrupted (0-603) (3492-4095) Expected 109, got 0 Written as (2352)7515(2437) Chunk 12669 corrupted (0-795) (3300-4095) Expected 125, got 0 Written as (1191)7661(5266) Chunk 13162 corrupted (0-1459) (2184-3643) Expected 106, got 0 Written as (9383)13662(11544) Chunk 14653 corrupted (0-27) (4068-4095) Expected 61, got 0 Written as (8100)9456(1275) Chunk 17332 corrupted (0-367) (3728-4095) Expected 180, got 0 Written as (760)12247(1244) Chunk 17445 corrupted (0-1459) (772-2231) Expected 37, got 0 Written as (8007)16481(14439) Chunk 17556 corrupted (0-1007) (3088-4095) Expected 148, got 0 Written as (10113)10657(10477) Chunk 17859 corrupted (0-995) (3100-4095) Expected 195, got 0 Written as (14472)14767(11426) Checking chunk 17954/17955 (99%) Gordon -- Gordon Farquharson - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
* Russell King [EMAIL PROTECTED] [2006-12-28 10:49]: By the way, I just tried it with TARGETSIZE (100 12) on a different ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of memory) and I see similar results to that from Gordon: Work around the glibc memset() problem by passing nr 255, and re-run the test. You're getting false positives. Yeah, I saw your message about this problem after I sent mine. -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: [PATCH] mm: fix page_mkclean_one
I set a qemu environment to test kernels: http://guichaz.free.fr/linux-bug/ I have corruption with every Fedora release kernel except the first, that is 2.4.22 works, but 2.6.5, 2.6.9, 2.6.11, 2.6.15 and 2.6.18-1.2798 exhibit some corruption. Command line to test: qemu root_fs -snapshot -kernel FC-kernels/FC2-vmlinuz-2.6.5-1.358 -append 'rw root=/dev/hda' I get this kind of corruption: http://guichaz.free.fr/linux-bug/corruption.png -- Guillaume - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
* Gordon Farquharson [EMAIL PROTECTED] [2006-12-28 07:15]: Thanks for the fix, Russell. I can now trigger the (real) problem by using a 25 MB file (100 18) and the Linksys NSLU2 (ARM, IXP420 processor, 32 MB RAM). Me too (using 100 18). Interestingly, I don't seem to get any corruption on a different ARM board, an IOP32x based machine with 128 MB RAM. -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Thu Dec 28 15:09 , Guillaume Chazarain sent: I set a qemu environment to test kernels: http://guichaz.free.fr/linux-bug/ I have corruption with every Fedora release kernel except the first, that is 2.4.22 works, but 2.6.5, 2.6.9, 2.6.11, 2.6.15 and 2.6.18-1.2798 exhibit some corruption. Confirm that I see corruption with Fedora kernel 2.6.18-1.2239.fc5: ... Chunk 142969 corrupted (0-1459) (2580-4039) Expected 121, got 0 Written as (89632)127652(124721) Chunk 142976 corrupted (0-1459) (512-1971) Expected 128, got 0 Written as (121734)128324(108589) Checking chunk 143639/143640 (99%) $ uname -a Linux 2.6.18-1.2239.fc5 #1 Fri Nov 10 13:04:06 EST 2006 i686 athlon i386 GNU/Linux /Martin - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: fix page_mkclean_one
On Wed, 27 Dec 2006, Gordon Farquharson wrote: 100kB and 200kB files always succeed on the ARM system. 400kB and larger always seem to fail. Oh, wow. Yeah, I've just repressed how tiny 32MB is. And especially if you lowered the /proc/sys/vm/dirty_ratio to a smaller percentage, I guess 400kB should be enough to cause writeback. Ugh. I tested a 128MB machine a few weeks ago, and found it painful. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] mm: fix page_mkclean_one
On Wed, 27 Dec 2006, Chen, Kenneth W wrote: Running the test code, git bisect points its finger at this commit. Reverting this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code. [PATCH] mm: balance dirty pages Now that we can detect writers of shared mappings, throttle them. Avoids OOM by surprise. Oh, never mind :-( I just didn't create enough write out pressure when test this. I just saw bug got triggered on a kernel I previously thought was OK. Btw, this is an important point - people have long felt that the new page balancing in 2.6.19 was to blame, but you've just confirmed the long-held suspicion (at least by me) that it's not actually a new bug at all, it's just that the dirty page balancing causes writeback to happen _earlier_, and thus is better able to _show_ a bug that we've likely had for a long long time. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/