Re: 2.6.19 file content corruption on ext3
On Fri, Dec 29, 2006 at 07:52:15PM +0100, maximilian attems wrote: > > The only -mm stuff I recall being in the Fedora 2.6.18 is > > the inode-diet stuff which ended up in 2.6.19, though the xmas > > break has left my head somewhat empty so I may be forgetting something. > > What patch in particular are you talking about? > > it's no longer visible in the FC6 cvs, due to rebase > but it's name was linux-2.6-mm-tracking-dirty-pages.patch > it is an earlier almagame of the merged patch serie: >- mm: tracking shared dirty pages >- mm: balance dirty pages >- mm: optimize the new mprotect() code a bit >- mm: small cleanup of install_page() >- mm: fixup do_wp_page() >- mm: msync() cleanup (closes: #394392) Ohh, that. Yes. I had forgotten all about that. I've been hitting the nog a little too hard :) Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Fri, Dec 29, 2006 at 10:02:53AM -0500, Dave Jones wrote: > On Fri, Dec 29, 2006 at 10:23:14AM +0100, maximilian attems wrote: > > > On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote: > > > > > That was a Fedora kernel. Has anyone seen the corruption in vanilla > 2.6.18 > > > > > (or older)? > > > > > > > > Well, that was a really _old_ fedora kernel. I guarantee you it > didn't > > > > have the page throttling patches in it, those were written this > summer. So > > > > it would either have to be Fedora carrying around another patch that > just > > > > happens to result in the same corruption for _years_, or it's the > same > > > > bug. > > > > > > The only notable VM patch in Fedora kernels of that vintage that I recall > > > was Ingo's 4g/4g thing. > > > > no the fedora 2.6.18 kernel is affected. > > I wasn't denying that, but Linus was talking about a 2.6.5 Fedora kernel. > > > it carries the same -mm patches that Debian backported > > for LSB 3.1 compliance. > > The only -mm stuff I recall being in the Fedora 2.6.18 is > the inode-diet stuff which ended up in 2.6.19, though the xmas > break has left my head somewhat empty so I may be forgetting something. > What patch in particular are you talking about? it's no longer visible in the FC6 cvs, due to rebase but it's name was linux-2.6-mm-tracking-dirty-pages.patch it is an earlier almagame of the merged patch serie: - mm: tracking shared dirty pages - mm: balance dirty pages - mm: optimize the new mprotect() code a bit - mm: small cleanup of install_page() - mm: fixup do_wp_page() - mm: msync() cleanup (closes: #394392) -- maks - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
Linus Torvalds a écrit : going back to Linux-2.6.5 at least, according to one tester). I apologize for the confusion, but it just occurred to me that I was actually experiencing a totally different problem: I set a root filesystem of 3Mib for qemu, so the test program just didn't have enough space for its file. -- Guillaume - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Fri, Dec 29, 2006 at 10:23:14AM +0100, maximilian attems wrote: > > On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote: > > > > > > > > > On Thu, 28 Dec 2006, Petri Kaukasoina wrote: > > > > > me up), and that seems to show the corruption going way way back > > (ie going > > > > > back to Linux-2.6.5 at least, according to one tester). > > > > > > > > That was a Fedora kernel. Has anyone seen the corruption in vanilla > > 2.6.18 > > > > (or older)? > > > > > > Well, that was a really _old_ fedora kernel. I guarantee you it didn't > > > have the page throttling patches in it, those were written this summer. > > So > > > it would either have to be Fedora carrying around another patch that > > just > > > happens to result in the same corruption for _years_, or it's the same > > > bug. > > > > The only notable VM patch in Fedora kernels of that vintage that I recall > > was Ingo's 4g/4g thing. > > no the fedora 2.6.18 kernel is affected. I wasn't denying that, but Linus was talking about a 2.6.5 Fedora kernel. > it carries the same -mm patches that Debian backported > for LSB 3.1 compliance. The only -mm stuff I recall being in the Fedora 2.6.18 is the inode-diet stuff which ended up in 2.6.19, though the xmas break has left my head somewhat empty so I may be forgetting something. What patch in particular are you talking about? Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
> On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote: > > > > > > On Thu, 28 Dec 2006, Petri Kaukasoina wrote: > > > > me up), and that seems to show the corruption going way way back (ie > going > > > > back to Linux-2.6.5 at least, according to one tester). > > > > > > That was a Fedora kernel. Has anyone seen the corruption in vanilla > 2.6.18 > > > (or older)? > > > > Well, that was a really _old_ fedora kernel. I guarantee you it didn't > > have the page throttling patches in it, those were written this summer. So > > it would either have to be Fedora carrying around another patch that just > > happens to result in the same corruption for _years_, or it's the same > > bug. > > The only notable VM patch in Fedora kernels of that vintage that I recall > was Ingo's 4g/4g thing. > > Dave no the fedora 2.6.18 kernel is affected. it carries the same -mm patches that Debian backported for LSB 3.1 compliance. -- maks ps sorry for stripping cc, only downloaded that message raw. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote: On Thu, 28 Dec 2006, Petri Kaukasoina wrote: me up), and that seems to show the corruption going way way back (ie going back to Linux-2.6.5 at least, according to one tester). That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18 (or older)? Well, that was a really _old_ fedora kernel. I guarantee you it didn't have the page throttling patches in it, those were written this summer. So it would either have to be Fedora carrying around another patch that just happens to result in the same corruption for _years_, or it's the same bug. The only notable VM patch in Fedora kernels of that vintage that I recall was Ingo's 4g/4g thing. Dave no the fedora 2.6.18 kernel is affected. it carries the same -mm patches that Debian backported for LSB 3.1 compliance. -- maks ps sorry for stripping cc, only downloaded that message raw. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Fri, Dec 29, 2006 at 10:23:14AM +0100, maximilian attems wrote: On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote: On Thu, 28 Dec 2006, Petri Kaukasoina wrote: me up), and that seems to show the corruption going way way back (ie going back to Linux-2.6.5 at least, according to one tester). That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18 (or older)? Well, that was a really _old_ fedora kernel. I guarantee you it didn't have the page throttling patches in it, those were written this summer. So it would either have to be Fedora carrying around another patch that just happens to result in the same corruption for _years_, or it's the same bug. The only notable VM patch in Fedora kernels of that vintage that I recall was Ingo's 4g/4g thing. no the fedora 2.6.18 kernel is affected. I wasn't denying that, but Linus was talking about a 2.6.5 Fedora kernel. it carries the same -mm patches that Debian backported for LSB 3.1 compliance. The only -mm stuff I recall being in the Fedora 2.6.18 is the inode-diet stuff which ended up in 2.6.19, though the xmas break has left my head somewhat empty so I may be forgetting something. What patch in particular are you talking about? Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
Linus Torvalds a écrit : going back to Linux-2.6.5 at least, according to one tester). I apologize for the confusion, but it just occurred to me that I was actually experiencing a totally different problem: I set a root filesystem of 3Mib for qemu, so the test program just didn't have enough space for its file. -- Guillaume - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Fri, Dec 29, 2006 at 10:02:53AM -0500, Dave Jones wrote: On Fri, Dec 29, 2006 at 10:23:14AM +0100, maximilian attems wrote: On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote: snipp That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18 (or older)? Well, that was a really _old_ fedora kernel. I guarantee you it didn't have the page throttling patches in it, those were written this summer. So it would either have to be Fedora carrying around another patch that just happens to result in the same corruption for _years_, or it's the same bug. The only notable VM patch in Fedora kernels of that vintage that I recall was Ingo's 4g/4g thing. no the fedora 2.6.18 kernel is affected. I wasn't denying that, but Linus was talking about a 2.6.5 Fedora kernel. it carries the same -mm patches that Debian backported for LSB 3.1 compliance. The only -mm stuff I recall being in the Fedora 2.6.18 is the inode-diet stuff which ended up in 2.6.19, though the xmas break has left my head somewhat empty so I may be forgetting something. What patch in particular are you talking about? it's no longer visible in the FC6 cvs, due to rebase but it's name was linux-2.6-mm-tracking-dirty-pages.patch it is an earlier almagame of the merged patch serie: - mm: tracking shared dirty pages - mm: balance dirty pages - mm: optimize the new mprotect() code a bit - mm: small cleanup of install_page() - mm: fixup do_wp_page() - mm: msync() cleanup (closes: #394392) -- maks - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Fri, Dec 29, 2006 at 07:52:15PM +0100, maximilian attems wrote: The only -mm stuff I recall being in the Fedora 2.6.18 is the inode-diet stuff which ended up in 2.6.19, though the xmas break has left my head somewhat empty so I may be forgetting something. What patch in particular are you talking about? it's no longer visible in the FC6 cvs, due to rebase but it's name was linux-2.6-mm-tracking-dirty-pages.patch it is an earlier almagame of the merged patch serie: - mm: tracking shared dirty pages - mm: balance dirty pages - mm: optimize the new mprotect() code a bit - mm: small cleanup of install_page() - mm: fixup do_wp_page() - mm: msync() cleanup (closes: #394392) Ohh, that. Yes. I had forgotten all about that. I've been hitting the nog a little too hard :) Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, 28 Dec 2006 17:38:38 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> wrote: > in > the hope that somebody else is working on this corruption issue and is > interested.. What corruption issue? ;) I'm finding that the corruption happens trivially with your test app, but apparently doesn't happen at all with ext2 or ext3, data=writeback. Maybe it will happen with increased rarity, but the difference is quite stark. Removing the err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE, NULL, journal_dirty_data_fn); from ext3_ordered_writepage() fixes things up. The things which journal_submit_data_buffers() does after dropping all the locks are ... disturbing - I don't think we have sufficient tests in there to ensure that the buffer is still where we think it is after we retake locks (they're slippery little buggers). But that wouldn't explain it anyway. It's inefficient that journal_dirty_data() will put these locked, clean buffers onto BJ_SyncData instead of BJ_Locked, but journal_submit_data_buffers() seems to dtrt with them. So no theory yet. Maybe ext3 is just altering timing. But the difference is really large.. Disabling all the WB_SYNC_NONE stuff and making everything go synchronous everywhere has no effect. Disabling bdi_write_congested() has no effect. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
Btw, much cleaned-up page tracing patch here, in case anybody cares (and "test.c" attached, although I don't think it changed since last time). The test.c output is a bit hard to read at times, since it will give offsets in bytes as hex (ie "00a77664" means page frame 0a77, and byte 664h within that page), while the kernel output is obvioiusly the page indexes (but the page fault _addresses_ can contain information about the exact byte in a page, so you can match them up when some kernel event is related to a page fault). So both forms are necessary/logical, but it means that to match things up, you often need to ignore the last three hex digits of the address that "test.c" outputs. This one also adds traces for the tags and the writeback activity, but since I'm going out for birthday dinner, I won't have time to try to actually analyse the trace I have.. Which is why I'm sending it out, in the hope that somebody else is working on this corruption issue and is interested.. Linus diff --git a/fs/buffer.c b/fs/buffer.c index 263f88e..f5e132a 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -722,6 +722,7 @@ int __set_page_dirty_buffers(struct page *page) set_buffer_dirty(bh); bh = bh->b_this_page; } while (bh != head); + PAGE_TRACE(page, "dirtied buffers"); } spin_unlock(>private_lock); @@ -734,6 +735,7 @@ int __set_page_dirty_buffers(struct page *page) __inc_zone_page_state(page, NR_FILE_DIRTY); task_io_account_write(PAGE_CACHE_SIZE); } + PAGE_TRACE(page, "setting TAG_DIRTY"); radix_tree_tag_set(>page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 350878a..0cf3dce 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -91,6 +91,14 @@ #define PG_nosave_free 18 /* Used for system suspend/resume */ #define PG_buddy 19 /* Page is free, on buddy lists */ +#define SetPageInteresting(page) set_bit(PG_arch_1, &(page)->flags) +#define PageInteresting(page) test_bit(PG_arch_1, &(page)->flags) + +#define PAGE_TRACE(page, msg, arg...) do { \ + if (PageInteresting(page)) \ + printk(KERN_DEBUG "PG %08lx: %s:%d " msg "\n", \ + (page)->index, __FILE__, __LINE__ ,##arg ); \ +} while (0) #if (BITS_PER_LONG > 32) /* @@ -183,32 +191,38 @@ static inline void SetPageUptodate(struct page *page) #define PageWriteback(page)test_bit(PG_writeback, &(page)->flags) #define SetPageWriteback(page) \ do {\ - if (!test_and_set_bit(PG_writeback, \ - &(page)->flags))\ + if (!test_and_set_bit(PG_writeback, &(page)->flags)) { \ + PAGE_TRACE(page, "set writeback"); \ inc_zone_page_state(page, NR_WRITEBACK);\ + } \ } while (0) #define TestSetPageWriteback(page) \ ({ \ int ret;\ ret = test_and_set_bit(PG_writeback,\ &(page)->flags);\ - if (!ret) \ + if (!ret) { \ + PAGE_TRACE(page, "set writeback"); \ inc_zone_page_state(page, NR_WRITEBACK);\ + } \ ret;\ }) #define ClearPageWriteback(page) \ do {\ - if (test_and_clear_bit(PG_writeback,\ - &(page)->flags))\ + if (test_and_clear_bit(PG_writeback, &(page)->flags)) { \ + PAGE_TRACE(page, "end writeback"); \ dec_zone_page_state(page, NR_WRITEBACK);\ + } \ } while (0) #define TestClearPageWriteback(page) \ ({
Re: 2.6.19 file content corruption on ext3
On Thu, 28 Dec 2006, Linus Torvalds wrote: > Ok, > with the ugly trace capture patch, I've actually captured this corruption > in action, I think. > > I did a full trace of all pages involved in one run, and picked one > corruption at random: > > Chunk 14465 corrupted (0-75) (01423fb4-01423fff) > Expected 129, got 0 > Written as (5126)9509(15017) > > That's the first 76 bytes of a chunk missing, and it's the last 76 bytes > on a page. It's page index 01423 in the mapped file, and bytes fb4-fff > within that file. > > There were four chunks written to that page: > > Writing chunk 14463/15800 (15%) (0142344c) (1) > Writing chunk 14462/15800 (30%) (01422e98) (2) (overflows into 1423) > Writing chunk 14464/15800 (32%) (01423a00) (3) > Writing chunk 14465/15800 (60%) (01423fb4) (4) <--- LOST! > > and the other three chunks checked out all right. > > And here's the annotated trace as it concerns that page: > > - here we write the first chunk to the page: > ** (1) do_no_page: mapping index 1423 at b7d1f44c (write) > ** Setting page 1423 dirty > > - something flushes it out to disk: > ** cpd_for_io: index 1423 > ** cleaning index 1423 at b7d1f000 > > - here we write the second chunk (which was split over the previous page >and the interesting one): > ** (2) Setting page 1422 dirty > ** (2) Setting page 1423 dirty > > - and here we do a cleaning event > ** cpd_for_io: index 1423 > ** cleaning index 1423 at b7d1f000 > > - here we write the third chunk: > ** (3) Setting page 1423 dirty > > - here we write the fourth chunk: > ** (4) NO DIRTY EVENT > > - and a third flush to disk: > ** cpd_for_io: index 1423 > ** cleaning index 1423 at b7d1f000 > > - here we unmap and flush: > ** Unmapped index 1423 at b7d1f000 > ** Removing index 1423 from page cache > > - here we remap to check: > ** do_no_page: mapping index 1423 at b7d1f000 (read) > ** Unmapped index 1423 at b7d1f000 > > - and finally, here I remove the file after the run: > ** Removing index 1423 from page cache > > Now, the important thing to see here is: > > - the missing write did not have a "Setting page 1423 dirty" event >associated with it. > > - but I can _see_ where the actual dirty event would be happening in the >logs, because I can see the dirty events of the other chunk writes >around it, so I know exactly where that fourth write happens. And >indeed, it _shouldn't_ get a dirty event, because the page is still >dirty from the write of chunk #3 to that page, which _did_ get a dirty >event. > >I can see that, because the testing app writes the log of the pages it >writes, and this is the log around the fourth and final write: > > ... > Writing chunk 5338/15800 (60%) (0076eb48) PFN: 76e/76f > Writing chunk 960/15800 (60%) (00156300)PFN: 156 > Writing chunk 14465/15800 (60%) (01423fb4) < > Writing chunk 8594/15800 (60%) (00bf74a8) PFN: bf7 > Writing chunk 556/15800 (60%) (000c62f0)PFN: c6 > Writing chunk 15190/15800 (60%) (01526678) PFN: 1526 > ... > >and I can match this up with the full log from the kernel, which looks >like this: > > Setting page 076e dirty > Setting page 076f dirty > Setting page 0156 dirty > Setting page 00c6 dirty > Setting page 1526 dirty > >so I know exactly where the missing writes (to our page at pfn 1423, >and the fpn-bf7 page) happened. > > - and the thing is, I can see a "cpd_for_io()" happening AFTER that >fourth write. Quite a long while after, in fact. So all of this looks >very fine indeed. We are not losing any dirty bits. > > - EVEN MORE INTERESTING: write 3 makes it onto disk, and it really uses >the SAME dirty bit as write 4 did (which didn't make it out to disk!). >The event that clears the dirty bit that write 3 did happens AFTER >write 4 has happened! > > So if we're not losing any dirty bits, what's going on? > > I think we have some nasty interaction with the buffer heads. In But are chunks 3 and 4 in separate buffer heads? Sorry could not see it immediately from the output you showed... It is just that there may be a different cause rather than buffer dirty state... A shot in the dark I know but it could perhaps be that a "COW for MAP_PRIVATE" like event happens when the page is dirty already thus the second write never actually makes it to the shared page thus it never gets written out. I am almost certainly totally barking up the wrong tree but I thought it may be worth mentioning just in case there was a slip in the COW logic or page
Re: 2.6.19 file content corruption on ext3
On Thu, 28 Dec 2006, Anton Altaparmakov wrote: > > But are chunks 3 and 4 in separate buffer heads? Sorry could not see it > immediately from the output you showed... No, this is a 4kB filesystem. A single bh per page. > It is just that there may be a different cause rather than buffer dirty > state... Sure. > A shot in the dark I know but it could perhaps be that a "COW for > MAP_PRIVATE" like event happens when the page is dirty already thus the > second write never actually makes it to the shared page thus it never gets > written out. There are no private mappings anywhere, and no forks. Just a single mmap (well, we unmap and remap in order to force the page cache to be invalidated properly with the posix_fadvise() thing, but that's literally the only user). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, 28 Dec 2006, David Miller wrote: > > What happens when we writeback, to the PTEs? Not a damn thing. We clear the PTE's _before_ we even start the write. The writeback does nothing to them. If the user dirties the page while writeback is in progress, we'll take the page fault and re-dirty it _again_. > page_mkclean_file() iterates the VMAs and when it finds a shared > one it goes: > > entry = ptep_clear_flush(vma, address, pte); > entry = pte_wrprotect(entry); > entry = pte_mkclean(entry); > > and that's fine, but that PTE is still marked writable, and > I think that's key. No it's not. It's right there. "pte_wrprotect(entry)". You even copied it yourself. > What does the fault path do in this situation? > > if (write_access) { > if (!pte_write(entry)) > return do_wp_page(mm, vma, address, > pte, pmd, ptl, entry); So we call "do_wp_page()", and that does everythign right. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
From: Linus Torvalds <[EMAIL PROTECTED]> Date: Thu, 28 Dec 2006 14:37:37 -0800 (PST) > So if we're not losing any dirty bits, what's going on? What happens when we writeback, to the PTEs? page_mkclean_file() iterates the VMAs and when it finds a shared one it goes: entry = ptep_clear_flush(vma, address, pte); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); and that's fine, but that PTE is still marked writable, and I think that's key. What does the fault path do in this situation? if (write_access) { if (!pte_write(entry)) return do_wp_page(mm, vma, address, pte, pmd, ptl, entry); entry = pte_mkdirty(entry); } It does nothing to update the page dirty state, because it's writable, it just sets the PTE dirty bit and that's it. Should it be setting the page dirty here for SHARED cases? So until vmscan actually unmaps the PTE completely, we have this window in which the application can write to the PTE and the page dirty state doesn't get updated. Perhaps something later cleans up after this, f.e. by rechecking the PTE dirty bit at the end of I/O or when vmscan unmaps the page. I guess that should handle things, but the above logic definitely stood out to me. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
Ok, with the ugly trace capture patch, I've actually captured this corruption in action, I think. I did a full trace of all pages involved in one run, and picked one corruption at random: Chunk 14465 corrupted (0-75) (01423fb4-01423fff) Expected 129, got 0 Written as (5126)9509(15017) That's the first 76 bytes of a chunk missing, and it's the last 76 bytes on a page. It's page index 01423 in the mapped file, and bytes fb4-fff within that file. There were four chunks written to that page: Writing chunk 14463/15800 (15%) (0142344c) (1) Writing chunk 14462/15800 (30%) (01422e98) (2) (overflows into 1423) Writing chunk 14464/15800 (32%) (01423a00) (3) Writing chunk 14465/15800 (60%) (01423fb4) (4) <--- LOST! and the other three chunks checked out all right. And here's the annotated trace as it concerns that page: - here we write the first chunk to the page: ** (1) do_no_page: mapping index 1423 at b7d1f44c (write) ** Setting page 1423 dirty - something flushes it out to disk: ** cpd_for_io: index 1423 ** cleaning index 1423 at b7d1f000 - here we write the second chunk (which was split over the previous page and the interesting one): ** (2) Setting page 1422 dirty ** (2) Setting page 1423 dirty - and here we do a cleaning event ** cpd_for_io: index 1423 ** cleaning index 1423 at b7d1f000 - here we write the third chunk: ** (3) Setting page 1423 dirty - here we write the fourth chunk: ** (4) NO DIRTY EVENT - and a third flush to disk: ** cpd_for_io: index 1423 ** cleaning index 1423 at b7d1f000 - here we unmap and flush: ** Unmapped index 1423 at b7d1f000 ** Removing index 1423 from page cache - here we remap to check: ** do_no_page: mapping index 1423 at b7d1f000 (read) ** Unmapped index 1423 at b7d1f000 - and finally, here I remove the file after the run: ** Removing index 1423 from page cache Now, the important thing to see here is: - the missing write did not have a "Setting page 1423 dirty" event associated with it. - but I can _see_ where the actual dirty event would be happening in the logs, because I can see the dirty events of the other chunk writes around it, so I know exactly where that fourth write happens. And indeed, it _shouldn't_ get a dirty event, because the page is still dirty from the write of chunk #3 to that page, which _did_ get a dirty event. I can see that, because the testing app writes the log of the pages it writes, and this is the log around the fourth and final write: ... Writing chunk 5338/15800 (60%) (0076eb48) PFN: 76e/76f Writing chunk 960/15800 (60%) (00156300)PFN: 156 Writing chunk 14465/15800 (60%) (01423fb4) < Writing chunk 8594/15800 (60%) (00bf74a8) PFN: bf7 Writing chunk 556/15800 (60%) (000c62f0)PFN: c6 Writing chunk 15190/15800 (60%) (01526678) PFN: 1526 ... and I can match this up with the full log from the kernel, which looks like this: Setting page 076e dirty Setting page 076f dirty Setting page 0156 dirty Setting page 00c6 dirty Setting page 1526 dirty so I know exactly where the missing writes (to our page at pfn 1423, and the fpn-bf7 page) happened. - and the thing is, I can see a "cpd_for_io()" happening AFTER that fourth write. Quite a long while after, in fact. So all of this looks very fine indeed. We are not losing any dirty bits. - EVEN MORE INTERESTING: write 3 makes it onto disk, and it really uses the SAME dirty bit as write 4 did (which didn't make it out to disk!). The event that clears the dirty bit that write 3 did happens AFTER write 4 has happened! So if we're not losing any dirty bits, what's going on? I think we have some nasty interaction with the buffer heads. In particular, I don't think it's the dirty page bits that are broken (I _see_ that the PageDirty bit was set after write 4 was done to memory in the kernel traces). So I think that a real writeback just doesn't happen, because somebody has marked the buffer heads clean _after_ it started IO on them. I think "__mpage_writepage()" is buggy in this regard, for example. It even has a comment about its crapola behaviour: /* * Must try to add the page before marking the buffer clean or * the confused fail path above (OOM) will be very confused when * it finds all bh marked clean (i.e. it will not write anything) */ however, I don't think that particular thing explains it, because I don't think we use that function for the cases I'm looking
Re: 2.6.19 file content corruption on ext3
On Thu, Dec 28, 2006 at 01:24:30PM -0800, Linus Torvalds wrote: > On Thu, 28 Dec 2006, Linus Torvalds wrote: > > > > What we need now is actually looking at the source code, and people who > > understand the VM, I'm afraid. I'm gathering traces now that I have a good > > test-case. I'll post my trace tools once I've tested that they work, in > > case others want to help. > > Ok, I've got the traces, but quite frankly, I doubt anybody is crazy > enough to want to trawl through them. It's a bit painful, since we're > talking thousands of pages to trigger this problem. > > Also, I've used the PG_arch_1 flag, which is fine on x86[-64] and probably > ARM, but is used for other things on ia64, powerpc and sparc64. But here's > the patch in case anybody cares. PG_arch_1 is used on ARM to flag pages that need a dcache flush prior to hitting userspace, in the same way that sparc64 uses it. So ARM systems should not have this patch applied. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, 28 Dec 2006, Linus Torvalds wrote: > > What we need now is actually looking at the source code, and people who > understand the VM, I'm afraid. I'm gathering traces now that I have a good > test-case. I'll post my trace tools once I've tested that they work, in > case others want to help. Ok, I've got the traces, but quite frankly, I doubt anybody is crazy enough to want to trawl through them. It's a bit painful, since we're talking thousands of pages to trigger this problem. Also, I've used the PG_arch_1 flag, which is fine on x86[-64] and probably ARM, but is used for other things on ia64, powerpc and sparc64. But here's the patch in case anybody cares. It wants a _big_ kernel buffer to capture all the crud into (which is why I made the thing accept a bigger log buffer), and quite frankly, I'm not at all sure that all the locking is ok (ie I could imagine that the dcache-locking thing there in "is_interesting()" could deadlock, what do I know..) But I've captured some real data with this, which I'll describe separately. Linus diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 350878a..967dd80 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -91,6 +91,8 @@ #define PG_nosave_free 18 /* Used for system suspend/resume */ #define PG_buddy 19 /* Page is free, on buddy lists */ +#define SetPageInteresting(page) set_bit(PG_arch_1, &(page)->flags) +#define PageInteresting(page) test_bit(PG_arch_1, &(page)->flags) #if (BITS_PER_LONG > 32) /* diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 5c26818..7735b83 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -79,7 +79,7 @@ config DEBUG_KERNEL config LOG_BUF_SHIFT int "Kernel log buffer size (16 => 64KB, 17 => 128KB)" if DEBUG_KERNEL - range 12 21 + range 12 24 default 17 if S390 || LOCKDEP default 16 if X86_NUMAQ || IA64 default 15 if SMP diff --git a/mm/filemap.c b/mm/filemap.c index 8332c77..d6a0f56 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -116,6 +116,7 @@ void __remove_from_page_cache(struct page *page) { struct address_space *mapping = page->mapping; +if (PageInteresting(page)) printk("Removing index %08x from page cache\n", page->index); radix_tree_delete(>page_tree, page->index); page->mapping = NULL; mapping->nrpages--; @@ -421,6 +422,23 @@ int filemap_write_and_wait_range(struct address_space *mapping, return err; } +static noinline int is_interesting(struct address_space *mapping) +{ + struct inode *inode = mapping->host; + struct dentry *dentry; + int retval = 0; + + spin_lock(_lock); + list_for_each_entry(dentry, >i_dentry, d_alias) { + if (strcmp(dentry->d_name.name, "mapfile")) + continue; + retval = 1; + break; + } + spin_unlock(_lock); + return retval; +} + /** * add_to_page_cache - add newly allocated pagecache pages * @page: page to add @@ -439,6 +457,9 @@ int add_to_page_cache(struct page *page, struct address_space *mapping, { int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); + if (is_interesting(mapping)) + SetPageInteresting(page); + if (error == 0) { write_lock_irq(>tree_lock); error = radix_tree_insert(>page_tree, offset, page); diff --git a/mm/memory.c b/mm/memory.c index 563792f..14c9815 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -667,6 +667,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, tlb_remove_tlb_entry(tlb, pte, addr); if (unlikely(!page)) continue; +if (PageInteresting(page)) + printk("Unmapped index %08x at %08x\n", page->index, addr); if (unlikely(details) && details->nonlinear_vma && linear_page_index(details->nonlinear_vma, addr) != page->index) @@ -1605,6 +1607,7 @@ gotten: */ ptep_clear_flush(vma, address, page_table); set_pte_at(mm, address, page_table, entry); +if (PageInteresting(new_page)) printk("do_wp_page: mapping index %08x at %08lx\n", new_page->index, address); update_mmu_cache(vma, address, entry); lru_cache_add_active(new_page); page_add_new_anon_rmap(new_page, vma, address); @@ -2249,6 +2252,7 @@ retry: entry = mk_pte(new_page, vma->vm_page_prot); if (write_access) entry = maybe_mkwrite(pte_mkdirty(entry), vma); +if (PageInteresting(new_page)) printk("do_no_page: mapping index %08x at %08lx (%s)\n", new_page->index, address, write_access ? "write" : "read"); set_pte_at(mm, address,
Re: 2.6.19 file content corruption on ext3
On Thu, 2006-12-28 at 14:39 -0500, Dave Jones wrote: > On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote: > > > > > > On Thu, 28 Dec 2006, Petri Kaukasoina wrote: > > > > me up), and that seems to show the corruption going way way back (ie > going > > > > back to Linux-2.6.5 at least, according to one tester). > > > > > > That was a Fedora kernel. Has anyone seen the corruption in vanilla > 2.6.18 > > > (or older)? > > > > Well, that was a really _old_ fedora kernel. I guarantee you it didn't > > have the page throttling patches in it, those were written this summer. So > > it would either have to be Fedora carrying around another patch that just > > happens to result in the same corruption for _years_, or it's the same > > bug. > > The only notable VM patch in Fedora kernels of that vintage that I recall > was Ingo's 4g/4g thing. which does tlb flushes *all the time* so that even rules out (well almost) a stale tlb somewhere... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote: > > > On Thu, 28 Dec 2006, Petri Kaukasoina wrote: > > > me up), and that seems to show the corruption going way way back (ie > > > going > > > back to Linux-2.6.5 at least, according to one tester). > > > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18 > > (or older)? > > Well, that was a really _old_ fedora kernel. I guarantee you it didn't > have the page throttling patches in it, those were written this summer. So > it would either have to be Fedora carrying around another patch that just > happens to result in the same corruption for _years_, or it's the same > bug. The only notable VM patch in Fedora kernels of that vintage that I recall was Ingo's 4g/4g thing. Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, 28 Dec 2006, Petri Kaukasoina wrote: > > me up), and that seems to show the corruption going way way back (ie going > > back to Linux-2.6.5 at least, according to one tester). > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18 > (or older)? Well, that was a really _old_ fedora kernel. I guarantee you it didn't have the page throttling patches in it, those were written this summer. So it would either have to be Fedora carrying around another patch that just happens to result in the same corruption for _years_, or it's the same bug. I bet it's the same bug, and it's been around for ages. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, Dec 28, 2006 at 11:00:46AM -0800, Linus Torvalds wrote: > And I have a test-program that shows the corruption _much_ easier (at > least according to my own testing, and that of several reporters that back > me up), and that seems to show the corruption going way way back (ie going > back to Linux-2.6.5 at least, according to one tester). That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18 (or older)? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, 28 Dec 2006, Marc Haber wrote: > > After being up for ten days, I have now encountered the file > corruption of pkgcache.bin for the first time again. The 256 MB i386 > box is like 26M in swap, is under very moderate load. > > I am running plain vanilla 2.6.19.1. Is there a patch that I should > apply against 2.6.19.1 that would help in debugging? Not right now. And I have a test-program that shows the corruption _much_ easier (at least according to my own testing, and that of several reporters that back me up), and that seems to show the corruption going way way back (ie going back to Linux-2.6.5 at least, according to one tester). So it just got a lot _easier_ to trigger in 2.6.19, but it's not a new bug. What we need now is actually looking at the source code, and people who understand the VM, I'm afraid. I'm gathering traces now that I have a good test-case. I'll post my trace tools once I've tested that they work, in case others want to help. (And hey, you don't have to be a VM expert to help: this could be a learning experience. However, I'll warn you: this is _the_ most grotty part of the whole kernel. It's not even ugly, it's just damn hard and complex). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Tue, Dec 19, 2006 at 09:51:49AM +0100, Marc Haber wrote: > On Sun, Dec 17, 2006 at 09:43:08PM -0800, Andrew Morton wrote: > > Six hours here of fsx-linux plus high memory pressure on SMP on 1k > > blocksize ext3, mainline. Zero failures. It's unlikely that this testing > > would pass, yet people running normal workloads are able to easily trigger > > failures. I suspect we're looking in the wrong place. > > I do not have a clue about memory management at all, but is it > possible that you're testing on a box with too much memory? My box has > only 256 MB, and I used to use mutt with a _huge_ inbox with mutt > taking somewhat 150 MB. Add spamassassin and a reasonably busy mail > server, and the box used to be like 150 MB in swap. > > I have tidied my inbox in the mean time and mutt's memory requirement > has been reduced to somewhat 30 MB, which might be the cause that I > don't see the issue that often any more. After being up for ten days, I have now encountered the file corruption of pkgcache.bin for the first time again. The 256 MB i386 box is like 26M in swap, is under very moderate load. I am running plain vanilla 2.6.19.1. Is there a patch that I should apply against 2.6.19.1 that would help in debugging? Greetings Marc -- - Marc Haber | "I don't trust Computers. They | Mailadresse im Header Mannheim, Germany | lose things."Winona Ryder | Fon: *49 621 72739834 Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
Btw, much cleaned-up page tracing patch here, in case anybody cares (and test.c attached, although I don't think it changed since last time). The test.c output is a bit hard to read at times, since it will give offsets in bytes as hex (ie 00a77664 means page frame 0a77, and byte 664h within that page), while the kernel output is obvioiusly the page indexes (but the page fault _addresses_ can contain information about the exact byte in a page, so you can match them up when some kernel event is related to a page fault). So both forms are necessary/logical, but it means that to match things up, you often need to ignore the last three hex digits of the address that test.c outputs. This one also adds traces for the tags and the writeback activity, but since I'm going out for birthday dinner, I won't have time to try to actually analyse the trace I have.. Which is why I'm sending it out, in the hope that somebody else is working on this corruption issue and is interested.. Linus diff --git a/fs/buffer.c b/fs/buffer.c index 263f88e..f5e132a 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -722,6 +722,7 @@ int __set_page_dirty_buffers(struct page *page) set_buffer_dirty(bh); bh = bh-b_this_page; } while (bh != head); + PAGE_TRACE(page, dirtied buffers); } spin_unlock(mapping-private_lock); @@ -734,6 +735,7 @@ int __set_page_dirty_buffers(struct page *page) __inc_zone_page_state(page, NR_FILE_DIRTY); task_io_account_write(PAGE_CACHE_SIZE); } + PAGE_TRACE(page, setting TAG_DIRTY); radix_tree_tag_set(mapping-page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 350878a..0cf3dce 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -91,6 +91,14 @@ #define PG_nosave_free 18 /* Used for system suspend/resume */ #define PG_buddy 19 /* Page is free, on buddy lists */ +#define SetPageInteresting(page) set_bit(PG_arch_1, (page)-flags) +#define PageInteresting(page) test_bit(PG_arch_1, (page)-flags) + +#define PAGE_TRACE(page, msg, arg...) do { \ + if (PageInteresting(page)) \ + printk(KERN_DEBUG PG %08lx: %s:%d msg \n, \ + (page)-index, __FILE__, __LINE__ ,##arg ); \ +} while (0) #if (BITS_PER_LONG 32) /* @@ -183,32 +191,38 @@ static inline void SetPageUptodate(struct page *page) #define PageWriteback(page)test_bit(PG_writeback, (page)-flags) #define SetPageWriteback(page) \ do {\ - if (!test_and_set_bit(PG_writeback, \ - (page)-flags))\ + if (!test_and_set_bit(PG_writeback, (page)-flags)) { \ + PAGE_TRACE(page, set writeback); \ inc_zone_page_state(page, NR_WRITEBACK);\ + } \ } while (0) #define TestSetPageWriteback(page) \ ({ \ int ret;\ ret = test_and_set_bit(PG_writeback,\ (page)-flags);\ - if (!ret) \ + if (!ret) { \ + PAGE_TRACE(page, set writeback); \ inc_zone_page_state(page, NR_WRITEBACK);\ + } \ ret;\ }) #define ClearPageWriteback(page) \ do {\ - if (test_and_clear_bit(PG_writeback,\ - (page)-flags))\ + if (test_and_clear_bit(PG_writeback, (page)-flags)) { \ + PAGE_TRACE(page, end writeback); \ dec_zone_page_state(page, NR_WRITEBACK);\ + } \ } while (0) #define TestClearPageWriteback(page) \ ({
Re: 2.6.19 file content corruption on ext3
On Thu, 28 Dec 2006 17:38:38 -0800 (PST) Linus Torvalds [EMAIL PROTECTED] wrote: in the hope that somebody else is working on this corruption issue and is interested.. What corruption issue? ;) I'm finding that the corruption happens trivially with your test app, but apparently doesn't happen at all with ext2 or ext3, data=writeback. Maybe it will happen with increased rarity, but the difference is quite stark. Removing the err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE, NULL, journal_dirty_data_fn); from ext3_ordered_writepage() fixes things up. The things which journal_submit_data_buffers() does after dropping all the locks are ... disturbing - I don't think we have sufficient tests in there to ensure that the buffer is still where we think it is after we retake locks (they're slippery little buggers). But that wouldn't explain it anyway. It's inefficient that journal_dirty_data() will put these locked, clean buffers onto BJ_SyncData instead of BJ_Locked, but journal_submit_data_buffers() seems to dtrt with them. So no theory yet. Maybe ext3 is just altering timing. But the difference is really large.. Disabling all the WB_SYNC_NONE stuff and making everything go synchronous everywhere has no effect. Disabling bdi_write_congested() has no effect. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Tue, Dec 19, 2006 at 09:51:49AM +0100, Marc Haber wrote: On Sun, Dec 17, 2006 at 09:43:08PM -0800, Andrew Morton wrote: Six hours here of fsx-linux plus high memory pressure on SMP on 1k blocksize ext3, mainline. Zero failures. It's unlikely that this testing would pass, yet people running normal workloads are able to easily trigger failures. I suspect we're looking in the wrong place. I do not have a clue about memory management at all, but is it possible that you're testing on a box with too much memory? My box has only 256 MB, and I used to use mutt with a _huge_ inbox with mutt taking somewhat 150 MB. Add spamassassin and a reasonably busy mail server, and the box used to be like 150 MB in swap. I have tidied my inbox in the mean time and mutt's memory requirement has been reduced to somewhat 30 MB, which might be the cause that I don't see the issue that often any more. After being up for ten days, I have now encountered the file corruption of pkgcache.bin for the first time again. The 256 MB i386 box is like 26M in swap, is under very moderate load. I am running plain vanilla 2.6.19.1. Is there a patch that I should apply against 2.6.19.1 that would help in debugging? Greetings Marc -- - Marc Haber | I don't trust Computers. They | Mailadresse im Header Mannheim, Germany | lose things.Winona Ryder | Fon: *49 621 72739834 Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, 28 Dec 2006, Marc Haber wrote: After being up for ten days, I have now encountered the file corruption of pkgcache.bin for the first time again. The 256 MB i386 box is like 26M in swap, is under very moderate load. I am running plain vanilla 2.6.19.1. Is there a patch that I should apply against 2.6.19.1 that would help in debugging? Not right now. And I have a test-program that shows the corruption _much_ easier (at least according to my own testing, and that of several reporters that back me up), and that seems to show the corruption going way way back (ie going back to Linux-2.6.5 at least, according to one tester). So it just got a lot _easier_ to trigger in 2.6.19, but it's not a new bug. What we need now is actually looking at the source code, and people who understand the VM, I'm afraid. I'm gathering traces now that I have a good test-case. I'll post my trace tools once I've tested that they work, in case others want to help. (And hey, you don't have to be a VM expert to help: this could be a learning experience. However, I'll warn you: this is _the_ most grotty part of the whole kernel. It's not even ugly, it's just damn hard and complex). Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, Dec 28, 2006 at 11:00:46AM -0800, Linus Torvalds wrote: And I have a test-program that shows the corruption _much_ easier (at least according to my own testing, and that of several reporters that back me up), and that seems to show the corruption going way way back (ie going back to Linux-2.6.5 at least, according to one tester). That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18 (or older)? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, 28 Dec 2006, Petri Kaukasoina wrote: me up), and that seems to show the corruption going way way back (ie going back to Linux-2.6.5 at least, according to one tester). That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18 (or older)? Well, that was a really _old_ fedora kernel. I guarantee you it didn't have the page throttling patches in it, those were written this summer. So it would either have to be Fedora carrying around another patch that just happens to result in the same corruption for _years_, or it's the same bug. I bet it's the same bug, and it's been around for ages. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote: On Thu, 28 Dec 2006, Petri Kaukasoina wrote: me up), and that seems to show the corruption going way way back (ie going back to Linux-2.6.5 at least, according to one tester). That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18 (or older)? Well, that was a really _old_ fedora kernel. I guarantee you it didn't have the page throttling patches in it, those were written this summer. So it would either have to be Fedora carrying around another patch that just happens to result in the same corruption for _years_, or it's the same bug. The only notable VM patch in Fedora kernels of that vintage that I recall was Ingo's 4g/4g thing. Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, 2006-12-28 at 14:39 -0500, Dave Jones wrote: On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote: On Thu, 28 Dec 2006, Petri Kaukasoina wrote: me up), and that seems to show the corruption going way way back (ie going back to Linux-2.6.5 at least, according to one tester). That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18 (or older)? Well, that was a really _old_ fedora kernel. I guarantee you it didn't have the page throttling patches in it, those were written this summer. So it would either have to be Fedora carrying around another patch that just happens to result in the same corruption for _years_, or it's the same bug. The only notable VM patch in Fedora kernels of that vintage that I recall was Ingo's 4g/4g thing. which does tlb flushes *all the time* so that even rules out (well almost) a stale tlb somewhere... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, 28 Dec 2006, Linus Torvalds wrote: What we need now is actually looking at the source code, and people who understand the VM, I'm afraid. I'm gathering traces now that I have a good test-case. I'll post my trace tools once I've tested that they work, in case others want to help. Ok, I've got the traces, but quite frankly, I doubt anybody is crazy enough to want to trawl through them. It's a bit painful, since we're talking thousands of pages to trigger this problem. Also, I've used the PG_arch_1 flag, which is fine on x86[-64] and probably ARM, but is used for other things on ia64, powerpc and sparc64. But here's the patch in case anybody cares. It wants a _big_ kernel buffer to capture all the crud into (which is why I made the thing accept a bigger log buffer), and quite frankly, I'm not at all sure that all the locking is ok (ie I could imagine that the dcache-locking thing there in is_interesting() could deadlock, what do I know..) But I've captured some real data with this, which I'll describe separately. Linus diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 350878a..967dd80 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -91,6 +91,8 @@ #define PG_nosave_free 18 /* Used for system suspend/resume */ #define PG_buddy 19 /* Page is free, on buddy lists */ +#define SetPageInteresting(page) set_bit(PG_arch_1, (page)-flags) +#define PageInteresting(page) test_bit(PG_arch_1, (page)-flags) #if (BITS_PER_LONG 32) /* diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 5c26818..7735b83 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -79,7 +79,7 @@ config DEBUG_KERNEL config LOG_BUF_SHIFT int Kernel log buffer size (16 = 64KB, 17 = 128KB) if DEBUG_KERNEL - range 12 21 + range 12 24 default 17 if S390 || LOCKDEP default 16 if X86_NUMAQ || IA64 default 15 if SMP diff --git a/mm/filemap.c b/mm/filemap.c index 8332c77..d6a0f56 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -116,6 +116,7 @@ void __remove_from_page_cache(struct page *page) { struct address_space *mapping = page-mapping; +if (PageInteresting(page)) printk(Removing index %08x from page cache\n, page-index); radix_tree_delete(mapping-page_tree, page-index); page-mapping = NULL; mapping-nrpages--; @@ -421,6 +422,23 @@ int filemap_write_and_wait_range(struct address_space *mapping, return err; } +static noinline int is_interesting(struct address_space *mapping) +{ + struct inode *inode = mapping-host; + struct dentry *dentry; + int retval = 0; + + spin_lock(dcache_lock); + list_for_each_entry(dentry, inode-i_dentry, d_alias) { + if (strcmp(dentry-d_name.name, mapfile)) + continue; + retval = 1; + break; + } + spin_unlock(dcache_lock); + return retval; +} + /** * add_to_page_cache - add newly allocated pagecache pages * @page: page to add @@ -439,6 +457,9 @@ int add_to_page_cache(struct page *page, struct address_space *mapping, { int error = radix_tree_preload(gfp_mask ~__GFP_HIGHMEM); + if (is_interesting(mapping)) + SetPageInteresting(page); + if (error == 0) { write_lock_irq(mapping-tree_lock); error = radix_tree_insert(mapping-page_tree, offset, page); diff --git a/mm/memory.c b/mm/memory.c index 563792f..14c9815 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -667,6 +667,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, tlb_remove_tlb_entry(tlb, pte, addr); if (unlikely(!page)) continue; +if (PageInteresting(page)) + printk(Unmapped index %08x at %08x\n, page-index, addr); if (unlikely(details) details-nonlinear_vma linear_page_index(details-nonlinear_vma, addr) != page-index) @@ -1605,6 +1607,7 @@ gotten: */ ptep_clear_flush(vma, address, page_table); set_pte_at(mm, address, page_table, entry); +if (PageInteresting(new_page)) printk(do_wp_page: mapping index %08x at %08lx\n, new_page-index, address); update_mmu_cache(vma, address, entry); lru_cache_add_active(new_page); page_add_new_anon_rmap(new_page, vma, address); @@ -2249,6 +2252,7 @@ retry: entry = mk_pte(new_page, vma-vm_page_prot); if (write_access) entry = maybe_mkwrite(pte_mkdirty(entry), vma); +if (PageInteresting(new_page)) printk(do_no_page: mapping index %08x at %08lx (%s)\n, new_page-index, address, write_access ? write : read); set_pte_at(mm, address, page_table, entry);
Re: 2.6.19 file content corruption on ext3
On Thu, Dec 28, 2006 at 01:24:30PM -0800, Linus Torvalds wrote: On Thu, 28 Dec 2006, Linus Torvalds wrote: What we need now is actually looking at the source code, and people who understand the VM, I'm afraid. I'm gathering traces now that I have a good test-case. I'll post my trace tools once I've tested that they work, in case others want to help. Ok, I've got the traces, but quite frankly, I doubt anybody is crazy enough to want to trawl through them. It's a bit painful, since we're talking thousands of pages to trigger this problem. Also, I've used the PG_arch_1 flag, which is fine on x86[-64] and probably ARM, but is used for other things on ia64, powerpc and sparc64. But here's the patch in case anybody cares. PG_arch_1 is used on ARM to flag pages that need a dcache flush prior to hitting userspace, in the same way that sparc64 uses it. So ARM systems should not have this patch applied. -- Russell King Linux kernel2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
Ok, with the ugly trace capture patch, I've actually captured this corruption in action, I think. I did a full trace of all pages involved in one run, and picked one corruption at random: Chunk 14465 corrupted (0-75) (01423fb4-01423fff) Expected 129, got 0 Written as (5126)9509(15017) That's the first 76 bytes of a chunk missing, and it's the last 76 bytes on a page. It's page index 01423 in the mapped file, and bytes fb4-fff within that file. There were four chunks written to that page: Writing chunk 14463/15800 (15%) (0142344c) (1) Writing chunk 14462/15800 (30%) (01422e98) (2) (overflows into 1423) Writing chunk 14464/15800 (32%) (01423a00) (3) Writing chunk 14465/15800 (60%) (01423fb4) (4) --- LOST! and the other three chunks checked out all right. And here's the annotated trace as it concerns that page: - here we write the first chunk to the page: ** (1) do_no_page: mapping index 1423 at b7d1f44c (write) ** Setting page 1423 dirty - something flushes it out to disk: ** cpd_for_io: index 1423 ** cleaning index 1423 at b7d1f000 - here we write the second chunk (which was split over the previous page and the interesting one): ** (2) Setting page 1422 dirty ** (2) Setting page 1423 dirty - and here we do a cleaning event ** cpd_for_io: index 1423 ** cleaning index 1423 at b7d1f000 - here we write the third chunk: ** (3) Setting page 1423 dirty - here we write the fourth chunk: ** (4) NO DIRTY EVENT - and a third flush to disk: ** cpd_for_io: index 1423 ** cleaning index 1423 at b7d1f000 - here we unmap and flush: ** Unmapped index 1423 at b7d1f000 ** Removing index 1423 from page cache - here we remap to check: ** do_no_page: mapping index 1423 at b7d1f000 (read) ** Unmapped index 1423 at b7d1f000 - and finally, here I remove the file after the run: ** Removing index 1423 from page cache Now, the important thing to see here is: - the missing write did not have a Setting page 1423 dirty event associated with it. - but I can _see_ where the actual dirty event would be happening in the logs, because I can see the dirty events of the other chunk writes around it, so I know exactly where that fourth write happens. And indeed, it _shouldn't_ get a dirty event, because the page is still dirty from the write of chunk #3 to that page, which _did_ get a dirty event. I can see that, because the testing app writes the log of the pages it writes, and this is the log around the fourth and final write: ... Writing chunk 5338/15800 (60%) (0076eb48) PFN: 76e/76f Writing chunk 960/15800 (60%) (00156300)PFN: 156 Writing chunk 14465/15800 (60%) (01423fb4) Writing chunk 8594/15800 (60%) (00bf74a8) PFN: bf7 Writing chunk 556/15800 (60%) (000c62f0)PFN: c6 Writing chunk 15190/15800 (60%) (01526678) PFN: 1526 ... and I can match this up with the full log from the kernel, which looks like this: Setting page 076e dirty Setting page 076f dirty Setting page 0156 dirty Setting page 00c6 dirty Setting page 1526 dirty so I know exactly where the missing writes (to our page at pfn 1423, and the fpn-bf7 page) happened. - and the thing is, I can see a cpd_for_io() happening AFTER that fourth write. Quite a long while after, in fact. So all of this looks very fine indeed. We are not losing any dirty bits. - EVEN MORE INTERESTING: write 3 makes it onto disk, and it really uses the SAME dirty bit as write 4 did (which didn't make it out to disk!). The event that clears the dirty bit that write 3 did happens AFTER write 4 has happened! So if we're not losing any dirty bits, what's going on? I think we have some nasty interaction with the buffer heads. In particular, I don't think it's the dirty page bits that are broken (I _see_ that the PageDirty bit was set after write 4 was done to memory in the kernel traces). So I think that a real writeback just doesn't happen, because somebody has marked the buffer heads clean _after_ it started IO on them. I think __mpage_writepage() is buggy in this regard, for example. It even has a comment about its crapola behaviour: /* * Must try to add the page before marking the buffer clean or * the confused fail path above (OOM) will be very confused when * it finds all bh marked clean (i.e. it will not write anything) */ however, I don't think that particular thing explains it, because I don't think we use that function for the cases I'm looking at.
Re: 2.6.19 file content corruption on ext3
From: Linus Torvalds [EMAIL PROTECTED] Date: Thu, 28 Dec 2006 14:37:37 -0800 (PST) So if we're not losing any dirty bits, what's going on? What happens when we writeback, to the PTEs? page_mkclean_file() iterates the VMAs and when it finds a shared one it goes: entry = ptep_clear_flush(vma, address, pte); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); and that's fine, but that PTE is still marked writable, and I think that's key. What does the fault path do in this situation? if (write_access) { if (!pte_write(entry)) return do_wp_page(mm, vma, address, pte, pmd, ptl, entry); entry = pte_mkdirty(entry); } It does nothing to update the page dirty state, because it's writable, it just sets the PTE dirty bit and that's it. Should it be setting the page dirty here for SHARED cases? So until vmscan actually unmaps the PTE completely, we have this window in which the application can write to the PTE and the page dirty state doesn't get updated. Perhaps something later cleans up after this, f.e. by rechecking the PTE dirty bit at the end of I/O or when vmscan unmaps the page. I guess that should handle things, but the above logic definitely stood out to me. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, 28 Dec 2006, David Miller wrote: What happens when we writeback, to the PTEs? Not a damn thing. We clear the PTE's _before_ we even start the write. The writeback does nothing to them. If the user dirties the page while writeback is in progress, we'll take the page fault and re-dirty it _again_. page_mkclean_file() iterates the VMAs and when it finds a shared one it goes: entry = ptep_clear_flush(vma, address, pte); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); and that's fine, but that PTE is still marked writable, and I think that's key. No it's not. It's right there. pte_wrprotect(entry). You even copied it yourself. What does the fault path do in this situation? if (write_access) { if (!pte_write(entry)) return do_wp_page(mm, vma, address, pte, pmd, ptl, entry); So we call do_wp_page(), and that does everythign right. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, 28 Dec 2006, Anton Altaparmakov wrote: But are chunks 3 and 4 in separate buffer heads? Sorry could not see it immediately from the output you showed... No, this is a 4kB filesystem. A single bh per page. It is just that there may be a different cause rather than buffer dirty state... Sure. A shot in the dark I know but it could perhaps be that a COW for MAP_PRIVATE like event happens when the page is dirty already thus the second write never actually makes it to the shared page thus it never gets written out. There are no private mappings anywhere, and no forks. Just a single mmap (well, we unmap and remap in order to force the page cache to be invalidated properly with the posix_fadvise() thing, but that's literally the only user). Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, 28 Dec 2006, Linus Torvalds wrote: Ok, with the ugly trace capture patch, I've actually captured this corruption in action, I think. I did a full trace of all pages involved in one run, and picked one corruption at random: Chunk 14465 corrupted (0-75) (01423fb4-01423fff) Expected 129, got 0 Written as (5126)9509(15017) That's the first 76 bytes of a chunk missing, and it's the last 76 bytes on a page. It's page index 01423 in the mapped file, and bytes fb4-fff within that file. There were four chunks written to that page: Writing chunk 14463/15800 (15%) (0142344c) (1) Writing chunk 14462/15800 (30%) (01422e98) (2) (overflows into 1423) Writing chunk 14464/15800 (32%) (01423a00) (3) Writing chunk 14465/15800 (60%) (01423fb4) (4) --- LOST! and the other three chunks checked out all right. And here's the annotated trace as it concerns that page: - here we write the first chunk to the page: ** (1) do_no_page: mapping index 1423 at b7d1f44c (write) ** Setting page 1423 dirty - something flushes it out to disk: ** cpd_for_io: index 1423 ** cleaning index 1423 at b7d1f000 - here we write the second chunk (which was split over the previous page and the interesting one): ** (2) Setting page 1422 dirty ** (2) Setting page 1423 dirty - and here we do a cleaning event ** cpd_for_io: index 1423 ** cleaning index 1423 at b7d1f000 - here we write the third chunk: ** (3) Setting page 1423 dirty - here we write the fourth chunk: ** (4) NO DIRTY EVENT - and a third flush to disk: ** cpd_for_io: index 1423 ** cleaning index 1423 at b7d1f000 - here we unmap and flush: ** Unmapped index 1423 at b7d1f000 ** Removing index 1423 from page cache - here we remap to check: ** do_no_page: mapping index 1423 at b7d1f000 (read) ** Unmapped index 1423 at b7d1f000 - and finally, here I remove the file after the run: ** Removing index 1423 from page cache Now, the important thing to see here is: - the missing write did not have a Setting page 1423 dirty event associated with it. - but I can _see_ where the actual dirty event would be happening in the logs, because I can see the dirty events of the other chunk writes around it, so I know exactly where that fourth write happens. And indeed, it _shouldn't_ get a dirty event, because the page is still dirty from the write of chunk #3 to that page, which _did_ get a dirty event. I can see that, because the testing app writes the log of the pages it writes, and this is the log around the fourth and final write: ... Writing chunk 5338/15800 (60%) (0076eb48) PFN: 76e/76f Writing chunk 960/15800 (60%) (00156300)PFN: 156 Writing chunk 14465/15800 (60%) (01423fb4) Writing chunk 8594/15800 (60%) (00bf74a8) PFN: bf7 Writing chunk 556/15800 (60%) (000c62f0)PFN: c6 Writing chunk 15190/15800 (60%) (01526678) PFN: 1526 ... and I can match this up with the full log from the kernel, which looks like this: Setting page 076e dirty Setting page 076f dirty Setting page 0156 dirty Setting page 00c6 dirty Setting page 1526 dirty so I know exactly where the missing writes (to our page at pfn 1423, and the fpn-bf7 page) happened. - and the thing is, I can see a cpd_for_io() happening AFTER that fourth write. Quite a long while after, in fact. So all of this looks very fine indeed. We are not losing any dirty bits. - EVEN MORE INTERESTING: write 3 makes it onto disk, and it really uses the SAME dirty bit as write 4 did (which didn't make it out to disk!). The event that clears the dirty bit that write 3 did happens AFTER write 4 has happened! So if we're not losing any dirty bits, what's going on? I think we have some nasty interaction with the buffer heads. In But are chunks 3 and 4 in separate buffer heads? Sorry could not see it immediately from the output you showed... It is just that there may be a different cause rather than buffer dirty state... A shot in the dark I know but it could perhaps be that a COW for MAP_PRIVATE like event happens when the page is dirty already thus the second write never actually makes it to the shared page thus it never gets written out. I am almost certainly totally barking up the wrong tree but I thought it may be worth mentioning just in case there was a slip in the COW logic or page writable state maintenance somewhere... Best regards, Anton particular, I don't think it's the dirty page
Re: 2.6.19 file content corruption on ext3
On Mon, 18 Dec 2006, Gene Heskett wrote: > > What about the mm/rmap.c one liner, in or out? The one that just removes the "pte_mkclean()"? That's definitely out, it was just a test-patch to verify that the pte dirty bits seemed to matter at all (and they do). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Sat, Dec 16, 2006 at 06:43:10PM +, Martin Michlmayr wrote: > * Marc Haber <[EMAIL PROTECTED]> [2006-12-09 10:26]: > > Unfortunately, I am lacking the knowledge needed to do this in an > > informed way. I am neither familiar enough with git nor do I possess > > the necessary C powers. > > I wonder if what you're seein is related to > http://lkml.org/lkml/2006/12/16/73 > > You said that you don't see any corruption with 2.6.18. Can you try > to apply the patch from > http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89 > to 2.6.18 to see if the corruption shows up? Since I am no longer seeing the issue after easing the memory load, I doubt that this would make sense. Greetings Marc -- - Marc Haber | "I don't trust Computers. They | Mailadresse im Header Mannheim, Germany | lose things."Winona Ryder | Fon: *49 621 72739834 Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Fri, Dec 22, 2006 at 08:30:06AM -0500, Daniel Drake wrote: > Marc Haber wrote: > >After updating to 2.6.19, Debian's apt control file > >/var/cache/apt/pkgcache.bin corrupts pretty frequently - like in under > >six hours. In that situation, "aptitude update" segfaults. When I > >delete the file and have apt recreate it, things are fine again for a > >few hours before the file is broken again and the segfault start over. > >In all cases, umounting the file system and doing an fsck does not > >show issues with the file system. > > Are you using wireless networking of any kind? Since the system in question is a colocated server box, I am pretty sure that there is no wireless networking. > Might be useful if you could post 'dmesg' output so that people can > see the other hardware that you have. I have attached what I could scrape from syslog. Greetings Marc -- - Marc Haber | "I don't trust Computers. They | Mailadresse im Header Mannheim, Germany | lose things."Winona Ryder | Fon: *49 621 72739834 Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835 Dec 18 15:45:01 torres syslogd 1.4.1#17: restart. Dec 18 15:45:01 torres kernel: klogd 1.4.1#17, log source = /proc/kmsg started. Dec 18 15:45:01 torres kernel: Inspecting /boot/System.map-2.6.19.1-zgsrv Dec 18 15:45:01 torres kernel: Loaded 26500 symbols from /boot/System.map-2.6.19.1-zgsrv. Dec 18 15:45:01 torres kernel: Symbols match kernel version 2.6.19. Dec 18 15:45:01 torres kernel: No module symbols loaded - kernel modules not enabled. Dec 18 15:45:01 torres kernel: Linux version 2.6.19.1-zgsrv ([EMAIL PROTECTED]) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #1 Sun Dec 17 12:44:56 UTC 2006 Dec 18 15:45:01 torres kernel: BIOS-provided physical RAM map: Dec 18 15:45:01 torres kernel: BIOS-e820: - 000a (usable) Dec 18 15:45:01 torres kernel: BIOS-e820: 000f - 0010 (reserved) Dec 18 15:45:01 torres kernel: BIOS-e820: 0010 - 0f7f (usable) Dec 18 15:45:01 torres kernel: BIOS-e820: 0f7f - 0f7f3000 (ACPI NVS) Dec 18 15:45:01 torres kernel: BIOS-e820: 0f7f3000 - 0f80 (ACPI data) Dec 18 15:45:01 torres kernel: BIOS-e820: - 0001 (reserved) Dec 18 15:45:01 torres kernel: 0MB HIGHMEM available. Dec 18 15:45:01 torres kernel: 247MB LOWMEM available. Dec 18 15:45:01 torres kernel: Entering add_active_range(0, 0, 63472) 0 entries of 256 used Dec 18 15:45:01 torres kernel: Zone PFN ranges: Dec 18 15:45:01 torres kernel: DMA 0 -> 4096 Dec 18 15:45:01 torres kernel: Normal 4096 ->63472 Dec 18 15:45:01 torres kernel: HighMem 63472 ->63472 Dec 18 15:45:01 torres kernel: early_node_map[1] active PFN ranges Dec 18 15:45:01 torres kernel: 0:0 ->63472 Dec 18 15:45:01 torres kernel: On node 0 totalpages: 63472 Dec 18 15:45:01 torres kernel: DMA zone: 32 pages used for memmap Dec 18 15:45:01 torres kernel: DMA zone: 0 pages reserved Dec 18 15:45:01 torres kernel: DMA zone: 4064 pages, LIFO batch:0 Dec 18 15:45:01 torres kernel: Normal zone: 463 pages used for memmap Dec 18 15:45:01 torres kernel: Normal zone: 58913 pages, LIFO batch:15 Dec 18 15:45:01 torres kernel: HighMem zone: 0 pages used for memmap Dec 18 15:45:01 torres kernel: DMI 2.2 present. Dec 18 15:45:01 torres kernel: ACPI: RSDP (v000 VIA694 ) @ 0x000f8050 Dec 18 15:45:01 torres kernel: ACPI: RSDT (v001 VIA694 MSI ACPI 0x42302e31 AWRD 0x) @ 0x0f7f3000 Dec 18 15:45:01 torres kernel: ACPI: FADT (v001 VIA694 MSI ACPI 0x42302e31 AWRD 0x) @ 0x0f7f3040 Dec 18 15:45:01 torres kernel: ACPI: DSDT (v001 VIA694 AWRDACPI 0x1000 MSFT 0x010c) @ 0x Dec 18 15:45:01 torres kernel: ACPI: PM-Timer IO Port: 0x4008 Dec 18 15:45:01 torres kernel: Allocating PCI resources starting at 1000 (gap: 0f80:f07f) Dec 18 15:45:01 torres kernel: Detected 1466.361 MHz processor. Dec 18 15:45:01 torres kernel: Built 1 zonelists. Total pages: 62977 Dec 18 15:45:01 torres kernel: Kernel command line: root=/dev/hda1 ro vga=normal Dec 18 15:45:01 torres kernel: Enabling fast FPU save and restore... done. Dec 18 15:45:01 torres kernel: Enabling unmasked SIMD FPU exception support... done. Dec 18 15:45:01 torres kernel: Initializing CPU#0 Dec 18 15:45:01 torres kernel: PID hash table entries: 1024 (order: 10, 4096 bytes) Dec 18 15:45:01 torres kernel: Console: colour VGA+ 80x25 Dec 18 15:45:01 torres kernel: Dentry cache hash table entries: 32768 (order: 5, 131072 bytes) Dec 18 15:45:01 torres kernel: Inode-cache hash table entries: 16384 (order: 4, 65536 bytes) Dec 18 15:45:01 torres kernel: Memory: 246964k/253888k available (2896k kernel code, 6368k reserved, 859k data, 204k init, 0k highmem) Dec 18
Re: 2.6.19 file content corruption on ext3
Marc Haber wrote: After updating to 2.6.19, Debian's apt control file /var/cache/apt/pkgcache.bin corrupts pretty frequently - like in under six hours. In that situation, "aptitude update" segfaults. When I delete the file and have apt recreate it, things are fine again for a few hours before the file is broken again and the segfault start over. In all cases, umounting the file system and doing an fsck does not show issues with the file system. Are you using wireless networking of any kind? If so which driver and security key system? Might be useful if you could post 'dmesg' output so that people can see the other hardware that you have. Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
Marc Haber wrote: After updating to 2.6.19, Debian's apt control file /var/cache/apt/pkgcache.bin corrupts pretty frequently - like in under six hours. In that situation, aptitude update segfaults. When I delete the file and have apt recreate it, things are fine again for a few hours before the file is broken again and the segfault start over. In all cases, umounting the file system and doing an fsck does not show issues with the file system. Are you using wireless networking of any kind? If so which driver and security key system? Might be useful if you could post 'dmesg' output so that people can see the other hardware that you have. Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Fri, Dec 22, 2006 at 08:30:06AM -0500, Daniel Drake wrote: Marc Haber wrote: After updating to 2.6.19, Debian's apt control file /var/cache/apt/pkgcache.bin corrupts pretty frequently - like in under six hours. In that situation, aptitude update segfaults. When I delete the file and have apt recreate it, things are fine again for a few hours before the file is broken again and the segfault start over. In all cases, umounting the file system and doing an fsck does not show issues with the file system. Are you using wireless networking of any kind? Since the system in question is a colocated server box, I am pretty sure that there is no wireless networking. Might be useful if you could post 'dmesg' output so that people can see the other hardware that you have. I have attached what I could scrape from syslog. Greetings Marc -- - Marc Haber | I don't trust Computers. They | Mailadresse im Header Mannheim, Germany | lose things.Winona Ryder | Fon: *49 621 72739834 Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835 Dec 18 15:45:01 torres syslogd 1.4.1#17: restart. Dec 18 15:45:01 torres kernel: klogd 1.4.1#17, log source = /proc/kmsg started. Dec 18 15:45:01 torres kernel: Inspecting /boot/System.map-2.6.19.1-zgsrv Dec 18 15:45:01 torres kernel: Loaded 26500 symbols from /boot/System.map-2.6.19.1-zgsrv. Dec 18 15:45:01 torres kernel: Symbols match kernel version 2.6.19. Dec 18 15:45:01 torres kernel: No module symbols loaded - kernel modules not enabled. Dec 18 15:45:01 torres kernel: Linux version 2.6.19.1-zgsrv ([EMAIL PROTECTED]) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #1 Sun Dec 17 12:44:56 UTC 2006 Dec 18 15:45:01 torres kernel: BIOS-provided physical RAM map: Dec 18 15:45:01 torres kernel: BIOS-e820: - 000a (usable) Dec 18 15:45:01 torres kernel: BIOS-e820: 000f - 0010 (reserved) Dec 18 15:45:01 torres kernel: BIOS-e820: 0010 - 0f7f (usable) Dec 18 15:45:01 torres kernel: BIOS-e820: 0f7f - 0f7f3000 (ACPI NVS) Dec 18 15:45:01 torres kernel: BIOS-e820: 0f7f3000 - 0f80 (ACPI data) Dec 18 15:45:01 torres kernel: BIOS-e820: - 0001 (reserved) Dec 18 15:45:01 torres kernel: 0MB HIGHMEM available. Dec 18 15:45:01 torres kernel: 247MB LOWMEM available. Dec 18 15:45:01 torres kernel: Entering add_active_range(0, 0, 63472) 0 entries of 256 used Dec 18 15:45:01 torres kernel: Zone PFN ranges: Dec 18 15:45:01 torres kernel: DMA 0 - 4096 Dec 18 15:45:01 torres kernel: Normal 4096 -63472 Dec 18 15:45:01 torres kernel: HighMem 63472 -63472 Dec 18 15:45:01 torres kernel: early_node_map[1] active PFN ranges Dec 18 15:45:01 torres kernel: 0:0 -63472 Dec 18 15:45:01 torres kernel: On node 0 totalpages: 63472 Dec 18 15:45:01 torres kernel: DMA zone: 32 pages used for memmap Dec 18 15:45:01 torres kernel: DMA zone: 0 pages reserved Dec 18 15:45:01 torres kernel: DMA zone: 4064 pages, LIFO batch:0 Dec 18 15:45:01 torres kernel: Normal zone: 463 pages used for memmap Dec 18 15:45:01 torres kernel: Normal zone: 58913 pages, LIFO batch:15 Dec 18 15:45:01 torres kernel: HighMem zone: 0 pages used for memmap Dec 18 15:45:01 torres kernel: DMI 2.2 present. Dec 18 15:45:01 torres kernel: ACPI: RSDP (v000 VIA694 ) @ 0x000f8050 Dec 18 15:45:01 torres kernel: ACPI: RSDT (v001 VIA694 MSI ACPI 0x42302e31 AWRD 0x) @ 0x0f7f3000 Dec 18 15:45:01 torres kernel: ACPI: FADT (v001 VIA694 MSI ACPI 0x42302e31 AWRD 0x) @ 0x0f7f3040 Dec 18 15:45:01 torres kernel: ACPI: DSDT (v001 VIA694 AWRDACPI 0x1000 MSFT 0x010c) @ 0x Dec 18 15:45:01 torres kernel: ACPI: PM-Timer IO Port: 0x4008 Dec 18 15:45:01 torres kernel: Allocating PCI resources starting at 1000 (gap: 0f80:f07f) Dec 18 15:45:01 torres kernel: Detected 1466.361 MHz processor. Dec 18 15:45:01 torres kernel: Built 1 zonelists. Total pages: 62977 Dec 18 15:45:01 torres kernel: Kernel command line: root=/dev/hda1 ro vga=normal Dec 18 15:45:01 torres kernel: Enabling fast FPU save and restore... done. Dec 18 15:45:01 torres kernel: Enabling unmasked SIMD FPU exception support... done. Dec 18 15:45:01 torres kernel: Initializing CPU#0 Dec 18 15:45:01 torres kernel: PID hash table entries: 1024 (order: 10, 4096 bytes) Dec 18 15:45:01 torres kernel: Console: colour VGA+ 80x25 Dec 18 15:45:01 torres kernel: Dentry cache hash table entries: 32768 (order: 5, 131072 bytes) Dec 18 15:45:01 torres kernel: Inode-cache hash table entries: 16384 (order: 4, 65536 bytes) Dec 18 15:45:01 torres kernel: Memory: 246964k/253888k available (2896k kernel code, 6368k reserved, 859k data, 204k init, 0k highmem) Dec 18 15:45:01 torres kernel: virtual
Re: 2.6.19 file content corruption on ext3
On Sat, Dec 16, 2006 at 06:43:10PM +, Martin Michlmayr wrote: * Marc Haber [EMAIL PROTECTED] [2006-12-09 10:26]: Unfortunately, I am lacking the knowledge needed to do this in an informed way. I am neither familiar enough with git nor do I possess the necessary C powers. I wonder if what you're seein is related to http://lkml.org/lkml/2006/12/16/73 You said that you don't see any corruption with 2.6.18. Can you try to apply the patch from http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89 to 2.6.18 to see if the corruption shows up? Since I am no longer seeing the issue after easing the memory load, I doubt that this would make sense. Greetings Marc -- - Marc Haber | I don't trust Computers. They | Mailadresse im Header Mannheim, Germany | lose things.Winona Ryder | Fon: *49 621 72739834 Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Mon, 18 Dec 2006, Gene Heskett wrote: What about the mm/rmap.c one liner, in or out? The one that just removes the pte_mkclean()? That's definitely out, it was just a test-patch to verify that the pte dirty bits seemed to matter at all (and they do). Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, 21 Dec 2006 14:03:20 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote: > On Tue, 2006-12-19 at 09:43 -0800, Linus Torvalds wrote: > > > > Btw, > > here's a totally new tangent on this: it's possible that user code is > > simply BUGGY. > > depmod: BADNESS: written outside isize 22183 akpm:/usr/src/module-init-tools-3.3-pre1> grep -r mmap . ./zlibsupport.c:map = mmap(0, *size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0); So presumably it's in a library. akpm:/usr/src/25> ldd /sbin/depmod linux-gate.so.1 => (0xe000) libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0x46afa000) /lib/ld-linux.so.2 (0x4631d000) worrisome. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Tue, 2006-12-19 at 09:43 -0800, Linus Torvalds wrote: > > Btw, > here's a totally new tangent on this: it's possible that user code is > simply BUGGY. depmod: BADNESS: written outside isize 22183 --- diff --git a/fs/buffer.c b/fs/buffer.c index d1f1b54..5db9fd9 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2393,6 +2393,17 @@ int nobh_commit_write(struct file *file, struct page *page, } EXPORT_SYMBOL(nobh_commit_write); +static void __check_tail_zero(char *kaddr, unsigned int offset) +{ + unsigned int check = 0; + do { + check += kaddr[offset++]; + } while (offset < PAGE_CACHE_SIZE); + if (check) + printk(KERN_ERR "%s: BADNESS: written outside isize %u\n", + current->comm, check); +} + /* * nobh_writepage() - based on block_full_write_page() except * that it tries to operate without attaching bufferheads to @@ -2437,6 +2448,7 @@ int nobh_writepage(struct page *page, get_block_t *get_block, * writes to that region are not written out to the file." */ kaddr = kmap_atomic(page, KM_USER0); + __check_tail_zero(kaddr, offset); memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset); flush_dcache_page(page); kunmap_atomic(kaddr, KM_USER0); @@ -2604,6 +2616,7 @@ int block_write_full_page(struct page *page, get_block_t *get_block, * writes to that region are not written out to the file." */ kaddr = kmap_atomic(page, KM_USER0); + __check_tail_zero(kaddr, offset); memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset); flush_dcache_page(page); kunmap_atomic(kaddr, KM_USER0); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Tue, 2006-12-19 at 09:43 -0800, Linus Torvalds wrote: Btw, here's a totally new tangent on this: it's possible that user code is simply BUGGY. depmod: BADNESS: written outside isize 22183 --- diff --git a/fs/buffer.c b/fs/buffer.c index d1f1b54..5db9fd9 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2393,6 +2393,17 @@ int nobh_commit_write(struct file *file, struct page *page, } EXPORT_SYMBOL(nobh_commit_write); +static void __check_tail_zero(char *kaddr, unsigned int offset) +{ + unsigned int check = 0; + do { + check += kaddr[offset++]; + } while (offset PAGE_CACHE_SIZE); + if (check) + printk(KERN_ERR %s: BADNESS: written outside isize %u\n, + current-comm, check); +} + /* * nobh_writepage() - based on block_full_write_page() except * that it tries to operate without attaching bufferheads to @@ -2437,6 +2448,7 @@ int nobh_writepage(struct page *page, get_block_t *get_block, * writes to that region are not written out to the file. */ kaddr = kmap_atomic(page, KM_USER0); + __check_tail_zero(kaddr, offset); memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset); flush_dcache_page(page); kunmap_atomic(kaddr, KM_USER0); @@ -2604,6 +2616,7 @@ int block_write_full_page(struct page *page, get_block_t *get_block, * writes to that region are not written out to the file. */ kaddr = kmap_atomic(page, KM_USER0); + __check_tail_zero(kaddr, offset); memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset); flush_dcache_page(page); kunmap_atomic(kaddr, KM_USER0); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Thu, 21 Dec 2006 14:03:20 +0100 Peter Zijlstra [EMAIL PROTECTED] wrote: On Tue, 2006-12-19 at 09:43 -0800, Linus Torvalds wrote: Btw, here's a totally new tangent on this: it's possible that user code is simply BUGGY. depmod: BADNESS: written outside isize 22183 akpm:/usr/src/module-init-tools-3.3-pre1 grep -r mmap . ./zlibsupport.c:map = mmap(0, *size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0); So presumably it's in a library. akpm:/usr/src/25 ldd /sbin/depmod linux-gate.so.1 = (0xe000) libc.so.6 = /lib/tls/i686/cmov/libc.so.6 (0x46afa000) /lib/ld-linux.so.2 (0x4631d000) worrisome. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
Peter Zijlstra wrote: On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote: On Tue, 19 Dec 2006, Linus Torvalds wrote: here's a totally new tangent on this: it's possible that user code is simply BUGGY. I'm sad to say this doesn't trigger :-( - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ Hi all, I ran it a number of times on 2.6.16-1.2115_FC4 and always got ./a.out | od -x 000 020 040 but running it on 2.6.19-rc5 I always get zeros in the middle. Steve -- "They that give up essential liberty to obtain temporary safety, deserve neither liberty nor safety." (Ben Franklin) "The course of history shows that as a government grows, liberty decreases." (Thomas Jefferson) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Wed, 2006-12-20 at 18:30 +0200, Andrei Popa wrote: > On Wed, 2006-12-20 at 15:23 +0100, Peter Zijlstra wrote: > > On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote: > > > On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote: > > > > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote: > > > > > > > > > OR: > > > > > > > > > > - page_mkclean_one() is simply buggy. > > > > > > > > GOLD! > > > > > > > > it seems to work with all this (full diff against current git). > > > > > > > > /me rebuilds full kernel to make sure... > > > > reboot... > > > > test... pff the tension... > > > > yay, still good! > > > > > > > > Andrei; would you please verify. > > > > > > I have corrupted files. > > > > drad; and with this patch: > > http://lkml.org/lkml/2006/12/20/112 > > Hash check on download completion found bad chunks, consider using > "safe_sync". *sigh* back to square 1. and I need to look at my reproduction case ;-( Thanks for testing. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Wed, 2006-12-20 at 15:23 +0100, Peter Zijlstra wrote: > On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote: > > On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote: > > > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote: > > > > > > > OR: > > > > > > > > - page_mkclean_one() is simply buggy. > > > > > > GOLD! > > > > > > it seems to work with all this (full diff against current git). > > > > > > /me rebuilds full kernel to make sure... > > > reboot... > > > test... pff the tension... > > > yay, still good! > > > > > > Andrei; would you please verify. > > > > I have corrupted files. > > drad; and with this patch: > http://lkml.org/lkml/2006/12/20/112 Hash check on download completion found bad chunks, consider using "safe_sync". > > /me goes rebuild his kernel and try more than 3 times > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Wed, 2006-12-20 at 10:01 +0100, Peter Zijlstra wrote: > Also, what is this page_test_and_clear_dirty() business, that seems to > be exclusively s390 btw. However they do seem to need this. > > > But the "ptep_get_and_clear() + flush_tlb_page()" sequence should > > hopefully also work. > > Yeah, probably, not optimally so on some archs that don't actually need > the flush though. And as above, I wonder about s390. Simple, the s390 architecture does not keep the dirty bit in the pte but in something called the storage key. For each physical page there is one associated storage key. It is accessed with special instructions like "iske", "sske" or "rrbe". To clear the dirty bit the storage key of a page is read with iske, the bit is cleared and the storage key is stored back with sske. That means that clearing the dirty bit is not an atomic operation. rrbe is used to test and clear the referenced bit (young/old infomation) and is atomic in regard to other storage key operations. If you think about it, the storage keys are quite nice for the operating system, page_referenced() can be implemented with a single test "page_test_and_clear_young()". No need to read all the ptes pointing to the page. The downside is that the storage keys have a cost on the hardware side. -- blue skies, Martin. Martin Schwidefsky Linux for zSeries Development & Services IBM Deutschland Entwicklung GmbH "Reality continues to ruin my life." - Calvin. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote: > On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote: > > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote: > > > > > OR: > > > > > > - page_mkclean_one() is simply buggy. > > > > GOLD! > > > > it seems to work with all this (full diff against current git). > > > > /me rebuilds full kernel to make sure... > > reboot... > > test... pff the tension... > > yay, still good! > > > > Andrei; would you please verify. > > I have corrupted files. drad; and with this patch: http://lkml.org/lkml/2006/12/20/112 /me goes rebuild his kernel and try more than 3 times - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote: > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote: > > > OR: > > > > - page_mkclean_one() is simply buggy. > > GOLD! > > it seems to work with all this (full diff against current git). > > /me rebuilds full kernel to make sure... > reboot... > test... pff the tension... > yay, still good! > > Andrei; would you please verify. I have corrupted files. > The magic seems to be in the extra tlb flush after clearing the dirty > bit. Just too bad ptep_clear_flush_dirty() needs ptep not entry. > > diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c > index 5e7cd45..2b8893b 100644 > --- a/drivers/connector/connector.c > +++ b/drivers/connector/connector.c > @@ -135,8 +135,7 @@ static int cn_call_callback(struct cn_msg *msg, void > (*destruct_data)(void *), v > spin_lock_bh(>cbdev->queue_lock); > list_for_each_entry(__cbq, >cbdev->queue_list, callback_entry) { > if (cn_cb_equal(&__cbq->id.id, >id)) { > - if (likely(!test_bit(WORK_STRUCT_PENDING, > - &__cbq->work.work.management) && > + if (likely(!delayed_work_pending(&__cbq->work) && > __cbq->data.ddata == NULL)) { > __cbq->data.callback_priv = msg; > > diff --git a/fs/buffer.c b/fs/buffer.c > index d1f1b54..263f88e 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page) > int ret = 0; > > BUG_ON(!PageLocked(page)); > - if (PageWriteback(page)) > + if (PageDirty(page) || PageWriteback(page)) > return 0; > > if (mapping == NULL) { /* can this still happen? */ > @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page) > spin_lock(>private_lock); > ret = drop_buffers(page, _to_free); > spin_unlock(>private_lock); > - if (ret) { > - /* > - * If the filesystem writes its buffers by hand (eg ext3) > - * then we can have clean buffers against a dirty page. We > - * clean the page here; otherwise later reattachment of buffers > - * could encounter a non-uptodate page, which is unresolvable. > - * This only applies in the rare case where try_to_free_buffers > - * succeeds but the page is not freed. > - * > - * Also, during truncate, discard_buffer will have marked all > - * the page's buffers clean. We discover that here and clean > - * the page also. > - */ > - if (test_clear_page_dirty(page)) > - task_io_account_cancelled_write(PAGE_CACHE_SIZE); > - } > out: > if (buffers_to_free) { > struct buffer_head *bh = buffers_to_free; > diff --git a/mm/memory.c b/mm/memory.c > index c00bac6..60e0945 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping, > } > EXPORT_SYMBOL(unmap_mapping_range); > > +static void check_last_page(struct address_space *mapping, loff_t size) > +{ > + pgoff_t index; > + unsigned int offset; > + struct page *page; > + > + if (!mapping) > + return; > + offset = size & ~PAGE_MASK; > + if (!offset) > + return; > + index = size >> PAGE_SHIFT; > + page = find_lock_page(mapping, index); > + if (page) { > + unsigned int check = 0; > + unsigned char *kaddr = kmap_atomic(page, KM_USER0); > + do { > + check += kaddr[offset++]; > + } while (offset < PAGE_SIZE); > + kunmap_atomic(kaddr, KM_USER0); > + unlock_page(page); > + page_cache_release(page); > + if (check) > + printk(KERN_ERR "%s: BADNESS: truncate check %u\n", > current->comm, check); > + } > +} > + > /** > * vmtruncate - unmap mappings "freed" by truncate() syscall > * @inode: inode of the file used > @@ -1875,6 +1902,7 @@ do_expand: > goto out_sig; > if (offset > inode->i_sb->s_maxbytes) > goto out_big; > + check_last_page(mapping, inode->i_size); > i_size_write(inode, offset); > > out_truncate: > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 237107c..f561e72 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -957,7 +957,7 @@ int test_set_page_writeback(struct page *page) > EXPORT_SYMBOL(test_set_page_writeback); > > /* > - * Return true if any of the pages in the mapping are marged with the > + * Return true if any of the pages in the mapping are marked with the > * passed tag. > */ > int mapping_tagged(struct address_space *mapping, int tag) > diff --git a/mm/rmap.c b/mm/rmap.c > index d8a842a..900229a 100644 >
Re: 2.6.19 file content corruption on ext3
> Hmm, should we not flush after clearing the dirty bit? That is, why does > ptep_clear_flush_dirty() need a flush after clearing that bit? does it > leak through in the tlb copy? afaics you need to 1) clear 2) flush 3) check and go to 1) if needed to be race free. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Tue, 2006-12-19 at 16:23 -0800, Linus Torvalds wrote: > Pls test. Is good. Only s390 remains a question. Another point, change_protection() also does a cache flush, should we too? > > diff --git a/mm/rmap.c b/mm/rmap.c > index d8a842a..eec8706 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -448,9 +448,10 @@ static int page_mkclean_one(struct page *page, struct > vm_area_struct *vma) > goto unlock; > > entry = ptep_get_and_clear(mm, address, pte); flush_cache_page(vma, address, pte_pfn(entry)); > + flush_tlb_page(vma, address); > entry = pte_mkclean(entry); > entry = pte_wrprotect(entry); > - ptep_establish(vma, address, pte, entry); > + set_pte_at(mm, address, pte, entry); > lazy_mmu_prot_update(entry); > ret = 1; > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Wed, 2006-12-20 at 10:01 +0100, Peter Zijlstra wrote: > I will try, but I had a look around the different architectures > implementation of ptep_clear_flush_dirty() and saw that not all do the > actual flush. So if we go down this road perhaps we should introduce > another per arch function that does the potential flush. like > flush_tlb_on_clear_dirty() or something like that. never mind, we do need an unconditional flush for changing the protection too. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Tue, 2006-12-19 at 16:23 -0800, Linus Torvalds wrote: > > On Wed, 20 Dec 2006, Peter Zijlstra wrote: > > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote: > > > OR: > > > > > > - page_mkclean_one() is simply buggy. > > > > GOLD! > > Ok. I was looking at that, and I wondered.. > > However, if that works, then I _think_ the correct sequence is the > following.. > > The rule should be: > - we flush the tlb _after_ we have cleared it, but _before_ we insert the >new entry. > > But I dunno. These things are damn subtle. Does this patch fix it for you? I will try, but I had a look around the different architectures implementation of ptep_clear_flush_dirty() and saw that not all do the actual flush. So if we go down this road perhaps we should introduce another per arch function that does the potential flush. like flush_tlb_on_clear_dirty() or something like that. Then we could write: entry = ptep_get_and_clear(mm, address, ptep) flush_tlb_on_clear_dirty(vma, address); entry = pte_mkclean(entry); entry = pte_wrprotect(entry); set_pte_at(mm, address, ptep, entry); > I actually suspect we should do this as an arch-specific macro, and > totally replace the current "ptep_clear_flush_dirty()" with one that does > "ptep_clear_flush_dirty_and_set_wp()". > > Because what I'd _really_ prefer to do on x86 (and probably on most other > sane architectures) is to do > > - atomically replace the pte with the EXACT SAME ONE, but one that >has the writable bit clear. > > bit_clear(_PAGE_BIT_RW, &(ptep)->pte_low); > > - flush the TLB, making sure that all CPU's will no longer write to it: > > flush_tlb_page(vma, address); > > - finally, just fetch-and-clear the dirty bit (and since it's no longer >writable, nobody should be settign it any more) > > ret = bit_clear(__PAGE_BIT_DIRTY, &(ptep)->pte_low); > > and now we should be all done. Hmm, should we not flush after clearing the dirty bit? That is, why does ptep_clear_flush_dirty() need a flush after clearing that bit? does it leak through in the tlb copy? Also, what is this page_test_and_clear_dirty() business, that seems to be exclusively s390 btw. However they do seem to need this. > But the "ptep_get_and_clear() + flush_tlb_page()" sequence should > hopefully also work. Yeah, probably, not optimally so on some archs that don't actually need the flush though. And as above, I wonder about s390. (added our s390 friends to the CC list) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Tue, 2006-12-19 at 16:23 -0800, Linus Torvalds wrote: On Wed, 20 Dec 2006, Peter Zijlstra wrote: On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote: OR: - page_mkclean_one() is simply buggy. GOLD! Ok. I was looking at that, and I wondered.. However, if that works, then I _think_ the correct sequence is the following.. The rule should be: - we flush the tlb _after_ we have cleared it, but _before_ we insert the new entry. But I dunno. These things are damn subtle. Does this patch fix it for you? I will try, but I had a look around the different architectures implementation of ptep_clear_flush_dirty() and saw that not all do the actual flush. So if we go down this road perhaps we should introduce another per arch function that does the potential flush. like flush_tlb_on_clear_dirty() or something like that. Then we could write: entry = ptep_get_and_clear(mm, address, ptep) flush_tlb_on_clear_dirty(vma, address); entry = pte_mkclean(entry); entry = pte_wrprotect(entry); set_pte_at(mm, address, ptep, entry); I actually suspect we should do this as an arch-specific macro, and totally replace the current ptep_clear_flush_dirty() with one that does ptep_clear_flush_dirty_and_set_wp(). Because what I'd _really_ prefer to do on x86 (and probably on most other sane architectures) is to do - atomically replace the pte with the EXACT SAME ONE, but one that has the writable bit clear. bit_clear(_PAGE_BIT_RW, (ptep)-pte_low); - flush the TLB, making sure that all CPU's will no longer write to it: flush_tlb_page(vma, address); - finally, just fetch-and-clear the dirty bit (and since it's no longer writable, nobody should be settign it any more) ret = bit_clear(__PAGE_BIT_DIRTY, (ptep)-pte_low); and now we should be all done. Hmm, should we not flush after clearing the dirty bit? That is, why does ptep_clear_flush_dirty() need a flush after clearing that bit? does it leak through in the tlb copy? Also, what is this page_test_and_clear_dirty() business, that seems to be exclusively s390 btw. However they do seem to need this. But the ptep_get_and_clear() + flush_tlb_page() sequence should hopefully also work. Yeah, probably, not optimally so on some archs that don't actually need the flush though. And as above, I wonder about s390. (added our s390 friends to the CC list) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Wed, 2006-12-20 at 10:01 +0100, Peter Zijlstra wrote: I will try, but I had a look around the different architectures implementation of ptep_clear_flush_dirty() and saw that not all do the actual flush. So if we go down this road perhaps we should introduce another per arch function that does the potential flush. like flush_tlb_on_clear_dirty() or something like that. never mind, we do need an unconditional flush for changing the protection too. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Tue, 2006-12-19 at 16:23 -0800, Linus Torvalds wrote: Pls test. Is good. Only s390 remains a question. Another point, change_protection() also does a cache flush, should we too? diff --git a/mm/rmap.c b/mm/rmap.c index d8a842a..eec8706 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -448,9 +448,10 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) goto unlock; entry = ptep_get_and_clear(mm, address, pte); flush_cache_page(vma, address, pte_pfn(entry)); + flush_tlb_page(vma, address); entry = pte_mkclean(entry); entry = pte_wrprotect(entry); - ptep_establish(vma, address, pte, entry); + set_pte_at(mm, address, pte, entry); lazy_mmu_prot_update(entry); ret = 1; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
Hmm, should we not flush after clearing the dirty bit? That is, why does ptep_clear_flush_dirty() need a flush after clearing that bit? does it leak through in the tlb copy? afaics you need to 1) clear 2) flush 3) check and go to 1) if needed to be race free. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote: On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote: OR: - page_mkclean_one() is simply buggy. GOLD! it seems to work with all this (full diff against current git). /me rebuilds full kernel to make sure... reboot... test... pff the tension... yay, still good! Andrei; would you please verify. I have corrupted files. The magic seems to be in the extra tlb flush after clearing the dirty bit. Just too bad ptep_clear_flush_dirty() needs ptep not entry. diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c index 5e7cd45..2b8893b 100644 --- a/drivers/connector/connector.c +++ b/drivers/connector/connector.c @@ -135,8 +135,7 @@ static int cn_call_callback(struct cn_msg *msg, void (*destruct_data)(void *), v spin_lock_bh(dev-cbdev-queue_lock); list_for_each_entry(__cbq, dev-cbdev-queue_list, callback_entry) { if (cn_cb_equal(__cbq-id.id, msg-id)) { - if (likely(!test_bit(WORK_STRUCT_PENDING, - __cbq-work.work.management) + if (likely(!delayed_work_pending(__cbq-work) __cbq-data.ddata == NULL)) { __cbq-data.callback_priv = msg; diff --git a/fs/buffer.c b/fs/buffer.c index d1f1b54..263f88e 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page) int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page) spin_lock(mapping-private_lock); ret = drop_buffers(page, buffers_to_free); spin_unlock(mapping-private_lock); - if (ret) { - /* - * If the filesystem writes its buffers by hand (eg ext3) - * then we can have clean buffers against a dirty page. We - * clean the page here; otherwise later reattachment of buffers - * could encounter a non-uptodate page, which is unresolvable. - * This only applies in the rare case where try_to_free_buffers - * succeeds but the page is not freed. - * - * Also, during truncate, discard_buffer will have marked all - * the page's buffers clean. We discover that here and clean - * the page also. - */ - if (test_clear_page_dirty(page)) - task_io_account_cancelled_write(PAGE_CACHE_SIZE); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; diff --git a/mm/memory.c b/mm/memory.c index c00bac6..60e0945 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping, } EXPORT_SYMBOL(unmap_mapping_range); +static void check_last_page(struct address_space *mapping, loff_t size) +{ + pgoff_t index; + unsigned int offset; + struct page *page; + + if (!mapping) + return; + offset = size ~PAGE_MASK; + if (!offset) + return; + index = size PAGE_SHIFT; + page = find_lock_page(mapping, index); + if (page) { + unsigned int check = 0; + unsigned char *kaddr = kmap_atomic(page, KM_USER0); + do { + check += kaddr[offset++]; + } while (offset PAGE_SIZE); + kunmap_atomic(kaddr, KM_USER0); + unlock_page(page); + page_cache_release(page); + if (check) + printk(KERN_ERR %s: BADNESS: truncate check %u\n, current-comm, check); + } +} + /** * vmtruncate - unmap mappings freed by truncate() syscall * @inode: inode of the file used @@ -1875,6 +1902,7 @@ do_expand: goto out_sig; if (offset inode-i_sb-s_maxbytes) goto out_big; + check_last_page(mapping, inode-i_size); i_size_write(inode, offset); out_truncate: diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 237107c..f561e72 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -957,7 +957,7 @@ int test_set_page_writeback(struct page *page) EXPORT_SYMBOL(test_set_page_writeback); /* - * Return true if any of the pages in the mapping are marged with the + * Return true if any of the pages in the mapping are marked with the * passed tag. */ int mapping_tagged(struct address_space *mapping, int tag) diff --git a/mm/rmap.c b/mm/rmap.c index d8a842a..900229a 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) {
Re: 2.6.19 file content corruption on ext3
On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote: On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote: On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote: OR: - page_mkclean_one() is simply buggy. GOLD! it seems to work with all this (full diff against current git). /me rebuilds full kernel to make sure... reboot... test... pff the tension... yay, still good! Andrei; would you please verify. I have corrupted files. drad; and with this patch: http://lkml.org/lkml/2006/12/20/112 /me goes rebuild his kernel and try more than 3 times - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Wed, 2006-12-20 at 10:01 +0100, Peter Zijlstra wrote: Also, what is this page_test_and_clear_dirty() business, that seems to be exclusively s390 btw. However they do seem to need this. But the ptep_get_and_clear() + flush_tlb_page() sequence should hopefully also work. Yeah, probably, not optimally so on some archs that don't actually need the flush though. And as above, I wonder about s390. Simple, the s390 architecture does not keep the dirty bit in the pte but in something called the storage key. For each physical page there is one associated storage key. It is accessed with special instructions like iske, sske or rrbe. To clear the dirty bit the storage key of a page is read with iske, the bit is cleared and the storage key is stored back with sske. That means that clearing the dirty bit is not an atomic operation. rrbe is used to test and clear the referenced bit (young/old infomation) and is atomic in regard to other storage key operations. If you think about it, the storage keys are quite nice for the operating system, page_referenced() can be implemented with a single test page_test_and_clear_young(). No need to read all the ptes pointing to the page. The downside is that the storage keys have a cost on the hardware side. -- blue skies, Martin. Martin Schwidefsky Linux for zSeries Development Services IBM Deutschland Entwicklung GmbH Reality continues to ruin my life. - Calvin. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Wed, 2006-12-20 at 15:23 +0100, Peter Zijlstra wrote: On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote: On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote: On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote: OR: - page_mkclean_one() is simply buggy. GOLD! it seems to work with all this (full diff against current git). /me rebuilds full kernel to make sure... reboot... test... pff the tension... yay, still good! Andrei; would you please verify. I have corrupted files. drad; and with this patch: http://lkml.org/lkml/2006/12/20/112 Hash check on download completion found bad chunks, consider using safe_sync. /me goes rebuild his kernel and try more than 3 times - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Wed, 2006-12-20 at 18:30 +0200, Andrei Popa wrote: On Wed, 2006-12-20 at 15:23 +0100, Peter Zijlstra wrote: On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote: On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote: On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote: OR: - page_mkclean_one() is simply buggy. GOLD! it seems to work with all this (full diff against current git). /me rebuilds full kernel to make sure... reboot... test... pff the tension... yay, still good! Andrei; would you please verify. I have corrupted files. drad; and with this patch: http://lkml.org/lkml/2006/12/20/112 Hash check on download completion found bad chunks, consider using safe_sync. *sigh* back to square 1. and I need to look at my reproduction case ;-( Thanks for testing. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
Peter Zijlstra wrote: On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote: On Tue, 19 Dec 2006, Linus Torvalds wrote: here's a totally new tangent on this: it's possible that user code is simply BUGGY. I'm sad to say this doesn't trigger :-( - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ Hi all, I ran it a number of times on 2.6.16-1.2115_FC4 and always got ./a.out | od -x 000 020 040 but running it on 2.6.19-rc5 I always get zeros in the middle. Steve -- They that give up essential liberty to obtain temporary safety, deserve neither liberty nor safety. (Ben Franklin) The course of history shows that as a government grows, liberty decreases. (Thomas Jefferson) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On 12/20/06, Linus Torvalds <[EMAIL PROTECTED]> wrote: On Tue, 19 Dec 2006, Linus Torvalds wrote: > > here's a totally new tangent on this: it's possible that user code is > simply BUGGY. Btw, here's a simpler test-program that actually shows the difference between 2.6.18 and 2.6.19 in action, and why it could explain why a program like rtorrent might show corruption behavious that it didn't show before. Kinda late to the discussion, but I guess I could summarize what rtorrent actually does, or should be doing. When downloading a new torrent, it will create the files and truncate them to the final size. It will never call truncate after this and the files will remain sparse until data is downloaded. A 'piece' is mapped to memory using MAP_SHARED, which will be page aligned on single file torrents but unlikely to be so on multi-file torrents. So on multi-file torrents it'll often end up with two mappings overlapping with one page, each of which only write to their own part the page. These will then be sync'ed with MS_ASYNC, or MS_SYNC if low on disk space. After that it might be unmapped, then mapped as read-only. I haven't thought of asking if single file torrents are ok. Rakshasa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Wed, 20 Dec 2006, Peter Zijlstra wrote: > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote: > > OR: > > > > - page_mkclean_one() is simply buggy. > > GOLD! Ok. I was looking at that, and I wondered.. However, if that works, then I _think_ the correct sequence is the following.. The rule should be: - we flush the tlb _after_ we have cleared it, but _before_ we insert the new entry. But I dunno. These things are damn subtle. Does this patch fix it for you? I actually suspect we should do this as an arch-specific macro, and totally replace the current "ptep_clear_flush_dirty()" with one that does "ptep_clear_flush_dirty_and_set_wp()". Because what I'd _really_ prefer to do on x86 (and probably on most other sane architectures) is to do - atomically replace the pte with the EXACT SAME ONE, but one that has the writable bit clear. bit_clear(_PAGE_BIT_RW, &(ptep)->pte_low); - flush the TLB, making sure that all CPU's will no longer write to it: flush_tlb_page(vma, address); - finally, just fetch-and-clear the dirty bit (and since it's no longer writable, nobody should be settign it any more) ret = bit_clear(__PAGE_BIT_DIRTY, &(ptep)->pte_low); and now we should be all done. But the "ptep_get_and_clear() + flush_tlb_page()" sequence should hopefully also work. Pls test. Linus diff --git a/mm/rmap.c b/mm/rmap.c index d8a842a..eec8706 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -448,9 +448,10 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) goto unlock; entry = ptep_get_and_clear(mm, address, pte); + flush_tlb_page(vma, address); entry = pte_mkclean(entry); entry = pte_wrprotect(entry); - ptep_establish(vma, address, pte, entry); + set_pte_at(mm, address, pte, entry); lazy_mmu_prot_update(entry); ret = 1; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Tue, 19 Dec 2006 16:03:49 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > On Wed, 20 Dec 2006, Peter Zijlstra wrote: > > > On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote: > > > > > Well... we'd need to see (corruption && this-not-triggering) to be sure. > > > > > > Peter, have you been able to trigger the corruption? > > > > Yes; however the mail I send describing that seems to be lost in space. > > Btw, can somebody actually explain the mess that is ext3 "dirtying". > > Ext3 does NOT use __set_page_dirty_buffers. It does > > static int ext3_journalled_set_page_dirty(struct page *page) > { > SetPageChecked(page); > return __set_page_dirty_nobuffers(page); > } > > and uses that "Checked" bit as a "whole page is dirty" bit (which it tests > in "writepage()". This is purely for data=journal, which is rarely used. In journalled-data mode, write(), write-fault, etc are not allowed to dirty the pages and buffers, because the data has to be written to the journal first. After the data has been written to the journal we only then mark buffers (and hence pages) dirty as far as the VFS is concerned. For checkpointing the data back to its real place on the disk. For MAP_SHARED pages ext3 cheats madly and doesn't journal the data at all. In all journalling modes, MAP_SHARED data follows the regular ext2-style handling. Which is a bit of a wart. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Wed, 20 Dec 2006, Peter Zijlstra wrote: > On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote: > > > Well... we'd need to see (corruption && this-not-triggering) to be sure. > > > > Peter, have you been able to trigger the corruption? > > Yes; however the mail I send describing that seems to be lost in space. Btw, can somebody actually explain the mess that is ext3 "dirtying". Ext3 does NOT use __set_page_dirty_buffers. It does static int ext3_journalled_set_page_dirty(struct page *page) { SetPageChecked(page); return __set_page_dirty_nobuffers(page); } and uses that "Checked" bit as a "whole page is dirty" bit (which it tests in "writepage()". You realize what this all means? It means that ANYTHING that actually clears the _real_ dirty bit won't actually be doing anything at all for ext3, because the Checked bit will still stay set, and any IO down the line on that page would totally ignore the dirty bits on the buffer heads and just write out everything. That is "The Mess(tm)". It also basically means that anything that clears the dirty bit without just calling "writepage()" had _better_ call "invalidatepage()" for the whole page, because otherwise the PageChecked bit will never be cleared as far as I can see. Happily, at least ext3 seems to _test_ for that case in the release_page() function, so it appears that we do do this. But this seems to just strengthen my argument: you can NEVER clean a page, unless you (a) do IO on it immediately afterwards (writeback) or (b) invalidate it entirely (truncate). I'd really like to see just those two functions exist. Preferably in a form where you can see easily that we actually follow those rules. Rather than having a confusing set of "clear_page_dirty()" and "test_and_clear_page_dirty()" functions that are called from random places. IOW, I think the "clear_page_dirty_for_io()" is fine (it's case (a)) above, and then we should probably have a "cancel_dirty_page()" function that does all the current clear_page_dirty() but also makes sure that we actually call the invalidate_page() function itself. Hmm? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote: > OR: > > - page_mkclean_one() is simply buggy. GOLD! it seems to work with all this (full diff against current git). /me rebuilds full kernel to make sure... reboot... test... pff the tension... yay, still good! Andrei; would you please verify. The magic seems to be in the extra tlb flush after clearing the dirty bit. Just too bad ptep_clear_flush_dirty() needs ptep not entry. diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c index 5e7cd45..2b8893b 100644 --- a/drivers/connector/connector.c +++ b/drivers/connector/connector.c @@ -135,8 +135,7 @@ static int cn_call_callback(struct cn_msg *msg, void (*destruct_data)(void *), v spin_lock_bh(>cbdev->queue_lock); list_for_each_entry(__cbq, >cbdev->queue_list, callback_entry) { if (cn_cb_equal(&__cbq->id.id, >id)) { - if (likely(!test_bit(WORK_STRUCT_PENDING, -&__cbq->work.work.management) && + if (likely(!delayed_work_pending(&__cbq->work) && __cbq->data.ddata == NULL)) { __cbq->data.callback_priv = msg; diff --git a/fs/buffer.c b/fs/buffer.c index d1f1b54..263f88e 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page) int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page) spin_lock(>private_lock); ret = drop_buffers(page, _to_free); spin_unlock(>private_lock); - if (ret) { - /* -* If the filesystem writes its buffers by hand (eg ext3) -* then we can have clean buffers against a dirty page. We -* clean the page here; otherwise later reattachment of buffers -* could encounter a non-uptodate page, which is unresolvable. -* This only applies in the rare case where try_to_free_buffers -* succeeds but the page is not freed. -* -* Also, during truncate, discard_buffer will have marked all -* the page's buffers clean. We discover that here and clean -* the page also. -*/ - if (test_clear_page_dirty(page)) - task_io_account_cancelled_write(PAGE_CACHE_SIZE); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; diff --git a/mm/memory.c b/mm/memory.c index c00bac6..60e0945 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping, } EXPORT_SYMBOL(unmap_mapping_range); +static void check_last_page(struct address_space *mapping, loff_t size) +{ + pgoff_t index; + unsigned int offset; + struct page *page; + + if (!mapping) + return; + offset = size & ~PAGE_MASK; + if (!offset) + return; + index = size >> PAGE_SHIFT; + page = find_lock_page(mapping, index); + if (page) { + unsigned int check = 0; + unsigned char *kaddr = kmap_atomic(page, KM_USER0); + do { + check += kaddr[offset++]; + } while (offset < PAGE_SIZE); + kunmap_atomic(kaddr, KM_USER0); + unlock_page(page); + page_cache_release(page); + if (check) + printk(KERN_ERR "%s: BADNESS: truncate check %u\n", current->comm, check); + } +} + /** * vmtruncate - unmap mappings "freed" by truncate() syscall * @inode: inode of the file used @@ -1875,6 +1902,7 @@ do_expand: goto out_sig; if (offset > inode->i_sb->s_maxbytes) goto out_big; + check_last_page(mapping, inode->i_size); i_size_write(inode, offset); out_truncate: diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 237107c..f561e72 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -957,7 +957,7 @@ int test_set_page_writeback(struct page *page) EXPORT_SYMBOL(test_set_page_writeback); /* - * Return true if any of the pages in the mapping are marged with the + * Return true if any of the pages in the mapping are marked with the * passed tag. */ int mapping_tagged(struct address_space *mapping, int tag) diff --git a/mm/rmap.c b/mm/rmap.c index d8a842a..900229a 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) { struct mm_struct *mm = vma->vm_mm; unsigned long address; - pte_t *pte,
Re: 2.6.19 file content corruption on ext3
On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote: > Well... we'd need to see (corruption && this-not-triggering) to be sure. > > Peter, have you been able to trigger the corruption? Yes; however the mail I send describing that seems to be lost in space. /me quotes from the send folder: > The bad new is, that doesn't help either. The good news is I can > reproduce it. > > What I did to achieve that: > > - get a sizable torrent from legaltorrents.com / or create a torrent > yourself that is around ~600M and has multiple files. > > - start a tracker, and multiple seeds (I used three machines here) > > - pull the torrent on a fourth machine > > the seeding machines don't much matter of course. > > the fourth machine was a dual core x86-64 with an SMP kernel and > PREEMPT, mem=256M (so that the torrent is quite a bit larger and does > require writeout) and I used an ext3 partition with 1k blocks. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Wed, 2006-12-20 at 00:06 +0100, Peter Zijlstra wrote: > On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote: > > > Well... we'd need to see (corruption && this-not-triggering) to be sure. > > > > Peter, have you been able to trigger the corruption? > > Yes; however the mail I send describing that seems to be lost in space. > > /me quotes from the send folder: > > > The bad new is, that doesn't help either. The good news is I can > > reproduce it. > > > > What I did to achieve that: > > > > - get a sizable torrent from legaltorrents.com / or create a torrent > > yourself that is around ~600M and has multiple files. > > > > - start a tracker, and multiple seeds (I used three machines here) > > > > - pull the torrent on a fourth machine > > > > the seeding machines don't much matter of course. > > > > the fourth machine was a dual core x86-64 with an SMP kernel and > > PREEMPT, mem=256M (so that the torrent is quite a bit larger and does > > require writeout) and I used an ext3 partition with 1k blocks. PS. this was a reply to: http://lkml.org/lkml/2006/12/19/121 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Tue, 19 Dec 2006 14:51:55 -0800 (PST) Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > On Tue, 19 Dec 2006, Peter Zijlstra wrote: > > > On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote: > > > > > > On Tue, 19 Dec 2006, Linus Torvalds wrote: > > > > > > > > here's a totally new tangent on this: it's possible that user code is > > > > simply BUGGY. > > > > I'm sad to say this doesn't trigger :-( > > Oh, well. It was a theory. > Well... we'd need to see (corruption && this-not-triggering) to be sure. Peter, have you been able to trigger the corruption? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Tue, 19 Dec 2006, Peter Zijlstra wrote: > On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote: > > > > On Tue, 19 Dec 2006, Linus Torvalds wrote: > > > > > > here's a totally new tangent on this: it's possible that user code is > > > simply BUGGY. > > I'm sad to say this doesn't trigger :-( Oh, well. It was a theory. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
* Linus Torvalds: > Now, this should _matter_ only for user processes that are buggy, > and that have written to the page _before_ extending it with > ftruncate(). APT seems to properly extend the file before mapping it, by writing a zero byte at the desired position (creating a hole). 24986 open("/var/cache/apt/pkgcache.bin", O_RDWR|O_CREAT|O_TRUNC, 0666) = 6 24986 lseek(6, 12582911, SEEK_SET) = 12582911 24986 write(6, "\0", 1) = 1 24986 mmap(NULL, 12582912, PROT_READ|PROT_WRITE, MAP_SHARED, 6, 0) = 0x2b6578636000 24986 msync(0x2b6578636000, 7464112, MS_SYNC) = 0 24986 msync(0x2b6578636000, 8656, MS_SYNC) = 0 24986 munmap(0x2b6578636000, 12582912) = 0 24986 ftruncate(6, 7464112) = 0 24986 fstat(6, {st_mode=S_IFREG|0644, st_size=7464112, ...}) = 0 24986 mmap(NULL, 7464112, PROT_READ, MAP_SHARED, 6, 0) = 0x2b6578636000 APT's code is pretty convoluted, though, and there might be some code path in it that gets it wrong. 8-P - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote: > > On Tue, 19 Dec 2006, Linus Torvalds wrote: > > > > here's a totally new tangent on this: it's possible that user code is > > simply BUGGY. I'm sad to say this doesn't trigger :-( - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Mon, 18 Dec 2006, Linus Torvalds wrote: > On Tue, 19 Dec 2006, Nick Piggin wrote: > > > > We never want to drop dirty data! (ignoring the truncate case, which is > > handled privately by truncate anyway) > > Bzzt. > > SURE we do. > > We absolutely do want to drop dirty data in the writeout path. > > How do you think dirty data ever _becomes_ clean data? > > In other words, yes, we _do_ want to test-and-clear all the pgtable bits > _and_ the PG_dirty bit. We want to do it for: > - writeout > - truncate > - possibly a "drop" event (which could be a case for a journal entry that >becomes stale due to being replaced or something - kind of "truncate" >on metadata) > > because both of those events _literally_ turn dirty state into clean > state. > > In no other circumstance do we ever want to clear a dirty bit, as far as I > can tell. i admit this may not be entirely relevant, but it seems like a good place to bring up an old problem: when a disk dies with lots of queued writes it can totally bring a system to its knees... even after the disk is removed. i wrote up something about this a while ago: http://lkml.org/lkml/2005/8/18/243 so there's another reason to "clear a dirty bit"... well, in fact -- drop the pages entirely. -dean - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Tue, 19 Dec 2006, Linus Torvalds wrote: > > here's a totally new tangent on this: it's possible that user code is > simply BUGGY. Btw, here's a simpler test-program that actually shows the difference between 2.6.18 and 2.6.19 in action, and why it could explain why a program like rtorrent might show corruption behavious that it didn't show before. #include #include #include #include int main(int argc, char **argv) { char *mapping; int fd; fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666); if (fd < 0) return -1; if (ftruncate(fd, 10) < 0) return -1; mapping = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (-1 == (int)(long)mapping) return -1; memset(mapping, 0xaa, 20); sync(); if (ftruncate(fd, 40) < 0) return -1; memset(mapping + 20, 0x55, 20); write(1, mapping, 40); return 0; } Notice the "sync()" in between the "memset()" and the "ftruncate()". In 2.6.18, that would normally do absolutely _nothing_ to the shared memory mapping, becuase we simply couldn't track pages that were dirty in the page tables. So in 2.6.18, if you try this, with ./a.out | od -x you should see something like 000 020 040 050 which matches your memset() patterns: 20 bytes of 0xaa, and 20 bytes of 0x55. HOWEVER. In 2.6.19, because we actually track dirty data so much better, "sync()" will actually be smart enough to write out the dirty mmap'ed data too. But since the user program has only allocated ten bytes for it in the file, when it is written out, the rest of the page is cleared. When you then write the last 20 bytes (after _properly_ allocating memory for them), you should now see a pattern like 000 020 040 050 instead: with ten bytes of zero in between, because the data that couldn't be written out was cleared. So 2.6.19 is strictly _better_, but exactly because it's tracking dirty status much more precisely, you'll see certain user-level bugs much more easily. NOTE NOTE NOTE! The code really _was_ buggy in 2.6.18 too, and you _can_ get the zeroes in the middle of the file with an older kernel. But in older kernels, you need to be really really unlucky, and have the page cleaned by strong memory pressure. In 2.6.19, any "sync()" activity (includign from the outside) will clean the page, so a user program with this bug can just be made to trigger the bug much more easily. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
Btw, here's a totally new tangent on this: it's possible that user code is simply BUGGY. There is one case where the kernel actually forcibly writes zeroes into a file: when we're writing a page that straddles the "inode->i_size" boundary. See the various writepages in fs/buffer.c, they all contain variations on that theme (although most of them aren't as well commented as this snippet): /* * The page straddles i_size. It must be zeroed out on each and every * writepage invocation because it may be mmapped. "A file is mapped * in multiples of the page size. For a file that is not a multiple of * the page size, the remaining memory is zeroed when mapped, and * writes to that region are not written out to the file." */ kaddr = kmap_atomic(page, KM_USER0); memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset); flush_dcache_page(page); kunmap_atomic(kaddr, KM_USER0); Now, this should _matter_ only for user processes that are buggy, and that have written to the page _before_ extending it with ftruncate(). That's definitely a serious bug, but it's one that can do totally undetected depending on when the actual write-out happens. So what I'm saying is that if we end up writing things earlier thanks to the more aggressive dirty-page-management thing in 2.6.19, we might actually just expose a long-time userspace bug that was just a LOT harder to trigger before.. I'm not saying this is the cause of all this, but we've been tearing our hair out, and it migth be worthwhile trying this really really really stupid patch that will notice when that happens at truncate() time, and tell the user that he's a total idiot. Or something to that effect. Maybe the reason this is so easy to trigger with rtorrent is not because rtorrent does some magic pattern that triggers a kernel bug, but simply because rtorrent itself might have a bug. Ok, so it's a long shot, but it's still worth testing, I suspect. The patch is very simple: whenever we do an _expanding_ truncate, we check the last page of the _old_ size, and if there were non-zero contents past the old size, we complain. As an attachement is a test-program that _should_ trigger a kernel message like a.out: BADNESS: truncate check 17000 for good measure, just so that you can verify that the patch works and actually catches this case. (The 17000 number is just the one-hundred _invalid_ 0xaa bytes - out of the 200 we wrote - that were summed up: 100*0xaa == 17000. Anything non-zero is always a bug). I doubt this is really it, but it's worth trying. If you fill out a page, and only do "ftruncate()" in response to SIGBUS messages (and don't truncate to whole pages), you could potentially see zeroes at the end of the page exactly because _writeout_ cleared the page for you! So it _could_ explain the symptoms, but only if user-space was horribly horribly broken. Linus diff --git a/mm/memory.c b/mm/memory.c index c00bac6..79cecab 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping, } EXPORT_SYMBOL(unmap_mapping_range); +static void check_last_page(struct address_space *mapping, loff_t size) +{ + pgoff_t index; + unsigned int offset; + struct page *page; + + if (!mapping) + return; + offset = size & ~PAGE_MASK; + if (!offset) + return; + index = size >> PAGE_SHIFT; + page = find_lock_page(mapping, index); + if (page) { + unsigned int check = 0; + unsigned char *kaddr = kmap_atomic(page, KM_USER0); + do { + check += kaddr[offset++]; + } while (offset < PAGE_SIZE); + kunmap_atomic(kaddr,KM_USER0); + unlock_page(page); + page_cache_release(page); + if (check) + printk("%s: BADNESS: truncate check %u\n", current->comm, check); + } +} + /** * vmtruncate - unmap mappings "freed" by truncate() syscall * @inode: inode of the file used @@ -1875,6 +1902,7 @@ do_expand: goto out_sig; if (offset > inode->i_sb->s_maxbytes) goto out_big; + check_last_page(mapping, inode->i_size); i_size_write(inode, offset); out_truncate:#include #include #include #include int main(int argc, char **argv) { char *mapping; int fd; fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666); if (fd < 0) return -1; if (ftruncate(fd, 10) < 0) return -1; mapping = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (-1 == (int)(long)mapping) return -1; memset(mapping, 0x55, 10); if (ftruncate(fd, 100) < 0) return -1;
Re: 2.6.19 file content corruption on ext3
On Tue, 19 Dec 2006, Nick Piggin wrote: > > Counterexample? Well AFAIKS, the clearing of PG_dirty in ttfb() in > response to finding all buffers clean is perfectly valid. What makes > you think otherwise? If the page really is clean, then why the heck cant' we just clean the page table bits too? Either it's clean or it isn't. If all the buffers being clean means that the page is clean, then it's clean. WE SHOULD NOT THINK THAT PTE'S ARE ANY DIFFERENT. I really don't see your point. Is it clean? If it is, then clear the damn dirty bits from the page tables too. Don't go pussyfooting around the issue and confuse yourself and everybody but me by saying "but if it's dirty in the page tables, it's magically dirty". NO. It really is that simple. Is it clean or not? If it's clean, you can remove ALL the dirty bits. Not just some. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Tue, 19 Dec 2006, Nick Piggin wrote: > > Now I'm not exactly sure how ext3 (or any other) filesystems make use > of this particular feature of try_to_free_buffers(), but it is clear > from the comments what it is for. So your patch isn't really a minimal > fix (ie. it would require an OK from all filesystems, wouldn't it?) > > Or did I miss a mail where you reasoned that it is safe to make this > change (/me goes to reread the thread)... I'm saying it had _better_ be safe, and no, low-level filesystems don't actually matter. The page has to be cleanable _some_ way. So if we test for "page_dirty()" at the top, and just refuse to do it in try_to_free_pages(), we still know that the _proper_ page cleaning had better clean it. Because ttfp() is never going to clean the page in the general case _anyway_. So I'm really saying: - the page WILL be cleaned by the real page cleaning action (ie memory pressure or sync or something else causing us to go through the bog-standard page-based writeout. Does anybody dispute this? - the "ttfp()" hack was a HACK. It was an ugly and nasty hack even when it was first introduced. It gets doubly worse now that we know we have something wrong with page cleaning, and it has distracted from the real problem. - I removed tha ugly and disgusting hack entirely at first, but Andrew points out that he really wants to keep the buffers there, because the buffers being clean actually say something. That, together with the fact that as long as the page is dirty, the buffers really do end up have a job to do, made me add a much smaller hack to replace the big ugly one ("don't even try, if the page is marked dirty"). - so with that thing in place, there isn't even any change in behaviour wrt the buffers and low-level filesystems. It's just that we make them a bit harder to get rid of. But arguably that shouldn't actually ever really _happen_ anyway (because I think it's a BUG if the page is marked dirty but none of the buffers are), so I think that part is a non-issue. In other words, ttfp() _never_ had anything to do with "page cleaning". Not originally, not with the horrible hack, and not with my patch. Trying to mix it in just caused a bug that _everybody_ agrees is a bug. It's not the bug we're chasing, but we've got three different patches to fix it (Andrew's, mine and yours), and mine is the simplest one by far especially in the long run, because it just REMOVES the ugly dependency. And yes, I probably care more about "in the long run" than most. To me, a bug is a bug even if it's _just_ a maintenance headache. Andrews patch made things _worse_ ("magic insane flag"), and while yours didn't make the code worse, it still introduced the notion of a totally insane "clean the page but if the PTE's are dirty, do something else" notion. IF THE PAGE TRULY IS CLEAN (and both you and Andrew claim it is, if all buffers are clean - since you mark it clean in the non-mapped case) THEN YOU SHOULD BE ABLE TO CLEAN THE PAGE TABLE BITS TOO. And by claiming that the page table bits are different from PG_dirty, you're just making the issues worse. They shouldn't be. That's what the whole point of Peter's patch was: PG_dirty fundmentally _means_ that the page tables might be dirty too. That was the whole _point_ of doing all this in 2.6.19 in the first place. So if you cannot accept that page table bits should be on "equal footing" with PG_dirty, then you should just say "Let's remove Peter's patch entirely". Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Tue, 2006-12-19 at 21:58 +1100, Nick Piggin wrote: > Peter Zijlstra wrote: > > On Tue, 2006-12-19 at 02:32 -0800, Andrew Morton wrote: > > >>Well it used to be. After 2.6.19 it can do the wrong thing for mapped > >>pages. But it turns out that we don't feed it mapped pages, apart from > >>pagevec_strip() and possibly races against pagefaults. > > > > > > So how about this: > > Well that's still racy. Anyway several earlier patches (including > the one I posted) closed this race. Some were still reported to > trigger corruption IIRC. I can't remember a patch that removes mapped pages from this code path, however I could have missed it. All out removing the mapping branch in ttfb() did also fix the problem - which is a superset of page_mapped(). I'm now building a kernel with this patch, and will submit that to rtorrent with mem=256M on a 1k ext3 filesystem on x86_64 smp preempt. --- fs/buffer.c | 32 +++- 1 file changed, 31 insertions(+), 1 deletion(-) Index: linux-2.6/fs/buffer.c === --- linux-2.6.orig/fs/buffer.c +++ linux-2.6/fs/buffer.c @@ -2798,11 +2798,38 @@ static inline int buffer_busy(struct buf (bh->b_state & ((1 << BH_Dirty) | (1 << BH_Lock))); } +/* + * AKPM sayeth: + * + * - a process does a one-byte-write to a file on a 64k pagesize, 4k + * blocksize ext3 filesystem. The page is now PageDirty, !PgeUptodate and + * has one dirty buffer and 15 not uptodate buffers. + * + * - kjournald writes the dirty buffer. The page is now PageDirty, + * !PageUptodate and has a mix of clean and not uptodate buffers. + * + * - try_to_free_buffers() removes the page's buffers. It MUST now clear + * PageDirty. If we were to leave the page dirty then we'd have a dirty, not + * uptodate page with no buffer_heads. + * + * We're screwed: we cannot write the page because we don't know which + * sections of it contain garbage. We cannot read the page because we don't + * know which sections of it contain modified data. We cannot free the page + * because it is dirty. + * + * However for mapped pages this is not true; mapped pages will be fully + * loaded and thus cannot have not uptodate buffers. + * + * Hence allow the PG_dirty bit to stay for pages that had no not uptodate + * buffers (and assert that mapped pages never have those). + */ + static int drop_buffers(struct page *page, struct buffer_head **buffers_to_free) { struct buffer_head *head = page_buffers(page); struct buffer_head *bh; + int uptodate = 1; bh = head; do { @@ -2818,11 +2845,14 @@ drop_buffers(struct page *page, struct b if (!list_empty(>b_assoc_buffers)) __remove_assoc_queue(bh); + if (!buffer_uptodate(bh)) + uptodate = 0; bh = next; } while (bh != head); *buffers_to_free = head; __clear_page_buffers(page); - return 1; + VM_BUG_ON(page_mapped(page) && !uptodate); + return !uptodate; failed: return 0; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
Peter Zijlstra wrote: On Tue, 2006-12-19 at 02:32 -0800, Andrew Morton wrote: Well it used to be. After 2.6.19 it can do the wrong thing for mapped pages. But it turns out that we don't feed it mapped pages, apart from pagevec_strip() and possibly races against pagefaults. So how about this: Well that's still racy. Anyway several earlier patches (including the one I posted) closed this race. Some were still reported to trigger corruption IIRC. Index: linux-2.6-git/mm/page-writeback.c === --- linux-2.6-git.orig/mm/page-writeback.c 2006-12-19 08:24:48.0 +0100 +++ linux-2.6-git/mm/page-writeback.c 2006-12-19 11:43:31.0 +0100 @@ -859,6 +859,9 @@ int test_clear_page_dirty(struct page *p struct address_space *mapping = page_mapping(page); unsigned long flags; + if (page_mapped(page)) + return 0; + if (!mapping) return TestClearPageDirty(page); - -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
Andrew Morton wrote: On Tue, 19 Dec 2006 20:56:50 +1100 Nick Piggin <[EMAIL PROTECTED]> wrote: I think it could be very likely that indeed the bug is a latent one in a clear_page_dirty caller, rather than dirty-tracking itself. The only callers are try_to_free_buffers(), truncate and a few scruffy possibly-wrong-for-fsync filesytems which aren't being used here. Well truncate/invalidate will not operate on mapped pages (barring the very-unlikely truncate/invalidate vs fault races). We can ignore those filesystems as they don't include ext3. Which brings us back to try_to_free_buffers(). Maybe it is something else entirely, but did try_to_free_buffers ever get completely cleared? Or was some of Andrei's corruption possibly leftover on-disk corruption from a previous kernel? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Tue, 2006-12-19 at 02:32 -0800, Andrew Morton wrote: > On Tue, 19 Dec 2006 20:56:50 +1100 > Nick Piggin <[EMAIL PROTECTED]> wrote: > > > Linus Torvalds wrote: > > > > > NOTICE? First you make a BIG DEAL about how dirty bits should never get > > > lost, but THE VERY SAME FUNCTION actually very much on purpose DOES drop > > > the dirty bit for when it's not in the page tables. > > > > try_to_free_buffers is quite a special case, where we're transferring > > the page dirty metadata from the buffers to the page. I think Andrew > > would have a better grasp of it so he could correct me, but what it > > does is legitimate. > > Well it used to be. After 2.6.19 it can do the wrong thing for mapped > pages. But it turns out that we don't feed it mapped pages, apart from > pagevec_strip() and possibly races against pagefaults. So how about this: Index: linux-2.6-git/mm/page-writeback.c === --- linux-2.6-git.orig/mm/page-writeback.c 2006-12-19 08:24:48.0 +0100 +++ linux-2.6-git/mm/page-writeback.c 2006-12-19 11:43:31.0 +0100 @@ -859,6 +859,9 @@ int test_clear_page_dirty(struct page *p struct address_space *mapping = page_mapping(page); unsigned long flags; + if (page_mapped(page)) + return 0; + if (!mapping) return TestClearPageDirty(page); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Tue, 19 Dec 2006 02:32:55 -0800 Andrew Morton <[EMAIL PROTECTED]> wrote: > > > If a write-fault races with a read-fault and the write-fault loses, we forget > to mark the page dirty. No that isn't right, is it. The writer just retakes the fault and all the right things happen. Ho hum. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
Andrew Morton wrote: On Tue, 19 Dec 2006 20:56:50 +1100 Nick Piggin <[EMAIL PROTECTED]> wrote: Linus Torvalds wrote: NOTICE? First you make a BIG DEAL about how dirty bits should never get lost, but THE VERY SAME FUNCTION actually very much on purpose DOES drop the dirty bit for when it's not in the page tables. try_to_free_buffers is quite a special case, where we're transferring the page dirty metadata from the buffers to the page. I think Andrew would have a better grasp of it so he could correct me, but what it does is legitimate. Well it used to be. After 2.6.19 it can do the wrong thing for mapped pages. Yes, that is what I was trying to get at. But it turns out that we don't feed it mapped pages, apart from pagevec_strip() and possibly races against pagefaults. True, and I think we have pretty well established that this isn't the cause of Andrei's problem, but I think we all agree it is *a* bug? And surely Andrei's data corruption will be of the same flavour in that test_clear_page_dirty somewhere is now stripping pte dirty bits where it shouldn't? (because it went away after Peter nooped that behaviour) I think it could be very likely that indeed the bug is a latent one in a clear_page_dirty caller, rather than dirty-tracking itself. The only callers are try_to_free_buffers(), truncate and a few scruffy possibly-wrong-for-fsync filesytems which aren't being used here. If a write-fault races with a read-fault and the write-fault loses, we forget to mark the page dirty. Hmm.. in that case will the pte still be readonly, and thus the write faulter will have to try again I think? Something like this, but it's probably wrong - I didn't try very hard (am feeling ill, and vaguely grumpy) From: Andrew Morton <[EMAIL PROTECTED]> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> --- mm/memory.c | 12 1 file changed, 12 insertions(+) diff -puN mm/memory.c~a mm/memory.c --- a/mm/memory.c~a +++ a/mm/memory.c @@ -2264,10 +2264,22 @@ retry: } } else { /* One of our sibling threads was faster, back out. */ + if (write_access) { + /* +* We might have raced against a read-fault. We still +* need to dirty the page. +*/ + dirty_page = vm_normal_page(vma, address, *page_table); + if (dirty_page) { + get_page(dirty_page); + goto dirty_it; + } + } page_cache_release(new_page); goto unlock; } +dirty_it: /* no need to invalidate: a not-present page shouldn't be cached */ update_mmu_cache(vma, address, entry); lazy_mmu_prot_update(entry); _ -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Tue, 19 Dec 2006 20:56:50 +1100 Nick Piggin <[EMAIL PROTECTED]> wrote: > Linus Torvalds wrote: > > > NOTICE? First you make a BIG DEAL about how dirty bits should never get > > lost, but THE VERY SAME FUNCTION actually very much on purpose DOES drop > > the dirty bit for when it's not in the page tables. > > try_to_free_buffers is quite a special case, where we're transferring > the page dirty metadata from the buffers to the page. I think Andrew > would have a better grasp of it so he could correct me, but what it > does is legitimate. Well it used to be. After 2.6.19 it can do the wrong thing for mapped pages. But it turns out that we don't feed it mapped pages, apart from pagevec_strip() and possibly races against pagefaults. > I think it could be very likely that indeed the bug is a latent one in > a clear_page_dirty caller, rather than dirty-tracking itself. The only callers are try_to_free_buffers(), truncate and a few scruffy possibly-wrong-for-fsync filesytems which aren't being used here. If a write-fault races with a read-fault and the write-fault loses, we forget to mark the page dirty. Something like this, but it's probably wrong - I didn't try very hard (am feeling ill, and vaguely grumpy) From: Andrew Morton <[EMAIL PROTECTED]> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> --- mm/memory.c | 12 1 file changed, 12 insertions(+) diff -puN mm/memory.c~a mm/memory.c --- a/mm/memory.c~a +++ a/mm/memory.c @@ -2264,10 +2264,22 @@ retry: } } else { /* One of our sibling threads was faster, back out. */ + if (write_access) { + /* +* We might have raced against a read-fault. We still +* need to dirty the page. +*/ + dirty_page = vm_normal_page(vma, address, *page_table); + if (dirty_page) { + get_page(dirty_page); + goto dirty_it; + } + } page_cache_release(new_page); goto unlock; } +dirty_it: /* no need to invalidate: a not-present page shouldn't be cached */ update_mmu_cache(vma, address, entry); lazy_mmu_prot_update(entry); _ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
Linus Torvalds wrote: On Tue, 19 Dec 2006, Nick Piggin wrote: Anyway it has the same issues as the others. See what happens when you run two test_clear_page_dirty_sync_ptes() consecutively, you still loose PG_dirty even though the page might actually be dirty. How can this happen? We'll only test_clear_page_dirty_sync_ptes again after buffers have been reattached, and subsequently cleaned. And in that case if the ptes are still clean at this point then the page really is clean. Why do you talk about buffers being reattached? Are you still in some world where "try_to_free_buffers()" matters? Have you not followed the I'm talking about fixing just the race Andrew noticed via inspection. No it doesn't appear to fix Andrei's problem, unfortunately. But it needs to be fixed all the same, doesn't it? discussion? Why do you ignore my MUCH SIMPLER patch that just removed all this crap ENTIRELY from "try_to_free_buffers()", and the exact same corruption happened? Forget about "try_to_free_buffers()". Please apply this patch to your tree first. That gets rid of _one_ copy of totally insane code that did all the wrong things. Only after you have applied this patch should you look at the code again. Realizing that the corruption still happens. So forget about buffers already. That piece of code was crap. Now I'm not exactly sure how ext3 (or any other) filesystems make use of this particular feature of try_to_free_buffers(), but it is clear from the comments what it is for. So your patch isn't really a minimal fix (ie. it would require an OK from all filesystems, wouldn't it?) Or did I miss a mail where you reasoned that it is safe to make this change (/me goes to reread the thread)... Linus --- diff --git a/fs/buffer.c b/fs/buffer.c index d1f1b54..263f88e 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page) int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page) spin_lock(>private_lock); ret = drop_buffers(page, _to_free); spin_unlock(>private_lock); - if (ret) { - /* -* If the filesystem writes its buffers by hand (eg ext3) -* then we can have clean buffers against a dirty page. We -* clean the page here; otherwise later reattachment of buffers -* could encounter a non-uptodate page, which is unresolvable. -* This only applies in the rare case where try_to_free_buffers -* succeeds but the page is not freed. -* -* Also, during truncate, discard_buffer will have marked all -* the page's buffers clean. We discover that here and clean -* the page also. -*/ - if (test_clear_page_dirty(page)) - task_io_account_cancelled_write(PAGE_CACHE_SIZE); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
* Marc Haber <[EMAIL PROTECTED]> [2006-12-19 09:51]: > I do not have a clue about memory management at all, but is it > possible that you're testing on a box with too much memory? My box has > only 256 MB, and I used to use mutt with a _huge_ inbox with mutt > taking somewhat 150 MB. Add spamassassin and a reasonably busy mail > server, and the box used to be like 150 MB in swap. FWIW, the ARM box I see this on has only 32 MB memory (and a 133 or 266 MHz CPU). I don't see it on another ARM box (different ARM sub-arch) with 128 MB memory and a 600 MHz CPU. -- Martin Michlmayr http://www.cyrius.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Tue, Dec 19, 2006 at 12:24:16AM -0800, Andrew Morton wrote: > Wow. I didn't expect that, because Mark Haber reported that ext3's > data=writeback > fixed it. Maybe he didn't run it for long enough? My test case is Debian's "aptitude update" running once an hour, and it was always the same file getting corrupted. With 2.6.19, I had this corruption like every third hour (but -only- if run from cron, running from a shell was always fine), data=writeback made the issue disappear for about two days before I booted into 2.6.19.1 without data=writeback (defaults chosen then), after which the issue only shows up like every other day. So, I feel like out of the loop since rtorrent seems much better in reproducing this. I notice, though, that both aptitude and rtorrent do downloads from the net, so there might be a relation to tcp/ip and/or the network driver. My box has a Linksys NC100 network card running with the tulip driver. Greetings Marc -- - Marc Haber | "I don't trust Computers. They | Mailadresse im Header Mannheim, Germany | lose things."Winona Ryder | Fon: *49 621 72739834 Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Tue, 2006-12-19 at 10:00 +0100, Peter Zijlstra wrote: > On Tue, 2006-12-19 at 00:04 -0800, Linus Torvalds wrote: > > > Nobody has actually ever explained why "test_clear_page_dirty()" is good > > at all. > > > > - Why is it ever used instead of "clear_page_dirty_for_io()"? > > > > - What is the difference? > > > > - Why would you EVER want to clear bits just in the "struct page *" or > >just in the PTE's? > > > > - Why is it EVER correct to clear dirty bits except JUST BEFORE THE IO? > > > > In other words, I have a theory: > > > > "A lot of this is actually historical cruft. Some of it may even be code > > that was never supposed to work, but because we maintained _other_ dirty > > bits in the PTE's, and never touched them before, we never even realized > > that the code that played with PG_dirty was totally insane" > > > > Now, that's just a theory. And yeah, it may be stated a bit provocatively. > > It may not be entirely correct. I'm just saying.. maybe it is? > > On Sun, 2006-12-17 at 15:40 -0800, Andrew Morton wrote: > > > try_to_free_buffers() clears the page's dirty state if it successfully > > removed > > the page's buffers. > > > > Background for this: > > > > - a process does a one-byte-write to a file on a 64k pagesize, 4k > > blocksize ext3 filesystem. The page is now PageDirty, !PgeUptodate and > > has one dirty buffer and 15 not uptodate buffers. > > > > - kjournald writes the dirty buffer. The page is now PageDirty, > > !PageUptodate and has a mix of clean and not uptodate buffers. > > > > - try_to_free_buffers() removes the page's buffers. It MUST now clear > > PageDirty. If we were to leave the page dirty then we'd have a dirty, > > not > > uptodate page with no buffer_heads. > > > > We're screwed: we cannot write the page because we don't know which > > sections of it contain garbage. We cannot read the page because we > > don't > > know which sections of it contain modified data. We cannot free the > > page > > because it is dirty. > > However!! this is not true for mapped pages because mapped pages must > have the whole (16k in akpm's example) page loaded. Hence I suspect that > what Andrei did by accident - remove the if (mapping) case in > test_clean_dirty_pages() - is actually totally correct. Obviously I need my morning shot, 64k ofcourse. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.19 file content corruption on ext3
On Tue, 2006-12-19 at 00:04 -0800, Linus Torvalds wrote: > Nobody has actually ever explained why "test_clear_page_dirty()" is good > at all. > > - Why is it ever used instead of "clear_page_dirty_for_io()"? > > - What is the difference? > > - Why would you EVER want to clear bits just in the "struct page *" or >just in the PTE's? > > - Why is it EVER correct to clear dirty bits except JUST BEFORE THE IO? > > In other words, I have a theory: > > "A lot of this is actually historical cruft. Some of it may even be code > that was never supposed to work, but because we maintained _other_ dirty > bits in the PTE's, and never touched them before, we never even realized > that the code that played with PG_dirty was totally insane" > > Now, that's just a theory. And yeah, it may be stated a bit provocatively. > It may not be entirely correct. I'm just saying.. maybe it is? On Sun, 2006-12-17 at 15:40 -0800, Andrew Morton wrote: > try_to_free_buffers() clears the page's dirty state if it successfully removed > the page's buffers. > > Background for this: > > - a process does a one-byte-write to a file on a 64k pagesize, 4k > blocksize ext3 filesystem. The page is now PageDirty, !PgeUptodate and > has one dirty buffer and 15 not uptodate buffers. > > - kjournald writes the dirty buffer. The page is now PageDirty, > !PageUptodate and has a mix of clean and not uptodate buffers. > > - try_to_free_buffers() removes the page's buffers. It MUST now clear > PageDirty. If we were to leave the page dirty then we'd have a dirty, not > uptodate page with no buffer_heads. > > We're screwed: we cannot write the page because we don't know which > sections of it contain garbage. We cannot read the page because we don't > know which sections of it contain modified data. We cannot free the page > because it is dirty. However!! this is not true for mapped pages because mapped pages must have the whole (16k in akpm's example) page loaded. Hence I suspect that what Andrei did by accident - remove the if (mapping) case in test_clean_dirty_pages() - is actually totally correct. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/