subject:"Re\: 2.6.19 file content corruption on ext3"

Re: 2.6.19 file content corruption on ext3

2006-12-29 Thread Dave Jones

On Fri, Dec 29, 2006 at 07:52:15PM +0100, maximilian attems wrote:
 
 > > The only -mm stuff I recall being in the Fedora 2.6.18 is
 > > the inode-diet stuff which ended up in 2.6.19, though the xmas
 > > break has left my head somewhat empty so I may be forgetting something.
 > > What patch in particular are you talking about?
 > 
 > it's no longer visible in the FC6 cvs, due to rebase
 >  but it's name was linux-2.6-mm-tracking-dirty-pages.patch
 > it is an earlier almagame of the merged patch serie:
 >- mm: tracking shared dirty pages
 >- mm: balance dirty pages
 >- mm: optimize the new mprotect() code a bit
 >- mm: small cleanup of install_page()
 >- mm: fixup do_wp_page()
 >- mm: msync() cleanup (closes: #394392)

Ohh, that. Yes. I had forgotten all about that.
I've been hitting the nog a little too hard :)

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-29 Thread maximilian attems

On Fri, Dec 29, 2006 at 10:02:53AM -0500, Dave Jones wrote:
> On Fri, Dec 29, 2006 at 10:23:14AM +0100, maximilian attems wrote:
>  > > On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote:

>  > >  > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 
> 2.6.18
>  > >  > > (or older)?
>  > >  > 
>  > >  > Well, that was a really _old_ fedora kernel. I guarantee you it 
> didn't 
>  > >  > have the page throttling patches in it, those were written this 
> summer. So 
>  > >  > it would either have to be Fedora carrying around another patch that 
> just 
>  > >  > happens to result in the same corruption for _years_, or it's the 
> same 
>  > >  > bug.
>  > > 
>  > > The only notable VM patch in Fedora kernels of that vintage that I recall
>  > > was Ingo's 4g/4g thing.
>  > 
>  > no the fedora 2.6.18 kernel is affected.
> 
> I wasn't denying that, but Linus was talking about a 2.6.5 Fedora kernel.
> 
>  > it carries the same -mm patches that Debian backported
>  > for LSB 3.1 compliance.
> 
> The only -mm stuff I recall being in the Fedora 2.6.18 is
> the inode-diet stuff which ended up in 2.6.19, though the xmas
> break has left my head somewhat empty so I may be forgetting something.
> What patch in particular are you talking about?

it's no longer visible in the FC6 cvs, due to rebase
 but it's name was linux-2.6-mm-tracking-dirty-pages.patch
it is an earlier almagame of the merged patch serie:
   - mm: tracking shared dirty pages
   - mm: balance dirty pages
   - mm: optimize the new mprotect() code a bit
   - mm: small cleanup of install_page()
   - mm: fixup do_wp_page()
   - mm: msync() cleanup (closes: #394392)

--
maks
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-29 Thread Guillaume Chazarain


Linus Torvalds a écrit :

going back to Linux-2.6.5 at least, according to one tester).
  


I apologize for the confusion, but it just occurred to me that I was 
actually
experiencing a totally different problem: I set a root filesystem of 
3Mib for

qemu, so the test program just didn't have enough space for its file.

--
Guillaume

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-29 Thread Dave Jones

On Fri, Dec 29, 2006 at 10:23:14AM +0100, maximilian attems wrote:
 > > On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote:
 > >  > 
 > >  > 
 > >  > On Thu, 28 Dec 2006, Petri Kaukasoina wrote:
 > >  > > > me up), and that seems to show the corruption going way way back 
 > > (ie going 
 > >  > > > back to Linux-2.6.5 at least, according to one tester).
 > >  > > 
 > >  > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 
 > > 2.6.18
 > >  > > (or older)?
 > >  > 
 > >  > Well, that was a really _old_ fedora kernel. I guarantee you it didn't 
 > >  > have the page throttling patches in it, those were written this summer. 
 > > So 
 > >  > it would either have to be Fedora carrying around another patch that 
 > > just 
 > >  > happens to result in the same corruption for _years_, or it's the same 
 > >  > bug.
 > > 
 > > The only notable VM patch in Fedora kernels of that vintage that I recall
 > > was Ingo's 4g/4g thing.
 > 
 > no the fedora 2.6.18 kernel is affected.

I wasn't denying that, but Linus was talking about a 2.6.5 Fedora kernel.

 > it carries the same -mm patches that Debian backported
 > for LSB 3.1 compliance.

The only -mm stuff I recall being in the Fedora 2.6.18 is
the inode-diet stuff which ended up in 2.6.19, though the xmas
break has left my head somewhat empty so I may be forgetting something.
What patch in particular are you talking about?

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-29 Thread maximilian attems

> On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote:
>  > 
>  > 
>  > On Thu, 28 Dec 2006, Petri Kaukasoina wrote:
>  > > > me up), and that seems to show the corruption going way way back (ie 
> going 
>  > > > back to Linux-2.6.5 at least, according to one tester).
>  > > 
>  > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 
> 2.6.18
>  > > (or older)?
>  > 
>  > Well, that was a really _old_ fedora kernel. I guarantee you it didn't 
>  > have the page throttling patches in it, those were written this summer. So 
>  > it would either have to be Fedora carrying around another patch that just 
>  > happens to result in the same corruption for _years_, or it's the same 
>  > bug.
> 
> The only notable VM patch in Fedora kernels of that vintage that I recall
> was Ingo's 4g/4g thing.
> 
>   Dave

no the fedora 2.6.18 kernel is affected.
it carries the same -mm patches that Debian backported
for LSB 3.1 compliance.

-- 
maks

ps sorry for stripping cc, only downloaded that message raw.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-29 Thread maximilian attems

 On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote:
   
   
   On Thu, 28 Dec 2006, Petri Kaukasoina wrote:
 me up), and that seems to show the corruption going way way back (ie 
 going 
 back to Linux-2.6.5 at least, according to one tester).

That was a Fedora kernel. Has anyone seen the corruption in vanilla 
 2.6.18
(or older)?
   
   Well, that was a really _old_ fedora kernel. I guarantee you it didn't 
   have the page throttling patches in it, those were written this summer. So 
   it would either have to be Fedora carrying around another patch that just 
   happens to result in the same corruption for _years_, or it's the same 
   bug.
 
 The only notable VM patch in Fedora kernels of that vintage that I recall
 was Ingo's 4g/4g thing.
 
   Dave

no the fedora 2.6.18 kernel is affected.
it carries the same -mm patches that Debian backported
for LSB 3.1 compliance.

-- 
maks

ps sorry for stripping cc, only downloaded that message raw.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-29 Thread Dave Jones

On Fri, Dec 29, 2006 at 10:23:14AM +0100, maximilian attems wrote:
   On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote:
 
 
 On Thu, 28 Dec 2006, Petri Kaukasoina wrote:
   me up), and that seems to show the corruption going way way back 
   (ie going 
   back to Linux-2.6.5 at least, according to one tester).
  
  That was a Fedora kernel. Has anyone seen the corruption in vanilla 
   2.6.18
  (or older)?
 
 Well, that was a really _old_ fedora kernel. I guarantee you it didn't 
 have the page throttling patches in it, those were written this summer. 
   So 
 it would either have to be Fedora carrying around another patch that 
   just 
 happens to result in the same corruption for _years_, or it's the same 
 bug.
   
   The only notable VM patch in Fedora kernels of that vintage that I recall
   was Ingo's 4g/4g thing.
  
  no the fedora 2.6.18 kernel is affected.

I wasn't denying that, but Linus was talking about a 2.6.5 Fedora kernel.

  it carries the same -mm patches that Debian backported
  for LSB 3.1 compliance.

The only -mm stuff I recall being in the Fedora 2.6.18 is
the inode-diet stuff which ended up in 2.6.19, though the xmas
break has left my head somewhat empty so I may be forgetting something.
What patch in particular are you talking about?

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-29 Thread Guillaume Chazarain


Linus Torvalds a écrit :

going back to Linux-2.6.5 at least, according to one tester).
  


I apologize for the confusion, but it just occurred to me that I was 
actually
experiencing a totally different problem: I set a root filesystem of 
3Mib for

qemu, so the test program just didn't have enough space for its file.

--
Guillaume

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-29 Thread maximilian attems

On Fri, Dec 29, 2006 at 10:02:53AM -0500, Dave Jones wrote:
 On Fri, Dec 29, 2006 at 10:23:14AM +0100, maximilian attems wrote:
On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote:
snipp
   That was a Fedora kernel. Has anyone seen the corruption in vanilla 
 2.6.18
   (or older)?
  
  Well, that was a really _old_ fedora kernel. I guarantee you it 
 didn't 
  have the page throttling patches in it, those were written this 
 summer. So 
  it would either have to be Fedora carrying around another patch that 
 just 
  happens to result in the same corruption for _years_, or it's the 
 same 
  bug.

The only notable VM patch in Fedora kernels of that vintage that I recall
was Ingo's 4g/4g thing.
   
   no the fedora 2.6.18 kernel is affected.
 
 I wasn't denying that, but Linus was talking about a 2.6.5 Fedora kernel.
 
   it carries the same -mm patches that Debian backported
   for LSB 3.1 compliance.
 
 The only -mm stuff I recall being in the Fedora 2.6.18 is
 the inode-diet stuff which ended up in 2.6.19, though the xmas
 break has left my head somewhat empty so I may be forgetting something.
 What patch in particular are you talking about?

it's no longer visible in the FC6 cvs, due to rebase
 but it's name was linux-2.6-mm-tracking-dirty-pages.patch
it is an earlier almagame of the merged patch serie:
   - mm: tracking shared dirty pages
   - mm: balance dirty pages
   - mm: optimize the new mprotect() code a bit
   - mm: small cleanup of install_page()
   - mm: fixup do_wp_page()
   - mm: msync() cleanup (closes: #394392)

--
maks
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-29 Thread Dave Jones

On Fri, Dec 29, 2006 at 07:52:15PM +0100, maximilian attems wrote:
 
   The only -mm stuff I recall being in the Fedora 2.6.18 is
   the inode-diet stuff which ended up in 2.6.19, though the xmas
   break has left my head somewhat empty so I may be forgetting something.
   What patch in particular are you talking about?
  
  it's no longer visible in the FC6 cvs, due to rebase
   but it's name was linux-2.6-mm-tracking-dirty-pages.patch
  it is an earlier almagame of the merged patch serie:
 - mm: tracking shared dirty pages
 - mm: balance dirty pages
 - mm: optimize the new mprotect() code a bit
 - mm: small cleanup of install_page()
 - mm: fixup do_wp_page()
 - mm: msync() cleanup (closes: #394392)

Ohh, that. Yes. I had forgotten all about that.
I've been hitting the nog a little too hard :)

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Andrew Morton

On Thu, 28 Dec 2006 17:38:38 -0800 (PST)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> in 
> the hope that somebody else is working on this corruption issue and is 
> interested..

What corruption issue? ;)

I'm finding that the corruption happens trivially with your test app, but
apparently doesn't happen at all with ext2 or ext3, data=writeback.  Maybe
it will happen with increased rarity, but the difference is quite stark.

Removing the

err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
NULL, journal_dirty_data_fn);

from ext3_ordered_writepage() fixes things up.

The things which journal_submit_data_buffers() does after dropping all the
locks are ...  disturbing - I don't think we have sufficient tests in there
to ensure that the buffer is still where we think it is after we retake
locks (they're slippery little buggers).  But that wouldn't explain it
anyway.

It's inefficient that journal_dirty_data() will put these locked, clean
buffers onto BJ_SyncData instead of BJ_Locked, but
journal_submit_data_buffers() seems to dtrt with them.

So no theory yet.  Maybe ext3 is just altering timing.  But the difference
is really large..

Disabling all the WB_SYNC_NONE stuff and making everything go synchronous
everywhere has no effect.  Disabling bdi_write_congested() has no effect.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Linus Torvalds


Btw, 
 much cleaned-up page tracing patch here, in case anybody cares (and 
"test.c" attached, although I don't think it changed since last time). 

The test.c output is a bit hard to read at times, since it will give 
offsets in bytes as hex (ie "00a77664" means page frame 0a77, and byte 
664h within that page), while the kernel output is obvioiusly the page 
indexes (but the page fault _addresses_ can contain information about the 
exact byte in a page, so you can match them up when some kernel event is 
related to a page fault).

So both forms are necessary/logical, but it means that to match things up, 
you often need to ignore the last three hex digits of the address that 
"test.c" outputs.

This one also adds traces for the tags and the writeback activity, but 
since I'm going out for birthday dinner, I won't have time to try to 
actually analyse the trace I have.. Which is why I'm sending it out, in 
the hope that somebody else is working on this corruption issue and is 
interested..

Linus


diff --git a/fs/buffer.c b/fs/buffer.c
index 263f88e..f5e132a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -722,6 +722,7 @@ int __set_page_dirty_buffers(struct page *page)
set_buffer_dirty(bh);
bh = bh->b_this_page;
} while (bh != head);
+   PAGE_TRACE(page, "dirtied buffers");
}
spin_unlock(>private_lock);
 
@@ -734,6 +735,7 @@ int __set_page_dirty_buffers(struct page *page)
__inc_zone_page_state(page, NR_FILE_DIRTY);
task_io_account_write(PAGE_CACHE_SIZE);
}
+   PAGE_TRACE(page, "setting TAG_DIRTY");
radix_tree_tag_set(>page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 350878a..0cf3dce 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -91,6 +91,14 @@
 #define PG_nosave_free 18  /* Used for system suspend/resume */
 #define PG_buddy   19  /* Page is free, on buddy lists */
 
+#define SetPageInteresting(page) set_bit(PG_arch_1, &(page)->flags)
+#define PageInteresting(page)  test_bit(PG_arch_1, &(page)->flags)
+
+#define PAGE_TRACE(page, msg, arg...) do { 
\
+   if (PageInteresting(page))  
\
+   printk(KERN_DEBUG "PG %08lx: %s:%d " msg "\n",  
\
+   (page)->index, __FILE__, __LINE__ ,##arg ); 
\
+} while (0)
 
 #if (BITS_PER_LONG > 32)
 /*
@@ -183,32 +191,38 @@ static inline void SetPageUptodate(struct page *page)
 #define PageWriteback(page)test_bit(PG_writeback, &(page)->flags)
 #define SetPageWriteback(page) \
do {\
-   if (!test_and_set_bit(PG_writeback, \
-   &(page)->flags))\
+   if (!test_and_set_bit(PG_writeback, &(page)->flags)) {  \
+   PAGE_TRACE(page, "set writeback");  \
inc_zone_page_state(page, NR_WRITEBACK);\
+   }   \
} while (0)
 #define TestSetPageWriteback(page) \
({  \
int ret;\
ret = test_and_set_bit(PG_writeback,\
&(page)->flags);\
-   if (!ret)   \
+   if (!ret) { \
+   PAGE_TRACE(page, "set writeback");  \
inc_zone_page_state(page, NR_WRITEBACK);\
+   }   \
ret;\
})
 #define ClearPageWriteback(page)   \
do {\
-   if (test_and_clear_bit(PG_writeback,\
-   &(page)->flags))\
+   if (test_and_clear_bit(PG_writeback, &(page)->flags)) { \
+   PAGE_TRACE(page, "end writeback");  \
dec_zone_page_state(page, NR_WRITEBACK);\
+   }   \
} while (0)
 #define TestClearPageWriteback(page)   \
({

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Anton Altaparmakov

On Thu, 28 Dec 2006, Linus Torvalds wrote:
> Ok,
>  with the ugly trace capture patch, I've actually captured this corruption 
> in action, I think.
> 
> I did a full trace of all pages involved in one run, and picked one 
> corruption at random:
> 
>   Chunk 14465 corrupted (0-75)  (01423fb4-01423fff)
>   Expected 129, got 0
>   Written as (5126)9509(15017)
> 
> That's the first 76 bytes of a chunk missing, and it's the last 76 bytes 
> on a page. It's page index 01423 in the mapped file, and bytes fb4-fff 
> within that file.
> 
> There were four chunks written to that page:
> 
>   Writing chunk 14463/15800 (15%) (0142344c) (1)
>   Writing chunk 14462/15800 (30%) (01422e98) (2) (overflows into 1423)
>   Writing chunk 14464/15800 (32%) (01423a00) (3)
>   Writing chunk 14465/15800 (60%) (01423fb4) (4)  <--- LOST!
> 
> and the other three chunks checked out all right.
> 
> And here's the annotated trace as it concerns that page:
> 
>  - here we write the first chunk to the page:
>   ** (1)  do_no_page: mapping index 1423 at b7d1f44c (write)
>   **  Setting page 1423 dirty
> 
>  - something flushes it out to disk:
>   **  cpd_for_io: index 1423
>   **  cleaning index 1423 at b7d1f000
> 
>  - here we write the second chunk (which was split over the previous page 
>and the interesting one):
>   ** (2)  Setting page 1422 dirty
>   ** (2)  Setting page 1423 dirty
> 
>  - and here we do a cleaning event
>   **  cpd_for_io: index 1423
>   **  cleaning index 1423 at b7d1f000
> 
>  - here we write the third chunk:
>   ** (3)  Setting page 1423 dirty
> 
>  - here we write the fourth chunk:
>   ** (4) NO DIRTY EVENT
> 
>  - and a third flush to disk: 
>   **  cpd_for_io: index 1423
>   **  cleaning index 1423 at b7d1f000
> 
>  - here we unmap and flush:
>   **  Unmapped index 1423 at b7d1f000
>   **  Removing index 1423 from page cache
> 
>  - here we remap to check:
>   **  do_no_page: mapping index 1423 at b7d1f000 (read)
>   **  Unmapped index 1423 at b7d1f000
> 
>  - and finally, here I remove the file after the run:
>   **  Removing index 1423 from page cache
> 
> Now, the important thing to see here is:
> 
>  - the missing write did not have a "Setting page 1423 dirty" event 
>associated with it.
> 
>  - but I can _see_ where the actual dirty event would be happening in the 
>logs, because I can see the dirty events of the other chunk writes 
>around it, so I know exactly where that fourth write happens. And 
>indeed, it _shouldn't_ get a dirty event, because the page is still 
>dirty from the write of chunk #3 to that page, which _did_ get a dirty 
>event.
> 
>I can see that, because the testing app writes the log of the pages it 
>writes, and this is the log around the fourth and final write:
> 
>   ...
> Writing chunk 5338/15800 (60%) (0076eb48)   PFN: 76e/76f
> Writing chunk 960/15800 (60%) (00156300)PFN: 156
> Writing chunk 14465/15800 (60%) (01423fb4)  <
> Writing chunk 8594/15800 (60%) (00bf74a8)   PFN: bf7
> Writing chunk 556/15800 (60%) (000c62f0)PFN: c6
>   Writing chunk 15190/15800 (60%) (01526678)  PFN: 1526
>   ...
> 
>and I can match this up with the full log from the kernel, which looks 
>like this:
> 
> Setting page 076e dirty
> Setting page 076f dirty
> Setting page 0156 dirty
> Setting page 00c6 dirty
>   Setting page 1526 dirty
> 
>so I know exactly where the missing writes (to our page at pfn 1423, 
>and the fpn-bf7 page) happened.
> 
>  - and the thing is, I can see a "cpd_for_io()" happening AFTER that 
>fourth write. Quite a long while after, in fact. So all of this looks 
>very fine indeed. We are not losing any dirty bits.
> 
>  - EVEN MORE INTERESTING: write 3 makes it onto disk, and it really uses 
>the SAME dirty bit as write 4 did (which didn't make it out to disk!). 
>The event that clears the dirty bit that write 3 did happens AFTER 
>write 4 has happened!
> 
> So if we're not losing any dirty bits, what's going on?
> 
> I think we have some nasty interaction with the buffer heads. In 

But are chunks 3 and 4 in separate buffer heads?  Sorry could not see it 
immediately from the output you showed...

It is just that there may be a different cause rather than buffer dirty 
state...

A shot in the dark I know but it could perhaps be that a "COW for 
MAP_PRIVATE" like event happens when the page is dirty already thus the 
second write never actually makes it to the shared page thus it never gets 
written out.

I am almost certainly totally barking up the wrong tree but I thought it 
may be worth mentioning just in case there was a slip in the COW logic or 
page

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Linus Torvalds



On Thu, 28 Dec 2006, Anton Altaparmakov wrote:
> 
> But are chunks 3 and 4 in separate buffer heads?  Sorry could not see it 
> immediately from the output you showed...

No, this is a 4kB filesystem. A single bh per page.

> It is just that there may be a different cause rather than buffer dirty 
> state...

Sure.

> A shot in the dark I know but it could perhaps be that a "COW for 
> MAP_PRIVATE" like event happens when the page is dirty already thus the 
> second write never actually makes it to the shared page thus it never gets 
> written out.

There are no private mappings anywhere, and no forks. Just a single mmap 
(well, we unmap and remap in order to force the page cache to be 
invalidated properly with the posix_fadvise() thing, but that's literally 
the only user).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Linus Torvalds

On Thu, 28 Dec 2006, David Miller wrote:
> 
> What happens when we writeback, to the PTEs?

Not a damn thing.

We clear the PTE's _before_ we even start the write. The writeback does 
nothing to them. If the user dirties the page while writeback is in 
progress, we'll take the page fault and re-dirty it _again_.

> page_mkclean_file() iterates the VMAs and when it finds a shared
> one it goes:
> 
>   entry = ptep_clear_flush(vma, address, pte);
>   entry = pte_wrprotect(entry);
>   entry = pte_mkclean(entry);
> 
> and that's fine, but that PTE is still marked writable, and
> I think that's key.

No it's not. It's right there. "pte_wrprotect(entry)". You even copied it 
yourself.

> What does the fault path do in this situation?
> 
>   if (write_access) {
>   if (!pte_write(entry))
>   return do_wp_page(mm, vma, address,
>   pte, pmd, ptl, entry);

So we call "do_wp_page()", and that does everythign right.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread David Miller

From: Linus Torvalds <[EMAIL PROTECTED]>
Date: Thu, 28 Dec 2006 14:37:37 -0800 (PST)

> So if we're not losing any dirty bits, what's going on?

What happens when we writeback, to the PTEs?

page_mkclean_file() iterates the VMAs and when it finds a shared
one it goes:

entry = ptep_clear_flush(vma, address, pte);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);

and that's fine, but that PTE is still marked writable, and
I think that's key.

What does the fault path do in this situation?

if (write_access) {
if (!pte_write(entry))
return do_wp_page(mm, vma, address,
pte, pmd, ptl, entry);
entry = pte_mkdirty(entry);
}

It does nothing to update the page dirty state, because it's
writable, it just sets the PTE dirty bit and that's it.  Should
it be setting the page dirty here for SHARED cases?

So until vmscan actually unmaps the PTE completely, we have this
window in which the application can write to the PTE and the
page dirty state doesn't get updated.

Perhaps something later cleans up after this, f.e. by rechecking the
PTE dirty bit at the end of I/O or when vmscan unmaps the page.
I guess that should handle things, but the above logic definitely
stood out to me.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Linus Torvalds


Ok,
 with the ugly trace capture patch, I've actually captured this corruption 
in action, I think.

I did a full trace of all pages involved in one run, and picked one 
corruption at random:

Chunk 14465 corrupted (0-75)  (01423fb4-01423fff)
Expected 129, got 0
Written as (5126)9509(15017)

That's the first 76 bytes of a chunk missing, and it's the last 76 bytes 
on a page. It's page index 01423 in the mapped file, and bytes fb4-fff 
within that file.

There were four chunks written to that page:

Writing chunk 14463/15800 (15%) (0142344c) (1)
Writing chunk 14462/15800 (30%) (01422e98) (2) (overflows into 1423)
Writing chunk 14464/15800 (32%) (01423a00) (3)
Writing chunk 14465/15800 (60%) (01423fb4) (4)  <--- LOST!

and the other three chunks checked out all right.

And here's the annotated trace as it concerns that page:

 - here we write the first chunk to the page:
** (1)  do_no_page: mapping index 1423 at b7d1f44c (write)
**  Setting page 1423 dirty

 - something flushes it out to disk:
**  cpd_for_io: index 1423
**  cleaning index 1423 at b7d1f000

 - here we write the second chunk (which was split over the previous page 
   and the interesting one):
** (2)  Setting page 1422 dirty
** (2)  Setting page 1423 dirty

 - and here we do a cleaning event
**  cpd_for_io: index 1423
**  cleaning index 1423 at b7d1f000

 - here we write the third chunk:
** (3)  Setting page 1423 dirty

 - here we write the fourth chunk:
** (4) NO DIRTY EVENT

 - and a third flush to disk: 
**  cpd_for_io: index 1423
**  cleaning index 1423 at b7d1f000

 - here we unmap and flush:
**  Unmapped index 1423 at b7d1f000
**  Removing index 1423 from page cache

 - here we remap to check:
**  do_no_page: mapping index 1423 at b7d1f000 (read)
**  Unmapped index 1423 at b7d1f000

 - and finally, here I remove the file after the run:
**  Removing index 1423 from page cache

Now, the important thing to see here is:

 - the missing write did not have a "Setting page 1423 dirty" event 
   associated with it.

 - but I can _see_ where the actual dirty event would be happening in the 
   logs, because I can see the dirty events of the other chunk writes 
   around it, so I know exactly where that fourth write happens. And 
   indeed, it _shouldn't_ get a dirty event, because the page is still 
   dirty from the write of chunk #3 to that page, which _did_ get a dirty 
   event.

   I can see that, because the testing app writes the log of the pages it 
   writes, and this is the log around the fourth and final write:

...
Writing chunk 5338/15800 (60%) (0076eb48)   PFN: 76e/76f
Writing chunk 960/15800 (60%) (00156300)PFN: 156
Writing chunk 14465/15800 (60%) (01423fb4)  <
Writing chunk 8594/15800 (60%) (00bf74a8)   PFN: bf7
Writing chunk 556/15800 (60%) (000c62f0)PFN: c6
Writing chunk 15190/15800 (60%) (01526678)  PFN: 1526
...

   and I can match this up with the full log from the kernel, which looks 
   like this:

Setting page 076e dirty
Setting page 076f dirty
Setting page 0156 dirty
Setting page 00c6 dirty
Setting page 1526 dirty

   so I know exactly where the missing writes (to our page at pfn 1423, 
   and the fpn-bf7 page) happened.

 - and the thing is, I can see a "cpd_for_io()" happening AFTER that 
   fourth write. Quite a long while after, in fact. So all of this looks 
   very fine indeed. We are not losing any dirty bits.

 - EVEN MORE INTERESTING: write 3 makes it onto disk, and it really uses 
   the SAME dirty bit as write 4 did (which didn't make it out to disk!). 
   The event that clears the dirty bit that write 3 did happens AFTER 
   write 4 has happened!

So if we're not losing any dirty bits, what's going on?

I think we have some nasty interaction with the buffer heads. In 
particular, I don't think it's the dirty page bits that are broken (I 
_see_ that the PageDirty bit was set after write 4 was done to memory in 
the kernel traces). So I think that a real writeback just doesn't happen, 
because somebody has marked the buffer heads clean _after_ it started IO 
on them.

I think "__mpage_writepage()" is buggy in this regard, for example. It 
even has a comment about its crapola behaviour:

/*
 * Must try to add the page before marking the buffer clean or
 * the confused fail path above (OOM) will be very confused when
 * it finds all bh marked clean (i.e. it will not write anything)
 */

however, I don't think that particular thing explains it, because I don't 
think we use that function for the cases I'm looking

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Russell King

On Thu, Dec 28, 2006 at 01:24:30PM -0800, Linus Torvalds wrote:
> On Thu, 28 Dec 2006, Linus Torvalds wrote:
> > 
> > What we need now is actually looking at the source code, and people who 
> > understand the VM, I'm afraid. I'm gathering traces now that I have a good 
> > test-case. I'll post my trace tools once I've tested that they work, in 
> > case others want to help.
> 
> Ok, I've got the traces, but quite frankly, I doubt anybody is crazy 
> enough to want to trawl through them. It's a bit painful, since we're 
> talking thousands of pages to trigger this problem.
> 
> Also, I've used the PG_arch_1 flag, which is fine on x86[-64] and probably 
> ARM, but is used for other things on ia64, powerpc and sparc64. But here's 
> the patch in case anybody cares.

PG_arch_1 is used on ARM to flag pages that need a dcache flush prior to
hitting userspace, in the same way that sparc64 uses it.  So ARM systems
should not have this patch applied.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Linus Torvalds



On Thu, 28 Dec 2006, Linus Torvalds wrote:
> 
> What we need now is actually looking at the source code, and people who 
> understand the VM, I'm afraid. I'm gathering traces now that I have a good 
> test-case. I'll post my trace tools once I've tested that they work, in 
> case others want to help.

Ok, I've got the traces, but quite frankly, I doubt anybody is crazy 
enough to want to trawl through them. It's a bit painful, since we're 
talking thousands of pages to trigger this problem.

Also, I've used the PG_arch_1 flag, which is fine on x86[-64] and probably 
ARM, but is used for other things on ia64, powerpc and sparc64. But here's 
the patch in case anybody cares.

It wants a _big_ kernel buffer to capture all the crud into (which is why 
I made the thing accept a bigger log buffer), and quite frankly, I'm not 
at all sure that all the locking is ok (ie I could imagine that the 
dcache-locking thing there in "is_interesting()" could deadlock, what do I 
know..)

But I've captured some real data with this, which I'll describe 
separately.

Linus


diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 350878a..967dd80 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -91,6 +91,8 @@
 #define PG_nosave_free 18  /* Used for system suspend/resume */
 #define PG_buddy   19  /* Page is free, on buddy lists */
 
+#define SetPageInteresting(page) set_bit(PG_arch_1, &(page)->flags)
+#define PageInteresting(page)  test_bit(PG_arch_1, &(page)->flags)
 
 #if (BITS_PER_LONG > 32)
 /*
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 5c26818..7735b83 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -79,7 +79,7 @@ config DEBUG_KERNEL
 
 config LOG_BUF_SHIFT
int "Kernel log buffer size (16 => 64KB, 17 => 128KB)" if DEBUG_KERNEL
-   range 12 21
+   range 12 24
default 17 if S390 || LOCKDEP
default 16 if X86_NUMAQ || IA64
default 15 if SMP
diff --git a/mm/filemap.c b/mm/filemap.c
index 8332c77..d6a0f56 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -116,6 +116,7 @@ void __remove_from_page_cache(struct page *page)
 {
struct address_space *mapping = page->mapping;
 
+if (PageInteresting(page)) printk("Removing index %08x from page cache\n", 
page->index);
radix_tree_delete(>page_tree, page->index);
page->mapping = NULL;
mapping->nrpages--;
@@ -421,6 +422,23 @@ int filemap_write_and_wait_range(struct address_space 
*mapping,
return err;
 }
 
+static noinline int is_interesting(struct address_space *mapping)
+{
+   struct inode *inode = mapping->host;
+   struct dentry *dentry;
+   int retval = 0;
+
+   spin_lock(_lock);
+   list_for_each_entry(dentry, >i_dentry, d_alias) {
+   if (strcmp(dentry->d_name.name, "mapfile"))
+   continue;
+   retval = 1;
+   break;
+   }
+   spin_unlock(_lock);
+   return retval;
+}
+
 /**
  * add_to_page_cache - add newly allocated pagecache pages
  * @page:  page to add
@@ -439,6 +457,9 @@ int add_to_page_cache(struct page *page, struct 
address_space *mapping,
 {
int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
+   if (is_interesting(mapping))
+   SetPageInteresting(page);
+
if (error == 0) {
write_lock_irq(>tree_lock);
error = radix_tree_insert(>page_tree, offset, page);
diff --git a/mm/memory.c b/mm/memory.c
index 563792f..14c9815 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -667,6 +667,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
tlb_remove_tlb_entry(tlb, pte, addr);
if (unlikely(!page))
continue;
+if (PageInteresting(page))
+   printk("Unmapped index %08x at %08x\n", page->index, addr);
if (unlikely(details) && details->nonlinear_vma
&& linear_page_index(details->nonlinear_vma,
addr) != page->index)
@@ -1605,6 +1607,7 @@ gotten:
 */
ptep_clear_flush(vma, address, page_table);
set_pte_at(mm, address, page_table, entry);
+if (PageInteresting(new_page)) printk("do_wp_page: mapping index %08x at 
%08lx\n", new_page->index, address);
update_mmu_cache(vma, address, entry);
lru_cache_add_active(new_page);
page_add_new_anon_rmap(new_page, vma, address);
@@ -2249,6 +2252,7 @@ retry:
entry = mk_pte(new_page, vma->vm_page_prot);
if (write_access)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+if (PageInteresting(new_page)) printk("do_no_page: mapping index %08x at %08lx 
(%s)\n", new_page->index, address, write_access ? "write" : "read");
set_pte_at(mm, address,

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Arjan van de Ven

On Thu, 2006-12-28 at 14:39 -0500, Dave Jones wrote:
> On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote:
>  > 
>  > 
>  > On Thu, 28 Dec 2006, Petri Kaukasoina wrote:
>  > > > me up), and that seems to show the corruption going way way back (ie 
> going 
>  > > > back to Linux-2.6.5 at least, according to one tester).
>  > > 
>  > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 
> 2.6.18
>  > > (or older)?
>  > 
>  > Well, that was a really _old_ fedora kernel. I guarantee you it didn't 
>  > have the page throttling patches in it, those were written this summer. So 
>  > it would either have to be Fedora carrying around another patch that just 
>  > happens to result in the same corruption for _years_, or it's the same 
>  > bug.
> 
> The only notable VM patch in Fedora kernels of that vintage that I recall
> was Ingo's 4g/4g thing.

which does tlb flushes *all the time* so that even rules out (well
almost) a stale tlb somewhere...


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Dave Jones

On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote:
 > 
 > 
 > On Thu, 28 Dec 2006, Petri Kaukasoina wrote:
 > > > me up), and that seems to show the corruption going way way back (ie 
 > > > going 
 > > > back to Linux-2.6.5 at least, according to one tester).
 > > 
 > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18
 > > (or older)?
 > 
 > Well, that was a really _old_ fedora kernel. I guarantee you it didn't 
 > have the page throttling patches in it, those were written this summer. So 
 > it would either have to be Fedora carrying around another patch that just 
 > happens to result in the same corruption for _years_, or it's the same 
 > bug.

The only notable VM patch in Fedora kernels of that vintage that I recall
was Ingo's 4g/4g thing.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Linus Torvalds

On Thu, 28 Dec 2006, Petri Kaukasoina wrote:
> > me up), and that seems to show the corruption going way way back (ie going 
> > back to Linux-2.6.5 at least, according to one tester).
> 
> That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18
> (or older)?

Well, that was a really _old_ fedora kernel. I guarantee you it didn't 
have the page throttling patches in it, those were written this summer. So 
it would either have to be Fedora carrying around another patch that just 
happens to result in the same corruption for _years_, or it's the same 
bug.

I bet it's the same bug, and it's been around for ages.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Petri Kaukasoina

On Thu, Dec 28, 2006 at 11:00:46AM -0800, Linus Torvalds wrote:
> And I have a test-program that shows the corruption _much_ easier (at 
> least according to my own testing, and that of several reporters that back 
> me up), and that seems to show the corruption going way way back (ie going 
> back to Linux-2.6.5 at least, according to one tester).

That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18
(or older)?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Linus Torvalds

On Thu, 28 Dec 2006, Marc Haber wrote:
> 
> After being up for ten days, I have now encountered the file
> corruption of pkgcache.bin for the first time again. The 256 MB i386
> box is like 26M in swap, is under very moderate load.
> 
> I am running plain vanilla 2.6.19.1. Is there a patch that I should
> apply against 2.6.19.1 that would help in debugging?

Not right now. 

And I have a test-program that shows the corruption _much_ easier (at 
least according to my own testing, and that of several reporters that back 
me up), and that seems to show the corruption going way way back (ie going 
back to Linux-2.6.5 at least, according to one tester).

So it just got a lot _easier_ to trigger in 2.6.19, but it's not a new 
bug.

What we need now is actually looking at the source code, and people who 
understand the VM, I'm afraid. I'm gathering traces now that I have a good 
test-case. I'll post my trace tools once I've tested that they work, in 
case others want to help.

(And hey, you don't have to be a VM expert to help: this could be a 
learning experience. However, I'll warn you: this is _the_ most grotty 
part of the whole kernel. It's not even ugly, it's just damn hard and 
complex).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Marc Haber

On Tue, Dec 19, 2006 at 09:51:49AM +0100, Marc Haber wrote:
> On Sun, Dec 17, 2006 at 09:43:08PM -0800, Andrew Morton wrote:
> > Six hours here of fsx-linux plus high memory pressure on SMP on 1k
> > blocksize ext3, mainline.  Zero failures.  It's unlikely that this testing
> > would pass, yet people running normal workloads are able to easily trigger
> > failures.  I suspect we're looking in the wrong place.
> 
> I do not have a clue about memory management at all, but is it
> possible that you're testing on a box with too much memory? My box has
> only 256 MB, and I used to use mutt with a _huge_ inbox with mutt
> taking somewhat 150 MB. Add spamassassin and a reasonably busy mail
> server, and the box used to be like 150 MB in swap.
> 
> I have tidied my inbox in the mean time and mutt's memory requirement
> has been reduced to somewhat 30 MB, which might be the cause that I
> don't see the issue that often any more.

After being up for ten days, I have now encountered the file
corruption of pkgcache.bin for the first time again. The 256 MB i386
box is like 26M in swap, is under very moderate load.

I am running plain vanilla 2.6.19.1. Is there a patch that I should
apply against 2.6.19.1 that would help in debugging?

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things."Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Linus Torvalds


Btw, 
 much cleaned-up page tracing patch here, in case anybody cares (and 
test.c attached, although I don't think it changed since last time). 

The test.c output is a bit hard to read at times, since it will give 
offsets in bytes as hex (ie 00a77664 means page frame 0a77, and byte 
664h within that page), while the kernel output is obvioiusly the page 
indexes (but the page fault _addresses_ can contain information about the 
exact byte in a page, so you can match them up when some kernel event is 
related to a page fault).

So both forms are necessary/logical, but it means that to match things up, 
you often need to ignore the last three hex digits of the address that 
test.c outputs.

This one also adds traces for the tags and the writeback activity, but 
since I'm going out for birthday dinner, I won't have time to try to 
actually analyse the trace I have.. Which is why I'm sending it out, in 
the hope that somebody else is working on this corruption issue and is 
interested..

Linus


diff --git a/fs/buffer.c b/fs/buffer.c
index 263f88e..f5e132a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -722,6 +722,7 @@ int __set_page_dirty_buffers(struct page *page)
set_buffer_dirty(bh);
bh = bh-b_this_page;
} while (bh != head);
+   PAGE_TRACE(page, dirtied buffers);
}
spin_unlock(mapping-private_lock);
 
@@ -734,6 +735,7 @@ int __set_page_dirty_buffers(struct page *page)
__inc_zone_page_state(page, NR_FILE_DIRTY);
task_io_account_write(PAGE_CACHE_SIZE);
}
+   PAGE_TRACE(page, setting TAG_DIRTY);
radix_tree_tag_set(mapping-page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 350878a..0cf3dce 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -91,6 +91,14 @@
 #define PG_nosave_free 18  /* Used for system suspend/resume */
 #define PG_buddy   19  /* Page is free, on buddy lists */
 
+#define SetPageInteresting(page) set_bit(PG_arch_1, (page)-flags)
+#define PageInteresting(page)  test_bit(PG_arch_1, (page)-flags)
+
+#define PAGE_TRACE(page, msg, arg...) do { 
\
+   if (PageInteresting(page))  
\
+   printk(KERN_DEBUG PG %08lx: %s:%d  msg \n,  
\
+   (page)-index, __FILE__, __LINE__ ,##arg ); 
\
+} while (0)
 
 #if (BITS_PER_LONG  32)
 /*
@@ -183,32 +191,38 @@ static inline void SetPageUptodate(struct page *page)
 #define PageWriteback(page)test_bit(PG_writeback, (page)-flags)
 #define SetPageWriteback(page) \
do {\
-   if (!test_and_set_bit(PG_writeback, \
-   (page)-flags))\
+   if (!test_and_set_bit(PG_writeback, (page)-flags)) {  \
+   PAGE_TRACE(page, set writeback);  \
inc_zone_page_state(page, NR_WRITEBACK);\
+   }   \
} while (0)
 #define TestSetPageWriteback(page) \
({  \
int ret;\
ret = test_and_set_bit(PG_writeback,\
(page)-flags);\
-   if (!ret)   \
+   if (!ret) { \
+   PAGE_TRACE(page, set writeback);  \
inc_zone_page_state(page, NR_WRITEBACK);\
+   }   \
ret;\
})
 #define ClearPageWriteback(page)   \
do {\
-   if (test_and_clear_bit(PG_writeback,\
-   (page)-flags))\
+   if (test_and_clear_bit(PG_writeback, (page)-flags)) { \
+   PAGE_TRACE(page, end writeback);  \
dec_zone_page_state(page, NR_WRITEBACK);\
+   }   \
} while (0)
 #define TestClearPageWriteback(page)   \
({

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Andrew Morton

On Thu, 28 Dec 2006 17:38:38 -0800 (PST)
Linus Torvalds [EMAIL PROTECTED] wrote:

 in 
 the hope that somebody else is working on this corruption issue and is 
 interested..

What corruption issue? ;)


I'm finding that the corruption happens trivially with your test app, but
apparently doesn't happen at all with ext2 or ext3, data=writeback.  Maybe
it will happen with increased rarity, but the difference is quite stark.

Removing the

err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
NULL, journal_dirty_data_fn);

from ext3_ordered_writepage() fixes things up.

The things which journal_submit_data_buffers() does after dropping all the
locks are ...  disturbing - I don't think we have sufficient tests in there
to ensure that the buffer is still where we think it is after we retake
locks (they're slippery little buggers).  But that wouldn't explain it
anyway.

It's inefficient that journal_dirty_data() will put these locked, clean
buffers onto BJ_SyncData instead of BJ_Locked, but
journal_submit_data_buffers() seems to dtrt with them.

So no theory yet.  Maybe ext3 is just altering timing.  But the difference
is really large..



Disabling all the WB_SYNC_NONE stuff and making everything go synchronous
everywhere has no effect.  Disabling bdi_write_congested() has no effect.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Marc Haber

On Tue, Dec 19, 2006 at 09:51:49AM +0100, Marc Haber wrote:
 On Sun, Dec 17, 2006 at 09:43:08PM -0800, Andrew Morton wrote:
  Six hours here of fsx-linux plus high memory pressure on SMP on 1k
  blocksize ext3, mainline.  Zero failures.  It's unlikely that this testing
  would pass, yet people running normal workloads are able to easily trigger
  failures.  I suspect we're looking in the wrong place.
 
 I do not have a clue about memory management at all, but is it
 possible that you're testing on a box with too much memory? My box has
 only 256 MB, and I used to use mutt with a _huge_ inbox with mutt
 taking somewhat 150 MB. Add spamassassin and a reasonably busy mail
 server, and the box used to be like 150 MB in swap.
 
 I have tidied my inbox in the mean time and mutt's memory requirement
 has been reduced to somewhat 30 MB, which might be the cause that I
 don't see the issue that often any more.

After being up for ten days, I have now encountered the file
corruption of pkgcache.bin for the first time again. The 256 MB i386
box is like 26M in swap, is under very moderate load.

I am running plain vanilla 2.6.19.1. Is there a patch that I should
apply against 2.6.19.1 that would help in debugging?

Greetings
Marc

-- 
-
Marc Haber | I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things.Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Linus Torvalds



On Thu, 28 Dec 2006, Marc Haber wrote:
 
 After being up for ten days, I have now encountered the file
 corruption of pkgcache.bin for the first time again. The 256 MB i386
 box is like 26M in swap, is under very moderate load.
 
 I am running plain vanilla 2.6.19.1. Is there a patch that I should
 apply against 2.6.19.1 that would help in debugging?

Not right now. 

And I have a test-program that shows the corruption _much_ easier (at 
least according to my own testing, and that of several reporters that back 
me up), and that seems to show the corruption going way way back (ie going 
back to Linux-2.6.5 at least, according to one tester).

So it just got a lot _easier_ to trigger in 2.6.19, but it's not a new 
bug.

What we need now is actually looking at the source code, and people who 
understand the VM, I'm afraid. I'm gathering traces now that I have a good 
test-case. I'll post my trace tools once I've tested that they work, in 
case others want to help.

(And hey, you don't have to be a VM expert to help: this could be a 
learning experience. However, I'll warn you: this is _the_ most grotty 
part of the whole kernel. It's not even ugly, it's just damn hard and 
complex).

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Petri Kaukasoina

On Thu, Dec 28, 2006 at 11:00:46AM -0800, Linus Torvalds wrote:
 And I have a test-program that shows the corruption _much_ easier (at 
 least according to my own testing, and that of several reporters that back 
 me up), and that seems to show the corruption going way way back (ie going 
 back to Linux-2.6.5 at least, according to one tester).

That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18
(or older)?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Linus Torvalds



On Thu, 28 Dec 2006, Petri Kaukasoina wrote:
  me up), and that seems to show the corruption going way way back (ie going 
  back to Linux-2.6.5 at least, according to one tester).
 
 That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18
 (or older)?

Well, that was a really _old_ fedora kernel. I guarantee you it didn't 
have the page throttling patches in it, those were written this summer. So 
it would either have to be Fedora carrying around another patch that just 
happens to result in the same corruption for _years_, or it's the same 
bug.

I bet it's the same bug, and it's been around for ages.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Dave Jones

On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote:
  
  
  On Thu, 28 Dec 2006, Petri Kaukasoina wrote:
me up), and that seems to show the corruption going way way back (ie 
going 
back to Linux-2.6.5 at least, according to one tester).
   
   That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18
   (or older)?
  
  Well, that was a really _old_ fedora kernel. I guarantee you it didn't 
  have the page throttling patches in it, those were written this summer. So 
  it would either have to be Fedora carrying around another patch that just 
  happens to result in the same corruption for _years_, or it's the same 
  bug.

The only notable VM patch in Fedora kernels of that vintage that I recall
was Ingo's 4g/4g thing.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Arjan van de Ven

On Thu, 2006-12-28 at 14:39 -0500, Dave Jones wrote:
 On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote:
   
   
   On Thu, 28 Dec 2006, Petri Kaukasoina wrote:
 me up), and that seems to show the corruption going way way back (ie 
 going 
 back to Linux-2.6.5 at least, according to one tester).

That was a Fedora kernel. Has anyone seen the corruption in vanilla 
 2.6.18
(or older)?
   
   Well, that was a really _old_ fedora kernel. I guarantee you it didn't 
   have the page throttling patches in it, those were written this summer. So 
   it would either have to be Fedora carrying around another patch that just 
   happens to result in the same corruption for _years_, or it's the same 
   bug.
 
 The only notable VM patch in Fedora kernels of that vintage that I recall
 was Ingo's 4g/4g thing.

which does tlb flushes *all the time* so that even rules out (well
almost) a stale tlb somewhere...


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Linus Torvalds



On Thu, 28 Dec 2006, Linus Torvalds wrote:
 
 What we need now is actually looking at the source code, and people who 
 understand the VM, I'm afraid. I'm gathering traces now that I have a good 
 test-case. I'll post my trace tools once I've tested that they work, in 
 case others want to help.

Ok, I've got the traces, but quite frankly, I doubt anybody is crazy 
enough to want to trawl through them. It's a bit painful, since we're 
talking thousands of pages to trigger this problem.

Also, I've used the PG_arch_1 flag, which is fine on x86[-64] and probably 
ARM, but is used for other things on ia64, powerpc and sparc64. But here's 
the patch in case anybody cares.

It wants a _big_ kernel buffer to capture all the crud into (which is why 
I made the thing accept a bigger log buffer), and quite frankly, I'm not 
at all sure that all the locking is ok (ie I could imagine that the 
dcache-locking thing there in is_interesting() could deadlock, what do I 
know..)

But I've captured some real data with this, which I'll describe 
separately.

Linus


diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 350878a..967dd80 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -91,6 +91,8 @@
 #define PG_nosave_free 18  /* Used for system suspend/resume */
 #define PG_buddy   19  /* Page is free, on buddy lists */
 
+#define SetPageInteresting(page) set_bit(PG_arch_1, (page)-flags)
+#define PageInteresting(page)  test_bit(PG_arch_1, (page)-flags)
 
 #if (BITS_PER_LONG  32)
 /*
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 5c26818..7735b83 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -79,7 +79,7 @@ config DEBUG_KERNEL
 
 config LOG_BUF_SHIFT
int Kernel log buffer size (16 = 64KB, 17 = 128KB) if DEBUG_KERNEL
-   range 12 21
+   range 12 24
default 17 if S390 || LOCKDEP
default 16 if X86_NUMAQ || IA64
default 15 if SMP
diff --git a/mm/filemap.c b/mm/filemap.c
index 8332c77..d6a0f56 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -116,6 +116,7 @@ void __remove_from_page_cache(struct page *page)
 {
struct address_space *mapping = page-mapping;
 
+if (PageInteresting(page)) printk(Removing index %08x from page cache\n, 
page-index);
radix_tree_delete(mapping-page_tree, page-index);
page-mapping = NULL;
mapping-nrpages--;
@@ -421,6 +422,23 @@ int filemap_write_and_wait_range(struct address_space 
*mapping,
return err;
 }
 
+static noinline int is_interesting(struct address_space *mapping)
+{
+   struct inode *inode = mapping-host;
+   struct dentry *dentry;
+   int retval = 0;
+
+   spin_lock(dcache_lock);
+   list_for_each_entry(dentry, inode-i_dentry, d_alias) {
+   if (strcmp(dentry-d_name.name, mapfile))
+   continue;
+   retval = 1;
+   break;
+   }
+   spin_unlock(dcache_lock);
+   return retval;
+}
+
 /**
  * add_to_page_cache - add newly allocated pagecache pages
  * @page:  page to add
@@ -439,6 +457,9 @@ int add_to_page_cache(struct page *page, struct 
address_space *mapping,
 {
int error = radix_tree_preload(gfp_mask  ~__GFP_HIGHMEM);
 
+   if (is_interesting(mapping))
+   SetPageInteresting(page);
+
if (error == 0) {
write_lock_irq(mapping-tree_lock);
error = radix_tree_insert(mapping-page_tree, offset, page);
diff --git a/mm/memory.c b/mm/memory.c
index 563792f..14c9815 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -667,6 +667,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
tlb_remove_tlb_entry(tlb, pte, addr);
if (unlikely(!page))
continue;
+if (PageInteresting(page))
+   printk(Unmapped index %08x at %08x\n, page-index, addr);
if (unlikely(details)  details-nonlinear_vma
 linear_page_index(details-nonlinear_vma,
addr) != page-index)
@@ -1605,6 +1607,7 @@ gotten:
 */
ptep_clear_flush(vma, address, page_table);
set_pte_at(mm, address, page_table, entry);
+if (PageInteresting(new_page)) printk(do_wp_page: mapping index %08x at 
%08lx\n, new_page-index, address);
update_mmu_cache(vma, address, entry);
lru_cache_add_active(new_page);
page_add_new_anon_rmap(new_page, vma, address);
@@ -2249,6 +2252,7 @@ retry:
entry = mk_pte(new_page, vma-vm_page_prot);
if (write_access)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+if (PageInteresting(new_page)) printk(do_no_page: mapping index %08x at %08lx 
(%s)\n, new_page-index, address, write_access ? write : read);
set_pte_at(mm, address, page_table, entry);

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Russell King

On Thu, Dec 28, 2006 at 01:24:30PM -0800, Linus Torvalds wrote:
 On Thu, 28 Dec 2006, Linus Torvalds wrote:
  
  What we need now is actually looking at the source code, and people who 
  understand the VM, I'm afraid. I'm gathering traces now that I have a good 
  test-case. I'll post my trace tools once I've tested that they work, in 
  case others want to help.
 
 Ok, I've got the traces, but quite frankly, I doubt anybody is crazy 
 enough to want to trawl through them. It's a bit painful, since we're 
 talking thousands of pages to trigger this problem.
 
 Also, I've used the PG_arch_1 flag, which is fine on x86[-64] and probably 
 ARM, but is used for other things on ia64, powerpc and sparc64. But here's 
 the patch in case anybody cares.

PG_arch_1 is used on ARM to flag pages that need a dcache flush prior to
hitting userspace, in the same way that sparc64 uses it.  So ARM systems
should not have this patch applied.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Linus Torvalds


Ok,
 with the ugly trace capture patch, I've actually captured this corruption 
in action, I think.

I did a full trace of all pages involved in one run, and picked one 
corruption at random:

Chunk 14465 corrupted (0-75)  (01423fb4-01423fff)
Expected 129, got 0
Written as (5126)9509(15017)

That's the first 76 bytes of a chunk missing, and it's the last 76 bytes 
on a page. It's page index 01423 in the mapped file, and bytes fb4-fff 
within that file.

There were four chunks written to that page:

Writing chunk 14463/15800 (15%) (0142344c) (1)
Writing chunk 14462/15800 (30%) (01422e98) (2) (overflows into 1423)
Writing chunk 14464/15800 (32%) (01423a00) (3)
Writing chunk 14465/15800 (60%) (01423fb4) (4)  --- LOST!

and the other three chunks checked out all right.

And here's the annotated trace as it concerns that page:

 - here we write the first chunk to the page:
** (1)  do_no_page: mapping index 1423 at b7d1f44c (write)
**  Setting page 1423 dirty

 - something flushes it out to disk:
**  cpd_for_io: index 1423
**  cleaning index 1423 at b7d1f000

 - here we write the second chunk (which was split over the previous page 
   and the interesting one):
** (2)  Setting page 1422 dirty
** (2)  Setting page 1423 dirty

 - and here we do a cleaning event
**  cpd_for_io: index 1423
**  cleaning index 1423 at b7d1f000

 - here we write the third chunk:
** (3)  Setting page 1423 dirty

 - here we write the fourth chunk:
** (4) NO DIRTY EVENT

 - and a third flush to disk: 
**  cpd_for_io: index 1423
**  cleaning index 1423 at b7d1f000

 - here we unmap and flush:
**  Unmapped index 1423 at b7d1f000
**  Removing index 1423 from page cache

 - here we remap to check:
**  do_no_page: mapping index 1423 at b7d1f000 (read)
**  Unmapped index 1423 at b7d1f000

 - and finally, here I remove the file after the run:
**  Removing index 1423 from page cache

Now, the important thing to see here is:

 - the missing write did not have a Setting page 1423 dirty event 
   associated with it.

 - but I can _see_ where the actual dirty event would be happening in the 
   logs, because I can see the dirty events of the other chunk writes 
   around it, so I know exactly where that fourth write happens. And 
   indeed, it _shouldn't_ get a dirty event, because the page is still 
   dirty from the write of chunk #3 to that page, which _did_ get a dirty 
   event.

   I can see that, because the testing app writes the log of the pages it 
   writes, and this is the log around the fourth and final write:

...
Writing chunk 5338/15800 (60%) (0076eb48)   PFN: 76e/76f
Writing chunk 960/15800 (60%) (00156300)PFN: 156
Writing chunk 14465/15800 (60%) (01423fb4)  
Writing chunk 8594/15800 (60%) (00bf74a8)   PFN: bf7
Writing chunk 556/15800 (60%) (000c62f0)PFN: c6
Writing chunk 15190/15800 (60%) (01526678)  PFN: 1526
...

   and I can match this up with the full log from the kernel, which looks 
   like this:

Setting page 076e dirty
Setting page 076f dirty
Setting page 0156 dirty
Setting page 00c6 dirty
Setting page 1526 dirty

   so I know exactly where the missing writes (to our page at pfn 1423, 
   and the fpn-bf7 page) happened.

 - and the thing is, I can see a cpd_for_io() happening AFTER that 
   fourth write. Quite a long while after, in fact. So all of this looks 
   very fine indeed. We are not losing any dirty bits.

 - EVEN MORE INTERESTING: write 3 makes it onto disk, and it really uses 
   the SAME dirty bit as write 4 did (which didn't make it out to disk!). 
   The event that clears the dirty bit that write 3 did happens AFTER 
   write 4 has happened!

So if we're not losing any dirty bits, what's going on?

I think we have some nasty interaction with the buffer heads. In 
particular, I don't think it's the dirty page bits that are broken (I 
_see_ that the PageDirty bit was set after write 4 was done to memory in 
the kernel traces). So I think that a real writeback just doesn't happen, 
because somebody has marked the buffer heads clean _after_ it started IO 
on them.

I think __mpage_writepage() is buggy in this regard, for example. It 
even has a comment about its crapola behaviour:

/*
 * Must try to add the page before marking the buffer clean or
 * the confused fail path above (OOM) will be very confused when
 * it finds all bh marked clean (i.e. it will not write anything)
 */

however, I don't think that particular thing explains it, because I don't 
think we use that function for the cases I'm looking at.

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread David Miller

From: Linus Torvalds [EMAIL PROTECTED]
Date: Thu, 28 Dec 2006 14:37:37 -0800 (PST)

 So if we're not losing any dirty bits, what's going on?

What happens when we writeback, to the PTEs?

page_mkclean_file() iterates the VMAs and when it finds a shared
one it goes:

entry = ptep_clear_flush(vma, address, pte);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);

and that's fine, but that PTE is still marked writable, and
I think that's key.

What does the fault path do in this situation?

if (write_access) {
if (!pte_write(entry))
return do_wp_page(mm, vma, address,
pte, pmd, ptl, entry);
entry = pte_mkdirty(entry);
}

It does nothing to update the page dirty state, because it's
writable, it just sets the PTE dirty bit and that's it.  Should
it be setting the page dirty here for SHARED cases?

So until vmscan actually unmaps the PTE completely, we have this
window in which the application can write to the PTE and the
page dirty state doesn't get updated.

Perhaps something later cleans up after this, f.e. by rechecking the
PTE dirty bit at the end of I/O or when vmscan unmaps the page.
I guess that should handle things, but the above logic definitely
stood out to me.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Linus Torvalds



On Thu, 28 Dec 2006, David Miller wrote:
 
 What happens when we writeback, to the PTEs?

Not a damn thing.

We clear the PTE's _before_ we even start the write. The writeback does 
nothing to them. If the user dirties the page while writeback is in 
progress, we'll take the page fault and re-dirty it _again_.

 page_mkclean_file() iterates the VMAs and when it finds a shared
 one it goes:
 
   entry = ptep_clear_flush(vma, address, pte);
   entry = pte_wrprotect(entry);
   entry = pte_mkclean(entry);
 
 and that's fine, but that PTE is still marked writable, and
 I think that's key.

No it's not. It's right there. pte_wrprotect(entry). You even copied it 
yourself.

 What does the fault path do in this situation?
 
   if (write_access) {
   if (!pte_write(entry))
   return do_wp_page(mm, vma, address,
   pte, pmd, ptl, entry);

So we call do_wp_page(), and that does everythign right.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Linus Torvalds



On Thu, 28 Dec 2006, Anton Altaparmakov wrote:
 
 But are chunks 3 and 4 in separate buffer heads?  Sorry could not see it 
 immediately from the output you showed...

No, this is a 4kB filesystem. A single bh per page.

 It is just that there may be a different cause rather than buffer dirty 
 state...

Sure.

 A shot in the dark I know but it could perhaps be that a COW for 
 MAP_PRIVATE like event happens when the page is dirty already thus the 
 second write never actually makes it to the shared page thus it never gets 
 written out.

There are no private mappings anywhere, and no forks. Just a single mmap 
(well, we unmap and remap in order to force the page cache to be 
invalidated properly with the posix_fadvise() thing, but that's literally 
the only user).

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-28 Thread Anton Altaparmakov

On Thu, 28 Dec 2006, Linus Torvalds wrote:
 Ok,
  with the ugly trace capture patch, I've actually captured this corruption 
 in action, I think.
 
 I did a full trace of all pages involved in one run, and picked one 
 corruption at random:
 
   Chunk 14465 corrupted (0-75)  (01423fb4-01423fff)
   Expected 129, got 0
   Written as (5126)9509(15017)
 
 That's the first 76 bytes of a chunk missing, and it's the last 76 bytes 
 on a page. It's page index 01423 in the mapped file, and bytes fb4-fff 
 within that file.
 
 There were four chunks written to that page:
 
   Writing chunk 14463/15800 (15%) (0142344c) (1)
   Writing chunk 14462/15800 (30%) (01422e98) (2) (overflows into 1423)
   Writing chunk 14464/15800 (32%) (01423a00) (3)
   Writing chunk 14465/15800 (60%) (01423fb4) (4)  --- LOST!
 
 and the other three chunks checked out all right.
 
 And here's the annotated trace as it concerns that page:
 
  - here we write the first chunk to the page:
   ** (1)  do_no_page: mapping index 1423 at b7d1f44c (write)
   **  Setting page 1423 dirty
 
  - something flushes it out to disk:
   **  cpd_for_io: index 1423
   **  cleaning index 1423 at b7d1f000
 
  - here we write the second chunk (which was split over the previous page 
and the interesting one):
   ** (2)  Setting page 1422 dirty
   ** (2)  Setting page 1423 dirty
 
  - and here we do a cleaning event
   **  cpd_for_io: index 1423
   **  cleaning index 1423 at b7d1f000
 
  - here we write the third chunk:
   ** (3)  Setting page 1423 dirty
 
  - here we write the fourth chunk:
   ** (4) NO DIRTY EVENT
 
  - and a third flush to disk: 
   **  cpd_for_io: index 1423
   **  cleaning index 1423 at b7d1f000
 
  - here we unmap and flush:
   **  Unmapped index 1423 at b7d1f000
   **  Removing index 1423 from page cache
 
  - here we remap to check:
   **  do_no_page: mapping index 1423 at b7d1f000 (read)
   **  Unmapped index 1423 at b7d1f000
 
  - and finally, here I remove the file after the run:
   **  Removing index 1423 from page cache
 
 Now, the important thing to see here is:
 
  - the missing write did not have a Setting page 1423 dirty event 
associated with it.
 
  - but I can _see_ where the actual dirty event would be happening in the 
logs, because I can see the dirty events of the other chunk writes 
around it, so I know exactly where that fourth write happens. And 
indeed, it _shouldn't_ get a dirty event, because the page is still 
dirty from the write of chunk #3 to that page, which _did_ get a dirty 
event.
 
I can see that, because the testing app writes the log of the pages it 
writes, and this is the log around the fourth and final write:
 
   ...
 Writing chunk 5338/15800 (60%) (0076eb48)   PFN: 76e/76f
 Writing chunk 960/15800 (60%) (00156300)PFN: 156
 Writing chunk 14465/15800 (60%) (01423fb4)  
 Writing chunk 8594/15800 (60%) (00bf74a8)   PFN: bf7
 Writing chunk 556/15800 (60%) (000c62f0)PFN: c6
   Writing chunk 15190/15800 (60%) (01526678)  PFN: 1526
   ...
 
and I can match this up with the full log from the kernel, which looks 
like this:
 
 Setting page 076e dirty
 Setting page 076f dirty
 Setting page 0156 dirty
 Setting page 00c6 dirty
   Setting page 1526 dirty
 
so I know exactly where the missing writes (to our page at pfn 1423, 
and the fpn-bf7 page) happened.
 
  - and the thing is, I can see a cpd_for_io() happening AFTER that 
fourth write. Quite a long while after, in fact. So all of this looks 
very fine indeed. We are not losing any dirty bits.
 
  - EVEN MORE INTERESTING: write 3 makes it onto disk, and it really uses 
the SAME dirty bit as write 4 did (which didn't make it out to disk!). 
The event that clears the dirty bit that write 3 did happens AFTER 
write 4 has happened!
 
 So if we're not losing any dirty bits, what's going on?
 
 I think we have some nasty interaction with the buffer heads. In 

But are chunks 3 and 4 in separate buffer heads?  Sorry could not see it 
immediately from the output you showed...

It is just that there may be a different cause rather than buffer dirty 
state...

A shot in the dark I know but it could perhaps be that a COW for 
MAP_PRIVATE like event happens when the page is dirty already thus the 
second write never actually makes it to the shared page thus it never gets 
written out.

I am almost certainly totally barking up the wrong tree but I thought it 
may be worth mentioning just in case there was a slip in the COW logic or 
page writable state maintenance somewhere...

Best regards,

Anton

 particular, I don't think it's the dirty page

Re: 2.6.19 file content corruption on ext3

2006-12-22 Thread Linus Torvalds

On Mon, 18 Dec 2006, Gene Heskett wrote:
>
> What about the mm/rmap.c one liner, in or out?

The one that just removes the "pte_mkclean()"? That's definitely out, it 
was just a test-patch to verify that the pte dirty bits seemed to matter 
at all (and they do).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-22 Thread Marc Haber

On Sat, Dec 16, 2006 at 06:43:10PM +, Martin Michlmayr wrote:
> * Marc Haber <[EMAIL PROTECTED]> [2006-12-09 10:26]:
> > Unfortunately, I am lacking the knowledge needed to do this in an
> > informed way. I am neither familiar enough with git nor do I possess
> > the necessary C powers.
> 
> I wonder if what you're seein is related to
> http://lkml.org/lkml/2006/12/16/73
> 
> You said that you don't see any corruption with 2.6.18.  Can you try
> to apply the patch from
> http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89
> to 2.6.18 to see if the corruption shows up?

Since I am no longer seeing the issue after easing the memory load, I
doubt that this would make sense.

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things."Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-22 Thread Marc Haber

On Fri, Dec 22, 2006 at 08:30:06AM -0500, Daniel Drake wrote:
> Marc Haber wrote:
> >After updating to 2.6.19, Debian's apt control file
> >/var/cache/apt/pkgcache.bin corrupts pretty frequently - like in under
> >six hours. In that situation, "aptitude update" segfaults. When I
> >delete the file and have apt recreate it, things are fine again for a
> >few hours before the file is broken again and the segfault start over.
> >In all cases, umounting the file system and doing an fsck does not
> >show issues with the file system.
> 
> Are you using wireless networking of any kind?

Since the system in question is a colocated server box, I am pretty
sure that there is no wireless networking.

>  Might be useful if you could post 'dmesg' output so that people can
>  see the other hardware that you have.

I have attached what I could scrape from syslog.

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things."Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835
Dec 18 15:45:01 torres syslogd 1.4.1#17: restart.
Dec 18 15:45:01 torres kernel: klogd 1.4.1#17, log source = /proc/kmsg started.
Dec 18 15:45:01 torres kernel: Inspecting /boot/System.map-2.6.19.1-zgsrv
Dec 18 15:45:01 torres kernel: Loaded 26500 symbols from 
/boot/System.map-2.6.19.1-zgsrv.
Dec 18 15:45:01 torres kernel: Symbols match kernel version 2.6.19.
Dec 18 15:45:01 torres kernel: No module symbols loaded - kernel modules not 
enabled. 
Dec 18 15:45:01 torres kernel: Linux version 2.6.19.1-zgsrv ([EMAIL PROTECTED]) 
(gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #1 Sun Dec 17 
12:44:56 UTC 2006
Dec 18 15:45:01 torres kernel: BIOS-provided physical RAM map:
Dec 18 15:45:01 torres kernel:  BIOS-e820:  - 000a 
(usable)
Dec 18 15:45:01 torres kernel:  BIOS-e820: 000f - 0010 
(reserved)
Dec 18 15:45:01 torres kernel:  BIOS-e820: 0010 - 0f7f 
(usable)
Dec 18 15:45:01 torres kernel:  BIOS-e820: 0f7f - 0f7f3000 
(ACPI NVS)
Dec 18 15:45:01 torres kernel:  BIOS-e820: 0f7f3000 - 0f80 
(ACPI data)
Dec 18 15:45:01 torres kernel:  BIOS-e820:  - 0001 
(reserved)
Dec 18 15:45:01 torres kernel: 0MB HIGHMEM available.
Dec 18 15:45:01 torres kernel: 247MB LOWMEM available.
Dec 18 15:45:01 torres kernel: Entering add_active_range(0, 0, 63472) 0 entries 
of 256 used
Dec 18 15:45:01 torres kernel: Zone PFN ranges:
Dec 18 15:45:01 torres kernel:   DMA 0 -> 4096
Dec 18 15:45:01 torres kernel:   Normal   4096 ->63472
Dec 18 15:45:01 torres kernel:   HighMem 63472 ->63472
Dec 18 15:45:01 torres kernel: early_node_map[1] active PFN ranges
Dec 18 15:45:01 torres kernel: 0:0 ->63472
Dec 18 15:45:01 torres kernel: On node 0 totalpages: 63472
Dec 18 15:45:01 torres kernel:   DMA zone: 32 pages used for memmap
Dec 18 15:45:01 torres kernel:   DMA zone: 0 pages reserved
Dec 18 15:45:01 torres kernel:   DMA zone: 4064 pages, LIFO batch:0
Dec 18 15:45:01 torres kernel:   Normal zone: 463 pages used for memmap
Dec 18 15:45:01 torres kernel:   Normal zone: 58913 pages, LIFO batch:15
Dec 18 15:45:01 torres kernel:   HighMem zone: 0 pages used for memmap
Dec 18 15:45:01 torres kernel: DMI 2.2 present.
Dec 18 15:45:01 torres kernel: ACPI: RSDP (v000 VIA694  
  ) @ 0x000f8050
Dec 18 15:45:01 torres kernel: ACPI: RSDT (v001 VIA694 MSI ACPI 0x42302e31 AWRD 
0x) @ 0x0f7f3000
Dec 18 15:45:01 torres kernel: ACPI: FADT (v001 VIA694 MSI ACPI 0x42302e31 AWRD 
0x) @ 0x0f7f3040
Dec 18 15:45:01 torres kernel: ACPI: DSDT (v001 VIA694 AWRDACPI 0x1000 MSFT 
0x010c) @ 0x
Dec 18 15:45:01 torres kernel: ACPI: PM-Timer IO Port: 0x4008
Dec 18 15:45:01 torres kernel: Allocating PCI resources starting at 1000 
(gap: 0f80:f07f)
Dec 18 15:45:01 torres kernel: Detected 1466.361 MHz processor.
Dec 18 15:45:01 torres kernel: Built 1 zonelists.  Total pages: 62977
Dec 18 15:45:01 torres kernel: Kernel command line: root=/dev/hda1 ro 
vga=normal 
Dec 18 15:45:01 torres kernel: Enabling fast FPU save and restore... done.
Dec 18 15:45:01 torres kernel: Enabling unmasked SIMD FPU exception support... 
done.
Dec 18 15:45:01 torres kernel: Initializing CPU#0
Dec 18 15:45:01 torres kernel: PID hash table entries: 1024 (order: 10, 4096 
bytes)
Dec 18 15:45:01 torres kernel: Console: colour VGA+ 80x25
Dec 18 15:45:01 torres kernel: Dentry cache hash table entries: 32768 (order: 
5, 131072 bytes)
Dec 18 15:45:01 torres kernel: Inode-cache hash table entries: 16384 (order: 4, 
65536 bytes)
Dec 18 15:45:01 torres kernel: Memory: 246964k/253888k available (2896k kernel 
code, 6368k reserved, 859k data, 204k init, 0k highmem)
Dec 18

Re: 2.6.19 file content corruption on ext3

2006-12-22 Thread Daniel Drake


Marc Haber wrote:

After updating to 2.6.19, Debian's apt control file
/var/cache/apt/pkgcache.bin corrupts pretty frequently - like in under
six hours. In that situation, "aptitude update" segfaults. When I
delete the file and have apt recreate it, things are fine again for a
few hours before the file is broken again and the segfault start over.
In all cases, umounting the file system and doing an fsck does not
show issues with the file system.


Are you using wireless networking of any kind? If so which driver and 
security key system? Might be useful if you could post 'dmesg' output so 
that people can see the other hardware that you have.


Daniel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-22 Thread Daniel Drake


Marc Haber wrote:

After updating to 2.6.19, Debian's apt control file
/var/cache/apt/pkgcache.bin corrupts pretty frequently - like in under
six hours. In that situation, aptitude update segfaults. When I
delete the file and have apt recreate it, things are fine again for a
few hours before the file is broken again and the segfault start over.
In all cases, umounting the file system and doing an fsck does not
show issues with the file system.


Are you using wireless networking of any kind? If so which driver and 
security key system? Might be useful if you could post 'dmesg' output so 
that people can see the other hardware that you have.


Daniel

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-22 Thread Marc Haber

On Fri, Dec 22, 2006 at 08:30:06AM -0500, Daniel Drake wrote:
 Marc Haber wrote:
 After updating to 2.6.19, Debian's apt control file
 /var/cache/apt/pkgcache.bin corrupts pretty frequently - like in under
 six hours. In that situation, aptitude update segfaults. When I
 delete the file and have apt recreate it, things are fine again for a
 few hours before the file is broken again and the segfault start over.
 In all cases, umounting the file system and doing an fsck does not
 show issues with the file system.
 
 Are you using wireless networking of any kind?

Since the system in question is a colocated server box, I am pretty
sure that there is no wireless networking.

  Might be useful if you could post 'dmesg' output so that people can
  see the other hardware that you have.

I have attached what I could scrape from syslog.

Greetings
Marc

-- 
-
Marc Haber | I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things.Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835
Dec 18 15:45:01 torres syslogd 1.4.1#17: restart.
Dec 18 15:45:01 torres kernel: klogd 1.4.1#17, log source = /proc/kmsg started.
Dec 18 15:45:01 torres kernel: Inspecting /boot/System.map-2.6.19.1-zgsrv
Dec 18 15:45:01 torres kernel: Loaded 26500 symbols from 
/boot/System.map-2.6.19.1-zgsrv.
Dec 18 15:45:01 torres kernel: Symbols match kernel version 2.6.19.
Dec 18 15:45:01 torres kernel: No module symbols loaded - kernel modules not 
enabled. 
Dec 18 15:45:01 torres kernel: Linux version 2.6.19.1-zgsrv ([EMAIL PROTECTED]) 
(gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #1 Sun Dec 17 
12:44:56 UTC 2006
Dec 18 15:45:01 torres kernel: BIOS-provided physical RAM map:
Dec 18 15:45:01 torres kernel:  BIOS-e820:  - 000a 
(usable)
Dec 18 15:45:01 torres kernel:  BIOS-e820: 000f - 0010 
(reserved)
Dec 18 15:45:01 torres kernel:  BIOS-e820: 0010 - 0f7f 
(usable)
Dec 18 15:45:01 torres kernel:  BIOS-e820: 0f7f - 0f7f3000 
(ACPI NVS)
Dec 18 15:45:01 torres kernel:  BIOS-e820: 0f7f3000 - 0f80 
(ACPI data)
Dec 18 15:45:01 torres kernel:  BIOS-e820:  - 0001 
(reserved)
Dec 18 15:45:01 torres kernel: 0MB HIGHMEM available.
Dec 18 15:45:01 torres kernel: 247MB LOWMEM available.
Dec 18 15:45:01 torres kernel: Entering add_active_range(0, 0, 63472) 0 entries 
of 256 used
Dec 18 15:45:01 torres kernel: Zone PFN ranges:
Dec 18 15:45:01 torres kernel:   DMA 0 - 4096
Dec 18 15:45:01 torres kernel:   Normal   4096 -63472
Dec 18 15:45:01 torres kernel:   HighMem 63472 -63472
Dec 18 15:45:01 torres kernel: early_node_map[1] active PFN ranges
Dec 18 15:45:01 torres kernel: 0:0 -63472
Dec 18 15:45:01 torres kernel: On node 0 totalpages: 63472
Dec 18 15:45:01 torres kernel:   DMA zone: 32 pages used for memmap
Dec 18 15:45:01 torres kernel:   DMA zone: 0 pages reserved
Dec 18 15:45:01 torres kernel:   DMA zone: 4064 pages, LIFO batch:0
Dec 18 15:45:01 torres kernel:   Normal zone: 463 pages used for memmap
Dec 18 15:45:01 torres kernel:   Normal zone: 58913 pages, LIFO batch:15
Dec 18 15:45:01 torres kernel:   HighMem zone: 0 pages used for memmap
Dec 18 15:45:01 torres kernel: DMI 2.2 present.
Dec 18 15:45:01 torres kernel: ACPI: RSDP (v000 VIA694  
  ) @ 0x000f8050
Dec 18 15:45:01 torres kernel: ACPI: RSDT (v001 VIA694 MSI ACPI 0x42302e31 AWRD 
0x) @ 0x0f7f3000
Dec 18 15:45:01 torres kernel: ACPI: FADT (v001 VIA694 MSI ACPI 0x42302e31 AWRD 
0x) @ 0x0f7f3040
Dec 18 15:45:01 torres kernel: ACPI: DSDT (v001 VIA694 AWRDACPI 0x1000 MSFT 
0x010c) @ 0x
Dec 18 15:45:01 torres kernel: ACPI: PM-Timer IO Port: 0x4008
Dec 18 15:45:01 torres kernel: Allocating PCI resources starting at 1000 
(gap: 0f80:f07f)
Dec 18 15:45:01 torres kernel: Detected 1466.361 MHz processor.
Dec 18 15:45:01 torres kernel: Built 1 zonelists.  Total pages: 62977
Dec 18 15:45:01 torres kernel: Kernel command line: root=/dev/hda1 ro 
vga=normal 
Dec 18 15:45:01 torres kernel: Enabling fast FPU save and restore... done.
Dec 18 15:45:01 torres kernel: Enabling unmasked SIMD FPU exception support... 
done.
Dec 18 15:45:01 torres kernel: Initializing CPU#0
Dec 18 15:45:01 torres kernel: PID hash table entries: 1024 (order: 10, 4096 
bytes)
Dec 18 15:45:01 torres kernel: Console: colour VGA+ 80x25
Dec 18 15:45:01 torres kernel: Dentry cache hash table entries: 32768 (order: 
5, 131072 bytes)
Dec 18 15:45:01 torres kernel: Inode-cache hash table entries: 16384 (order: 4, 
65536 bytes)
Dec 18 15:45:01 torres kernel: Memory: 246964k/253888k available (2896k kernel 
code, 6368k reserved, 859k data, 204k init, 0k highmem)
Dec 18 15:45:01 torres kernel: virtual

Re: 2.6.19 file content corruption on ext3

2006-12-22 Thread Marc Haber

On Sat, Dec 16, 2006 at 06:43:10PM +, Martin Michlmayr wrote:
 * Marc Haber [EMAIL PROTECTED] [2006-12-09 10:26]:
  Unfortunately, I am lacking the knowledge needed to do this in an
  informed way. I am neither familiar enough with git nor do I possess
  the necessary C powers.
 
 I wonder if what you're seein is related to
 http://lkml.org/lkml/2006/12/16/73
 
 You said that you don't see any corruption with 2.6.18.  Can you try
 to apply the patch from
 http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89
 to 2.6.18 to see if the corruption shows up?

Since I am no longer seeing the issue after easing the memory load, I
doubt that this would make sense.

Greetings
Marc

-- 
-
Marc Haber | I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things.Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-22 Thread Linus Torvalds



On Mon, 18 Dec 2006, Gene Heskett wrote:

 What about the mm/rmap.c one liner, in or out?

The one that just removes the pte_mkclean()? That's definitely out, it 
was just a test-patch to verify that the pte dirty bits seemed to matter 
at all (and they do).

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-21 Thread Andrew Morton

On Thu, 21 Dec 2006 14:03:20 +0100
Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> On Tue, 2006-12-19 at 09:43 -0800, Linus Torvalds wrote:
> > 
> > Btw,
> >  here's a totally new tangent on this: it's possible that user code is 
> > simply BUGGY. 
> 
> depmod: BADNESS: written outside isize 22183

akpm:/usr/src/module-init-tools-3.3-pre1> grep -r mmap .
./zlibsupport.c:map = mmap(0, *size, PROT_READ|PROT_WRITE, MAP_PRIVATE, 
fd, 0);

So presumably it's in a library.

akpm:/usr/src/25> ldd /sbin/depmod
linux-gate.so.1 =>  (0xe000)
libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0x46afa000)
/lib/ld-linux.so.2 (0x4631d000)

worrisome.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-21 Thread Peter Zijlstra

On Tue, 2006-12-19 at 09:43 -0800, Linus Torvalds wrote:
> 
> Btw,
>  here's a totally new tangent on this: it's possible that user code is 
> simply BUGGY. 

depmod: BADNESS: written outside isize 22183

---
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..5db9fd9 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2393,6 +2393,17 @@ int nobh_commit_write(struct file *file, struct page 
*page,
 }
 EXPORT_SYMBOL(nobh_commit_write);
 
+static void __check_tail_zero(char *kaddr, unsigned int offset)
+{
+   unsigned int check = 0;
+   do {
+   check += kaddr[offset++];
+   } while (offset < PAGE_CACHE_SIZE);
+   if (check)
+   printk(KERN_ERR "%s: BADNESS: written outside isize %u\n",
+   current->comm, check);
+}
+
 /*
  * nobh_writepage() - based on block_full_write_page() except
  * that it tries to operate without attaching bufferheads to
@@ -2437,6 +2448,7 @@ int nobh_writepage(struct page *page, get_block_t 
*get_block,
 * writes to that region are not written out to the file."
 */
kaddr = kmap_atomic(page, KM_USER0);
+   __check_tail_zero(kaddr, offset);
memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
flush_dcache_page(page);
kunmap_atomic(kaddr, KM_USER0);
@@ -2604,6 +2616,7 @@ int block_write_full_page(struct page *page, get_block_t 
*get_block,
 * writes to that region are not written out to the file."
 */
kaddr = kmap_atomic(page, KM_USER0);
+   __check_tail_zero(kaddr, offset);
memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
flush_dcache_page(page);
kunmap_atomic(kaddr, KM_USER0);


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-21 Thread Peter Zijlstra

On Tue, 2006-12-19 at 09:43 -0800, Linus Torvalds wrote:
 
 Btw,
  here's a totally new tangent on this: it's possible that user code is 
 simply BUGGY. 

depmod: BADNESS: written outside isize 22183

---
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..5db9fd9 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2393,6 +2393,17 @@ int nobh_commit_write(struct file *file, struct page 
*page,
 }
 EXPORT_SYMBOL(nobh_commit_write);
 
+static void __check_tail_zero(char *kaddr, unsigned int offset)
+{
+   unsigned int check = 0;
+   do {
+   check += kaddr[offset++];
+   } while (offset  PAGE_CACHE_SIZE);
+   if (check)
+   printk(KERN_ERR %s: BADNESS: written outside isize %u\n,
+   current-comm, check);
+}
+
 /*
  * nobh_writepage() - based on block_full_write_page() except
  * that it tries to operate without attaching bufferheads to
@@ -2437,6 +2448,7 @@ int nobh_writepage(struct page *page, get_block_t 
*get_block,
 * writes to that region are not written out to the file.
 */
kaddr = kmap_atomic(page, KM_USER0);
+   __check_tail_zero(kaddr, offset);
memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
flush_dcache_page(page);
kunmap_atomic(kaddr, KM_USER0);
@@ -2604,6 +2616,7 @@ int block_write_full_page(struct page *page, get_block_t 
*get_block,
 * writes to that region are not written out to the file.
 */
kaddr = kmap_atomic(page, KM_USER0);
+   __check_tail_zero(kaddr, offset);
memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
flush_dcache_page(page);
kunmap_atomic(kaddr, KM_USER0);


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-21 Thread Andrew Morton

On Thu, 21 Dec 2006 14:03:20 +0100
Peter Zijlstra [EMAIL PROTECTED] wrote:

 On Tue, 2006-12-19 at 09:43 -0800, Linus Torvalds wrote:
  
  Btw,
   here's a totally new tangent on this: it's possible that user code is 
  simply BUGGY. 
 
 depmod: BADNESS: written outside isize 22183

akpm:/usr/src/module-init-tools-3.3-pre1 grep -r mmap .
./zlibsupport.c:map = mmap(0, *size, PROT_READ|PROT_WRITE, MAP_PRIVATE, 
fd, 0);

So presumably it's in a library.

akpm:/usr/src/25 ldd /sbin/depmod
linux-gate.so.1 =  (0xe000)
libc.so.6 = /lib/tls/i686/cmov/libc.so.6 (0x46afa000)
/lib/ld-linux.so.2 (0x4631d000)

worrisome.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-20 Thread Stephen Clark


Peter Zijlstra wrote:


On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote:
 


On Tue, 19 Dec 2006, Linus Torvalds wrote:
   

here's a totally new tangent on this: it's possible that user code is 
simply BUGGY. 
 



I'm sad to say this doesn't trigger :-(


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 


Hi all,

I ran it a number of times on 2.6.16-1.2115_FC4 and always got
./a.out | od -x
000        
020        
040    

but running it on 2.6.19-rc5 I always get zeros in the middle.

Steve

--

"They that give up essential liberty to obtain temporary safety, 
deserve neither liberty nor safety."  (Ben Franklin)


"The course of history shows that as a government grows, liberty 
decreases."  (Thomas Jefferson)




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-20 Thread Peter Zijlstra

On Wed, 2006-12-20 at 18:30 +0200, Andrei Popa wrote:
> On Wed, 2006-12-20 at 15:23 +0100, Peter Zijlstra wrote:
> > On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote:
> > > On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote:
> > > > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> > > > 
> > > > > OR:
> > > > > 
> > > > >  - page_mkclean_one() is simply buggy.
> > > > 
> > > > GOLD!
> > > > 
> > > > it seems to work with all this (full diff against current git).
> > > > 
> > > > /me rebuilds full kernel to make sure...
> > > > reboot...
> > > > test...  pff the tension...
> > > > yay, still good!
> > > > 
> > > > Andrei; would you please verify.
> > > 
> > > I have corrupted files.
> > 
> > drad; and with this patch:
> >   http://lkml.org/lkml/2006/12/20/112
> 
> Hash check on download completion found bad chunks, consider using
> "safe_sync".

*sigh* back to square 1.

and I need to look at my reproduction case ;-(

Thanks for testing.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-20 Thread Andrei Popa

On Wed, 2006-12-20 at 15:23 +0100, Peter Zijlstra wrote:
> On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote:
> > On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote:
> > > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> > > 
> > > > OR:
> > > > 
> > > >  - page_mkclean_one() is simply buggy.
> > > 
> > > GOLD!
> > > 
> > > it seems to work with all this (full diff against current git).
> > > 
> > > /me rebuilds full kernel to make sure...
> > > reboot...
> > > test...  pff the tension...
> > > yay, still good!
> > > 
> > > Andrei; would you please verify.
> > 
> > I have corrupted files.
> 
> drad; and with this patch:
>   http://lkml.org/lkml/2006/12/20/112

Hash check on download completion found bad chunks, consider using
"safe_sync".

> 
> /me goes rebuild his kernel and try more than 3 times
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-20 Thread Martin Schwidefsky

On Wed, 2006-12-20 at 10:01 +0100, Peter Zijlstra wrote:
> Also, what is this page_test_and_clear_dirty() business, that seems to
> be exclusively s390 btw. However they do seem to need this.
> 
> > But the "ptep_get_and_clear() + flush_tlb_page()" sequence should
> > hopefully also work.
> 
> Yeah, probably, not optimally so on some archs that don't actually need
> the flush though. And as above, I wonder about s390.

Simple, the s390 architecture does not keep the dirty bit in the pte but
in something called the storage key. For each physical page there is one
associated storage key. It is accessed with special instructions like
"iske", "sske" or "rrbe". To clear the dirty bit the storage key of a
page is read with iske, the bit is cleared and the storage key is stored
back with sske. That means that clearing the dirty bit is not an atomic
operation. rrbe is used to test and clear the referenced bit (young/old
infomation) and is atomic in regard to other storage key operations. If
you think about it, the storage keys are quite nice for the operating
system, page_referenced() can be implemented with a single test
"page_test_and_clear_young()". No need to read all the ptes pointing to
the page. The downside is that the storage keys have a cost on the
hardware side.

-- 
blue skies,
  Martin.

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

"Reality continues to ruin my life." - Calvin.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-20 Thread Peter Zijlstra

On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote:
> On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote:
> > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> > 
> > > OR:
> > > 
> > >  - page_mkclean_one() is simply buggy.
> > 
> > GOLD!
> > 
> > it seems to work with all this (full diff against current git).
> > 
> > /me rebuilds full kernel to make sure...
> > reboot...
> > test...  pff the tension...
> > yay, still good!
> > 
> > Andrei; would you please verify.
> 
> I have corrupted files.

drad; and with this patch:
  http://lkml.org/lkml/2006/12/20/112

/me goes rebuild his kernel and try more than 3 times

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-20 Thread Andrei Popa

On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote:
> On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> 
> > OR:
> > 
> >  - page_mkclean_one() is simply buggy.
> 
> GOLD!
> 
> it seems to work with all this (full diff against current git).
> 
> /me rebuilds full kernel to make sure...
> reboot...
> test...  pff the tension...
> yay, still good!
> 
> Andrei; would you please verify.

I have corrupted files.

> The magic seems to be in the extra tlb flush after clearing the dirty
> bit. Just too bad ptep_clear_flush_dirty() needs ptep not entry.
> 
> diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c
> index 5e7cd45..2b8893b 100644
> --- a/drivers/connector/connector.c
> +++ b/drivers/connector/connector.c
> @@ -135,8 +135,7 @@ static int cn_call_callback(struct cn_msg *msg, void 
> (*destruct_data)(void *), v
>   spin_lock_bh(>cbdev->queue_lock);
>   list_for_each_entry(__cbq, >cbdev->queue_list, callback_entry) {
>   if (cn_cb_equal(&__cbq->id.id, >id)) {
> - if (likely(!test_bit(WORK_STRUCT_PENDING,
> -  &__cbq->work.work.management) &&
> + if (likely(!delayed_work_pending(&__cbq->work) &&
>   __cbq->data.ddata == NULL)) {
>   __cbq->data.callback_priv = msg;
>  
> diff --git a/fs/buffer.c b/fs/buffer.c
> index d1f1b54..263f88e 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
>   int ret = 0;
>  
>   BUG_ON(!PageLocked(page));
> - if (PageWriteback(page))
> + if (PageDirty(page) || PageWriteback(page))
>   return 0;
>  
>   if (mapping == NULL) {  /* can this still happen? */
> @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
>   spin_lock(>private_lock);
>   ret = drop_buffers(page, _to_free);
>   spin_unlock(>private_lock);
> - if (ret) {
> - /*
> -  * If the filesystem writes its buffers by hand (eg ext3)
> -  * then we can have clean buffers against a dirty page.  We
> -  * clean the page here; otherwise later reattachment of buffers
> -  * could encounter a non-uptodate page, which is unresolvable.
> -  * This only applies in the rare case where try_to_free_buffers
> -  * succeeds but the page is not freed.
> -  *
> -  * Also, during truncate, discard_buffer will have marked all
> -  * the page's buffers clean.  We discover that here and clean
> -  * the page also.
> -  */
> - if (test_clear_page_dirty(page))
> - task_io_account_cancelled_write(PAGE_CACHE_SIZE);
> - }
>  out:
>   if (buffers_to_free) {
>   struct buffer_head *bh = buffers_to_free;
> diff --git a/mm/memory.c b/mm/memory.c
> index c00bac6..60e0945 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping,
>  }
>  EXPORT_SYMBOL(unmap_mapping_range);
>  
> +static void check_last_page(struct address_space *mapping, loff_t size)
> +{
> + pgoff_t index;
> + unsigned int offset;
> + struct page *page;
> +
> + if (!mapping)
> + return;
> + offset = size & ~PAGE_MASK;
> + if (!offset)
> + return;
> + index = size >> PAGE_SHIFT;
> + page = find_lock_page(mapping, index);
> + if (page) {
> + unsigned int check = 0;
> + unsigned char *kaddr = kmap_atomic(page, KM_USER0);
> + do {
> + check += kaddr[offset++];
> + } while (offset < PAGE_SIZE);
> + kunmap_atomic(kaddr, KM_USER0);
> + unlock_page(page);
> + page_cache_release(page);
> + if (check)
> + printk(KERN_ERR "%s: BADNESS: truncate check %u\n", 
> current->comm, check);
> + }
> +}
> +
>  /**
>   * vmtruncate - unmap mappings "freed" by truncate() syscall
>   * @inode: inode of the file used
> @@ -1875,6 +1902,7 @@ do_expand:
>   goto out_sig;
>   if (offset > inode->i_sb->s_maxbytes)
>   goto out_big;
> + check_last_page(mapping, inode->i_size);
>   i_size_write(inode, offset);
>  
>  out_truncate:
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 237107c..f561e72 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -957,7 +957,7 @@ int test_set_page_writeback(struct page *page)
>  EXPORT_SYMBOL(test_set_page_writeback);
>  
>  /*
> - * Return true if any of the pages in the mapping are marged with the
> + * Return true if any of the pages in the mapping are marked with the
>   * passed tag.
>   */
>  int mapping_tagged(struct address_space *mapping, int tag)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d8a842a..900229a 100644
>

Re: 2.6.19 file content corruption on ext3

2006-12-20 Thread Arjan van de Ven


> Hmm, should we not flush after clearing the dirty bit? That is, why does
> ptep_clear_flush_dirty() need a flush after clearing that bit? does it
> leak through in the tlb copy?

afaics you need to 
1) clear
2) flush 
3) check and go to 1) if needed

to be race free. 



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-20 Thread Peter Zijlstra

On Tue, 2006-12-19 at 16:23 -0800, Linus Torvalds wrote:

> Pls test.

Is good. Only s390 remains a question.

Another point, change_protection() also does a cache flush, should we
too?

> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d8a842a..eec8706 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -448,9 +448,10 @@ static int page_mkclean_one(struct page *page, struct 
> vm_area_struct *vma)
>   goto unlock;
>  
>   entry = ptep_get_and_clear(mm, address, pte);
  flush_cache_page(vma, address, pte_pfn(entry));
> + flush_tlb_page(vma, address);
>   entry = pte_mkclean(entry);
>   entry = pte_wrprotect(entry);
> - ptep_establish(vma, address, pte, entry);
> + set_pte_at(mm, address, pte, entry);
>   lazy_mmu_prot_update(entry);
>   ret = 1;
>  
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-20 Thread Peter Zijlstra

On Wed, 2006-12-20 at 10:01 +0100, Peter Zijlstra wrote:

> I will try, but I had a look around the different architectures
> implementation of ptep_clear_flush_dirty() and saw that not all do the
> actual flush. So if we go down this road perhaps we should introduce
> another per arch function that does the potential flush. like
> flush_tlb_on_clear_dirty() or something like that.

never mind, we do need an unconditional flush for changing the
protection too.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-20 Thread Peter Zijlstra

On Tue, 2006-12-19 at 16:23 -0800, Linus Torvalds wrote:
> 
> On Wed, 20 Dec 2006, Peter Zijlstra wrote:
> > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> > > OR:
> > > 
> > >  - page_mkclean_one() is simply buggy.
> > 
> > GOLD!
> 
> Ok. I was looking at that, and I wondered..
> 
> However, if that works, then I _think_ the correct sequence is the 
> following..
> 
> The rule should be:
>  - we flush the tlb _after_ we have cleared it, but _before_ we insert the 
>new entry.
> 
> But I dunno. These things are damn subtle. Does this patch fix it for you?

I will try, but I had a look around the different architectures
implementation of ptep_clear_flush_dirty() and saw that not all do the
actual flush. So if we go down this road perhaps we should introduce
another per arch function that does the potential flush. like
flush_tlb_on_clear_dirty() or something like that.

Then we could write:

  entry = ptep_get_and_clear(mm, address, ptep)
  flush_tlb_on_clear_dirty(vma, address);
  entry = pte_mkclean(entry);
  entry = pte_wrprotect(entry);
  set_pte_at(mm, address, ptep, entry);

> I actually suspect we should do this as an arch-specific macro, and 
> totally replace the current "ptep_clear_flush_dirty()" with one that does 
> "ptep_clear_flush_dirty_and_set_wp()".
> 
> Because what I'd _really_ prefer to do on x86 (and probably on most other 
> sane architectures) is to do
> 
>  - atomically replace the pte with the EXACT SAME ONE, but one that 
>has the writable bit clear.
> 
>   bit_clear(_PAGE_BIT_RW, &(ptep)->pte_low);
> 
>  - flush the TLB, making sure that all CPU's will no longer write to it:
> 
>   flush_tlb_page(vma, address);
> 
>  - finally, just fetch-and-clear the dirty bit (and since it's no longer 
>writable, nobody should be settign it any more)
> 
>   ret = bit_clear(__PAGE_BIT_DIRTY, &(ptep)->pte_low);
> 
> and now we should be all done.

Hmm, should we not flush after clearing the dirty bit? That is, why does
ptep_clear_flush_dirty() need a flush after clearing that bit? does it
leak through in the tlb copy?

Also, what is this page_test_and_clear_dirty() business, that seems to
be exclusively s390 btw. However they do seem to need this.

> But the "ptep_get_and_clear() + flush_tlb_page()" sequence should 
> hopefully also work.

Yeah, probably, not optimally so on some archs that don't actually need
the flush though. And as above, I wonder about s390.

(added our s390 friends to the CC list)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-20 Thread Peter Zijlstra

On Tue, 2006-12-19 at 16:23 -0800, Linus Torvalds wrote:
 
 On Wed, 20 Dec 2006, Peter Zijlstra wrote:
  On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
   OR:
   
- page_mkclean_one() is simply buggy.
  
  GOLD!
 
 Ok. I was looking at that, and I wondered..
 
 However, if that works, then I _think_ the correct sequence is the 
 following..
 
 The rule should be:
  - we flush the tlb _after_ we have cleared it, but _before_ we insert the 
new entry.
 
 But I dunno. These things are damn subtle. Does this patch fix it for you?

I will try, but I had a look around the different architectures
implementation of ptep_clear_flush_dirty() and saw that not all do the
actual flush. So if we go down this road perhaps we should introduce
another per arch function that does the potential flush. like
flush_tlb_on_clear_dirty() or something like that.

Then we could write:

  entry = ptep_get_and_clear(mm, address, ptep)
  flush_tlb_on_clear_dirty(vma, address);
  entry = pte_mkclean(entry);
  entry = pte_wrprotect(entry);
  set_pte_at(mm, address, ptep, entry);

 I actually suspect we should do this as an arch-specific macro, and 
 totally replace the current ptep_clear_flush_dirty() with one that does 
 ptep_clear_flush_dirty_and_set_wp().
 
 Because what I'd _really_ prefer to do on x86 (and probably on most other 
 sane architectures) is to do
 
  - atomically replace the pte with the EXACT SAME ONE, but one that 
has the writable bit clear.
 
   bit_clear(_PAGE_BIT_RW, (ptep)-pte_low);
 
  - flush the TLB, making sure that all CPU's will no longer write to it:
 
   flush_tlb_page(vma, address);
 
  - finally, just fetch-and-clear the dirty bit (and since it's no longer 
writable, nobody should be settign it any more)
 
   ret = bit_clear(__PAGE_BIT_DIRTY, (ptep)-pte_low);
 
 and now we should be all done.

Hmm, should we not flush after clearing the dirty bit? That is, why does
ptep_clear_flush_dirty() need a flush after clearing that bit? does it
leak through in the tlb copy?

Also, what is this page_test_and_clear_dirty() business, that seems to
be exclusively s390 btw. However they do seem to need this.

 But the ptep_get_and_clear() + flush_tlb_page() sequence should 
 hopefully also work.

Yeah, probably, not optimally so on some archs that don't actually need
the flush though. And as above, I wonder about s390.

(added our s390 friends to the CC list)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-20 Thread Peter Zijlstra

On Wed, 2006-12-20 at 10:01 +0100, Peter Zijlstra wrote:

 I will try, but I had a look around the different architectures
 implementation of ptep_clear_flush_dirty() and saw that not all do the
 actual flush. So if we go down this road perhaps we should introduce
 another per arch function that does the potential flush. like
 flush_tlb_on_clear_dirty() or something like that.

never mind, we do need an unconditional flush for changing the
protection too.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-20 Thread Peter Zijlstra

On Tue, 2006-12-19 at 16:23 -0800, Linus Torvalds wrote:

 Pls test.

Is good. Only s390 remains a question.

Another point, change_protection() also does a cache flush, should we
too?

 
 diff --git a/mm/rmap.c b/mm/rmap.c
 index d8a842a..eec8706 100644
 --- a/mm/rmap.c
 +++ b/mm/rmap.c
 @@ -448,9 +448,10 @@ static int page_mkclean_one(struct page *page, struct 
 vm_area_struct *vma)
   goto unlock;
  
   entry = ptep_get_and_clear(mm, address, pte);
  flush_cache_page(vma, address, pte_pfn(entry));
 + flush_tlb_page(vma, address);
   entry = pte_mkclean(entry);
   entry = pte_wrprotect(entry);
 - ptep_establish(vma, address, pte, entry);
 + set_pte_at(mm, address, pte, entry);
   lazy_mmu_prot_update(entry);
   ret = 1;
  
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-20 Thread Arjan van de Ven


 Hmm, should we not flush after clearing the dirty bit? That is, why does
 ptep_clear_flush_dirty() need a flush after clearing that bit? does it
 leak through in the tlb copy?

afaics you need to 
1) clear
2) flush 
3) check and go to 1) if needed

to be race free. 



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-20 Thread Andrei Popa

On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote:
 On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
 
  OR:
  
   - page_mkclean_one() is simply buggy.
 
 GOLD!
 
 it seems to work with all this (full diff against current git).
 
 /me rebuilds full kernel to make sure...
 reboot...
 test...  pff the tension...
 yay, still good!
 
 Andrei; would you please verify.

I have corrupted files.

 The magic seems to be in the extra tlb flush after clearing the dirty
 bit. Just too bad ptep_clear_flush_dirty() needs ptep not entry.
 
 diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c
 index 5e7cd45..2b8893b 100644
 --- a/drivers/connector/connector.c
 +++ b/drivers/connector/connector.c
 @@ -135,8 +135,7 @@ static int cn_call_callback(struct cn_msg *msg, void 
 (*destruct_data)(void *), v
   spin_lock_bh(dev-cbdev-queue_lock);
   list_for_each_entry(__cbq, dev-cbdev-queue_list, callback_entry) {
   if (cn_cb_equal(__cbq-id.id, msg-id)) {
 - if (likely(!test_bit(WORK_STRUCT_PENDING,
 -  __cbq-work.work.management) 
 + if (likely(!delayed_work_pending(__cbq-work) 
   __cbq-data.ddata == NULL)) {
   __cbq-data.callback_priv = msg;
  
 diff --git a/fs/buffer.c b/fs/buffer.c
 index d1f1b54..263f88e 100644
 --- a/fs/buffer.c
 +++ b/fs/buffer.c
 @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
   int ret = 0;
  
   BUG_ON(!PageLocked(page));
 - if (PageWriteback(page))
 + if (PageDirty(page) || PageWriteback(page))
   return 0;
  
   if (mapping == NULL) {  /* can this still happen? */
 @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
   spin_lock(mapping-private_lock);
   ret = drop_buffers(page, buffers_to_free);
   spin_unlock(mapping-private_lock);
 - if (ret) {
 - /*
 -  * If the filesystem writes its buffers by hand (eg ext3)
 -  * then we can have clean buffers against a dirty page.  We
 -  * clean the page here; otherwise later reattachment of buffers
 -  * could encounter a non-uptodate page, which is unresolvable.
 -  * This only applies in the rare case where try_to_free_buffers
 -  * succeeds but the page is not freed.
 -  *
 -  * Also, during truncate, discard_buffer will have marked all
 -  * the page's buffers clean.  We discover that here and clean
 -  * the page also.
 -  */
 - if (test_clear_page_dirty(page))
 - task_io_account_cancelled_write(PAGE_CACHE_SIZE);
 - }
  out:
   if (buffers_to_free) {
   struct buffer_head *bh = buffers_to_free;
 diff --git a/mm/memory.c b/mm/memory.c
 index c00bac6..60e0945 100644
 --- a/mm/memory.c
 +++ b/mm/memory.c
 @@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping,
  }
  EXPORT_SYMBOL(unmap_mapping_range);
  
 +static void check_last_page(struct address_space *mapping, loff_t size)
 +{
 + pgoff_t index;
 + unsigned int offset;
 + struct page *page;
 +
 + if (!mapping)
 + return;
 + offset = size  ~PAGE_MASK;
 + if (!offset)
 + return;
 + index = size  PAGE_SHIFT;
 + page = find_lock_page(mapping, index);
 + if (page) {
 + unsigned int check = 0;
 + unsigned char *kaddr = kmap_atomic(page, KM_USER0);
 + do {
 + check += kaddr[offset++];
 + } while (offset  PAGE_SIZE);
 + kunmap_atomic(kaddr, KM_USER0);
 + unlock_page(page);
 + page_cache_release(page);
 + if (check)
 + printk(KERN_ERR %s: BADNESS: truncate check %u\n, 
 current-comm, check);
 + }
 +}
 +
  /**
   * vmtruncate - unmap mappings freed by truncate() syscall
   * @inode: inode of the file used
 @@ -1875,6 +1902,7 @@ do_expand:
   goto out_sig;
   if (offset  inode-i_sb-s_maxbytes)
   goto out_big;
 + check_last_page(mapping, inode-i_size);
   i_size_write(inode, offset);
  
  out_truncate:
 diff --git a/mm/page-writeback.c b/mm/page-writeback.c
 index 237107c..f561e72 100644
 --- a/mm/page-writeback.c
 +++ b/mm/page-writeback.c
 @@ -957,7 +957,7 @@ int test_set_page_writeback(struct page *page)
  EXPORT_SYMBOL(test_set_page_writeback);
  
  /*
 - * Return true if any of the pages in the mapping are marged with the
 + * Return true if any of the pages in the mapping are marked with the
   * passed tag.
   */
  int mapping_tagged(struct address_space *mapping, int tag)
 diff --git a/mm/rmap.c b/mm/rmap.c
 index d8a842a..900229a 100644
 --- a/mm/rmap.c
 +++ b/mm/rmap.c
 @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page *page, struct 
 vm_area_struct *vma)
  {

Re: 2.6.19 file content corruption on ext3

2006-12-20 Thread Peter Zijlstra

On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote:
 On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote:
  On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
  
   OR:
   
- page_mkclean_one() is simply buggy.
  
  GOLD!
  
  it seems to work with all this (full diff against current git).
  
  /me rebuilds full kernel to make sure...
  reboot...
  test...  pff the tension...
  yay, still good!
  
  Andrei; would you please verify.
 
 I have corrupted files.

drad; and with this patch:
  http://lkml.org/lkml/2006/12/20/112

/me goes rebuild his kernel and try more than 3 times

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-20 Thread Martin Schwidefsky

On Wed, 2006-12-20 at 10:01 +0100, Peter Zijlstra wrote:
 Also, what is this page_test_and_clear_dirty() business, that seems to
 be exclusively s390 btw. However they do seem to need this.
 
  But the ptep_get_and_clear() + flush_tlb_page() sequence should
  hopefully also work.
 
 Yeah, probably, not optimally so on some archs that don't actually need
 the flush though. And as above, I wonder about s390.

Simple, the s390 architecture does not keep the dirty bit in the pte but
in something called the storage key. For each physical page there is one
associated storage key. It is accessed with special instructions like
iske, sske or rrbe. To clear the dirty bit the storage key of a
page is read with iske, the bit is cleared and the storage key is stored
back with sske. That means that clearing the dirty bit is not an atomic
operation. rrbe is used to test and clear the referenced bit (young/old
infomation) and is atomic in regard to other storage key operations. If
you think about it, the storage keys are quite nice for the operating
system, page_referenced() can be implemented with a single test
page_test_and_clear_young(). No need to read all the ptes pointing to
the page. The downside is that the storage keys have a cost on the
hardware side.

-- 
blue skies,
  Martin.

Martin Schwidefsky
Linux for zSeries Development  Services
IBM Deutschland Entwicklung GmbH

Reality continues to ruin my life. - Calvin.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-20 Thread Andrei Popa

On Wed, 2006-12-20 at 15:23 +0100, Peter Zijlstra wrote:
 On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote:
  On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote:
   On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
   
OR:

 - page_mkclean_one() is simply buggy.
   
   GOLD!
   
   it seems to work with all this (full diff against current git).
   
   /me rebuilds full kernel to make sure...
   reboot...
   test...  pff the tension...
   yay, still good!
   
   Andrei; would you please verify.
  
  I have corrupted files.
 
 drad; and with this patch:
   http://lkml.org/lkml/2006/12/20/112

Hash check on download completion found bad chunks, consider using
safe_sync.

 
 /me goes rebuild his kernel and try more than 3 times
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-20 Thread Peter Zijlstra

On Wed, 2006-12-20 at 18:30 +0200, Andrei Popa wrote:
 On Wed, 2006-12-20 at 15:23 +0100, Peter Zijlstra wrote:
  On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote:
   On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote:
On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:

 OR:
 
  - page_mkclean_one() is simply buggy.

GOLD!

it seems to work with all this (full diff against current git).

/me rebuilds full kernel to make sure...
reboot...
test...  pff the tension...
yay, still good!

Andrei; would you please verify.
   
   I have corrupted files.
  
  drad; and with this patch:
http://lkml.org/lkml/2006/12/20/112
 
 Hash check on download completion found bad chunks, consider using
 safe_sync.

*sigh* back to square 1.

and I need to look at my reproduction case ;-(

Thanks for testing.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-20 Thread Stephen Clark


Peter Zijlstra wrote:


On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote:
 


On Tue, 19 Dec 2006, Linus Torvalds wrote:
   

here's a totally new tangent on this: it's possible that user code is 
simply BUGGY. 
 



I'm sad to say this doesn't trigger :-(


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 


Hi all,

I ran it a number of times on 2.6.16-1.2115_FC4 and always got
./a.out | od -x
000        
020        
040    

but running it on 2.6.19-rc5 I always get zeros in the middle.

Steve

--

They that give up essential liberty to obtain temporary safety, 
deserve neither liberty nor safety.  (Ben Franklin)


The course of history shows that as a government grows, liberty 
decreases.  (Thomas Jefferson)




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Jari Sundell

On 12/20/06, Linus Torvalds <[EMAIL PROTECTED]> wrote:

On Tue, 19 Dec 2006, Linus Torvalds wrote:
>
>  here's a totally new tangent on this: it's possible that user code is
> simply BUGGY.

Btw, here's a simpler test-program that actually shows the difference
between 2.6.18 and 2.6.19 in action, and why it could explain why a
program like rtorrent might show corruption behavious that it didn't show
before.

Kinda late to the discussion, but I guess I could summarize what
rtorrent actually does, or should be doing.

When downloading a new torrent, it will create the files and truncate
them to the final size. It will never call truncate after this and the
files will remain sparse until data is downloaded. A 'piece' is mapped
to memory using MAP_SHARED, which will be page aligned on single file
torrents but unlikely to be so on multi-file torrents.

So on multi-file torrents it'll often end up with two mappings
overlapping with one page, each of which only write to their own part
the page. These will then be sync'ed with MS_ASYNC, or MS_SYNC if low
on disk space. After that it might be unmapped, then mapped as
read-only.

I haven't thought of asking if single file torrents are ok.

Rakshasa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Linus Torvalds



On Wed, 20 Dec 2006, Peter Zijlstra wrote:
> On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> > OR:
> > 
> >  - page_mkclean_one() is simply buggy.
> 
> GOLD!

Ok. I was looking at that, and I wondered..

However, if that works, then I _think_ the correct sequence is the 
following..

The rule should be:
 - we flush the tlb _after_ we have cleared it, but _before_ we insert the 
   new entry.

But I dunno. These things are damn subtle. Does this patch fix it for you?

I actually suspect we should do this as an arch-specific macro, and 
totally replace the current "ptep_clear_flush_dirty()" with one that does 
"ptep_clear_flush_dirty_and_set_wp()".

Because what I'd _really_ prefer to do on x86 (and probably on most other 
sane architectures) is to do

 - atomically replace the pte with the EXACT SAME ONE, but one that 
   has the writable bit clear.

bit_clear(_PAGE_BIT_RW, &(ptep)->pte_low);

 - flush the TLB, making sure that all CPU's will no longer write to it:

flush_tlb_page(vma, address);

 - finally, just fetch-and-clear the dirty bit (and since it's no longer 
   writable, nobody should be settign it any more)

ret = bit_clear(__PAGE_BIT_DIRTY, &(ptep)->pte_low);

and now we should be all done.

But the "ptep_get_and_clear() + flush_tlb_page()" sequence should 
hopefully also work.

Pls test.

Linus


diff --git a/mm/rmap.c b/mm/rmap.c
index d8a842a..eec8706 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -448,9 +448,10 @@ static int page_mkclean_one(struct page *page, struct 
vm_area_struct *vma)
goto unlock;
 
entry = ptep_get_and_clear(mm, address, pte);
+   flush_tlb_page(vma, address);
entry = pte_mkclean(entry);
entry = pte_wrprotect(entry);
-   ptep_establish(vma, address, pte, entry);
+   set_pte_at(mm, address, pte, entry);
lazy_mmu_prot_update(entry);
ret = 1;
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Andrew Morton

On Tue, 19 Dec 2006 16:03:49 -0800 (PST)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> 
> 
> On Wed, 20 Dec 2006, Peter Zijlstra wrote:
> 
> > On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote:
> > 
> > > Well... we'd need to see (corruption && this-not-triggering) to be sure.
> > > 
> > > Peter, have you been able to trigger the corruption?
> > 
> > Yes; however the mail I send describing that seems to be lost in space.
> 
> Btw, can somebody actually explain the mess that is ext3 "dirtying".
> 
> Ext3 does NOT use __set_page_dirty_buffers. It does
> 
>   static int ext3_journalled_set_page_dirty(struct page *page)
>   {
>   SetPageChecked(page);
>   return __set_page_dirty_nobuffers(page);
>   }
> 
> and uses that "Checked" bit as a "whole page is dirty" bit (which it tests 
> in "writepage()".

This is purely for data=journal, which is rarely used.

In journalled-data mode, write(), write-fault, etc are not allowed to dirty
the pages and buffers, because the data has to be written to the journal
first.  After the data has been written to the journal we only then mark
buffers (and hence pages) dirty as far as the VFS is concerned.  For
checkpointing the data back to its real place on the disk.

For MAP_SHARED pages ext3 cheats madly and doesn't journal the data at all.
In all journalling modes, MAP_SHARED data follows the regular ext2-style
handling.  Which is a bit of a wart.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Linus Torvalds

On Wed, 20 Dec 2006, Peter Zijlstra wrote:

> On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote:
> 
> > Well... we'd need to see (corruption && this-not-triggering) to be sure.
> > 
> > Peter, have you been able to trigger the corruption?
> 
> Yes; however the mail I send describing that seems to be lost in space.

Btw, can somebody actually explain the mess that is ext3 "dirtying".

Ext3 does NOT use __set_page_dirty_buffers. It does

static int ext3_journalled_set_page_dirty(struct page *page)
{
SetPageChecked(page);
return __set_page_dirty_nobuffers(page);
}

and uses that "Checked" bit as a "whole page is dirty" bit (which it tests 
in "writepage()".

You realize what this all means? It means that ANYTHING that actually 
clears the _real_ dirty bit won't actually be doing anything at all for 
ext3, because the Checked bit will still stay set, and any IO down the 
line on that page would totally ignore the dirty bits on the buffer heads 
and just write out everything.

That is "The Mess(tm)".

It also basically means that anything that clears the dirty bit without 
just calling "writepage()" had _better_ call "invalidatepage()" for the 
whole page, because otherwise the PageChecked bit will never be cleared as 
far as I can see. Happily, at least ext3 seems to _test_ for that case in 
the release_page() function, so it appears that we do do this.

But this seems to just strengthen my argument: you can NEVER clean a page, 
unless you (a) do IO on it immediately afterwards (writeback) or (b) 
invalidate it entirely (truncate).

I'd really like to see just those two functions exist. Preferably in a 
form where you can see easily that we actually follow those rules. Rather 
than having a confusing set of "clear_page_dirty()" and
"test_and_clear_page_dirty()" functions that are called from random 
places.

IOW, I think the "clear_page_dirty_for_io()" is fine (it's case (a)) 
above, and then we should probably have a "cancel_dirty_page()" function 
that does all the current clear_page_dirty() but also makes sure that we 
actually call the invalidate_page() function itself. 

Hmm?

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Peter Zijlstra

On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:

> OR:
> 
>  - page_mkclean_one() is simply buggy.

GOLD!

it seems to work with all this (full diff against current git).

/me rebuilds full kernel to make sure...
reboot...
test...  pff the tension...
yay, still good!

Andrei; would you please verify.

The magic seems to be in the extra tlb flush after clearing the dirty
bit. Just too bad ptep_clear_flush_dirty() needs ptep not entry.

diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c
index 5e7cd45..2b8893b 100644
--- a/drivers/connector/connector.c
+++ b/drivers/connector/connector.c
@@ -135,8 +135,7 @@ static int cn_call_callback(struct cn_msg *msg, void 
(*destruct_data)(void *), v
spin_lock_bh(>cbdev->queue_lock);
list_for_each_entry(__cbq, >cbdev->queue_list, callback_entry) {
if (cn_cb_equal(&__cbq->id.id, >id)) {
-   if (likely(!test_bit(WORK_STRUCT_PENDING,
-&__cbq->work.work.management) &&
+   if (likely(!delayed_work_pending(&__cbq->work) &&
__cbq->data.ddata == NULL)) {
__cbq->data.callback_priv = msg;
 
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
int ret = 0;
 
BUG_ON(!PageLocked(page));
-   if (PageWriteback(page))
+   if (PageDirty(page) || PageWriteback(page))
return 0;
 
if (mapping == NULL) {  /* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
spin_lock(>private_lock);
ret = drop_buffers(page, _to_free);
spin_unlock(>private_lock);
-   if (ret) {
-   /*
-* If the filesystem writes its buffers by hand (eg ext3)
-* then we can have clean buffers against a dirty page.  We
-* clean the page here; otherwise later reattachment of buffers
-* could encounter a non-uptodate page, which is unresolvable.
-* This only applies in the rare case where try_to_free_buffers
-* succeeds but the page is not freed.
-*
-* Also, during truncate, discard_buffer will have marked all
-* the page's buffers clean.  We discover that here and clean
-* the page also.
-*/
-   if (test_clear_page_dirty(page))
-   task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-   }
 out:
if (buffers_to_free) {
struct buffer_head *bh = buffers_to_free;
diff --git a/mm/memory.c b/mm/memory.c
index c00bac6..60e0945 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
+static void check_last_page(struct address_space *mapping, loff_t size)
+{
+   pgoff_t index;
+   unsigned int offset;
+   struct page *page;
+
+   if (!mapping)
+   return;
+   offset = size & ~PAGE_MASK;
+   if (!offset)
+   return;
+   index = size >> PAGE_SHIFT;
+   page = find_lock_page(mapping, index);
+   if (page) {
+   unsigned int check = 0;
+   unsigned char *kaddr = kmap_atomic(page, KM_USER0);
+   do {
+   check += kaddr[offset++];
+   } while (offset < PAGE_SIZE);
+   kunmap_atomic(kaddr, KM_USER0);
+   unlock_page(page);
+   page_cache_release(page);
+   if (check)
+   printk(KERN_ERR "%s: BADNESS: truncate check %u\n", 
current->comm, check);
+   }
+}
+
 /**
  * vmtruncate - unmap mappings "freed" by truncate() syscall
  * @inode: inode of the file used
@@ -1875,6 +1902,7 @@ do_expand:
goto out_sig;
if (offset > inode->i_sb->s_maxbytes)
goto out_big;
+   check_last_page(mapping, inode->i_size);
i_size_write(inode, offset);
 
 out_truncate:
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..f561e72 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -957,7 +957,7 @@ int test_set_page_writeback(struct page *page)
 EXPORT_SYMBOL(test_set_page_writeback);
 
 /*
- * Return true if any of the pages in the mapping are marged with the
+ * Return true if any of the pages in the mapping are marked with the
  * passed tag.
  */
 int mapping_tagged(struct address_space *mapping, int tag)
diff --git a/mm/rmap.c b/mm/rmap.c
index d8a842a..900229a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page *page, struct 
vm_area_struct *vma)
 {
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
-   pte_t *pte,

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Peter Zijlstra

On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote:

> Well... we'd need to see (corruption && this-not-triggering) to be sure.
> 
> Peter, have you been able to trigger the corruption?

Yes; however the mail I send describing that seems to be lost in space.

/me quotes from the send folder:

> The bad new is, that doesn't help either. The good news is I can
> reproduce it.
> 
> What I did to achieve that:
>  
>  - get a sizable torrent from legaltorrents.com / or create a torrent
> yourself that is around ~600M and has multiple files.
> 
>  - start a tracker, and multiple seeds (I used three machines here)
> 
>  - pull the torrent on a fourth machine
> 
> the seeding machines don't much matter of course.
> 
> the fourth machine was a dual core x86-64 with an SMP kernel and
> PREEMPT, mem=256M (so that the torrent is quite a bit larger and does
> require writeout) and I used an ext3 partition with 1k blocks.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Peter Zijlstra

On Wed, 2006-12-20 at 00:06 +0100, Peter Zijlstra wrote:
> On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote:
> 
> > Well... we'd need to see (corruption && this-not-triggering) to be sure.
> > 
> > Peter, have you been able to trigger the corruption?
> 
> Yes; however the mail I send describing that seems to be lost in space.
> 
> /me quotes from the send folder:
> 
> > The bad new is, that doesn't help either. The good news is I can
> > reproduce it.
> > 
> > What I did to achieve that:
> >  
> >  - get a sizable torrent from legaltorrents.com / or create a torrent
> > yourself that is around ~600M and has multiple files.
> > 
> >  - start a tracker, and multiple seeds (I used three machines here)
> > 
> >  - pull the torrent on a fourth machine
> > 
> > the seeding machines don't much matter of course.
> > 
> > the fourth machine was a dual core x86-64 with an SMP kernel and
> > PREEMPT, mem=256M (so that the torrent is quite a bit larger and does
> > require writeout) and I used an ext3 partition with 1k blocks.

PS. this was a reply to:
 http://lkml.org/lkml/2006/12/19/121

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Andrew Morton

On Tue, 19 Dec 2006 14:51:55 -0800 (PST)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> 
> 
> On Tue, 19 Dec 2006, Peter Zijlstra wrote:
> 
> > On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote:
> > > 
> > > On Tue, 19 Dec 2006, Linus Torvalds wrote:
> > > >
> > > >  here's a totally new tangent on this: it's possible that user code is 
> > > > simply BUGGY. 
> > 
> > I'm sad to say this doesn't trigger :-(
> 
> Oh, well. It was a theory. 
> 

Well... we'd need to see (corruption && this-not-triggering) to be sure.

Peter, have you been able to trigger the corruption?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Linus Torvalds



On Tue, 19 Dec 2006, Peter Zijlstra wrote:

> On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote:
> > 
> > On Tue, 19 Dec 2006, Linus Torvalds wrote:
> > >
> > >  here's a totally new tangent on this: it's possible that user code is 
> > > simply BUGGY. 
> 
> I'm sad to say this doesn't trigger :-(

Oh, well. It was a theory. 

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Florian Weimer

* Linus Torvalds:

> Now, this should _matter_ only for user processes that are buggy,
> and that have written to the page _before_ extending it with
> ftruncate().

APT seems to properly extend the file before mapping it, by writing a
zero byte at the desired position (creating a hole).

24986 open("/var/cache/apt/pkgcache.bin", O_RDWR|O_CREAT|O_TRUNC, 0666) = 6

24986 lseek(6, 12582911, SEEK_SET)  = 12582911
24986 write(6, "\0", 1) = 1

24986 mmap(NULL, 12582912, PROT_READ|PROT_WRITE, MAP_SHARED, 6, 0) = 
0x2b6578636000

24986 msync(0x2b6578636000, 7464112, MS_SYNC) = 0
24986 msync(0x2b6578636000, 8656, MS_SYNC) = 0
24986 munmap(0x2b6578636000, 12582912)  = 0
24986 ftruncate(6, 7464112) = 0
24986 fstat(6, {st_mode=S_IFREG|0644, st_size=7464112, ...}) = 0
24986 mmap(NULL, 7464112, PROT_READ, MAP_SHARED, 6, 0) = 0x2b6578636000

APT's code is pretty convoluted, though, and there might be some code
path in it that gets it wrong. 8-P
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Peter Zijlstra

On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote:
> 
> On Tue, 19 Dec 2006, Linus Torvalds wrote:
> >
> >  here's a totally new tangent on this: it's possible that user code is 
> > simply BUGGY. 

I'm sad to say this doesn't trigger :-(


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread dean gaudet

On Mon, 18 Dec 2006, Linus Torvalds wrote:

> On Tue, 19 Dec 2006, Nick Piggin wrote:
> > 
> > We never want to drop dirty data! (ignoring the truncate case, which is
> > handled privately by truncate anyway)
> 
> Bzzt.
> 
> SURE we do.
> 
> We absolutely do want to drop dirty data in the writeout path.
> 
> How do you think dirty data ever _becomes_ clean data?
> 
> In other words, yes, we _do_ want to test-and-clear all the pgtable bits 
> _and_ the PG_dirty bit. We want to do it for:
>  - writeout
>  - truncate
>  - possibly a "drop" event (which could be a case for a journal entry that 
>becomes stale due to being replaced or something - kind of "truncate" 
>on metadata)
> 
> because both of those events _literally_ turn dirty state into clean 
> state.
> 
> In no other circumstance do we ever want to clear a dirty bit, as far as I 
> can tell. 

i admit this may not be entirely relevant, but it seems like a good place 
to bring up an old problem:  when a disk dies with lots of queued writes 
it can totally bring a system to its knees... even after the disk is 
removed.  i wrote up something about this a while ago:

http://lkml.org/lkml/2005/8/18/243

so there's another reason to "clear a dirty bit"... well, in fact -- drop 
the pages entirely.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Linus Torvalds

On Tue, 19 Dec 2006, Linus Torvalds wrote:
>
>  here's a totally new tangent on this: it's possible that user code is 
> simply BUGGY. 

Btw, here's a simpler test-program that actually shows the difference 
between 2.6.18 and 2.6.19 in action, and why it could explain why a 
program like rtorrent might show corruption behavious that it didn't show 
before.

#include 
#include 
#include 
#include 

int main(int argc, char **argv)
{
char *mapping;
int fd;

fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
if (fd < 0)
return -1;
if (ftruncate(fd, 10) < 0)
return -1;
mapping = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, 
fd, 0);
if (-1 == (int)(long)mapping)
return -1;
memset(mapping, 0xaa, 20);
sync();
if (ftruncate(fd, 40) < 0)
return -1;
memset(mapping + 20, 0x55, 20);
write(1, mapping, 40);
return 0;
}

Notice the "sync()" in between the "memset()" and the "ftruncate()". In 
2.6.18, that would normally do absolutely _nothing_ to the shared memory 
mapping, becuase we simply couldn't track pages that were dirty in the 
page tables. 

So in 2.6.18, if you try this, with

./a.out | od -x

you should see something like

000        
020        
040    
050

which matches your memset() patterns: 20 bytes of 0xaa, and 20 bytes of 
0x55.

HOWEVER. 

In 2.6.19, because we actually track dirty data so much better, "sync()" 
will actually be smart enough to write out the dirty mmap'ed data too. But 
since the user program has only allocated ten bytes for it in the file, 
when it is written out, the rest of the page is cleared. When you then 
write the last 20 bytes (after _properly_ allocating memory for them), you 
should now see a pattern like

000        
020        
040    
050

instead: with ten bytes of zero in between, because the data that couldn't 
be written out was cleared.

So 2.6.19 is strictly _better_, but exactly because it's tracking dirty 
status much more precisely, you'll see certain user-level bugs much more 
easily.

NOTE NOTE NOTE! The code really _was_ buggy in 2.6.18 too, and you _can_ 
get the zeroes in the middle of the file with an older kernel. But in 
older kernels, you need to be really really unlucky, and have the page 
cleaned by strong memory pressure. In 2.6.19, any "sync()" activity 
(includign from the outside) will clean the page, so a user program with 
this bug can just be made to trigger the bug much more easily.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Linus Torvalds



Btw,
 here's a totally new tangent on this: it's possible that user code is 
simply BUGGY. 

There is one case where the kernel actually forcibly writes zeroes into a 
file: when we're writing a page that straddles the "inode->i_size" 
boundary. See the various writepages in fs/buffer.c, they all contain 
variations on that theme (although most of them aren't as well commented 
as this snippet):

/*
 * The page straddles i_size.  It must be zeroed out on each and every
 * writepage invocation because it may be mmapped.  "A file is mapped
 * in multiples of the page size.  For a file that is not a multiple of
 * the  page size, the remaining memory is zeroed when mapped, and
 * writes to that region are not written out to the file."
 */
kaddr = kmap_atomic(page, KM_USER0);
memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
flush_dcache_page(page);
kunmap_atomic(kaddr, KM_USER0);

Now, this should _matter_ only for user processes that are buggy, and that 
have written to the page _before_ extending it with ftruncate(). That's 
definitely a serious bug, but it's one that can do totally undetected 
depending on when the actual write-out happens.

So what I'm saying is that if we end up writing things earlier thanks to 
the more aggressive dirty-page-management thing in 2.6.19, we might 
actually just expose a long-time userspace bug that was just a LOT harder 
to trigger before..

I'm not saying this is the cause of all this, but we've been tearing our 
hair out, and it migth be worthwhile trying this really really really 
stupid patch that will notice when that happens at truncate() time, and 
tell the user that he's a total idiot. Or something to that effect.

Maybe the reason this is so easy to trigger with rtorrent is not because 
rtorrent does some magic pattern that triggers a kernel bug, but simply 
because rtorrent itself might have a bug.

Ok, so it's a long shot, but it's still worth testing, I suspect. The 
patch is very simple: whenever we do an _expanding_ truncate, we check the 
last page of the _old_ size, and if there were non-zero contents past the 
old size, we complain.

As an attachement is a test-program that _should_ trigger a 
kernel message like

a.out: BADNESS: truncate check 17000

for good measure, just so that you can verify that the patch works and 
actually catches this case.

(The 17000 number is just the one-hundred _invalid_ 0xaa bytes - out of 
the 200 we wrote - that were summed up: 100*0xaa == 17000. Anything 
non-zero is always a bug).

I doubt this is really it, but it's worth trying. If you fill out a page, 
and only do "ftruncate()" in response to SIGBUS messages (and don't 
truncate to whole pages), you could potentially see zeroes at the end of 
the page exactly because _writeout_ cleared the page for you! So it 
_could_ explain the symptoms, but only if user-space was horribly horribly 
broken.

Linus


diff --git a/mm/memory.c b/mm/memory.c
index c00bac6..79cecab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
+static void check_last_page(struct address_space *mapping, loff_t size)
+{
+   pgoff_t index;
+   unsigned int offset;
+   struct page *page;
+
+   if (!mapping)
+   return;
+   offset = size & ~PAGE_MASK;
+   if (!offset)
+   return;
+   index = size >> PAGE_SHIFT;
+   page = find_lock_page(mapping, index);
+   if (page) {
+   unsigned int check = 0;
+   unsigned char *kaddr = kmap_atomic(page, KM_USER0);
+   do {
+   check += kaddr[offset++];
+   } while (offset < PAGE_SIZE);
+   kunmap_atomic(kaddr,KM_USER0);
+   unlock_page(page);
+   page_cache_release(page);
+   if (check)
+   printk("%s: BADNESS: truncate check %u\n", 
current->comm, check);
+   }
+}
+
 /**
  * vmtruncate - unmap mappings "freed" by truncate() syscall
  * @inode: inode of the file used
@@ -1875,6 +1902,7 @@ do_expand:
goto out_sig;
if (offset > inode->i_sb->s_maxbytes)
goto out_big;
+   check_last_page(mapping, inode->i_size);
i_size_write(inode, offset);
 
 out_truncate:#include 
#include 
#include 
#include 

int main(int argc, char **argv)
{
char *mapping;
int fd;

fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
if (fd < 0)
return -1;
if (ftruncate(fd, 10) < 0)
return -1;
mapping = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (-1 == (int)(long)mapping)
return -1;
memset(mapping, 0x55, 10);
if (ftruncate(fd, 100) < 0)
return -1;

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Linus Torvalds

On Tue, 19 Dec 2006, Nick Piggin wrote:
> 
> Counterexample? Well AFAIKS, the clearing of PG_dirty in ttfb() in
> response to finding all buffers clean is perfectly valid. What makes
> you think otherwise?

If the page really is clean, then why the heck cant' we just clean the 
page table bits too?

Either it's clean or it isn't. If all the buffers being clean means that 
the page is clean, then it's clean. WE SHOULD NOT THINK THAT PTE'S ARE ANY 
DIFFERENT.

I really don't see your point. Is it clean? If it is, then clear the damn 
dirty bits from the page tables too. Don't go pussyfooting around the 
issue and confuse yourself and everybody but me by saying "but if it's 
dirty in the page tables, it's magically dirty". NO.

It really is that simple. Is it clean or not?

If it's clean, you can remove ALL the dirty bits. Not just some.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Linus Torvalds

On Tue, 19 Dec 2006, Nick Piggin wrote:
> 
> Now I'm not exactly sure how ext3 (or any other) filesystems make use
> of this particular feature of try_to_free_buffers(), but it is clear
> from the comments what it is for. So your patch isn't really a minimal
> fix (ie. it would require an OK from all filesystems, wouldn't it?)
> 
> Or did I miss a mail where you reasoned that it is safe to make this
> change (/me goes to reread the thread)...

I'm saying it had _better_ be safe, and no, low-level filesystems don't 
actually matter.

The page has to be cleanable _some_ way. So if we test for "page_dirty()" 
at the top, and just refuse to do it in try_to_free_pages(), we still know 
that the _proper_ page cleaning had better clean it. Because ttfp() is 
never going to clean the page in the general case _anyway_.

So I'm really saying:

 - the page WILL be cleaned by the real page cleaning action (ie memory 
   pressure or sync or something else causing us to go through the 
   bog-standard page-based writeout.

   Does anybody dispute this?

 - the "ttfp()" hack was a HACK. It was an ugly and nasty hack even when 
   it was first introduced. It gets doubly worse now that we know we have 
   something wrong with page cleaning, and it has distracted from the real 
   problem.

 - I removed tha ugly and disgusting hack entirely at first, but Andrew 
   points out that he really wants to keep the buffers there, because the 
   buffers being clean actually say something. That, together with the 
   fact that as long as the page is dirty, the buffers really do end up 
   have a job to do, made me add a much smaller hack to replace the big 
   ugly one ("don't even try, if the page is marked dirty").

 - so with that thing in place, there isn't even any change in behaviour 
   wrt the buffers and low-level filesystems. It's just that we make them 
   a bit harder to get rid of. But arguably that shouldn't actually ever 
   really _happen_ anyway (because I think it's a BUG if the page is 
   marked dirty but none of the buffers are), so I think that part is a 
   non-issue.

In other words, ttfp() _never_ had anything to do with "page cleaning". 
Not originally, not with the horrible hack, and not with my patch. 

Trying to mix it in just caused a bug that _everybody_ agrees is a bug. 
It's not the bug we're chasing, but we've got three different patches to 
fix it (Andrew's, mine and yours), and mine is the simplest one by far 
especially in the long run, because it just REMOVES the ugly dependency.

And yes, I probably care more about "in the long run" than most. To me, a 
bug is a bug even if it's _just_ a maintenance headache. Andrews patch 
made things _worse_ ("magic insane flag"), and while yours didn't make the 
code worse, it still introduced the notion of a totally insane "clean the 
page but if the PTE's are dirty, do something else" notion.

IF THE PAGE TRULY IS CLEAN (and both you and Andrew claim it is, if all 
buffers are clean - since you mark it clean in the non-mapped case) THEN 
YOU SHOULD BE ABLE TO CLEAN THE PAGE TABLE BITS TOO.

And by claiming that the page table bits are different from PG_dirty, 
you're just making the issues worse. They shouldn't be. That's what the 
whole point of Peter's patch was: PG_dirty fundmentally _means_ that the 
page tables might be dirty too. That was the whole _point_ of doing all 
this in 2.6.19 in the first place.

So if you cannot accept that page table bits should be on "equal footing" 
with PG_dirty, then you should just say "Let's remove Peter's patch 
entirely".

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Peter Zijlstra

On Tue, 2006-12-19 at 21:58 +1100, Nick Piggin wrote:
> Peter Zijlstra wrote:
> > On Tue, 2006-12-19 at 02:32 -0800, Andrew Morton wrote:
> 
> >>Well it used to be.  After 2.6.19 it can do the wrong thing for mapped
> >>pages.  But it turns out that we don't feed it mapped pages, apart from
> >>pagevec_strip() and possibly races against pagefaults.
> > 
> > 
> > So how about this:
> 
> Well that's still racy. Anyway several earlier patches (including
> the one I posted) closed this race. Some were still reported to
> trigger corruption IIRC.

I can't remember a patch that removes mapped pages from this code path,
however I could have missed it. All out removing the mapping branch in
ttfb() did also fix the problem - which is a superset of page_mapped().

I'm now building a kernel with this patch, and will submit that to
rtorrent with mem=256M on a 1k ext3 filesystem on x86_64 smp preempt.

---
 fs/buffer.c |   32 +++-
 1 file changed, 31 insertions(+), 1 deletion(-)

Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2798,11 +2798,38 @@ static inline int buffer_busy(struct buf
(bh->b_state & ((1 << BH_Dirty) | (1 << BH_Lock)));
 }
 
+/*
+ * AKPM sayeth:
+ *
+ * - a process does a one-byte-write to a file on a 64k pagesize, 4k
+ *   blocksize ext3 filesystem.  The page is now PageDirty, !PgeUptodate and
+ *   has one dirty buffer and 15 not uptodate buffers.
+ *
+ * - kjournald writes the dirty buffer.  The page is now PageDirty,
+ *   !PageUptodate and has a mix of clean and not uptodate buffers.
+ *
+ * - try_to_free_buffers() removes the page's buffers.  It MUST now clear
+ *   PageDirty.  If we were to leave the page dirty then we'd have a dirty, not
+ *   uptodate page with no buffer_heads.
+ *
+ *   We're screwed: we cannot write the page because we don't know which
+ *   sections of it contain garbage.  We cannot read the page because we don't
+ *   know which sections of it contain modified data.  We cannot free the page
+ *   because it is dirty.
+ *
+ * However for mapped pages this is not true; mapped pages will be fully
+ * loaded and thus cannot have not uptodate buffers.
+ *
+ * Hence allow the PG_dirty bit to stay for pages that had no not uptodate
+ * buffers (and assert that mapped pages never have those).
+ */
+
 static int
 drop_buffers(struct page *page, struct buffer_head **buffers_to_free)
 {
struct buffer_head *head = page_buffers(page);
struct buffer_head *bh;
+   int uptodate = 1;
 
bh = head;
do {
@@ -2818,11 +2845,14 @@ drop_buffers(struct page *page, struct b
 
if (!list_empty(>b_assoc_buffers))
__remove_assoc_queue(bh);
+   if (!buffer_uptodate(bh))
+   uptodate = 0;
bh = next;
} while (bh != head);
*buffers_to_free = head;
__clear_page_buffers(page);
-   return 1;
+   VM_BUG_ON(page_mapped(page) && !uptodate);
+   return !uptodate;
 failed:
return 0;
 }


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Nick Piggin


Peter Zijlstra wrote:

On Tue, 2006-12-19 at 02:32 -0800, Andrew Morton wrote:



Well it used to be.  After 2.6.19 it can do the wrong thing for mapped
pages.  But it turns out that we don't feed it mapped pages, apart from
pagevec_strip() and possibly races against pagefaults.



So how about this:


Well that's still racy. Anyway several earlier patches (including
the one I posted) closed this race. Some were still reported to
trigger corruption IIRC.


Index: linux-2.6-git/mm/page-writeback.c
===
--- linux-2.6-git.orig/mm/page-writeback.c  2006-12-19 08:24:48.0 
+0100
+++ linux-2.6-git/mm/page-writeback.c   2006-12-19 11:43:31.0 +0100
@@ -859,6 +859,9 @@ int test_clear_page_dirty(struct page *p
struct address_space *mapping = page_mapping(page);
unsigned long flags;
 
+	if (page_mapped(page))

+   return 0;
+
if (!mapping)
return TestClearPageDirty(page);
 



-


--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Nick Piggin


Andrew Morton wrote:

On Tue, 19 Dec 2006 20:56:50 +1100
Nick Piggin <[EMAIL PROTECTED]> wrote:



I think it could be very likely that indeed the bug is a latent one in
a clear_page_dirty caller, rather than dirty-tracking itself.



The only callers are try_to_free_buffers(), truncate and a few scruffy
possibly-wrong-for-fsync filesytems which aren't being used here.


Well truncate/invalidate will not operate on mapped pages (barring the
very-unlikely truncate/invalidate vs fault races). We can ignore those
filesystems as they don't include ext3. Which brings us back to
try_to_free_buffers().

Maybe it is something else entirely, but did try_to_free_buffers ever
get completely cleared? Or was some of Andrei's corruption possibly
leftover on-disk corruption from a previous kernel?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Peter Zijlstra

On Tue, 2006-12-19 at 02:32 -0800, Andrew Morton wrote:
> On Tue, 19 Dec 2006 20:56:50 +1100
> Nick Piggin <[EMAIL PROTECTED]> wrote:
> 
> > Linus Torvalds wrote:
> > 
> > > NOTICE? First you make a BIG DEAL about how dirty bits should never get 
> > > lost, but THE VERY SAME FUNCTION actually very much on purpose DOES drop 
> > > the dirty bit for when it's not in the page tables.
> > 
> > try_to_free_buffers is quite a special case, where we're transferring
> > the page dirty metadata from the buffers to the page. I think Andrew
> > would have a better grasp of it so he could correct me, but what it
> > does is legitimate.
> 
> Well it used to be.  After 2.6.19 it can do the wrong thing for mapped
> pages.  But it turns out that we don't feed it mapped pages, apart from
> pagevec_strip() and possibly races against pagefaults.

So how about this:

Index: linux-2.6-git/mm/page-writeback.c
===
--- linux-2.6-git.orig/mm/page-writeback.c  2006-12-19 08:24:48.0 
+0100
+++ linux-2.6-git/mm/page-writeback.c   2006-12-19 11:43:31.0 +0100
@@ -859,6 +859,9 @@ int test_clear_page_dirty(struct page *p
struct address_space *mapping = page_mapping(page);
unsigned long flags;
 
+   if (page_mapped(page))
+   return 0;
+
if (!mapping)
return TestClearPageDirty(page);
 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Andrew Morton

On Tue, 19 Dec 2006 02:32:55 -0800
Andrew Morton <[EMAIL PROTECTED]> wrote:

> 
> 
> If a write-fault races with a read-fault and the write-fault loses, we forget
> to mark the page dirty.

No that isn't right, is it.  The writer just retakes the fault and
all the right things happen.  Ho hum.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Nick Piggin


Andrew Morton wrote:

On Tue, 19 Dec 2006 20:56:50 +1100
Nick Piggin <[EMAIL PROTECTED]> wrote:



Linus Torvalds wrote:


NOTICE? First you make a BIG DEAL about how dirty bits should never get 
lost, but THE VERY SAME FUNCTION actually very much on purpose DOES drop 
the dirty bit for when it's not in the page tables.


try_to_free_buffers is quite a special case, where we're transferring
the page dirty metadata from the buffers to the page. I think Andrew
would have a better grasp of it so he could correct me, but what it
does is legitimate.



Well it used to be.  After 2.6.19 it can do the wrong thing for mapped
pages.


Yes, that is what I was trying to get at.


 But it turns out that we don't feed it mapped pages, apart from
pagevec_strip() and possibly races against pagefaults.


True, and I think we have pretty well established that this isn't the
cause of Andrei's problem, but I think we all agree it is *a* bug?

And surely Andrei's data corruption will be of the same flavour in
that test_clear_page_dirty somewhere is now stripping pte dirty bits
where it shouldn't? (because it went away after Peter nooped that
behaviour)


I think it could be very likely that indeed the bug is a latent one in
a clear_page_dirty caller, rather than dirty-tracking itself.



The only callers are try_to_free_buffers(), truncate and a few scruffy
possibly-wrong-for-fsync filesytems which aren't being used here.




If a write-fault races with a read-fault and the write-fault loses, we forget
to mark the page dirty.


Hmm.. in that case will the pte still be readonly, and thus the write
faulter will have to try again I think?



Something like this, but it's probably wrong - I didn't try very hard (am
feeling ill, and vaguely grumpy)


From: Andrew Morton <[EMAIL PROTECTED]>

Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 mm/memory.c |   12 
 1 file changed, 12 insertions(+)

diff -puN mm/memory.c~a mm/memory.c
--- a/mm/memory.c~a
+++ a/mm/memory.c
@@ -2264,10 +2264,22 @@ retry:
}
} else {
/* One of our sibling threads was faster, back out. */
+   if (write_access) {
+   /*
+* We might have raced against a read-fault.  We still
+* need to dirty the page.
+*/
+   dirty_page = vm_normal_page(vma, address, *page_table);
+   if (dirty_page) {
+   get_page(dirty_page);
+   goto dirty_it;
+   }
+   }
page_cache_release(new_page);
goto unlock;
}
 
+dirty_it:

/* no need to invalidate: a not-present page shouldn't be cached */
update_mmu_cache(vma, address, entry);
lazy_mmu_prot_update(entry);
_





--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Andrew Morton

On Tue, 19 Dec 2006 20:56:50 +1100
Nick Piggin <[EMAIL PROTECTED]> wrote:

> Linus Torvalds wrote:
> 
> > NOTICE? First you make a BIG DEAL about how dirty bits should never get 
> > lost, but THE VERY SAME FUNCTION actually very much on purpose DOES drop 
> > the dirty bit for when it's not in the page tables.
> 
> try_to_free_buffers is quite a special case, where we're transferring
> the page dirty metadata from the buffers to the page. I think Andrew
> would have a better grasp of it so he could correct me, but what it
> does is legitimate.

Well it used to be.  After 2.6.19 it can do the wrong thing for mapped
pages.  But it turns out that we don't feed it mapped pages, apart from
pagevec_strip() and possibly races against pagefaults.

> I think it could be very likely that indeed the bug is a latent one in
> a clear_page_dirty caller, rather than dirty-tracking itself.

The only callers are try_to_free_buffers(), truncate and a few scruffy
possibly-wrong-for-fsync filesytems which aren't being used here.




If a write-fault races with a read-fault and the write-fault loses, we forget
to mark the page dirty.

Something like this, but it's probably wrong - I didn't try very hard (am
feeling ill, and vaguely grumpy)


From: Andrew Morton <[EMAIL PROTECTED]>

Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 mm/memory.c |   12 
 1 file changed, 12 insertions(+)

diff -puN mm/memory.c~a mm/memory.c
--- a/mm/memory.c~a
+++ a/mm/memory.c
@@ -2264,10 +2264,22 @@ retry:
}
} else {
/* One of our sibling threads was faster, back out. */
+   if (write_access) {
+   /*
+* We might have raced against a read-fault.  We still
+* need to dirty the page.
+*/
+   dirty_page = vm_normal_page(vma, address, *page_table);
+   if (dirty_page) {
+   get_page(dirty_page);
+   goto dirty_it;
+   }
+   }
page_cache_release(new_page);
goto unlock;
}
 
+dirty_it:
/* no need to invalidate: a not-present page shouldn't be cached */
update_mmu_cache(vma, address, entry);
lazy_mmu_prot_update(entry);
_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Nick Piggin


Linus Torvalds wrote:


On Tue, 19 Dec 2006, Nick Piggin wrote:


Anyway it has the same issues as the others. See what happens when you
run two test_clear_page_dirty_sync_ptes() consecutively, you still loose
PG_dirty even though the page might actually be dirty.


How can this happen? We'll only test_clear_page_dirty_sync_ptes again
after buffers have been reattached, and subsequently cleaned. And in
that case if the ptes are still clean at this point then the page really
is clean.



Why do you talk about buffers being reattached? Are you still in some 
world where "try_to_free_buffers()" matters? Have you not followed the 


I'm talking about fixing just the race Andrew noticed via inspection. No
it doesn't appear to fix Andrei's problem, unfortunately. But it needs
to be fixed all the same, doesn't it?

discussion? Why do you ignore my MUCH SIMPLER patch that just removed all 
this crap ENTIRELY from "try_to_free_buffers()", and the exact same 
corruption happened?


Forget about "try_to_free_buffers()". Please apply this patch to your tree 
first. That gets rid of _one_ copy of totally insane code that did all the 
wrong things.


Only after you have applied this patch should you look at the code again. 
Realizing that the corruption still happens.


So forget about buffers already. That piece of code was crap.


Now I'm not exactly sure how ext3 (or any other) filesystems make use
of this particular feature of try_to_free_buffers(), but it is clear
from the comments what it is for. So your patch isn't really a minimal
fix (ie. it would require an OK from all filesystems, wouldn't it?)

Or did I miss a mail where you reasoned that it is safe to make this
change (/me goes to reread the thread)...



Linus

---
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
int ret = 0;
 
 	BUG_ON(!PageLocked(page));

-   if (PageWriteback(page))
+   if (PageDirty(page) || PageWriteback(page))
return 0;
 
 	if (mapping == NULL) {		/* can this still happen? */

@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
spin_lock(>private_lock);
ret = drop_buffers(page, _to_free);
spin_unlock(>private_lock);
-   if (ret) {
-   /*
-* If the filesystem writes its buffers by hand (eg ext3)
-* then we can have clean buffers against a dirty page.  We
-* clean the page here; otherwise later reattachment of buffers
-* could encounter a non-uptodate page, which is unresolvable.
-* This only applies in the rare case where try_to_free_buffers
-* succeeds but the page is not freed.
-*
-* Also, during truncate, discard_buffer will have marked all
-* the page's buffers clean.  We discover that here and clean
-* the page also.
-*/
-   if (test_clear_page_dirty(page))
-   task_io_account_cancelled_write(PAGE_CACHE_SIZE);
-   }
 out:
if (buffers_to_free) {
struct buffer_head *bh = buffers_to_free;




--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Martin Michlmayr

* Marc Haber <[EMAIL PROTECTED]> [2006-12-19 09:51]:
> I do not have a clue about memory management at all, but is it
> possible that you're testing on a box with too much memory? My box has
> only 256 MB, and I used to use mutt with a _huge_ inbox with mutt
> taking somewhat 150 MB. Add spamassassin and a reasonably busy mail
> server, and the box used to be like 150 MB in swap.

FWIW, the ARM box I see this on has only 32 MB memory (and a 133 or
266 MHz CPU).  I don't see it on another ARM box (different ARM
sub-arch) with 128 MB memory and a 600 MHz CPU.
-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Marc Haber

On Tue, Dec 19, 2006 at 12:24:16AM -0800, Andrew Morton wrote:
> Wow.  I didn't expect that, because Mark Haber reported that ext3's 
> data=writeback
> fixed it.   Maybe he didn't run it for long enough?

My test case is Debian's "aptitude update" running once an hour, and
it was always the same file getting corrupted. With 2.6.19, I had this
corruption like every third hour (but -only- if run from cron, running
from a shell was always fine), data=writeback made the issue disappear
for about two days before I booted into 2.6.19.1 without
data=writeback (defaults chosen then), after which the issue only
shows up like every other day.

So, I feel like out of the loop since rtorrent seems much better in
reproducing this.

I notice, though, that both aptitude and rtorrent do downloads from
the net, so there might be a relation to tcp/ip and/or the network
driver. My box has a Linksys NC100 network card running with the tulip
driver.

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany  |  lose things."Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature |  How to make an American Quilt | Fax: *49 621 72739835
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Peter Zijlstra

On Tue, 2006-12-19 at 10:00 +0100, Peter Zijlstra wrote:
> On Tue, 2006-12-19 at 00:04 -0800, Linus Torvalds wrote:
> 
> > Nobody has actually ever explained why "test_clear_page_dirty()" is good 
> > at all.
> > 
> >  - Why is it ever used instead of "clear_page_dirty_for_io()"?
> > 
> >  - What is the difference?
> > 
> >  - Why would you EVER want to clear bits just in the "struct page *" or 
> >just in the PTE's?
> > 
> >  - Why is it EVER correct to clear dirty bits except JUST BEFORE THE IO?
> > 
> > In other words, I have a theory:
> > 
> >  "A lot of this is actually historical cruft. Some of it may even be code 
> >   that was never supposed to work, but because we maintained _other_ dirty 
> >   bits in the PTE's, and never touched them before, we never even realized 
> >   that the code that played with PG_dirty was totally insane"
> > 
> > Now, that's just a theory. And yeah, it may be stated a bit provocatively. 
> > It may not be entirely correct. I'm just saying.. maybe it is?
> 
> On Sun, 2006-12-17 at 15:40 -0800, Andrew Morton wrote:
> 
> > try_to_free_buffers() clears the page's dirty state if it successfully 
> > removed
> > the page's buffers.
> > 
> >   Background for this:
> > 
> >   - a process does a one-byte-write to a file on a 64k pagesize, 4k
> > blocksize ext3 filesystem.  The page is now PageDirty, !PgeUptodate and
> > has one dirty buffer and 15 not uptodate buffers.
> > 
> >   - kjournald writes the dirty buffer.  The page is now PageDirty,
> > !PageUptodate and has a mix of clean and not uptodate buffers.
> > 
> >   - try_to_free_buffers() removes the page's buffers.  It MUST now clear
> > PageDirty.  If we were to leave the page dirty then we'd have a dirty, 
> > not
> > uptodate page with no buffer_heads.
> > 
> > We're screwed: we cannot write the page because we don't know which
> > sections of it contain garbage.  We cannot read the page because we 
> > don't
> > know which sections of it contain modified data.  We cannot free the 
> > page
> > because it is dirty.
> 
> However!! this is not true for mapped pages because mapped pages must
> have the whole (16k in akpm's example) page loaded. Hence I suspect that
> what Andrei did by accident - remove the if (mapping) case in
> test_clean_dirty_pages() - is actually totally correct.

Obviously I need my morning shot, 64k ofcourse.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.19 file content corruption on ext3

2006-12-19 Thread Peter Zijlstra

On Tue, 2006-12-19 at 00:04 -0800, Linus Torvalds wrote:

> Nobody has actually ever explained why "test_clear_page_dirty()" is good 
> at all.
> 
>  - Why is it ever used instead of "clear_page_dirty_for_io()"?
> 
>  - What is the difference?
> 
>  - Why would you EVER want to clear bits just in the "struct page *" or 
>just in the PTE's?
> 
>  - Why is it EVER correct to clear dirty bits except JUST BEFORE THE IO?
> 
> In other words, I have a theory:
> 
>  "A lot of this is actually historical cruft. Some of it may even be code 
>   that was never supposed to work, but because we maintained _other_ dirty 
>   bits in the PTE's, and never touched them before, we never even realized 
>   that the code that played with PG_dirty was totally insane"
> 
> Now, that's just a theory. And yeah, it may be stated a bit provocatively. 
> It may not be entirely correct. I'm just saying.. maybe it is?

On Sun, 2006-12-17 at 15:40 -0800, Andrew Morton wrote:

> try_to_free_buffers() clears the page's dirty state if it successfully removed
> the page's buffers.
> 
>   Background for this:
> 
>   - a process does a one-byte-write to a file on a 64k pagesize, 4k
> blocksize ext3 filesystem.  The page is now PageDirty, !PgeUptodate and
> has one dirty buffer and 15 not uptodate buffers.
> 
>   - kjournald writes the dirty buffer.  The page is now PageDirty,
> !PageUptodate and has a mix of clean and not uptodate buffers.
> 
>   - try_to_free_buffers() removes the page's buffers.  It MUST now clear
> PageDirty.  If we were to leave the page dirty then we'd have a dirty, not
> uptodate page with no buffer_heads.
> 
> We're screwed: we cannot write the page because we don't know which
> sections of it contain garbage.  We cannot read the page because we don't
> know which sections of it contain modified data.  We cannot free the page
> because it is dirty.

However!! this is not true for mapped pages because mapped pages must
have the whole (16k in akpm's example) page loaded. Hence I suspect that
what Andrei did by accident - remove the if (mapping) case in
test_clean_dirty_pages() - is actually totally correct.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 >

1 - 100 of 305 matches

Mail list logo