Re: [PATCH] mm: fix page_mkclean_one

2007-02-02 Thread Evgeniy Polyakov
On Fri, Feb 02, 2007 at 07:42:52PM +1100, Nick Piggin ([EMAIL PROTECTED]) wrote:
> Anyway, I had a look at your bugzilla test-case and managed to slim it
> down to something that easily shows what the problem is (available on
> request) -- the problem is that recipient of the sendfile is seeing
> modifications that occur to the source file _after_ the sender has
> completed the sendfile, because the file pages are not copied but
> queued.
> 
> I think the usual approach to what you are trying to do is to set TCP_CORK,
> then write(2) the header into the socket, then sendfile directly from the
> file you want.
> 
> Another approach I guess is to implement an ack in your userland protocol
> so you do not modify the sendfile source file until the client acks that
> it has all the data.

Mark, don't you use e1000 or other scatter-gather capable nic with
checksum offload? Likely yes.

Actual data sucking in that case happens when packet is supposed to be
transmitted by the NIC, not when sendfile() is returned. The same
applies to the case, when you have fancy egress filtering.

It is not allowed to modify pages until they are really transmitted, if
you want data integrity.

There are _no_ bugs in network or VFS cache in this test case.

> I'm not sure if there are any other usual ways to do this (ie. a barrier
> for sendfile, to ensure it will not pick up "future" modifications to the
> file). netdev cc'ed, someone there might have additional comments.
> 
> Please close this bug if/when you are satisfied it is not a kernel problem.
> 
> Thanks,
> Nick
> 
> -- 
> SUSE Labs, Novell Inc.
> Send instant messages to your online friends http://au.messenger.yahoo.com 

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2007-02-02 Thread Nick Piggin

Mark Groves wrote:

Hi,


I have been been seeing a problem when using sendfile repeatedly on an
SMP server, which I believe is related to the problem that was
discovered recently with marking dirty pages. The bug, as well as a test
script, is listed at http://bugzilla.kernel.org/show_bug.cgi?id=7650.
Currently, we're experiencing errors where part of a previous packet is
being sent out rather than the current packet.

I have applied the patch Linus posted to a 2.6.19 kernel but am still
getting the problem. So I am wondering if there are any other places in
the kernel which mark pages as dirty which might require a similar
patch?


Your issue is not related, firstly because the page_mkclean bug did not
exist before 2.6.19 kernels.

Anyway, I had a look at your bugzilla test-case and managed to slim it
down to something that easily shows what the problem is (available on
request) -- the problem is that recipient of the sendfile is seeing
modifications that occur to the source file _after_ the sender has
completed the sendfile, because the file pages are not copied but
queued.

I think the usual approach to what you are trying to do is to set TCP_CORK,
then write(2) the header into the socket, then sendfile directly from the
file you want.

Another approach I guess is to implement an ack in your userland protocol
so you do not modify the sendfile source file until the client acks that
it has all the data.

I'm not sure if there are any other usual ways to do this (ie. a barrier
for sendfile, to ensure it will not pick up "future" modifications to the
file). netdev cc'ed, someone there might have additional comments.

Please close this bug if/when you are satisfied it is not a kernel problem.

Thanks,
Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2007-02-01 Thread Mark Groves
Hi,


I have been been seeing a problem when using sendfile repeatedly on an
SMP server, which I believe is related to the problem that was
discovered recently with marking dirty pages. The bug, as well as a test
script, is listed at http://bugzilla.kernel.org/show_bug.cgi?id=7650.
Currently, we're experiencing errors where part of a previous packet is
being sent out rather than the current packet.

I have applied the patch Linus posted to a 2.6.19 kernel but am still
getting the problem. So I am wondering if there are any other places in
the kernel which mark pages as dirty which might require a similar
patch?


Regards,

Mark Groves
Researcher
University of Waterloo
Waterloo, Ontario, Canada

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2007-01-06 Thread Andrew Morton
On Sun, 7 Jan 2007 12:36:18 +1030
"Tom Lanyon" <[EMAIL PROTECTED]> wrote:

> On 12/27/06, Linus Torvalds <[EMAIL PROTECTED]> wrote:
> > What would also actually be interesting is whether somebody can reproduce
> > this on Reiserfs, for example. I _think_ all the reports I've seen are on
> > ext2 or ext3, and if this is somehow writeback-related, it could be some
> > bug that is just shared between the two by virtue of them still having a
> > lot of stuff in common.
> >
> > Linus
> 
> I've been following this thread for a while now as I started
> experiencing file corruption in rtorrent when I upgraded to 2.6.19. I
> am using reiserfs.

reiserfs defaults to data=ordered, so it's quite possibly the same bug.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2007-01-06 Thread Tom Lanyon

On 1/7/07, Tom Lanyon <[EMAIL PROTECTED]> wrote:

I've been following this thread for a while now as I started
experiencing file corruption in rtorrent when I upgraded to 2.6.19. I
am using reiserfs.


However, moving to 2.6.20-rc3 does indeed seem to fix the issue thus far...

--
Tom Lanyon
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2007-01-06 Thread Tom Lanyon

On 12/27/06, Linus Torvalds <[EMAIL PROTECTED]> wrote:

What would also actually be interesting is whether somebody can reproduce
this on Reiserfs, for example. I _think_ all the reports I've seen are on
ext2 or ext3, and if this is somehow writeback-related, it could be some
bug that is just shared between the two by virtue of them still having a
lot of stuff in common.

Linus


I've been following this thread for a while now as I started
experiencing file corruption in rtorrent when I upgraded to 2.6.19. I
am using reiserfs.

--
Tom Lanyon
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Andrew Morton
On Fri, 29 Dec 2006 16:58:41 -0800 (PST)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> 
> 
> On Fri, 29 Dec 2006, Andrew Morton wrote:
> >
> > > > Somewhat nastily, but as ext3 directories are metadata it is appropriate
> > > > that modifications to them be done in terms of buffer_heads (ie: 
> > > > blocks).
> > > 
> > > No. There is nothing "appropriate" about using buffer_heads for metadata. 
> > 
> > I said "modification".
> 
> You said "metadata".
> 
> Why do you think directories are any different from files? Yes, they are 
> metadata. So what? What does that have to do with anything?

We journal the contents of directories.  Fully.  So we handle their dirty
data at the block (ie: buffer_head) level.  When someone tries to dirty
part of a directory we need to cheat and not mark that part of the page as
dirty and we need to then write the block to the journal and then mark the
block as really dirty for checkpointing (but still attached to the journal)
and all that goop.

The regular page-based writeback doesn't apply until the block has been
written to the journal.  At that stage the block is considered dirty
against its real position on disk.  It will then be written back by pdflush
via the blockdev inode -> blkdev_writepage().  Unless kjournald needs to do
an early flush to reclaim the journal space, in which case kjournald will
write the block itself.

> 
> So I really don't understand why you make excuses for ext3 and talk about 
> "modifications" and "metadata". It was a fine design ten years ago. It's 
> not really very good any longer.
> 

As I said in another apparently-neglected email:

: We could possibly move ext3/4 directories out of the blockdev pagecache and
: into per-directory pagecache, but that wouldn't change anything - the
: journalling would still be block-based.

We already have all the code in place to journal blocks which are cached in
an address_space other than the blockdev inode's: ext3_journalled_aops.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Andrew Morton wrote:
>
> > > Somewhat nastily, but as ext3 directories are metadata it is appropriate
> > > that modifications to them be done in terms of buffer_heads (ie: blocks).
> > 
> > No. There is nothing "appropriate" about using buffer_heads for metadata. 
> 
> I said "modification".

You said "metadata".

Why do you think directories are any different from files? Yes, they are 
metadata. So what? What does that have to do with anything?

They should still use virtual indexes, the way files do. That doesn't 
preclude them from using buffer-heads to mark their (partial-page) 
modifications and for keeping the virtual->physical translations cached.

I mean, really. Look at ext2. It does exactly that. It keeps the 
directories in the page cache - virtually indexed. And it even modifies 
them there. Exactly the same way it modifies regular file data.

It all works exactly the same way it works for regular files. It uses

page->mapping->a_ops->prepare_write(NULL, page, from, to);
... do modification ...
ext2_commit_chunk(page, from, to);

exactly the way regular file data works. 

That's why I'm saying there is absolutely _zero_ thing about "metadata" 
here, or even about "modifications". It all works better in a virtual 
cache, because you get all the support that we give to page caches.

So I really don't understand why you make excuses for ext3 and talk about 
"modifications" and "metadata". It was a fine design ten years ago. It's 
not really very good any longer.

I suspect we're stuck with the design, but that doesn't make it any 
_better_.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Andrew Morton wrote:
> 
> Adam Richter spent considerable time a few years ago trying to make the
> mpage code go direct-to-BIO in all cases and we eventually gave up.  The
> conceptual layering of page<->blocks<->bio is pretty clean, and it is hard
> and ugly to fully optimise away the "block" bit in the middle.

Using the buffer cache as a translation layer to the physical address is 
fine. That's what _any_ block device will do.

I'm not at all sayign that "buffer heads must go away". They work fine.

What I'm saying is that

 - if you index by buffer heads, you're screwed.
 - if you do IO by starting at buffer heads, you're screwed.

Both indexing and writeback decisions should be done at the page cache 
layer. Then, when you actually need to do IO, you look at the buffers. But 
you start from the "page". YOU SHOULD NEVER LOOK UP a buffer on its own 
merits, and YOU SHOULD NEVER DO IO on a buffer head on its own cognizance.

So by all means keep the buffer heads as a way to keep the 
"virtual->physical" translation. It's what they were designed for. But 
they were _originally_ also designed for "lookup" and "driving the start 
of IO", and that is wrong, and has been wrong for a long time now, because

 - lookup based on physical address is fundamentally slow and inefficient. 
   You have to look up the virtual->physical translation somewhere else, 
   so it's by design an unnecessary indirection _and_ that "somewere 
   else" is also by definition filesystem-specific, so you can't do any 
   of these things at the VFS layer.

   Ergo: anything that needs to look up the physical address in order to 
   find the buffer head is BROKEN in this day and age. We look up the 
   _virtual_ page cache page, and then we can trivially find the buffer 
   heads within that page thanks to page->buffers.

   Example: ext2 vs ext3 readdir. One of them sucks, the other doesn't. 

 - starting IO based on the physical entity is insane. It's insane exactly 
   _because_ the VM doesn't actually think in physical addresses, or in 
   buffer-sized blocks. The VM only really knows about whole pages, and 
   all the VM decisions fundamentally have to be page-based. We don't ever 
   "free a buffer". We free a whole page, and as such, doing writeback 
   based on buffers is pointless, because it doesn't actually say anything 
   about the "page state" which is what the VM tracks.

But neither of these means that "buffer_head" itself has to go away. They 
both really boil down to the same thing: you should never KEY things by 
the buffer head. All actions should be based on virtual indexes as far as 
at all humanly possible.

Once you do lookup and locking and writeback _starting_ from the page, 
it's then easy to look up the actual buffer head within the page, and use 
that as a way to do the actual _IO_ on the physical address. So the buffer 
heads still exist in ext2, for example, but they don't drive the show 
quite as much.

(They still do in some areas: the allocation bitmaps, the xattr code etc. 
But as long as none of those have big VM footprints, and as long as no 
_common_ operations really care deeply, and as long as those data 
structures never need to be touched by the VM or VFS layer, nobody will 
ever really care).

The directory case comes up just because "readdir()" actually is very 
common, and sometimes very slow. And it can have a big VM working set 
footprint ("find"), so trying to be page-based actually really helps, 
because it all drives things like writeback on the _right_ issues, and we 
can do things like LRU's and writeback decisions on the level that really 
matters.

I actually suspect that the inode tables could benefit from being in the 
page cache too (although I think that the inode buffer address is actually 
"physical", so there's no indirection for inode tables, which means that 
the virtual vs physical addressing doesn't matter). For directories, there 
definitely is a big cost to continually doing the virtual->physical 
translation all the time.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Andrew Morton
On Fri, 29 Dec 2006 16:11:44 -0800 (PST)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> 
> 
> > JBD implements physical block-based journalling, so it is 100% appropriate
> > that JBD deal with these disk blocks using their buffer_head
> > representation.
> 
> And as long as it does that, you just have to face the fact that it's 
> going to perform like crap, including what you call "extra" writes, and 
> what I call "deal with it".

It is quite tiresome to delete things which your interlocutor said and to
then restate them as if it were some sort of relevation.

> > Somewhat nastily, but as ext3 directories are metadata it is appropriate
> > that modifications to them be done in terms of buffer_heads (ie: blocks).
> 
> No. There is nothing "appropriate" about using buffer_heads for metadata. 

I said "modification".

> [stuff about directory reads elided]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Andrew Morton wrote:
> 
> They're extra.  As in "can be optimised away".

Sure. Don't use buffer heads.

> The buffer_head is not an IO container.  It is the kernel's core
> representation of a disk block.

Please come back from the 90's.

The buffer heads are nothing but a mapping of where the hardware block is. 
If you use it for anything else, you're basically screwed.

> JBD implements physical block-based journalling, so it is 100% appropriate
> that JBD deal with these disk blocks using their buffer_head
> representation.

And as long as it does that, you just have to face the fact that it's 
going to perform like crap, including what you call "extra" writes, and 
what I call "deal with it".

Btw, you can make pages be physically indexed too, but they obviously
 (a) won't be coherent with any virtual mapping laid on top of it
 (b) will be _physical_, so any readahead etc will be based on physical 
 addresses too.

> I thought I fixed the performance problem?

No, you papered over it, for the reasonably common case where things were 
physically contiguous - exactly by using a physical page cache, so now it 
can do read-ahead based on that. Then, because the pages contain buffer 
heads, the directory accesses can look up buffers, and if it was all 
physically contiguous, it all works fine.

But if you actually want virtualluy indexed caching (and all _users_ want 
it), it really doesn't work.

> Somewhat nastily, but as ext3 directories are metadata it is appropriate
> that modifications to them be done in terms of buffer_heads (ie: blocks).

No. There is nothing "appropriate" about using buffer_heads for metadata. 

It's quite proper - and a hell of a lot more efficient - to use virtual 
page-caching for metadata too.

Look at the ext2 readdir() implementation, and compare it to the crapola 
horror that is ext3. Guess what? ext2 uses virtually indexed metadata, and 
as a result it is both simpler, smaller and a LOT faster than ext3 in 
accessing that metadata.

Face it, Andrew, you're wrong on this one. Really. Just take a look at 
ext2_readdir(). 

[ I'm not saying that ext2_readdir() is _beautiful_. If it had been 
  written with the page cache in mind, it would probably have been done 
  very differently. And it doesn't do any readahead, probably because 
  nobody cared enough, but it should be trivial to add, and it would 
  automatically "do the right thing" just because it's much easier at the 
  page cache level.

  But I _am_ saying that compared to ext3, the ext2 readdir is a work of 
  art. ]

"metadata" has _zero_ to do with "physically indexed". There is no 
correlation what-so-ever. If you think there is a correlation, it's all in 
your mind.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Andrew Morton
On Fri, 29 Dec 2006 18:32:07 -0500
Theodore Tso <[EMAIL PROTECTED]> wrote:

> On Fri, Dec 29, 2006 at 02:42:51PM -0800, Linus Torvalds wrote:
> > I think ext3 is terminally crap by now. It still uses buffer heads in 
> > places where it really really shouldn't, and as a result, things like 
> > directory accesses are simply slower than they should be. Sadly, I don't 
> > think ext4 is going to fix any of this, either.
> 
> Not just ext3; ocfs2 is using the jbd layer as well.  I think we're
> going to have to put this (a rework of jbd2 to use the page cache) on
> the ext4 todo list, and work with the ocfs2 folks to try to come up
> with something that suits their needs as well.  Fortunately we have
> this filesystem/storage summit thing coming up in the next few months,
> and we can try to get some discussion going on the linux-ext4 mailing
> list in the meantime.  Unfortunately, I don't think this is going to
> be trivial.

I suspect it would be insane to move any part of JBD (apart from the
ordered-data flush) to use pagecache.  The whole thing is fundamentally
block-based.  But only for metadata - there's no strong reason why ext3/4
needs to manipulate file data via buffer_heads if data=journal and chattr
+j aren't in use.

We could possibly move ext3/4 directories out of the blockdev pagecache and
into per-directory pagecache, but that wouldn't change anything - the
journalling would still be block-based.

Adam Richter spent considerable time a few years ago trying to make the
mpage code go direct-to-BIO in all cases and we eventually gave up.  The
conceptual layering of page<->blocks<->bio is pretty clean, and it is hard
and ugly to fully optimise away the "block" bit in the middle.

buffer_heads become more important with large PAGE_CACHE_SIZE.  I'd expect
nobh mode to be quite inefficient with some workloads on 64k pages.  We
need that representation of the state (and location) of the block-sized
hunks which make up the page.

> If we do get this fixed for ext4, one interesting question is whether
> people would accept a patch to backport the fixes to ext3, given the
> the grief this is causing the page I/O and VM routines.  OTOH, reiser3
> probably has the same problems, and I suspect the changes to ext3 to
> cause it to avoid buffer heads, especially in order to support for
> filesystem blocksizes < pagesize, are going to be sufficiently risky
> in terms of introducing regressions to ext3 that they would probably
> be rejected on those grounds.  So unfortunately, we probably are going
> to have to support flushes via buffer heads for the foreseeable
> future.

We'll see.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Theodore Tso wrote:
>
> If we do get this fixed for ext4, one interesting question is whether
> people would accept a patch to backport the fixes to ext3, given the
> the grief this is causing the page I/O and VM routines.

I don't think backporting is the smartest option (unless it's done _way_ 
later), but the real problem with it isn't actually the VM behaviour, but 
simply the fact that cached performance absoluely _sucks_ with the buffer 
cache.

With the physically indexed buffer cache thing, you end up always having 
to do these complicated translations into block numbers for every single 
access, and at some point when I benchmarked it, it was a huge overhead 
for doing simple things like readdir.

It's also a major pain for read-ahead, exactly partly due to the high cost 
of translation - because you can't cheaply check whether the next block is 
there, the cost of even asking the question "should I try to read ahead?" 
is much much higher. As a result, read-ahead is seriously limited, because 
it's so expensive for the cached case (which is still hopefully the 
_common_ case).

So because read-ahead is limited, the non-cached case then _really_ sucks.

It was somewhat fixed in a really god-awful fashion by having 
ext3_readdir() actually do _readahead_ though the page cache, even though 
it does everything else through the buffer cache. And that just happens to 
work because we hopefully have physically contiguous blocks, but when that 
isn't true, the readahead doesn't do squat.

It's really quite fundamentally broken. But none of that causes any 
problems for the VM, since directories cannot be mmap'ed anyway. But it's 
really pitiful, and it really doesn't work very well. Of course, other 
filesystems _also_ suck at this, and other operating systems haev even 
MORE problems, so people don't always seem to realize how horribly 
horribly broken this all is.

I really wish somebody would write a filesystem that did large cold-cache 
directories well. Open some horrible file manager on /usr/bin with cold 
caches, and weep. The biggest problem is the inode indirection, but at 
some point when I looked at why it sucked, it was doing basically 
synchronous single-buffer reads on the directory too, because readahead 
didn't work properly.

I was hoping that something like SpadFS would actually take off, because 
it seemed to do a lot of good design choices (having inodes in-line in the 
directory for when there are no hardlinks is probably a requirement for a 
good filesystem these days. The separate inode table had its uses, but 
indirection in a filesystem really does suck, and stat information is too 
important to be indirect unless it absolutely has to).

But I suspect it needs more than somebody who just wants to get his thesis 
written ;)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Andrew Morton
On Fri, 29 Dec 2006 14:42:51 -0800 (PST)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> 
> 
> On Fri, 29 Dec 2006, Andrew Morton wrote:
> > 
> > - The above change means that we do extra writeout.  If a page is dirtied
> >   once, kjournald will write it and then pdflush will come along and
> >   needlessly write it again.
> 
> There's zero extra writeout for any flushing that flushes BY PAGES.
> 
> Only broken flushers that flush by buffer heads (which really really 
> really shouldn't be done any more: welcome to the 21st century) will cause 
> extra writeouts. And those extra writeouts are obviously required for all 
> the dirty state to actually hit the disk - which is the point of the 
> patch.
> 
> So they're not "extra" - they are "required for correct working".

They're extra.  As in "can be optimised away".

> But I can't stress the fact enough that people SHOULD NOT do writeback by 
> buffer heads. The buffer head has been purely an "IO entity" for the last 
> several years now, and it's not a cache entity.

The buffer_head is not an IO container.  It is the kernel's core
representation of a disk block.  Usually (but not always) it is backed by
some memory which is in pagecache.  We can feed buffer_heads into IO
containers via submit_bh(), but that's far from the only thing we use
buffer_heads for.  We should have done s/buffer_head/block/g years ago.

JBD implements physical block-based journalling, so it is 100% appropriate
that JBD deal with these disk blocks using their buffer_head
representation.

That being said, ordered-data mode isn't really part of the JBD journalling
system at all (the data doesn't get journalled!) - ordered-mode is an
add-on to the JBD journal to make the metadata which we're about to journal
point at more-likely-to-be-correct data.

JBD's ordered-mode writeback is just a sync and I see no conceptual
problems with killing its old buffer_head based sync and moving it into the
21st century.

> Anybody who does writeback 
> by buffer heads is basically bypassing the real cache (the page cache), 
> and that's why all the problems happen.
> 
> I think ext3 is terminally crap by now. It still uses buffer heads in 
> places where it really really shouldn't,

The ordered-data mode flush: sure.  The rest of JBD's use of buffer_heads
is quite appropriate.

> and as a result, things like 
> directory accesses are simply slower than they should be. Sadly, I don't 
> think ext4 is going to fix any of this, either.

I thought I fixed the performance problem?

Somewhat nastily, but as ext3 directories are metadata it is appropriate
that modifications to them be done in terms of buffer_heads (ie: blocks).

> It's all just too inherently wrongly designed around the buffer head 
> (which was correct in 1995, but hasn't been correct for a long time in the 
> kernel any more).
> 
> > - Poor old IO accounting broke again.
> 
> No. That's why I used "set_page_dirty()" and did it that strange ugly way 
> ("set page dirty, even though it's already dirty, and even though the very 
> next thing we will do is TestClearPageDirty???").

nfs_set_page_dirty() and reiserfs_set_page_dirty() should now bail if
PageDirty() to avoid needless work.

> > - For a long time I've wanted to nuke the current ext3/jbd ordered-data
> >   implementation altogether, and just make kjournald call into the
> >   standard writeback code to do a standard suberblock->inodes->pages walk.
> 
> I really would like to see less of the buffer-head-based stuff, and yes, 
> more of the normal inode page walking. I don't think you can "order" 
> accesses within a page anyway, exactly because of memory mapping issues, 
> so any page ordering is not about buffer heads on the page itself, it 
> should be purely about metadata.

In this context ext3's "ordered" mode means "sync the file contents before
journalling the metadata which points at it".

> > - It's pretty obnoxious that the VM now sets a clean page "dirty" and
> >   then proceeds to modify its contents.  It would be nice to stop doing
> >   that.
> 
> No. I think this really the fundamental confusion people had. People 
> thought that setting the page dirty meant that it was no longer being 
> modified.

No.  Setting a page (or bh, or inode) dirty means "this is known to have
been modified".  ie: this cached entity is now out of sync with backing
store.

Ho hum.  I don't care much, really.  But then, I understand how all this
stuff works.  Try explaining to someone the relationship between
pte-dirtiness, page-dirtiness, radix-tree-dirtiness and
buffer_head-dirtiness.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Theodore Tso
On Fri, Dec 29, 2006 at 02:42:51PM -0800, Linus Torvalds wrote:
> I think ext3 is terminally crap by now. It still uses buffer heads in 
> places where it really really shouldn't, and as a result, things like 
> directory accesses are simply slower than they should be. Sadly, I don't 
> think ext4 is going to fix any of this, either.

Not just ext3; ocfs2 is using the jbd layer as well.  I think we're
going to have to put this (a rework of jbd2 to use the page cache) on
the ext4 todo list, and work with the ocfs2 folks to try to come up
with something that suits their needs as well.  Fortunately we have
this filesystem/storage summit thing coming up in the next few months,
and we can try to get some discussion going on the linux-ext4 mailing
list in the meantime.  Unfortunately, I don't think this is going to
be trivial.

If we do get this fixed for ext4, one interesting question is whether
people would accept a patch to backport the fixes to ext3, given the
the grief this is causing the page I/O and VM routines.  OTOH, reiser3
probably has the same problems, and I suspect the changes to ext3 to
cause it to avoid buffer heads, especially in order to support for
filesystem blocksizes < pagesize, are going to be sufficiently risky
in terms of introducing regressions to ext3 that they would probably
be rejected on those grounds.  So unfortunately, we probably are going
to have to support flushes via buffer heads for the foreseeable
future.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Andrew Morton wrote:
> 
> - The above change means that we do extra writeout.  If a page is dirtied
>   once, kjournald will write it and then pdflush will come along and
>   needlessly write it again.

There's zero extra writeout for any flushing that flushes BY PAGES.

Only broken flushers that flush by buffer heads (which really really 
really shouldn't be done any more: welcome to the 21st century) will cause 
extra writeouts. And those extra writeouts are obviously required for all 
the dirty state to actually hit the disk - which is the point of the 
patch.

So they're not "extra" - they are "required for correct working".

But I can't stress the fact enough that people SHOULD NOT do writeback by 
buffer heads. The buffer head has been purely an "IO entity" for the last 
several years now, and it's not a cache entity. Anybody who does writeback 
by buffer heads is basically bypassing the real cache (the page cache), 
and that's why all the problems happen.

I think ext3 is terminally crap by now. It still uses buffer heads in 
places where it really really shouldn't, and as a result, things like 
directory accesses are simply slower than they should be. Sadly, I don't 
think ext4 is going to fix any of this, either.

It's all just too inherently wrongly designed around the buffer head 
(which was correct in 1995, but hasn't been correct for a long time in the 
kernel any more).

> - Poor old IO accounting broke again.

No. That's why I used "set_page_dirty()" and did it that strange ugly way 
("set page dirty, even though it's already dirty, and even though the very 
next thing we will do is TestClearPageDirty???").

That code looks strange as a result, which is why it now has more comments 
on it than actual code ;)

> - People were saying that ext2 and ext3,data=writeback were also showing
>   corruption.  What's up with that?

I thought the "ext3,data=writeback" case was reported to be fine by 
several people?

I'm not sure about ext2. I didn't look at what it did based on buffer 
heads. I would have expected it to be ok.

That said, at least one report was later shown to be bogus (errors due to 
out of disk, not due to actual errors ;).

> - For a long time I've wanted to nuke the current ext3/jbd ordered-data
>   implementation altogether, and just make kjournald call into the
>   standard writeback code to do a standard suberblock->inodes->pages walk.

I really would like to see less of the buffer-head-based stuff, and yes, 
more of the normal inode page walking. I don't think you can "order" 
accesses within a page anyway, exactly because of memory mapping issues, 
so any page ordering is not about buffer heads on the page itself, it 
should be purely about metadata.

> - It's pretty obnoxious that the VM now sets a clean page "dirty" and
>   then proceeds to modify its contents.  It would be nice to stop doing
>   that.

No. I think this really the fundamental confusion people had. People 
thought that setting the page dirty meant that it was no longer being 
modified. It hasn't meant that in a LONG time - ever since the whole 
DIRTY_TAG thing, the most important part of the PG_dirty thing has really 
been that it's now efficiently findable by the writeout logic.

And that is very much what the whole page accounting _depends_ on. When we 
mmap a page, we need to mark it "findable" as dirty _before_ people 
actually start writing to it, because it's too late afterwards.

>   We could stop marking the page dirty in do_wp_page() and create a new
>   VM counter "NR_PTE_DIRTY", which means
> 
> "number of mapping_cap_account_dirty() pages which have a dirty pte
> pointing at them".

Well, then you need to change what PAGE_MAPPING_TAG_DIRTY means too.

That's very fundamental. That DIRTY _tag_ is now even more important than 
the PG_dirty bit itself, since that's what we actually use to _access_ 
those things.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Andrew Morton
On Fri, 29 Dec 2006 14:16:32 -0800
Andrew Morton <[EMAIL PROTECTED]> wrote:

> - Poor old IO accounting broke again.

No it didn't - we're relying upon the behaviour of __set_page_dirty_buffers()
against an already-dirty page.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Andrew Morton
On Fri, 29 Dec 2006 02:48:35 -0800 (PST)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> + if (mapping && mapping_cap_account_dirty(mapping)) {
> + /*
> +  * Yes, Virginia, this is indeed insane.
> +  *
> +  * We use this sequence to make sure that
> +  *  (a) we account for dirty stats properly
> +  *  (b) we tell the low-level filesystem to
> +  *  mark the whole page dirty if it was
> +  *  dirty in a pagetable. Only to then
> +  *  (c) clean the page again and return 1 to
> +  *  cause the writeback.
> +  *
> +  * This way we avoid all nasty races with the
> +  * dirty bit in multiple places and clearing
> +  * them concurrently from different threads.
> +  *
> +  * Note! Normally the "set_page_dirty(page)"
> +  * has no effect on the actual dirty bit - since
> +  * that will already usually be set. But we
> +  * need the side effects, and it can help us
> +  * avoid races.
> +  *
> +  * We basically use the page "master dirty bit"
> +  * as a serialization point for all the different
> +  * threds doing their things.
> +  *
> +  * FIXME! We still have a race here: if somebody
> +  * adds the page back to the page tables in
> +  * between the "page_mkclean()" and the "TestClearPageDirty()",
> +  * we might have it mapped without the dirty bit set.
> +  */
> + if (page_mkclean(page))
> + set_page_dirty(page);
> + if (TestClearPageDirty(page)) {
>   dec_zone_page_state(page, NR_FILE_DIRTY);
> + return 1;
>   }

- Presumably reiser3's ordered-data mode has the same problem.  And ext4,
  of course.  Dunno about other filesytems.

- The above change means that we do extra writeout.  If a page is dirtied
  once, kjournald will write it and then pdflush will come along and
  needlessly write it again.

  But otoh, if a mapping is being repeatedly dirtied, kjournald will
  write the page once per 30 seconds (dirty_expire_centisecs) and pdflush
  will write the page once per 30 seconds as well.  But we _should_ be
  writing it once per five seconds (kjournald commit interval).  So we're
  still ahead ;)

- Poor old IO accounting broke again.

- People were saying that ext2 and ext3,data=writeback were also showing
  corruption.  What's up with that?

- For a long time I've wanted to nuke the current ext3/jbd ordered-data
  implementation altogether, and just make kjournald call into the
  standard writeback code to do a standard suberblock->inodes->pages walk.

  I think it'd be fairly straightforward to do.  We'd need to teach the
  writeback code to be able to skip dirty pages which don't have a disk
  mapping, so that kjournald doesn't end up waiting for kjournald to free
  up journal space..

  Would need to avoid possible deadlocks where someone calls
  ext3_force_commit() or otherwise does a synchronous commit while holding
  VFS locks.

  reiser3 and ext4 could be converted too.

  Not a short-term project, but this would avoid the problem.

- It's pretty obnoxious that the VM now sets a clean page "dirty" and
  then proceeds to modify its contents.  It would be nice to stop doing
  that.

  We could stop marking the page dirty in do_wp_page() and create a new
  VM counter "NR_PTE_DIRTY", which means

"number of mapping_cap_account_dirty() pages which have a dirty pte
pointing at them".

  Or, perhaps

"number of dirty ptes which point at mapping_cap_account_dirty() pages".

  Which can be larger, but the writeout code will probably cope.

  Then we take NR_PTE_DIRTY into account in the dirty-page balancing act.
  So

  - do_wp_page() will still run balance_dirty_pages()

  - but it would no longer run set_page_dirty().

  - But it needs to run mark_inode_dirty() so the fs-writeback code
notices the file.

  - And mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) becomes insufficient.

  The tricky part here is "how do we do the writeback"?  The
  pte-dirty,!PageDirty pages aren't tagged as dirty in the radix-tree and
  writeback needs to find them so that it can effectively do an msync() on
  them.  Walking all the mm's and vma's would be insane.  Visiting all the
  pages in the file would also probably be insane.

  Perhaps this can be solved by adding a new radix-tree tag which means
  "this page might have dirty ptes pointing at it".  For each file
  writeback would do a radix-tree walk of these pages,
  cleaning-and-write-protecting ptes, marking the corresponding pages
  dirty and clearing their PAGECACHE_TAG_PTE_DIRTY tags.

  Then we can fix the mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)
  problem by doing

mapping_tagg

Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Theodore Tso wrote:
> 
> I'm confused.  Does this mean that if "fs blocksize"=="VM pagesize"
> this bug can't trigger?

No. Even if there is just a single buffer-head, if the filesystem ever 
writes out that _single_ buffer-head out of turn (ie before the VM 
actually asks it to, with "->writepage()"), then the same issue will 
happen.

In fact, a bigger fs blocksize will likely just make this easier to 
trigger (although I doubt it makes a big difference), since any 
out-of-order buffer flushback will happen for the whole page, rather than 
just a part of the page.

So the "problem" really ends up being that the filesystem does flushing 
that the VM isn't aware of, so when the VM did "set_page_dirty()" at an 
earlier time, the VM _expected_ the "->writepages()" call that happened 
much later to write the whole page - but because the FS had flushed things 
behind it backs even _before_ the "->writepage" happens, by the time the 
VM actually asks for the page to be written out, the FS layer won't 
actually write it all out any more.

Blocksize doesn't matter, the only thing that matters is whether something 
writes out data on a buffer-cache level, not on a "page cache" level. Ext3 
apparently does this in "ordered" data more at least (and hey, I suspect 
that the code that tries to release buffer head data might try to do it on 
its own too).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Nick Piggin wrote:
> 
> > It still has a tiny tiny race (see the comment), but I bet nobody can really
> > hit it in real life anyway, and I know several ways to fix it, so I'm not
> > really _that_ worried about it.
> 
> Well the race isn't a data loss one, is it? Just a case where the
> pte may be dirty but the page dirty state not accounted for.

Right. We should be picking it up eventually, since it's still in the page 
tables, but if we've lost sight of the page dirtyness we won't react 
correctly to msync() and/or fdatasync(). So we don't _lose_ the data, we 
just might not write it out in a timely manner if we ever hit the race.

> Can we fix it by just putting the page_mkclean back inside the
> TestClearPageDirty check, and re-clearing PG_dirty after redoing
> the set_page_dirty?

I considered it, but quite frankly, if we did it that way, I'd really like 
to just fix the whole insane "set_page_dirty()" instead.

I think set_page_dirty() should be split up. One thing that confused me 
mentally was that almost all of the dirty handling was actualyl done only 
if PG_dirty wasn't already set, so the _bulk_ of set_page_dirty() really 
ends up being

if (!TestSetPageDirty(page)) {
.. we just marked the page dirty, it was clean before,
   so we need to add it to the queues etc ..
}

and that's the part that I (and probably others) always really thought 
about.

But then we have the _one_ thing that runs outside of that "do only once 
per dirty bit" logic, and it's the buffer dirtying. If we had had two 
separate operations for this all: "set_dirty_every_time()" and the regular 
"set_dirty()", I don't think this would have been nearly as confusing.

(And then the difference between "__set_page_dirty_nobuffers()" and 
"__set_page_dirty_buffers()" really boils down to one doing the 
"everytime" _and_ the "once per dirty" checks and the other one doing just 
the "once per dirty bit" act - and we could rename the damn things to 
something saner too).

If we split it up that way, then the whole clear_page_dirty_for_io() logic 
would boil down to

if (TestClearPageDirty(page)) {
if (page_mkclean(page))
set_dirty_every_time();
return 1;
}
return 0;

and we wouldn't even need to do any of the "clear dirty again" kind of 
idiocy, because the "set_dirty_every_time()" stuff is the one that doesn't 
even care about the state of the PG_dirty bit - it's done regardless, and 
doesn't really touch it.

That's what I wanted to do, but with the current "set_page_dirty()" setup, 
I think my patch makes reasonable sense.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Martin Michlmayr
* Stephen Clark <[EMAIL PROTECTED]> [2006-12-29 10:17]:
> >It works for me now, both your testcase as well as an installation of
> >Debian on this ARM device.  I manually applied the patch to 2.6.19.
> 
> Can you post a diff against 2.6.19?

--- a/mm/page-writeback.c   2006-11-29 21:57:37.0 +
+++ b/mm/page-writeback.c   2006-12-29 11:02:55.555147896 +
@@ -893,16 +893,45 @@
 {
struct address_space *mapping = page_mapping(page);
 
-   if (mapping) {
+   if (mapping && mapping_cap_account_dirty(mapping)) {
+   /*
+* Yes, Virginia, this is indeed insane.
+*
+* We use this sequence to make sure that
+*  (a) we account for dirty stats properly
+*  (b) we tell the low-level filesystem to
+*  mark the whole page dirty if it was
+*  dirty in a pagetable. Only to then
+*  (c) clean the page again and return 1 to
+*  cause the writeback.
+*
+* This way we avoid all nasty races with the
+* dirty bit in multiple places and clearing
+* them concurrently from different threads.
+*
+* Note! Normally the "set_page_dirty(page)"
+* has no effect on the actual dirty bit - since
+* that will already usually be set. But we
+* need the side effects, and it can help us
+* avoid races.
+*
+* We basically use the page "master dirty bit"
+* as a serialization point for all the different
+* threds doing their things.
+*
+* FIXME! We still have a race here: if somebody
+* adds the page back to the page tables in
+* between the "page_mkclean()" and the "TestClearPageDirty()",
+* we might have it mapped without the dirty bit set.
+*/
+   if (page_mkclean(page))
+   set_page_dirty(page);
if (TestClearPageDirty(page)) {
-   if (mapping_cap_account_dirty(mapping)) {
-   page_mkclean(page);
-   dec_zone_page_state(page, NR_FILE_DIRTY);
-   }
+   dec_zone_page_state(page, NR_FILE_DIRTY);
return 1;
}
return 0;
-   }
+   }
return TestClearPageDirty(page);
 }
 EXPORT_SYMBOL(clear_page_dirty_for_io);

-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Theodore Tso
On Fri, Dec 29, 2006 at 12:58:12AM -0800, Linus Torvalds wrote:
> Because what "__set_page_dirty_buffers()" does is that AT THE TIME THE 
> "set_page_dirty()" IS CALLED, it will mark all the buffers on that page as 
> dirty. That may _sound_ like what we want, but it really isn't. Because by 
> the time "writepage()" is actually called (which can be MUCH MUCH later), 
> some internal filesystem activity may actually have cleaned one or more of 
> those buffers in the meantime, and now we call "writepage()" (which really 
> wants to write them _all_), and it will write only part of them, or none 
> at all.

I'm confused.  Does this mean that if "fs blocksize"=="VM pagesize"
this bug can't trigger?  But I thought at least one of people
reporting corruption was using a filesystem with a 4k block size on an
i386?

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Stephen Clark

Martin Michlmayr wrote:


* Linus Torvalds <[EMAIL PROTECTED]> [2006-12-29 02:48]:
 


Can anybody get corruption with this thing applied? It goes on top
of plain v2.6.20-rc2.
   



It works for me now, both your testcase as well as an installation of
Debian on this ARM device.  I manually applied the patch to 2.6.19.

Thanks.
 


Hi Martin,

Can you post a diff against 2.6.19?

Thanks,
Steve

--

"They that give up essential liberty to obtain temporary safety, 
deserve neither liberty nor safety."  (Ben Franklin)


"The course of history shows that as a government grows, liberty 
decreases."  (Thomas Jefferson)




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Martin Michlmayr
* Linus Torvalds <[EMAIL PROTECTED]> [2006-12-29 02:48]:
> Can anybody get corruption with this thing applied? It goes on top
> of plain v2.6.20-rc2.

It works for me now, both your testcase as well as an installation of
Debian on this ARM device.  I manually applied the patch to 2.6.19.

Thanks.
-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Martin Johansson

Linus Torvalds wrote:
[...] 
The patch is mostly a comment. The "real" meat of it is actually just a 
few lines.


Can anybody get corruption with this thing applied? It goes on top of 
plain v2.6.20-rc2. 


No corruption with the testcase here. Will check with rtorrent too later 
today but I suppose it will work just fine.
Nice work! It has been interesting (and educating) to follow this 
bug-hunt :)


/Martin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> > Hmm? I'd love it if somebody else wrote the patch and tested it, 
> > because I'm getting sick and tired of this bug ;)
> 
> Who the hell am I kidding? I haven't been able to sleep right for the 
> last few days over this bug. It was really getting to me.
> 
> And putting on the thinking cap, there's actually a fairly simple an 
> nonintrusive patch. [...]

ok, your patch seems to fix the testcase here too on -rc2-rt.

[ Damn, i should have slept a bit more, that would have saved me a ~4
  hour debug and tracing session today to analyze your testcase, just to
  find your patch and your explanation on lkml, right after i sent my
  analysis and workaround patch ;-) At least now we know it from two
  independent tracing results that the suspect code is the same. ]

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Nick Piggin

Hey nice work Linus!

Linus Torvalds wrote:


On Fri, 29 Dec 2006, Linus Torvalds wrote:

Hmm? I'd love it if somebody else wrote the patch and tested it, because 
I'm getting sick and tired of this bug ;)



Who the hell am I kidding? I haven't been able to sleep right for the last 
few days over this bug. It was really getting to me.


And putting on the thinking cap, there's actually a fairly simple an 
nonintrusive patch.


Yeah *this* makes more sense. And in retrospect it was simple, we
can't just throw out pte dirtiness information if the page doesn't
have all buffers dirtied.

It still has a tiny tiny race (see the comment), but I 
bet nobody can really hit it in real life anyway, and I know several ways 
to fix it, so I'm not really _that_ worried about it.


Well the race isn't a data loss one, is it? Just a case where the
pte may be dirty but the page dirty state not accounted for.

Can we fix it by just putting the page_mkclean back inside the
TestClearPageDirty check, and re-clearing PG_dirty after redoing
the set_page_dirty?




The patch is mostly a comment. The "real" meat of it is actually just a 
few lines.


Can anybody get corruption with this thing applied? It goes on top of 
plain v2.6.20-rc2.


Linus


diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b3a198c..ec01da1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page)
 {
struct address_space *mapping = page_mapping(page);
 
-	if (!mapping)

-   return TestClearPageDirty(page);
-
-   if (TestClearPageDirty(page)) {
-   if (mapping_cap_account_dirty(mapping)) {
-   page_mkclean(page);
+   if (mapping && mapping_cap_account_dirty(mapping)) {
+   /*
+* Yes, Virginia, this is indeed insane.
+*
+* We use this sequence to make sure that
+*  (a) we account for dirty stats properly
+*  (b) we tell the low-level filesystem to
+*  mark the whole page dirty if it was
+*  dirty in a pagetable. Only to then
+*  (c) clean the page again and return 1 to
+*  cause the writeback.
+*
+* This way we avoid all nasty races with the
+* dirty bit in multiple places and clearing
+* them concurrently from different threads.
+*
+* Note! Normally the "set_page_dirty(page)"
+* has no effect on the actual dirty bit - since
+* that will already usually be set. But we
+* need the side effects, and it can help us
+* avoid races.
+*
+* We basically use the page "master dirty bit"
+* as a serialization point for all the different
+* threds doing their things.
+*
+* FIXME! We still have a race here: if somebody
+* adds the page back to the page tables in
+* between the "page_mkclean()" and the "TestClearPageDirty()",
+* we might have it mapped without the dirty bit set.
+*/
+   if (page_mkclean(page))
+   set_page_dirty(page);
+   if (TestClearPageDirty(page)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
+   return 1;
}
-   return 1;
+   return 0;
}
-   return 0;
+   return TestClearPageDirty(page);
 }
 EXPORT_SYMBOL(clear_page_dirty_for_io);
 




--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Andrei Popa
On Fri, 2006-12-29 at 02:48 -0800, Linus Torvalds wrote:
> 
> On Fri, 29 Dec 2006, Linus Torvalds wrote:
> > 
> > Hmm? I'd love it if somebody else wrote the patch and tested it, because 
> > I'm getting sick and tired of this bug ;)
> 
> Who the hell am I kidding? I haven't been able to sleep right for the last 
> few days over this bug. It was really getting to me.
> 
> And putting on the thinking cap, there's actually a fairly simple an 
> nonintrusive patch. It still has a tiny tiny race (see the comment), but I 
> bet nobody can really hit it in real life anyway, and I know several ways 
> to fix it, so I'm not really _that_ worried about it.
> 
> The patch is mostly a comment. The "real" meat of it is actually just a 
> few lines.
> 
> Can anybody get corruption with this thing applied? It goes on top of 
> plain v2.6.20-rc2.

Tested with rtorrent and there is no corruption.


> 
>   Linus
> 
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index b3a198c..ec01da1 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page)
>  {
>   struct address_space *mapping = page_mapping(page);
>  
> - if (!mapping)
> - return TestClearPageDirty(page);
> -
> - if (TestClearPageDirty(page)) {
> - if (mapping_cap_account_dirty(mapping)) {
> - page_mkclean(page);
> + if (mapping && mapping_cap_account_dirty(mapping)) {
> + /*
> +  * Yes, Virginia, this is indeed insane.
> +  *
> +  * We use this sequence to make sure that
> +  *  (a) we account for dirty stats properly
> +  *  (b) we tell the low-level filesystem to
> +  *  mark the whole page dirty if it was
> +  *  dirty in a pagetable. Only to then
> +  *  (c) clean the page again and return 1 to
> +  *  cause the writeback.
> +  *
> +  * This way we avoid all nasty races with the
> +  * dirty bit in multiple places and clearing
> +  * them concurrently from different threads.
> +  *
> +  * Note! Normally the "set_page_dirty(page)"
> +  * has no effect on the actual dirty bit - since
> +  * that will already usually be set. But we
> +  * need the side effects, and it can help us
> +  * avoid races.
> +  *
> +  * We basically use the page "master dirty bit"
> +  * as a serialization point for all the different
> +  * threds doing their things.
> +  *
> +  * FIXME! We still have a race here: if somebody
> +  * adds the page back to the page tables in
> +  * between the "page_mkclean()" and the "TestClearPageDirty()",
> +  * we might have it mapped without the dirty bit set.
> +  */
> + if (page_mkclean(page))
> + set_page_dirty(page);
> + if (TestClearPageDirty(page)) {
>   dec_zone_page_state(page, NR_FILE_DIRTY);
> + return 1;
>   }
> - return 1;
> + return 0;
>   }
> - return 0;
> + return TestClearPageDirty(page);
>  }
>  EXPORT_SYMBOL(clear_page_dirty_for_io);
>  

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Linus Torvalds wrote:
> 
> Hmm? I'd love it if somebody else wrote the patch and tested it, because 
> I'm getting sick and tired of this bug ;)

Who the hell am I kidding? I haven't been able to sleep right for the last 
few days over this bug. It was really getting to me.

And putting on the thinking cap, there's actually a fairly simple an 
nonintrusive patch. It still has a tiny tiny race (see the comment), but I 
bet nobody can really hit it in real life anyway, and I know several ways 
to fix it, so I'm not really _that_ worried about it.

The patch is mostly a comment. The "real" meat of it is actually just a 
few lines.

Can anybody get corruption with this thing applied? It goes on top of 
plain v2.6.20-rc2.

Linus


diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b3a198c..ec01da1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page)
 {
struct address_space *mapping = page_mapping(page);
 
-   if (!mapping)
-   return TestClearPageDirty(page);
-
-   if (TestClearPageDirty(page)) {
-   if (mapping_cap_account_dirty(mapping)) {
-   page_mkclean(page);
+   if (mapping && mapping_cap_account_dirty(mapping)) {
+   /*
+* Yes, Virginia, this is indeed insane.
+*
+* We use this sequence to make sure that
+*  (a) we account for dirty stats properly
+*  (b) we tell the low-level filesystem to
+*  mark the whole page dirty if it was
+*  dirty in a pagetable. Only to then
+*  (c) clean the page again and return 1 to
+*  cause the writeback.
+*
+* This way we avoid all nasty races with the
+* dirty bit in multiple places and clearing
+* them concurrently from different threads.
+*
+* Note! Normally the "set_page_dirty(page)"
+* has no effect on the actual dirty bit - since
+* that will already usually be set. But we
+* need the side effects, and it can help us
+* avoid races.
+*
+* We basically use the page "master dirty bit"
+* as a serialization point for all the different
+* threds doing their things.
+*
+* FIXME! We still have a race here: if somebody
+* adds the page back to the page tables in
+* between the "page_mkclean()" and the "TestClearPageDirty()",
+* we might have it mapped without the dirty bit set.
+*/
+   if (page_mkclean(page))
+   set_page_dirty(page);
+   if (TestClearPageDirty(page)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
+   return 1;
}
-   return 1;
+   return 0;
}
-   return 0;
+   return TestClearPageDirty(page);
 }
 EXPORT_SYMBOL(clear_page_dirty_for_io);
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Thu, 28 Dec 2006, Linus Torvalds wrote:
> 
> So everything I have ever seen says that the VM layer is actually doing 
> everything right.

That was true, but at the same time, it's not. Let me explain.

> That to me says: "somebody didn't actually write out out". The VM layer 
> asked the filesystem to do the write, but the filesystem just didn't do 
> it. I personally think it's because some buffer-head BH_dirty bit got 
> scrogged

Ok, I have proof of this now.

Here's a trace (with cycle counts), and with a new trace event added: this 
is for another corrupted page. I have:

 49105  PG 15d8 (14800): mm/page-writeback.c:872 clean_for_io
 49106  PG 15d8 (6900): mm/rmap.c:451 cleaning PTE b7fa6000
 49107  PG 15d8 (9900): mm/page-writeback.c:914 set writeback
 49108  PG 15d8 (6480): mm/page-writeback.c:916 setting TAG_WRITEBACK
 49109  PG 15d8 (7110): mm/page-writeback.c:922 clearing TAG_DIRTY
 49110  PG 15d8 (7190): fs/buffer.c:1713 no IO underway
 49111  PG 15d8 (6180): mm/page-writeback.c:891 end writeback
 49112  PG 15d8 (6460): mm/page-writeback.c:893 clearing TAG_WRITEBACK

where that first column is the trace event number again, and the "PG 
15d8" is that corrupted page. The thing in the parenthesis is "CPU 
cycles since last event), and the important part to note is that this is 
indeed all one single thing with no actual IO anywhere (~7000 CPU cycles 
may sound like a lot, but (a) it's not that many cache misses and (b) a 
lot of it is the logging overhead - back-to-back log events will take 
about 3500 cycles) just because it does the actual ASCII printk() etc.

Also, the new event is:

fs/buffer.c:1713 no IO underway

which is just the 

if (nr_underway == 0)

case in fs/buffer.c

And I now finally really believe that I fully understand the corruption, 
but I don't have a simple solution, much less a patch.

What the problem basically boils down to is that "set_page_dirty()" is 
supposed to be a mark for dirtying THE WHOLE PAGE, but it really is not 
"the whole page when the 'set_page_dirty()' itself happens", but more of a 
"the next writepage() needs to write back the whole page" thing.

And that's not what "__set_page_dirty_buffers()" really does.

Because what "__set_page_dirty_buffers()" does is that AT THE TIME THE 
"set_page_dirty()" IS CALLED, it will mark all the buffers on that page as 
dirty. That may _sound_ like what we want, but it really isn't. Because by 
the time "writepage()" is actually called (which can be MUCH MUCH later), 
some internal filesystem activity may actually have cleaned one or more of 
those buffers in the meantime, and now we call "writepage()" (which really 
wants to write them _all_), and it will write only part of them, or none 
at all.

So the VM thought that since it did a "writepage()", all the dirty state 
at that point got written back. But it didn't - the filesystem could have 
written back part or all of the page much earlier, and the writepage() 
actually does nothing at all.

Both filesystem and VM actually _think_ they do the right thing, because 
they simply have totally different expectations. The filesystem thinks 
that it should care about dirty buffers (that got marked dirty _after_ 
they were dirtied), while the filesystem thinks that it cares about dirty 
_pages_ (that got dirted at any time _before_ "writepage()" was called).

Neither is really "wrong", per se, it's just that the two parts have 
different expectations, and the _combination_ just doesn't work. 
"set_page_dirty()" at some point meant "the writes have been done", but 
these days it really means something else.

Now, the reason there is no trivial patch is not that this is conceptually 
really hard to fix. I can see several different approaches to fixing it, 
but they all really boil down to two alternatives:

 (a) splitting the one "PG_dirty" bit up into two bits: the 
 "PG_writescheduled" bit and the "PG_alldirty" bit.

 The "PG_write_scheduled" bit would be the bit that the filesystem 
 would set when it has pending dirty data that it wrote itself (and 
 that may not cover the whole page), and is the part of PG_dirty that 
 sets the PAGECACHE_TAG_DIRTY. It's also what forces "writepage()" to 
 be called.

 The "PG_alldirty" bit is just an additional "somebody else dirtied 
 random parts of this page, and we don't know what" flag, which is set 
 by "set_page_dirty()" in addition to doing the PG_write_scheduled 
 stuff. We would test-and-clear it at "writepage()" time, and pass it 
 in to "writepages()" to tell the writepage() function that it can't 
 just write out its own small limited notion of what is dirty.

 (There are various variations on this whole theme: instead of having 
 a flag to "writepage()", we could split the "whole page" case out as 
 a separate callback or similar)

 (b) making sure that all "set_page_dirty()" calls are _after_ the page 
 

Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Linus Torvalds


On Fri, 29 Dec 2006, Segher Boessenkool wrote:
>
> > I think what might be happening is that pdflush writes them out fine,
> > however we don't trap writes by the application _during_ that writeout.
> 
> Yeah.  I believe that more exactly it happens if the very last
> write to the page causes a writeback (due to dirty balancing)
> while another writeback for the page is already happening.
> 
> As usual in these cases, I have zero proof.

I actually have proof to the contrary, ie I have traces that say "the 
write was started" after the last write.

And the VM layer in this area is actually fairly sane and civilized. It 
has a bit that says "writeback in progress", and if that bit is set, it 
simply _will_not_ start a new write. It even has various BUG_ON()'s to 
that effect.

So everything I have ever seen says that the VM layer is actually doing 
everything right.

> It's the do_wp_page -> balance_dirty_pages -> generic_writepages
> path for sure.  Maybe it's enough to change
> 
> if (wbc->sync_mode != WB_SYNC_NONE)
> wait_on_page_writeback(page);
> 
> if (PageWriteback(page) ||
> !clear_page_dirty_for_io(page)) {
> unlock_page(page);
> continue;
> }

Notive how this one basically says:

 - if it's under writeback, don't even clear the page dirty flag.

Your suggested change:

> if (wbc->sync_mode != WB_SYNC_NONE)
> wait_on_page_writeback(page);
> 
> if (PageWriteback(page)) {
>   redirty_page_for_writepage(wbc, page);

makes no sense, because we simply never _did_ the "clear_page_dirty()" if 
the thing was under writeback in the first place. That's how C 
conditionals work.  So there's no reason to "redirty" it, because it 
wasn't cleaned in the first place.

I've double- and triple-checked the dirty bits, including having traces 
that actually say that the IO was started (from a VM perspective) _after_ 
the last write was done. The IO just didn't hit the disk.

I'm personally fairly convinced that it's not a VM issue, but a "IO 
issue". Either in a low-level filesystem or in some of the fs/buffer.c 
helper routines.

But I'd love to be proven wrong. 

I do have a few interesting details from the trace I haven't really 
analyzed yet. Here's the trace for events on one of the pages that was 
corrupted. Note how the events are numbered (there were 171640 events 
total), so the thing you see is just a small set of events from the whole 
big trace, but it's the ones that talk about _that_ particular page.

I've grouped them so hat "consecutive" events group together. That just 
means that no events on any other pages happened in between those events, 
and it is usually a sign that it's really one single call-chain that 
causes all the events.

For example, for the first group of three events (44366-44368), it's the 
page fault that brings in the page, and since it's a write-fault, it will 
not only map the page, it will mark the page itself dirty and then also 
set the TAG_DIRTY on the mapping. So the "group" is just really a result 
of one single event happening, which causes several things to happen to 
that page. That's exactly what you'd expect.

Anyway, here is the list of events that page went through:

   44366  PG 0f6d: mm/memory.c:2254 mapping at b789fc54 (write)
   44367  PG 0f6d: mm/page-writeback.c:817 setting dirty
   44368  PG 0f6d: fs/buffer.c:738 setting TAG_DIRTY

   64231  PG 0f6d: mm/page-writeback.c:872 clean_for_io
   64232  PG 0f6d: mm/rmap.c:451 cleaning PTE b789f000
   64233  PG 0f6d: mm/page-writeback.c:914 set writeback
   64234  PG 0f6d: mm/page-writeback.c:916 setting TAG_WRITEBACK
   64235  PG 0f6d: mm/page-writeback.c:922 clearing TAG_DIRTY

   67570  PG 0f6d: mm/page-writeback.c:891 end writeback
   67571  PG 0f6d: mm/page-writeback.c:893 clearing TAG_WRITEBACK

   76705  PG 0f6d: mm/page-writeback.c:817 setting dirty
   76706  PG 0f6d: fs/buffer.c:725 dirtied buffers
   76707  PG 0f6d: fs/buffer.c:738 setting TAG_DIRTY

  105267  PG 0f6d: mm/page-writeback.c:872 clean_for_io
  105268  PG 0f6d: mm/rmap.c:451 cleaning PTE b789f000
  105269  PG 0f6d: mm/page-writeback.c:914 set writeback
  105270  PG 0f6d: mm/page-writeback.c:916 setting TAG_WRITEBACK
  105271  PG 0f6d: mm/page-writeback.c:922 clearing TAG_DIRTY
  105272  PG 0f6d: mm/page-writeback.c:891 end writeback
  105273  PG 0f6d: mm/page-writeback.c:893 clearing TAG_WRITEBACK

  128032  PG 0f6d: mm/memory.c:670 unmapped at b789f000

  132662  PG 0f6d: mm/filemap.c:119 Removing page cache

  148278  PG 0f6d: mm/memory.c:2254 mapping at b789f000 (read)

  166326  PG 0f6d: mm/memory.c:670 unmapped at b789f000

  171958  PG 

Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Segher Boessenkool

I think what might be happening is that pdflush writes them out fine,
however we don't trap writes by the application _during_ that writeout.


Yeah.  I believe that more exactly it happens if the very last
write to the page causes a writeback (due to dirty balancing)
while another writeback for the page is already happening.

As usual in these cases, I have zero proof.


It's something that will only occur with writeback and MAP_SHARED
writable access to the file pages.


It's the do_wp_page -> balance_dirty_pages -> generic_writepages
path for sure.  Maybe it's enough to change

if (wbc->sync_mode != WB_SYNC_NONE)
wait_on_page_writeback(page);

if (PageWriteback(page) ||
!clear_page_dirty_for_io(page)) {
unlock_page(page);
continue;
}

to

if (wbc->sync_mode != WB_SYNC_NONE)
wait_on_page_writeback(page);

if (PageWriteback(page)) {
redirty_page_for_writepage(wbc, page);
unlock_page(page);
continue;
}

if (!clear_page_dirty_for_io(page)) {
unlock_page(page);
continue;
}

or something along those lines.  Completely untested of course :-)


Segher

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread David Miller
From: Linus Torvalds <[EMAIL PROTECTED]>
Date: Thu, 28 Dec 2006 12:14:31 -0800 (PST)

> I get corruption - but the whole point is that it's very much pdflush that 
> should be writing these pages out.

I think what might be happening is that pdflush writes them out fine,
however we don't trap writes by the application _during_ that writeout.

These corruptions look exactly as if:

1) pdflush begins writeback of page X
2) page goes to disk
3) application writes a chunk to the page
4) pdflush et al. think the page is clean, so it gets tossed, losing
   the writes done in #3

So there's a missing PTE change in there, so that we never get proper
re-dirtying of the page if the application tries to write to the page
during the writeback.

It's something that will only occur with writeback and MAP_SHARED
writable access to the file pages.  That's why we never see this
with normal filesystem writes, since those explicitly manage the
page dirty state.

I think the dirty balancing logic etc. isn't where the problems are,
to me it's a PTE state update issue for sure.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Mike Galbraith
On Thu, 2006-12-28 at 11:45 -0800, Andrew Morton wrote:
> On Thu, 28 Dec 2006 11:28:52 -0800 (PST)
> Linus Torvalds <[EMAIL PROTECTED]> wrote:
> 
> > 
> > 
> > On Thu, 28 Dec 2006, Guillaume Chazarain wrote:
> > > 
> > > The attached patch fixes the corruption for me.
> > 
> > Well, that's a good hint, but it's really just a symptom. You effectively 
> > just made the test-program not even try to flush the data to disk, so the 
> > page cache would stay in memory, and you'd not see the corruption as well.
> > 
> > So you basically disabled the code that tried to trigger the bug more 
> > easily.
> > 
> > But the reason I say it's interesting is that "WB_SYNC_NONE" is very much 
> > implicated in mm/page-writeback.c, and if there is a bug triggered by 
> > WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also 
> > fails..
> > 
> 
> It would be interesting to convert your app to do fsync() before
> FADV_DONTNEED.  That would take WB_SYNC_NONE out of the picture as well
> (apart from pdflush activity).

I did fdatasync(), tried remapping before unmapping... nogo here.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Linus Torvalds


On Thu, 28 Dec 2006, Andrew Morton wrote:
> 
> It would be interesting to convert your app to do fsync() before
> FADV_DONTNEED.  That would take WB_SYNC_NONE out of the picture as well
> (apart from pdflush activity).

I get corruption - but the whole point is that it's very much pdflush that 
should be writing these pages out.

Andrew - give my test-program a try. It can run in about 1 minute if you 
have a 256MB machine (I didn't, but "mem=256M" is my friend), and it seems 
to very consistently cause corruption.

What I do is:

# Make sure we write back aggressively
echo 5 > /proc/sys/vm/dirty_ratio

as root, and then just run the thing. Tons of corruption. But the 
corruption goes away if I just leave the default dirty ratio alone (but 
then I can increse the file size to trigger it, of course - but that 
also makes the test run a lot slower).

Now, with a pre-2.6.19 kernel, I bet you won't get the corruption as 
easily (at least with the "fsync()"), but that's less to do with anything 
new, and probably just because then you simply won't have any pdflushing 
going on - since the kernel won't even notice that you have tons of dirty 
pages ;)

It might also depend on the speed of your disk drive - the machine I test 
this on has a slow 4200 rpm laptop drive in it, and that probably makes 
things go south more easily. That's _especially_ true if this is related 
to any "bdi_write_congested()" logic.

Now, it could also be related to various code snippets like

...
if (wbc->sync_mode != WB_SYNC_NONE)
wait_on_page_writeback(page);

if (PageWriteback(page) ||
!clear_page_dirty_for_io(page)) {
unlock_page(page);
continue;
}
...

where the WB_SYNC_NONE case will hit the "PageWriteback()" and just not do 
the writeback at all (but it also won't clear the dirty bit, so it's 
certainly not an *OBVIOUS* bug).

We also have code like this ("pageout()"):

if (clear_page_dirty_for_io(page)) {
int res;
struct writeback_control wbc = {
.sync_mode = WB_SYNC_NONE,
..
}
...
res = mapping->a_ops->writepage(page, &wbc);

and in this case, if the "WB_SYNC_NONE" means that the "writepage()" call 
won't do anything at all because of congestion, then that would be a _bad_ 
thing, and would certainly explain how something didn't get written out.

But that particular path should only trigger for the "shrink_page_list()" 
case, and it's not the case I seem to be testing with my "low dirty_ratio" 
testing.

Linus#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define TARGETSIZE (22 << 20)
#define CHUNKSIZE (1460)
#define NRCHUNKS (TARGETSIZE / CHUNKSIZE)
#define SIZE (NRCHUNKS * CHUNKSIZE)

static void fillmem(void *start, int nr)
{
memset(start, nr, CHUNKSIZE);
}

#define page_offset(buf, off) (unsigned)((unsigned long)(buf)+(off)-(unsigned 
long)(mapping))

static int chunkorder[NRCHUNKS];
static char *mapping;

static int order(int nr)
{
int i;
if (nr < 0 || nr >= NRCHUNKS)
return -1;
for (i = 0; i < NRCHUNKS; i++)
if (chunkorder[i] == nr)
return i;
return -2;
}

static void checkmem(void *buf, int nr)
{
unsigned int start = ~0u, end = 0;
unsigned char c = nr, *p = buf, differs = 0;
int i;
for (i = 0; i < CHUNKSIZE; i++) {
unsigned char got = *p++;
if (got != c) {
if (i < start)
start = i;
if (i > end)
end = i;
differs = got;
}
}
if (start < end) {
printf("Chunk %d corrupted (%u-%u)  (%x-%x)\n", nr, 
start, end,
page_offset(buf, start), page_offset(buf, end));
printf("Expected %u, got %u\n", c, differs);
printf("Written as (%d)%d(%d)\n", order(nr-1), order(nr), 
order(nr+1));
}
}

static char *remap(int fd, char *mapping)
{
if (mapping) {
munmap(mapping, SIZE);
//  fsync(fd);
posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED);
}
return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
}

int main(int argc, char **argv)
{
int fd, i;

/*
 * Make some random ordering of writing the chunks to the
 * memory map..
 *
 * Start with fully ordered..
 */
for (i = 0; i < NRCHUNKS; i++)
chunkorder[i] = i;

/* ..and then mix it up randomly */
srandom(time(NULL));
for (i = 0; i < NRCHUNKS; i++) {
int index = (unsigned int) random() % NRCH

Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Andrew Morton
On Thu, 28 Dec 2006 11:28:52 -0800 (PST)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> 
> 
> On Thu, 28 Dec 2006, Guillaume Chazarain wrote:
> > 
> > The attached patch fixes the corruption for me.
> 
> Well, that's a good hint, but it's really just a symptom. You effectively 
> just made the test-program not even try to flush the data to disk, so the 
> page cache would stay in memory, and you'd not see the corruption as well.
> 
> So you basically disabled the code that tried to trigger the bug more 
> easily.
> 
> But the reason I say it's interesting is that "WB_SYNC_NONE" is very much 
> implicated in mm/page-writeback.c, and if there is a bug triggered by 
> WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also 
> fails..
> 

It would be interesting to convert your app to do fsync() before
FADV_DONTNEED.  That would take WB_SYNC_NONE out of the picture as well
(apart from pdflush activity).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Linus Torvalds


On Thu, 28 Dec 2006, Guillaume Chazarain wrote:
> 
> The attached patch fixes the corruption for me.

Well, that's a good hint, but it's really just a symptom. You effectively 
just made the test-program not even try to flush the data to disk, so the 
page cache would stay in memory, and you'd not see the corruption as well.

So you basically disabled the code that tried to trigger the bug more 
easily.

But the reason I say it's interesting is that "WB_SYNC_NONE" is very much 
implicated in mm/page-writeback.c, and if there is a bug triggered by 
WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also 
fails..

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Guillaume Chazarain

Guillaume Chazarain a écrit :
I get this kind of corruption: 
http://guichaz.free.fr/linux-bug/corruption.png


Actually in qemu, I get three different behaviours:
- no corruption at all : with linux-2.4
- corruption only on the first chunks: before  [PATCH] mm: balance dirty 
pages as identified by Kenneth

- corruption of all chunks: after the balance dirty pages patch

Bisecting in linux-2.5 land I found 
http://kernel.org/pub/linux/kernel/people/akpm/patches/2.5/2.5.66/2.5.66-mm3/broken-out/fadvise-flush-data.patch 
to cause the corruption for me.


The attached patch fixes the corruption for me.

--
Guillaume

diff -r 3859b1144d3a mm/fadvise.c
--- a/mm/fadvise.c	Sun Dec 24 05:00:03 2006 +
+++ b/mm/fadvise.c	Thu Dec 28 19:53:40 2006 +0100
@@ -96,9 +96,6 @@ asmlinkage long sys_fadvise64_64(int fd,
 	case POSIX_FADV_NOREUSE:
 		break;
 	case POSIX_FADV_DONTNEED:
-		if (!bdi_write_congested(mapping->backing_dev_info))
-			filemap_flush(mapping);
-
 		/* First and last FULL page! */
 		start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
 		end_index = (endbyte >> PAGE_CACHE_SHIFT);


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Linus Torvalds


On Thu, 28 Dec 2006, Russell King wrote:
> 
> Yup, but I have nothing to do with glibc because I refuse to do that
> silly copyright assignment FSF thing.  Hopefully someone else can
> resolve it, but...

Yeah, me too.

> _is_ a fix whether _you_ like it or not to work around the issue so
> people can at least run your test program.  I'm not saying it's a
> proper fix though.

My point was that it wasn't a "fix", it's a "workaround". The _fix_ would 
be in glibc.

Nothing more..

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Russell King
On Thu, Dec 28, 2006 at 09:27:12AM -0800, Linus Torvalds wrote:
> On Thu, 28 Dec 2006, Russell King wrote:
> > and if you look at glibc's memset() function, you'll notice that's exactly
> > what you expect if you pass a non-8bit value to it.  Ergo, what you're
> > seeing is utterly expected given glibc's memset() implementation on ARM.
> 
> Guys, you _really_ should fix memset(). What you describe is a _bug_. 

Yup, but I have nothing to do with glibc because I refuse to do that
silly copyright assignment FSF thing.  Hopefully someone else can
resolve it, but...

> > Fixing Linus' test program to pass nr & 255 to memset
> 
> No. I'm almost certain that that is not a "fix", it's a workaround for a 
> serious bug in your glibc crap.

_is_ a fix whether _you_ like it or not to work around the issue so
people can at least run your test program.  I'm not saying it's a
proper fix though.

Of course, if you prefer to be mislead by incorrect bug reports...

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Linus Torvalds


On Thu, 28 Dec 2006, Russell King wrote:
> 
> and if you look at glibc's memset() function, you'll notice that's exactly
> what you expect if you pass a non-8bit value to it.  Ergo, what you're
> seeing is utterly expected given glibc's memset() implementation on ARM.

Guys, you _really_ should fix memset(). What you describe is a _bug_. 

"memset()" takes an "int" as its argument (always has), and has to convert 
it to a byte _itself_. It may not be common, but it's perfectly normal, to 
pass it values outside 0-255 (negative values that still fit in a "signed 
char" in particular are very normal, but my usage of "let the thing 
truncate it itself" is also quite fine).

> Fixing Linus' test program to pass nr & 255 to memset

No. I'm almost certain that that is not a "fix", it's a workaround for a 
serious bug in your glibc crap.

But it does explain all the unexpected strange behaviour (and the really 
small writeback size - now it doesn't need any /proc/sys/vm/dirty_ratio 
assumptions to be explicable.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Linus Torvalds


On Thu, 28 Dec 2006, Zhang, Yanmin wrote:
> 
> The test program is a process to write/read data. pdflush might write data
> to disk asynchronously. After pdflush writes a page to disk, it will call 
> (either by
> softirq) clear_page_dirty to clear the dirty bit after getting the interrupt
> notification.

That would indeed be a horrible bug. However, we don't do 
"clear_page_dirty()" _after_ the IO has completed, we do it _before_ the 
IO starts.

If you can actually find a place that does clear_page_dirty _after_ IO, 
then yes, you've just found the bug. But I haven't found it.

So the rule is _always_:

 - call "clear_page_dirty_for_io()" with the page lock held, and _before_ 
   the IO starts.
 - do "set_page_writeback()" before unlocking the page again
 - do a "end_page_writeback()" when the IO actually finishes.

and any code sequence that doesn't honor those rules would be buggy. A 
beer for anybody that finds it..

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Linus Torvalds


On Wed, 27 Dec 2006, Chen, Kenneth W wrote:
> > 
> > Running the test code, git bisect points its finger at this commit. 
> > Reverting
> > this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code.
> > 
> > [PATCH] mm: balance dirty pages
> > 
> > Now that we can detect writers of shared mappings, throttle them.  
> > Avoids OOM
> > by surprise.
> 
> Oh, never mind :-(  I just didn't create enough write out pressure when
> test this. I just saw bug got triggered on a kernel I previously thought
> was OK.

Btw, this is an important point - people have long felt that the new page 
balancing in 2.6.19 was to blame, but you've just confirmed the long-held 
suspicion (at least by me) that it's not actually a new bug at all, it's 
just that the dirty page balancing causes writeback to happen _earlier_, 
and thus is better able to _show_ a bug that we've likely had for a long 
long time.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Linus Torvalds


On Wed, 27 Dec 2006, Gordon Farquharson wrote:
> 
> 100kB and 200kB files always succeed on the ARM system. 400kB and
> larger always seem to fail.

Oh, wow. Yeah, I've just repressed how tiny 32MB is. And especially if you 
lowered the /proc/sys/vm/dirty_ratio to a smaller percentage, I guess 
400kB should be enough to cause writeback.

Ugh. I tested a 128MB machine a few weeks ago, and found it painful.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread [EMAIL PROTECTED]

On Thu Dec 28 15:09 , Guillaume Chazarain sent:

>I set a qemu environment to test kernels: http://guichaz.free.fr/linux-bug/
>I have corruption with every Fedora release kernel except the first, that is
>2.4.22 works, but 2.6.5, 2.6.9, 2.6.11, 2.6.15 and 2.6.18-1.2798 exhibit 
>some corruption.

Confirm that I see corruption with Fedora kernel 2.6.18-1.2239.fc5:

...
Chunk 142969 corrupted (0-1459)  (2580-4039)
Expected 121, got 0
Written as (89632)127652(124721)
Chunk 142976 corrupted (0-1459)  (512-1971)
Expected 128, got 0
Written as (121734)128324(108589)
Checking chunk 143639/143640 (99%)
$ uname -a
Linux 2.6.18-1.2239.fc5 #1 Fri Nov 10 13:04:06 EST 2006 i686 athlon i386 
GNU/Linux

/Martin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Martin Michlmayr
* Gordon Farquharson <[EMAIL PROTECTED]> [2006-12-28 07:15]:
> Thanks for the fix, Russell.
> 
> I can now trigger the (real) problem by using a 25 MB file (100 << 18)
> and the Linksys NSLU2 (ARM, IXP420 processor, 32 MB RAM).

Me too (using 100 << 18).  Interestingly, I don't seem to get any
corruption on a different ARM board, an IOP32x based machine with 128
MB RAM.
-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Guillaume Chazarain

I set a qemu environment to test kernels: http://guichaz.free.fr/linux-bug/
I have corruption with every Fedora release kernel except the first, that is
2.4.22 works, but 2.6.5, 2.6.9, 2.6.11, 2.6.15 and 2.6.18-1.2798 exhibit 
some

corruption.

Command line to test:

qemu root_fs -snapshot -kernel FC-kernels/FC2-vmlinuz-2.6.5-1.358 -append 'rw 
root=/dev/hda'

I get this kind of corruption: 
http://guichaz.free.fr/linux-bug/corruption.png


--
Guillaume

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Martin Michlmayr
* Russell King <[EMAIL PROTECTED]> [2006-12-28 10:49]:
> > By the way, I just tried it with TARGETSIZE (100 << 12) on a different
> > ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of
> > memory) and I see similar results to that from Gordon:
> 
> Work around the glibc memset() problem by passing nr & 255, and re-run
> the test.  You're getting false positives.

Yeah, I saw your message about this problem after I sent mine.
-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Gordon Farquharson

On 12/28/06, Russell King <[EMAIL PROTECTED]> wrote:


Fixing Linus' test program to pass nr & 255 to memset results in clean
passes on 2.6.9 on TheCus N2100 (IOP8032x) and 2.6.16.9 StrongARM
machines (as would be expected.)


Thanks for the fix, Russell.

I can now trigger the (real) problem by using a 25 MB file (100 << 18)
and the Linksys NSLU2 (ARM, IXP420 processor, 32 MB RAM).

$ ./linus-test
Writing chunk 17954/17955 (99%)
Chunk 514 corrupted (0-1459)  (872-2331)
Expected 2, got 0
Written as (8479)11160(10312)
Chunk 516 corrupted (0-303)  (3792-4095)
Expected 4, got 0
Written as (10312)10569(4426)
Chunk 959 corrupted (0-691)  (3404-4095)
Expected 191, got 0
Written as (687)4881(1522)
Chunk 1895 corrupted (0-1459)  (1900-3359)
Expected 103, got 0
Written as (7746)8389(6231)
Chunk 2702 corrupted (0-1459)  (472-1931)
Expected 142, got 0
Written as (4866)7103(2409)
Chunk 3314 corrupted (0-1459)  (1064-2523)
Expected 242, got 0
Written as (4287)7064(1730)
Chunk 4043 corrupted (0-1459)  (444-1903)
Expected 203, got 0
Written as (6495)8509(4464)
Chunk 5180 corrupted (0-1459)  (1584-3043)
Expected 60, got 0
Written as (11056)12826(10797)
Chunk 5672 corrupted (0-991)  (3104-4095)
Expected 40, got 0
Written as (9944)4872(41)
Chunk 5793 corrupted (460-1459)  (0-999)
Expected 161, got 0
Written as (7059)5038(4377)
Chunk 6089 corrupted (0-1459)  (1620-3079)
Expected 201, got 0
Written as (4672)5230(4403)
Chunk 6545 corrupted (268-1459)  (0-1191)
Expected 145, got 0
Written as (3701)5969(4668)
Chunk 7578 corrupted (0-1459)  (584-2043)
Expected 154, got 0
Written as (10015)5082(1648)
Chunk 7880 corrupted (864-1459)  (0-595)
Expected 200, got 0
Written as (17869)5064(4745)
Chunk 8086 corrupted (0-1459)  (888-2347)
Expected 150, got 0
Written as (10206)11050(10374)
Chunk 8749 corrupted (0-1459)  (2212-3671)
Expected 45, got 0
Written as (15263)7132(4825)
Chunk 9068 corrupted (0-1459)  (1008-2467)
Expected 108, got 0
Written as (5557)7571(6771)
Chunk 9193 corrupted (812-1459)  (0-647)
Expected 233, got 0
Written as (9238)7277(4757)
Chunk 10032 corrupted (576-1459)  (0-883)
Expected 48, got 0
Written as (15741)10012(1753)
Chunk 10056 corrupted (0-1459)  (1696-3155)
Expected 72, got 0
Written as (5379)7431(262)
Chunk 10395 corrupted (0-1459)  (1020-2479)
Expected 155, got 0
Written as (21)7442(5902)
Chunk 10791 corrupted (0-1459)  (1644-3103)
Expected 39, got 0
Written as (4753)5925(5926)
Chunk 10792 corrupted (0-991)  (3104-4095)
Expected 40, got 0
Written as (5925)5926(8555)
Chunk 11036 corrupted (0-1103)  (2992-4095)
Expected 28, got 0
Written as (13755)14449(7458)
Chunk 11387 corrupted (644-1459)  (0-815)
Expected 123, got 0
Written as (10853)11459(9445)
Chunk 11586 corrupted (920-1459)  (0-539)
Expected 66, got 0
Written as (3769)11691(11123)
Chunk 11882 corrupted (0-1459)  (1160-2619)
Expected 106, got 0
Written as (10736)11696(2788)
Chunk 12397 corrupted (0-603)  (3492-4095)
Expected 109, got 0
Written as (2352)7515(2437)
Chunk 12669 corrupted (0-795)  (3300-4095)
Expected 125, got 0
Written as (1191)7661(5266)
Chunk 13162 corrupted (0-1459)  (2184-3643)
Expected 106, got 0
Written as (9383)13662(11544)
Chunk 14653 corrupted (0-27)  (4068-4095)
Expected 61, got 0
Written as (8100)9456(1275)
Chunk 17332 corrupted (0-367)  (3728-4095)
Expected 180, got 0
Written as (760)12247(1244)
Chunk 17445 corrupted (0-1459)  (772-2231)
Expected 37, got 0
Written as (8007)16481(14439)
Chunk 17556 corrupted (0-1007)  (3088-4095)
Expected 148, got 0
Written as (10113)10657(10477)
Chunk 17859 corrupted (0-995)  (3100-4095)
Expected 195, got 0
Written as (14472)14767(11426)
Checking chunk 17954/17955 (99%)

Gordon

--
Gordon Farquharson
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Petri Kaukasoina
On Wed, Dec 27, 2006 at 07:04:34PM -0800, Linus Torvalds wrote:
> [ Modified test-program that tells you where the corruption happens (and 
>   when the missing parts were supposed to be written out) appended, in 
>   case people care. ]

Hi

2.6.18 (and 2.6.18.6) is ok, 2.6.19-rc1 is broken. I tried some snapshots
between them but they hung before shell (2.6.18-git11, 2.6.18-git16,
2.6.18-git20, 2.6.18-git21). 2.6.18-git22 booted and was broken.

(UP, no preempt)

-Petri
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Russell King
On Thu, Dec 28, 2006 at 11:16:59AM +0100, Martin Michlmayr wrote:
> * Gordon Farquharson <[EMAIL PROTECTED]> [2006-12-27 22:38]:
> > >> #define TARGETSIZE (100 << 12)
> > >
> > >That's just 400kB!
> > >
> > >There's no way you should see corruption with that kind of value. It
> > >should all stay solidly in the cache.
> > >
> > >Is this perhaps with ARM nommu or something else strange? It may be that
> > >the program just doesn't work at all if mmap() is faked out with a malloc
> > >or similar.
> > 
> > Definitely a question for the ARM gurus. I'm out of my depth.
> 
> By the way, I just tried it with TARGETSIZE (100 << 12) on a different
> ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of
> memory) and I see similar results to that from Gordon:

Work around the glibc memset() problem by passing nr & 255, and re-run
the test.  You're getting false positives.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Martin Michlmayr
* Gordon Farquharson <[EMAIL PROTECTED]> [2006-12-27 22:38]:
> >> #define TARGETSIZE (100 << 12)
> >
> >That's just 400kB!
> >
> >There's no way you should see corruption with that kind of value. It
> >should all stay solidly in the cache.
> >
> >Is this perhaps with ARM nommu or something else strange? It may be that
> >the program just doesn't work at all if mmap() is faked out with a malloc
> >or similar.
> 
> Definitely a question for the ARM gurus. I'm out of my depth.

By the way, I just tried it with TARGETSIZE (100 << 12) on a different
ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of
memory) and I see similar results to that from Gordon:

Writing chunk 279/280 (99%)
Chunk 256 corrupted (1-1455)  (1025-2479)
Expected 0, got 1
Written as (199)43(184)
Chunk 258 corrupted (1-1455)  (3945-1303)
Expected 2, got 3
Written as (184)74(145)
Chunk 260 corrupted (1-1455)  (2769-127)
Expected 4, got 5
Written as (145)89(237)
Chunk 262 corrupted (1-1455)  (1593-3047)
Expected 6, got 7
Written as (237)168(174)
Chunk 264 corrupted (1-1455)  (417-1871)
Expected 8, got 9
Written as (174)135(161)
Chunk 266 corrupted (1-1455)  (3337-695)
Expected 10, got 11
Written as (161)123(180)
Chunk 268 corrupted (1-1455)  (2161-3615)
Expected 12, got 13
Written as (180)13(19)
Chunk 270 corrupted (1-1455)  (985-2439)
Expected 14, got 15
Written as (19)172(106)
Chunk 272 corrupted (1-1455)  (3905-1263)
Expected 16, got 17
Written as (106)212(140)
Chunk 274 corrupted (1-1455)  (2729-87)
Expected 18, got 19
Written as (140)124(73)
Chunk 276 corrupted (1-1455)  (1553-3007)
Expected 20, got 21
Written as (73)151(52)
Chunk 278 corrupted (1-1455)  (377-1831)
Expected 22, got 23
Written as (52)215(99)
Checking chunk 279/280 (99%)

-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Russell King
On Wed, Dec 27, 2006 at 10:20:20PM -0700, Gordon Farquharson wrote:
> I have run the program a few times, and the output is pretty
> consistent. However, when I increase the target size, the difference
> between the expected and actual values is larger.
> 
> Written as (749)935(738)
> Chunk 1113 corrupted (1-1455)  (2965-323)
> Expected 89, got 93

This is not the corruption Linus is after.  Note that the corruption starts
at offset '1'.  Also note that:

89 = 1113 & 255
93 = 1113 & 255 | (1113 >> 8)

and if you look at glibc's memset() function, you'll notice that's exactly
what you expect if you pass a non-8bit value to it.  Ergo, what you're
seeing is utterly expected given glibc's memset() implementation on ARM.

Fixing Linus' test program to pass nr & 255 to memset results in clean
passes on 2.6.9 on TheCus N2100 (IOP8032x) and 2.6.16.9 StrongARM
machines (as would be expected.)

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Martin Michlmayr
* Gordon Farquharson <[EMAIL PROTECTED]> [2006-12-27 22:38]:
> >That's just 400kB!
> >
> >There's no way you should see corruption with that kind of value. It
> >should all stay solidly in the cache.
> >
> >Is this perhaps with ARM nommu or something else strange? It may be that
> >the program just doesn't work at all if mmap() is faked out with a malloc
> >or similar.
> 
> Definitely a question for the ARM gurus. I'm out of my depth.

The CPU has a MMU.  For reference, it's a IXP4xx based device with 32 MB
of memory.
-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Zhang, Yanmin
On Wed, 2006-12-27 at 19:04 -0800, Linus Torvalds wrote:
> 
> On Wed, 27 Dec 2006, David Miller wrote:
> > > 
> > > I still don't see _why_, though. But maybe smarter people than me can see 
> > > it..
> > 
> > FWIW this program definitely triggers the bug for me.
> 
> Ok, now that I have something simple to do repeatable stuff with, I can 
> say what the pattern is.. It's not all that surprising, but it's still 
> worth just stating for the record.
> 
> What happens is that when I do the "packetized writes" in random order, 
> the _last_ write to a page occasionally just goes missing. It's not always 
> at the end of a page, as shown by for example:
> 
>  - A whole chunk got dropped:
> 
>   Chunk 2094 corrupted (0-1459)  (1624-3083)
>   Expected 46, got 0
>   Written as (30912)55414(1)
> 
>That "Written as (x)y(z)" line means that the corrupted chunk was 
>written as chunk #y, and the preceding and following chunks (that were 
>_not_ corrupt) on the page was written as #x and #z respectively.
> 
>In other words, the missing chunk (which is still zero) was written 
>much later than the ones that were ok, and never hit the disk. It's a 
>contiguous chunk in the middle of the page (chunks are 1460 bytes in 
>size)
> 
>The first line means that all bytes of the chunk (0-1459) were 
>corrupted, and the values in parenthesis are the offsets within a page. 
>In other words, this was a chunk in the _middle_ of a page.
> 
>  - The missing data can also be at the beginning or ends of pages:
> 
>Beginning of the chunk missing, it was at the end of a page (page 
>offsets 3288-4095) and the _next_ page got written out fine:
> 
>   Chunk 2126 corrupted (0-807)  (3288-4095)
>   Expected 78, got 0
>   Written as (32713)55573(14301)
> 
>End of a chunk missing, it was the beginning of a page (and the 
>_previous_ page that contained the beginning of the chunk was written 
>out fine)
> 
>   Chunk 2179 corrupted (1252-1459)  (0-207)
>   Expected 131, got 0
>   Written as (45189)55489(15515)
> 
> Now, the reason I say this isn't surprising is that this is entirely 
> consistent with the dirty bit being dropped on the floor somewhere, and 
> likely through some interaction with the previous changes being in the 
> process of being written out.
> 
> Something (incorrectly) ends up deciding that it doesn't need to write the 
> page, since it's already written, or alternatively clears the dirty bit 
> too late (clears it because an _earlier_ write finished, never mind that 
> the new dirty data didn't make it).
There might be a narrow race between set_page_dirty and clear_page_dirty.

The test program is a process to write/read data. pdflush might write data
to disk asynchronously. After pdflush writes a page to disk, it will call 
(either by
softirq) clear_page_dirty to clear the dirty bit after getting the interrupt
notification. But just after the page is written to disk and before the 
interrupt
reports the result, the test process might change the data and unmap the area. 
When
the area is unmapped, the page is marked as dirty again, but just after that, 
the
interrupt arrives and the dirty bit is cleared, so the late data will not be 
written
to disk.

Function zap_pte_range checks pte to set page dirty if needed, but it doesn't
hold page lock. If the page lock is held before set page dirty, the race might
be avoided.

Yanmin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-27 Thread David Miller
From: "Chen, Kenneth W" <[EMAIL PROTECTED]>
Date: Wed, 27 Dec 2006 22:10:52 -0800

> Chen, Kenneth wrote on Wednesday, December 27, 2006 9:55 PM
> > Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM
> > > On Wed, 27 Dec 2006, David Miller wrote:
> > > > > 
> > > > > I still don't see _why_, though. But maybe smarter people than me can 
> > > > > see 
> > > > > it..
> > > > 
> > > > FWIW this program definitely triggers the bug for me.
> > > 
> > > Ok, now that I have something simple to do repeatable stuff with, I can 
> > > say what the pattern is.. It's not all that surprising, but it's still 
> > > worth just stating for the record.
> > 
> > 
> > Running the test code, git bisect points its finger at this commit. 
> > Reverting
> > this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code.
> > 
> > edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit
> > commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f
> > Author: Peter Zijlstra <[EMAIL PROTECTED]>
> > Date:   Mon Sep 25 23:30:58 2006 -0700
> > 
> > [PATCH] mm: balance dirty pages
> > 
> > Now that we can detect writers of shared mappings, throttle them.  
> > Avoids OOM
> > by surprise.
> 
> 
> Oh, never mind :-(  I just didn't create enough write out pressure when
> test this. I just saw bug got triggered on a kernel I previously thought
> was OK.

Besides, I'm pretty sure that from the Debian bug entry it's been
established that the dirty-page tracking changes from a few releases
ago introduced this problem.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm: fix page_mkclean_one

2006-12-27 Thread Chen, Kenneth W
Chen, Kenneth wrote on Wednesday, December 27, 2006 9:55 PM
> Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM
> > On Wed, 27 Dec 2006, David Miller wrote:
> > > > 
> > > > I still don't see _why_, though. But maybe smarter people than me can 
> > > > see 
> > > > it..
> > > 
> > > FWIW this program definitely triggers the bug for me.
> > 
> > Ok, now that I have something simple to do repeatable stuff with, I can 
> > say what the pattern is.. It's not all that surprising, but it's still 
> > worth just stating for the record.
> 
> 
> Running the test code, git bisect points its finger at this commit. Reverting
> this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code.
> 
> edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit
> commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f
> Author: Peter Zijlstra <[EMAIL PROTECTED]>
> Date:   Mon Sep 25 23:30:58 2006 -0700
> 
> [PATCH] mm: balance dirty pages
> 
> Now that we can detect writers of shared mappings, throttle them.  Avoids 
> OOM
> by surprise.


Oh, never mind :-(  I just didn't create enough write out pressure when
test this. I just saw bug got triggered on a kernel I previously thought
was OK.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-27 Thread Gordon Farquharson

On 12/27/06, Linus Torvalds <[EMAIL PROTECTED]> wrote:


That's just 400kB!

There's no way you should see corruption with that kind of value. It
should all stay solidly in the cache.


100kB and 200kB files always succeed on the ARM system. 400kB and
larger always seem to fail.

Does the following help interpret the results on ARM at all ?

$ free
total   used   free sharedbuffers cached
Mem: 3  23620   6380  0808  15676
-/+ buffers/cache:   7136  22864
Swap:88316   3664  84652

Gordon

--
Gordon Farquharson
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm: fix page_mkclean_one

2006-12-27 Thread Chen, Kenneth W
Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM
> On Wed, 27 Dec 2006, David Miller wrote:
> > > 
> > > I still don't see _why_, though. But maybe smarter people than me can see 
> > > it..
> > 
> > FWIW this program definitely triggers the bug for me.
> 
> Ok, now that I have something simple to do repeatable stuff with, I can 
> say what the pattern is.. It's not all that surprising, but it's still 
> worth just stating for the record.


Running the test code, git bisect points its finger at this commit. Reverting
this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code.


edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit
commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f
Author: Peter Zijlstra <[EMAIL PROTECTED]>
Date:   Mon Sep 25 23:30:58 2006 -0700

[PATCH] mm: balance dirty pages

Now that we can detect writers of shared mappings, throttle them.  Avoids 
OOM
by surprise.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
Cc: Hugh Dickins <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-27 Thread Gordon Farquharson

Hi David

On 12/27/06, David Miller <[EMAIL PROTECTED]> wrote:


Me too, I added "-D_POSIX_C_SOURCE=200112" to "fix" this.


That works for me. Thanks for the tip.

Gordon

--
Gordon Farquharson
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-27 Thread David Miller
From: "Gordon Farquharson" <[EMAIL PROTECTED]>
Date: Wed, 27 Dec 2006 22:20:20 -0700

> and for some reason I get
> 
> linus-test.c: In function 'remap':
> linus-test.c:61: error: 'POSIX_FADV_DONTNEED' undeclared (first use in
> this function)
> 
> when I compile the program, so I replaced POSIX_FADV_DONTNEED with 4
> as defined in /usr/include/bits/fcntl.h.

Me too, I added "-D_POSIX_C_SOURCE=200112" to "fix" this.

Perhaps Linus's GCC sets that by default and our's doesn't.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-27 Thread Gordon Farquharson

On 12/27/06, Linus Torvalds <[EMAIL PROTECTED]> wrote:


On Wed, 27 Dec 2006, Gordon Farquharson wrote:
>
> I don't think so. I did reduce the target size
>
> #define TARGETSIZE (100 << 12)

That's just 400kB!

There's no way you should see corruption with that kind of value. It
should all stay solidly in the cache.

Is this perhaps with ARM nommu or something else strange? It may be that
the program just doesn't work at all if mmap() is faked out with a malloc
or similar.


Definitely a question for the ARM gurus. I'm out of my depth.

Gordon

--
Gordon Farquharson
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-27 Thread Gordon Farquharson

[Oops - forgot to hit "Reply to All" first time round.]

Hi Linus

On 12/27/06, Linus Torvalds <[EMAIL PROTECTED]> wrote:


For all I know, my test-program is buggy wrt the ordering printouts,
though. Did you perhaps change the logic in any way?


I don't think so. I did reduce the target size

#define TARGETSIZE (100 << 12)

to make the program finish a little quicker, and for some reason I get

linus-test.c: In function 'remap':
linus-test.c:61: error: 'POSIX_FADV_DONTNEED' undeclared (first use in
this function)

when I compile the program, so I replaced POSIX_FADV_DONTNEED with 4
as defined in /usr/include/bits/fcntl.h.

Other than these two changes, the program is identical to the version
you posted.

I have run the program a few times, and the output is pretty
consistent. However, when I increase the target size, the difference
between the expected and actual values is larger.

Written as (749)935(738)
Chunk 1113 corrupted (1-1455)  (2965-323)
Expected 89, got 93
Written as (935)738(538)
Chunk 1114 corrupted (1-1455)  (329-1783)
Expected 90, got 94
Written as (738)538(678)
Chunk 1115 corrupted (1-1455)  (1789-3243)
Expected 91, got 95
Written as (538)678(989)
Chunk 1120 corrupted (1-1455)  (897-2351)
Expected 96, got 100
Written as (537)265(1005)
Chunk 1121 corrupted (1-1455)  (2357-3811)
Expected 97, got 101
Written as (265)1005(-1)

--- linus-test.c.orig   2006-12-28 06:17:24.0 +0100
+++ linus-test.c2006-12-28 06:18:24.0 +0100
@@ -6,7 +6,7 @@
#include 
#include 

-#define TARGETSIZE (100 << 20)
+#define TARGETSIZE (100 << 14)
#define CHUNKSIZE (1460)
#define NRCHUNKS (TARGETSIZE / CHUNKSIZE)
#define SIZE (NRCHUNKS * CHUNKSIZE)
@@ -61,7 +61,7 @@
{
   if (mapping) {
   munmap(mapping, SIZE);
-posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED);
+posix_fadvise(fd, 0, SIZE, 4);
   }
   return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

Gordon

--
Gordon Farquharson
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-27 Thread Linus Torvalds


On Wed, 27 Dec 2006, Gordon Farquharson wrote:
> 
> It is at all suprising that the second offset within a page can be
> less than the first offset within a page ? e.g.
> 
> Chunk 260 corrupted (1-1455)  (2769-127)

No, that just means that it went over to the next page (so you actually 
had two consecutive pages that weren't written out).

That said, your output is very different from mine in another way. You 
don't have zeroes in your pages, rather the thing seems to have data from 
the next block (ie the chunk that should have 20 is reported as having 21 
etc). You also have your offsets shifted up by one (ie offset 0 looks ok 
for you, and then you have a strange pattern of corruption at bytes 
1...1455 instead of 0..1459.

You also seem to have an example of the _earlier_ writes being corrupted, 
rather than the later ones. For example (but it's also a page-crosser, so 
maybe that's part of it):

Chunk 274 corrupted (1-1455)  (2729-87)
Expected 18, got 19
Written as (154)11(85)

says that block chunk 274 is the corrupt one, but it was written fairly 
early as #11, and the blocks around it (chunks 273 and 275) were actually 
written later.

For all I know, my test-program is buggy wrt the ordering printouts, 
though. Did you perhaps change the logic in any way?

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-27 Thread Gordon Farquharson

On 12/27/06, Linus Torvalds <[EMAIL PROTECTED]> wrote:


[ Modified test-program that tells you where the corruption happens (and
  when the missing parts were supposed to be written out) appended, in
  case people care. ]


For the record, this is the output from a run on our ARM machine (32
MB RAM) with 2.6.18 + the following patches:

  mm: tracking shared dirty pages
  mm: balance dirty pages
  mm: optimize the new mprotect() code a bit
  mm: small cleanup of install_page()
  mm: fixup do_wp_page()
  mm: msync() cleanup

It is at all suprising that the second offset within a page can be
less than the first offset within a page ? e.g.

Chunk 260 corrupted (1-1455)  (2769-127)

$ ./linus-test
Writing chunk 279/280 (99%)
Chunk 256 corrupted (1-1455)  (1025-2479)
Expected 0, got 1
Written as (82)175(56)
Chunk 258 corrupted (1-1455)  (3945-1303)
Expected 2, got 3
Written as (56)51(20)
Chunk 260 corrupted (1-1455)  (2769-127)
Expected 4, got 5
Written as (20)30(18)
Chunk 262 corrupted (1-1455)  (1593-3047)
Expected 6, got 7
Written as (18)196(158)
Chunk 264 corrupted (1-1455)  (417-1871)
Expected 8, got 9
Written as (158)133(146)
Chunk 266 corrupted (1-1455)  (3337-695)
Expected 10, got 11
Written as (146)43(77)
Chunk 268 corrupted (1-1455)  (2161-3615)
Expected 12, got 13
Written as (77)251(211)
Chunk 270 corrupted (1-1455)  (985-2439)
Expected 14, got 15
Written as (211)257(231)
Chunk 272 corrupted (1-1455)  (3905-1263)
Expected 16, got 17
Written as (231)254(154)
Chunk 274 corrupted (1-1455)  (2729-87)
Expected 18, got 19
Written as (154)11(85)
Chunk 276 corrupted (1-1455)  (1553-3007)
Expected 20, got 21
Written as (85)230(134)
Chunk 278 corrupted (1-1455)  (377-1831)
Expected 22, got 23
Written as (134)233(103)
Checking chunk 279/280 (99%)

Gordon

--
Gordon Farquharson
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-27 Thread Linus Torvalds


On Wed, 27 Dec 2006, David Miller wrote:
> > 
> > I still don't see _why_, though. But maybe smarter people than me can see 
> > it..
> 
> FWIW this program definitely triggers the bug for me.

Ok, now that I have something simple to do repeatable stuff with, I can 
say what the pattern is.. It's not all that surprising, but it's still 
worth just stating for the record.

What happens is that when I do the "packetized writes" in random order, 
the _last_ write to a page occasionally just goes missing. It's not always 
at the end of a page, as shown by for example:

 - A whole chunk got dropped:

Chunk 2094 corrupted (0-1459)  (1624-3083)
Expected 46, got 0
Written as (30912)55414(1)

   That "Written as (x)y(z)" line means that the corrupted chunk was 
   written as chunk #y, and the preceding and following chunks (that were 
   _not_ corrupt) on the page was written as #x and #z respectively.

   In other words, the missing chunk (which is still zero) was written 
   much later than the ones that were ok, and never hit the disk. It's a 
   contiguous chunk in the middle of the page (chunks are 1460 bytes in 
   size)

   The first line means that all bytes of the chunk (0-1459) were 
   corrupted, and the values in parenthesis are the offsets within a page. 
   In other words, this was a chunk in the _middle_ of a page.

 - The missing data can also be at the beginning or ends of pages:

   Beginning of the chunk missing, it was at the end of a page (page 
   offsets 3288-4095) and the _next_ page got written out fine:

Chunk 2126 corrupted (0-807)  (3288-4095)
Expected 78, got 0
Written as (32713)55573(14301)

   End of a chunk missing, it was the beginning of a page (and the 
   _previous_ page that contained the beginning of the chunk was written 
   out fine)

Chunk 2179 corrupted (1252-1459)  (0-207)
Expected 131, got 0
Written as (45189)55489(15515)

Now, the reason I say this isn't surprising is that this is entirely 
consistent with the dirty bit being dropped on the floor somewhere, and 
likely through some interaction with the previous changes being in the 
process of being written out.

Something (incorrectly) ends up deciding that it doesn't need to write the 
page, since it's already written, or alternatively clears the dirty bit 
too late (clears it because an _earlier_ write finished, never mind that 
the new dirty data didn't make it).

I also figured out that it's not the low-memory situation that does it, it 
really must be the "page_mkclean()" triggering. Becuase I can do

echo 5 > /proc/sys/vm/dirty_ratio
echo 3 > /proc/sys/vm/dirty_background_ratio

to make it clean the pages much more aggressively than the default, and I 
can see corruption on my 256MB machine with just a 40MB shared file, and 
70MB of memory consistently free.

So this thing is definitely giving some answers. It's NOT about low 
memory, and it very much seems to be about the whole "balance_dirty_ratio" 
thing. I don't think I triggered the actual low-memory stuff once in that 
situation..

So I have some more data on the behaviour, but I _still_ don't see the 
reason behind it. It's probably something really obvious once it's pointed 
out..

[ Modified test-program that tells you where the corruption happens (and 
  when the missing parts were supposed to be written out) appended, in 
  case people care. ]

Linus

---
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define TARGETSIZE (100 << 20)
#define CHUNKSIZE (1460)
#define NRCHUNKS (TARGETSIZE / CHUNKSIZE)
#define SIZE (NRCHUNKS * CHUNKSIZE)

static void fillmem(void *start, int nr)
{
memset(start, nr, CHUNKSIZE);
}

#define page_offset(buf, off) (0xfff & ((unsigned)(unsigned long)(buf)+(off)))

static int chunkorder[NRCHUNKS];

static int order(int nr)
{
int i;
if (nr < 0 || nr >= NRCHUNKS)
return -1;
for (i = 0; i < NRCHUNKS; i++)
if (chunkorder[i] == nr)
return i;
return -2;
}

static void checkmem(void *buf, int nr)
{
unsigned int start = ~0u, end = 0;
unsigned char c = nr, *p = buf, differs = 0;
int i;
for (i = 0; i < CHUNKSIZE; i++) {
unsigned char got = *p++;
if (got != c) {
if (i < start)
start = i;
if (i > end)
end = i;
differs = got;
}
}
if (start < end) {
printf("Chunk %d corrupted (%u-%u)  (%u-%u)\n", nr, 
start, end,
page_offset(buf, start), page_offset(buf, end));
printf("Expected %u, got %u\n", c, differs);
printf("Written as (%d)%d(%d)\n", order(nr-1), order(nr), 
order(nr+1));
}
}

static char 

Re: [PATCH] mm: fix page_mkclean_one

2006-12-27 Thread David Miller
From: Linus Torvalds <[EMAIL PROTECTED]>
Date: Wed, 27 Dec 2006 16:39:43 -0800 (PST)

> 
> 
> On Wed, 27 Dec 2006, Linus Torvalds wrote:
> > 
> > I think the test-case could probably be improved by having a munmap() and 
> > page-cache flush in between the writing and the checking, to see whether 
> > that shows the corruption easier (and possibly without having to start 
> > paging in order to throw the pages out, which would simplify testing a 
> > lot).
> 
> I think the page-writeout is implicated, because I do seem to need it, but 
> the page-cache flush does seem to make corruption _easier_ to see. I now 
> seem about to trigger it with a 100MB file on a 256MB machine in a minute 
> or so, with this slight modification.
> 
> I still don't see _why_, though. But maybe smarter people than me can see 
> it..

FWIW this program definitely triggers the bug for me.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-27 Thread David Miller
From: Linus Torvalds <[EMAIL PROTECTED]>
Date: Wed, 27 Dec 2006 16:42:40 -0800 (PST)

> That's fine. In that situation, you shouldn't need any atomic ops at all, 
> I think all our sw page-table operations are already done under the pte 
> lock. 

This is true, but there is one subtlety to this I want to
point out in passing.

That lock can possibly only protect a page of PTEs.

When NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS, the locking is done per page
of PTEs, not for all of the page tables of an address space at once.

What this means is that it's really difficult to forcefully block out
all page table operations for a given mm, and I actually needed to do
something like this on sparc64 (when growing the TLB lookup hash
table, you can't let any PTEs change state while the table is
changing).  For my case, I added a spinlock to mm->context since
actually what I need is to block modifications to the hash table
itself during PTE changes.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-27 Thread Linus Torvalds


On Thu, 28 Dec 2006, Martin Schwidefsky wrote:
> 
> For s390 there are two aspects to consider:
> 1) the pte values are 100% software controlled.

That's fine. In that situation, you shouldn't need any atomic ops at all, 
I think all our sw page-table operations are already done under the pte 
lock. 

The reason x86 needs to be careful is exactly the fact that the hardware 
will obviously do a lot on its own, and the hardware is _not_ going to 
honor our page table locking ;)

In an all-sw situation, a lot of this should be easier. S390 has _other_ 
things that are inconvenient (the strange "dirty bit is not in the page 
tables" thing that makes it look different from everybody else), but hey, 
it's a balance..

So for s390, ptep_exchange() in my example should be able to be a simple 
"load old value and store new one", assuming everybody honors the pte lock 
(and they _should_).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-27 Thread Linus Torvalds


On Wed, 27 Dec 2006, Linus Torvalds wrote:
> 
> I think the test-case could probably be improved by having a munmap() and 
> page-cache flush in between the writing and the checking, to see whether 
> that shows the corruption easier (and possibly without having to start 
> paging in order to throw the pages out, which would simplify testing a 
> lot).

I think the page-writeout is implicated, because I do seem to need it, but 
the page-cache flush does seem to make corruption _easier_ to see. I now 
seem about to trigger it with a 100MB file on a 256MB machine in a minute 
or so, with this slight modification.

I still don't see _why_, though. But maybe smarter people than me can see 
it..

Linus

---
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define TARGETSIZE (100 << 20)
#define CHUNKSIZE (1460)
#define NRCHUNKS (TARGETSIZE / CHUNKSIZE)
#define SIZE (NRCHUNKS * CHUNKSIZE)

static void fillmem(void *start, int nr)
{
memset(start, nr, CHUNKSIZE);
}

static void checkmem(void *start, int nr)
{
unsigned char c = nr, *p = start;
int i;
for (i = 0; i < CHUNKSIZE; i++) {
if (*p++ != c) {
printf("Chunk %d corrupted   \n", nr);
return;
}
}
}

static char *remap(int fd, char *mapping)
{
if (mapping) {
munmap(mapping, SIZE);
posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED);
}
return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
}

int main(int argc, char **argv)
{
char *mapping;
int fd, i;
static int chunkorder[NRCHUNKS];

/*
 * Make some random ordering of writing the chunks to the
 * memory map..
 *
 * Start with fully ordered..
 */
for (i = 0; i < NRCHUNKS; i++)
chunkorder[i] = i;

/* ..and then mix it up randomly */
srandom(time(NULL));
for (i = 0; i < NRCHUNKS; i++) {
int index = (unsigned int) random() % NRCHUNKS;
int nr = chunkorder[index];
chunkorder[index] = chunkorder[i];
chunkorder[i] = nr;
}

fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
if (fd < 0)
return -1;
if (ftruncate(fd, SIZE) < 0)
return -1;
mapping = remap(fd, NULL);
if (-1 == (int)(long)mapping)
return -1;

for (i = 0; i < NRCHUNKS; i++) {
int chunk = chunkorder[i];
printf("Writing chunk %d/%d (%d%%) \r", i, NRCHUNKS, 
100*i/NRCHUNKS);
fillmem(mapping + chunk * CHUNKSIZE, chunk);
}
printf("\n");

/* Unmap, drop, and remap.. */
mapping = remap(fd, mapping);

/* .. and check */
for (i = 0; i < NRCHUNKS; i++) {
int chunk = i;
printf("Checking chunk %d/%d (%d%%) \r", i, NRCHUNKS, 
100*i/NRCHUNKS);
checkmem(mapping + chunk * CHUNKSIZE, chunk);
}
printf("\n");

return 0;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-27 Thread Linus Torvalds

On Tue, 26 Dec 2006, David Miller wrote:
> 
> I've seen it on sparc64, UP kernel, no preempt.

Ok, I still don't have a clue, but I think I at least have a new 
test-case.

It can probably be improved upon, but this would _seem_ to trigger the 
problem. Can people check?

You'd want to make sure you get page-put activity, by making TARGETSIZE be 
big enough to cause memory pressure (and rather than making it bigger, you 
might want to make your memory smaller instead, to make it run more 
quickly. Either using "mem=128M" or big compiles or something...).

If it finds corruption, you'll see something like

Writing chunk 183858/183859 (99%)
Chunk ..
Chunk 120887 corrupted
Chunk 122372 corrupted
Chunk ...
Checking chunk 183858/183859 (99%)

otherwise it will just say

Writing chunk 183858/183859 (99%)
Checking chunk 183858/183859 (99%)

and exit.

I didn't spend a lot of time verifying this, but I _was_ able to cause 
those "Chunk xxx corrupted" messages with this. There's probably a more 
efficient better way to do it, but this is better than trying to use 
rtorrent, and also makes any worries about what rtorrent does go away.

Of course, maybe it's this test-program that is buggy now, although it 
looks trivial enough that I don't think it is.

I think my earlier stress-tester may not have triggered this, because it 
just did all its writing in a linear order, so any LRU logic will happen 
to write back old pages that we are no longer touching. The randomization 
(and using a chunksize that isn't a multiple of a page-size) makes sure 
that we're actually going to have lots of rewriting going on.

I think the test-case could probably be improved by having a munmap() and 
page-cache flush in between the writing and the checking, to see whether 
that shows the corruption easier (and possibly without having to start 
paging in order to throw the pages out, which would simplify testing a 
lot). But I haven't tested. I decided to post this asap, now that I've 
recreated the corruption with something else, and something that is 
possibly easier to analyze..

Linus


#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define TARGETSIZE (256 << 20)
#define CHUNKSIZE (1460)
#define NRCHUNKS (TARGETSIZE / CHUNKSIZE)
#define SIZE (NRCHUNKS * CHUNKSIZE)

static void fillmem(void *start, int nr)
{
memset(start, nr, CHUNKSIZE);
}

static void checkmem(void *start, int nr)
{
unsigned char c = nr, *p = start;
int i;
for (i = 0; i < CHUNKSIZE; i++) {
if (*p++ != c) {
printf("Chunk %d corrupted   \n", nr);
return;
}
}
}

int main(int argc, char **argv)
{
char *mapping;
int fd, i;
static int chunkorder[NRCHUNKS];

/*
 * Make some random ordering of writing the chunks to the
 * memory map..
 *
 * Start with fully ordered..
 */
for (i = 0; i < NRCHUNKS; i++)
chunkorder[i] = i;

/* ..and then mix it up randomly */
srandom(time(NULL));
for (i = 0; i < NRCHUNKS; i++) {
int index = (unsigned int) random() % NRCHUNKS;
int nr = chunkorder[index];
chunkorder[index] = chunkorder[i];
chunkorder[i] = nr;
}

fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
if (fd < 0)
return -1;
if (ftruncate(fd, SIZE) < 0)
return -1;
mapping = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (-1 == (int)(long)mapping)
return -1;

for (i = 0; i < NRCHUNKS; i++) {
int chunk = chunkorder[i];
printf("Writing chunk %d/%d (%d%%) \r", i, NRCHUNKS, 
100*i/NRCHUNKS);
fillmem(mapping + chunk * CHUNKSIZE, chunk);
}
printf("\n");
for (i = 0; i < NRCHUNKS; i++) {
int chunk = i;
printf("Checking chunk %d/%d (%d%%) \r", i, NRCHUNKS, 
100*i/NRCHUNKS);
checkmem(mapping + chunk * CHUNKSIZE, chunk);
}
printf("\n");

return 0;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-27 Thread Martin Schwidefsky
On Thu, 2006-12-21 at 12:01 -0800, Linus Torvalds wrote:
> What do you guys think? Does something like this work out for S/390 too? I
> tried to make that "ptep_flush_dirty()" concept work for architectures
> that hide the dirty bit somewhere else too, but..

For s390 there are two aspects to consider:
1) the pte values are 100% software controlled. They only change because
a cpu stored a value to it or issued one of the specialized instructions
(csp, ipte and idte). The ptep_flush_dirty would be a nop for s390.
2) ptep_exchange is a bit dangerous. For s390 we need a lock that
protects the software controlled updates of the ptes. The reason is the
ipte instruction. It is implemented by the machine microcode in a
non-atomic way in regard to the memory. It reads the byte of the pte
that contains the invalid bit, flushes the tlb entries for it and then
writes back the byte with the invalid bit set. The microcode makes sure
that this pte cannot be used for form a new tlb on any cpu while the
ipte is in progress.
That means a compare-and-swap semantics on ptes won't work together with
the ipte optimization. As long as there is the pte lock that protects
all software accesses to the pte we are fine. But if any code expects
that ptep_exchange does something like an xchg things break.

-- 
blue skies,
  Martin.

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

"Reality continues to ruin my life." - Calvin.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-27 Thread Jari Sundell

On 12/27/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

I do get this error on reiserfs ( old one, didn't try on reiser4 ).
Stock 2.6.19 plus reiser4 patch. Previously reported by me only in the
debian bts.


I've had reports of corrupted data on earlier kernel releases with
reiserfs3, which were fixed by upgrading to reiserfs4.

Jari Sundell
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-27 Thread valdyn
On Tue, Dec 26, 2006 at 11:26:50AM -0800, Linus Torvalds wrote:
> What would also actually be interesting is whether somebody can reproduce 
> this on Reiserfs, for example. I _think_ all the reports I've seen are on 
> ext2 or ext3, and if this is somehow writeback-related, it could be some 
> bug that is just shared between the two by virtue of them still having a 
> lot of stuff in common. 
> 
>   Linus
I do get this error on reiserfs ( old one, didn't try on reiser4 ). 
Stock 2.6.19 plus reiser4 patch. Previously reported by me only in the
debian bts.

flo attenberger

---
Linux master 2.6.19 #1 PREEMPT Thu Dec 21 10:55:34 CET 2006 x86_64
GNU/Linux

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.19
# Thu Dec 21 10:45:05 2006
#
CONFIG_X86_64=y
CONFIG_64BIT=y
CONFIG_X86=y
CONFIG_ZONE_DMA32=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_CMPXCHG=y
CONFIG_EARLY_PRINTK=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_DMI=y
CONFIG_AUDIT_ARCH=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
# CONFIG_IPC_NS is not set
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
# CONFIG_TASKSTATS is not set
# CONFIG_UTS_NS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_RELAY is not set
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SHMEM=y
CONFIG_SLAB=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
# CONFIG_SLOB is not set

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
CONFIG_MODVERSIONS=y
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y

#
# Block layer
#
CONFIG_BLOCK=y
# CONFIG_LBD is not set
# CONFIG_BLK_DEV_IO_TRACE is not set
# CONFIG_LSF is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=m
CONFIG_IOSCHED_DEADLINE=m
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"

#
# Processor type and features
#
CONFIG_X86_PC=y
# CONFIG_X86_VSMP is not set
CONFIG_MK8=y
# CONFIG_MPSC is not set
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_L1_CACHE_BYTES=64
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_INTERNODE_CACHE_BYTES=64
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_MICROCODE=m
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=m
CONFIG_X86_CPUID=m
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_MTRR=y
# CONFIG_SMP is not set
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_BKL=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
# CONFIG_SPARSEMEM_STATIC is not set
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_RESOURCES_64BIT=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_HPET_TIMER=y
CONFIG_IOMMU=y
# CONFIG_CALGARY_IOMMU is not set
CONFIG_SWIOTLB=y
CONFIG_X86_MCE=y
# CONFIG_X86_MCE_INTEL is not set
CONFIG_X86_MCE_AMD=y
CONFIG_KEXEC=y
# CONFIG_CRASH_DUMP is not set
CONFIG_PHYSICAL_START=0x20
CONFIG_SECCOMP=y
# CONFIG_CC_STACKPROTECTOR is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_REORDER=y
CONFIG_K8_NB=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_ISA_DMA_API=y

#
# Power management options
#
CONFIG_PM=y
CONFIG_PM_LEGACY=y
# CONFIG_PM_DEBUG is not set
CONFIG_PM_SYSFS_DEPRECATED=y
# CONFIG_SOFTWARE_SUSPEND is not set

#
# ACPI (Advanced Configuration and Power Interface) Support
#
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_SLEEP_PROC_FS=y
# CONFIG_ACPI_SLEEP_PROC_SLEEP is not set
CONFIG_ACPI_AC=m
# CONFIG_ACPI_BATTERY is not set
CONFIG_ACPI_BUTTON=m
CONFIG_ACPI_VIDEO=m
CONFIG_ACPI_HOTKEY=m
CONFIG_ACPI_FAN=m
# CONFIG_ACPI_DOCK is not set
CONFIG_ACPI_PROCESSOR=m
CONFIG_ACPI_THERMAL=m
# CONFIG_ACPI_ASUS is not set
# CONFIG_ACPI_IBM is not set
# CONFIG_ACPI_TOSHIBA is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_EC=y
CONFIG_ACPI_POWER=y
CONFIG_ACPI_SYSTEM=y
CONFIG_X86_P

Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-27 Thread Jari Sundell

On 12/27/06, Linus Torvalds <[EMAIL PROTECTED]> wrote:


 - It never uses mprotect on the shared mappings, but it _does_ do:
"mincore()" - but the return values don't much matter (it's used
  as a heuristic on which parts to hash, apparently)

  I double- and triple-checked this one, because I
  did make changes to "mincore()", but those didn't go
  into the affected kernels anyway (ie they are not in
  plain 2.6.19, nor in 2.6.18.3 either)


Correct, mincore is only used to check if it should delay the hash checking.


"madvise(MADV_WILLNEED)"
"msync(MS_ASYNC)" (or MS_SYNC if you use a command line flag)
"munmap()" of course

 - it never seems to mix mmap() and write() - it does _only_ mmap.

 - it seems to mmap/munmap the shared files in nice 64-page chunks, all
   64-page aligned in the file (ie it does NOT create one big mapping, it
   has some kind of LRU of thse 64-page chunks). The only exception being
   the last chunk, which it maps byte-accurate to the size.


The length of the chunks is only page aligned on single file torrents,
not so on multi-file torrents. I've attached a patch for rtorrent that
will extend the length to the page boundary.


 - I haven't checked whether it only ever has the same chunk mapped once
   at a time.


This should be the case, but two mapped chunks may share a page,
sometimes with different r/w permissions.

Jari Sundell


extend_mapping.diff
Description: Binary data


Re: [PATCH] mm: fix page_mkclean_one

2006-12-27 Thread Andrei Popa
I have corrupted files...

> ---
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 263f88e..4652ef1 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1653,19 +1653,7 @@ static int __block_write_full_page(struct inode 
> *inode, struct page *page,
>   do {
>   if (!buffer_mapped(bh))
>   continue;
> - /*
> -  * If it's a fully non-blocking write attempt and we cannot
> -  * lock the buffer then redirty the page.  Note that this can
> -  * potentially cause a busy-wait loop from pdflush and kswapd
> -  * activity, but those code paths have their own higher-level
> -  * throttling.
> -  */
> - if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) {
> - lock_buffer(bh);
> - } else if (test_set_buffer_locked(bh)) {
> - redirty_page_for_writepage(wbc, page);
> - continue;
> - }
> + lock_buffer(bh);
>   if (test_clear_buffer_dirty(bh)) {
>   mark_buffer_async_write(bh);
>   } else {

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-26 Thread Linus Torvalds


On Tue, 26 Dec 2006, David Miller wrote:
> 
> I've seen it on sparc64, UP kernel, no preempt.

Btw, having tried to debug the writeback code, there's one very special 
case that just makes me go "hmm".

If we have a buffer that is "busy" when we try to write back a page, we 
have this magic "wbc->sync_mode == WB_SYNC_NONE && wbc->nonblocking" mode, 
where we won't wait for it, but instead we'll redirty the page and redo 
the whole thing.

Looking at the code, that should all work, but at the same time, it 
triggers some of my debug messages about having a dirty page during 
writeback, and one way to trigger that debug message is to try to run 
rtorrent on the machine.. 

I dunno. Witht he writeback being suspicious, and the normal 
"block_write_full_page()" path being implicated in at least ext2, I just 
wonder. This is one of those "let's see if behaviour changes" patches, 
that I'm just throwing out there..

Linus

---
diff --git a/fs/buffer.c b/fs/buffer.c
index 263f88e..4652ef1 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1653,19 +1653,7 @@ static int __block_write_full_page(struct inode *inode, 
struct page *page,
do {
if (!buffer_mapped(bh))
continue;
-   /*
-* If it's a fully non-blocking write attempt and we cannot
-* lock the buffer then redirty the page.  Note that this can
-* potentially cause a busy-wait loop from pdflush and kswapd
-* activity, but those code paths have their own higher-level
-* throttling.
-*/
-   if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) {
-   lock_buffer(bh);
-   } else if (test_set_buffer_locked(bh)) {
-   redirty_page_for_writepage(wbc, page);
-   continue;
-   }
+   lock_buffer(bh);
if (test_clear_buffer_dirty(bh)) {
mark_buffer_async_write(bh);
} else {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-26 Thread David Miller
From: Tobias Diedrich <[EMAIL PROTECTED]>
Date: Tue, 26 Dec 2006 17:17:00 +0100

> Linus Torvalds wrote:
> > I don't think it's a page table issue any more, it just doesn't look 
> > likely with the ARM UP corruption. It's also not apparently even on a 
> > cacheline boundary, so it probably is really a dirty bit that got cleared 
> > wrogn due to some race with IO.
> 
> So, until now it's only been reported for SMP on i386?
> I'm seeing the issue on my Pentium-M Notebook (Thinkpad R52) over
> here, UP kernel, no preempt.

I've seen it on sparc64, UP kernel, no preempt.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-26 Thread Linus Torvalds


On Tue, 26 Dec 2006, Nick Piggin wrote:

> Linus Torvalds wrote:
> > 
> > Ok, so how about this diff.
> > 
> > I'm actually feeling good about this one. It really looks like
> > "do_no_page()" was simply buggy, and that this explains everything.
> 
> Still trying to catch up here, so I'm not going to reply to any old
> stuff and just start at the tip of the thread... Other than to say
> that I really like cancel_page_dirty ;)

Yeah, I think that part is a bit clearer about what's going on now.

> I think your patch is quite right so that's a good catch.

Actually, since people told me it didn't matter, I went back and looked at 
_why_ - the thing is, "vma->vm_page_prot" should always be read-only 
anyway, except for mappings that don't do dirty accounting at all, so I 
think my patch only found cases that are unimportant (ie pages that get 
faulted on on filesystems like ramfs that doesn't do any dirty page 
accounting because they're all dirty anyway).

> But I'm not too surprised that it does not help the problem, because I 
> don't think we have started shedding any old pte_dirty tests at 
> unmap/reclaim-time, have we? So the dirty bit isn't going to get lost, 
> as such.

True. We should no longer _need_ those dirty bit reclaims at 
unmap/reclaim, but we still do them, so you're right, even if we were 
buggy in this area, it should only really matter for the dirty page 
counting, not for any lost data.

> I was hoping that you've almost narrowed it down to the filesystem
> writeback code, with the last few mails?

I think so, yes.

However, I've checked, and "rtorrent" really does seem to be fairly 
well-behaved wrt any filesystem activity. It does

 - no threading. It's 100% single-threaded, and doesn't even appear to use 
   signals.

 - exactly _one_ "ftruncate()", and it does it at the beginning, for the 
   full final size.

   IOW, it's not anything subtle with truncate and dirty page cancel.

 - It never uses mprotect on the shared mappings, but it _does_ do:
"mincore()" - but the return values don't much matter (it's used 
  as a heuristic on which parts to hash, apparently)

  I double- and triple-checked this one, because I
  did make changes to "mincore()", but those didn't go 
  into the affected kernels anyway (ie they are not in 
  plain 2.6.19, nor in 2.6.18.3 either)

"madvise(MADV_WILLNEED)"
"msync(MS_ASYNC)" (or MS_SYNC if you use a command line flag)
"munmap()" of course

 - it never seems to mix mmap() and write() - it does _only_ mmap.

 - it seems to mmap/munmap the shared files in nice 64-page chunks, all 
   64-page aligned in the file (ie it does NOT create one big mapping, it 
   has some kind of LRU of thse 64-page chunks). The only exception being 
   the last chunk, which it maps byte-accurate to the size.

 - I haven't checked whether it only ever has the same chunk mapped once 
   at a time.

Anyway, the _one_ half-way interesting thing is the fact that it doesn't 
allocate any backing store at all for the file, and as such the page 
writeback needs to create all the underlying buffers on the filesystem. I 
really don't see why that would be a problem either, but I could imagine 
that if we have some writeback bug where we can end up writing back the 
_same_ page concurrently, we'd actually end up racing in the kernel, and 
allocating two different backing stores, and then maybe the other one 
would effectively "get lost" (and the earlier writeback would win the 
race, explaining why we'd end up with zeroes at the end of a block).

Or something.

However, all the codepaths _seem_ to test for PG_writeback, and not even 
try to start another writeback while the first one is still active.

What would also actually be interesting is whether somebody can reproduce 
this on Reiserfs, for example. I _think_ all the reports I've seen are on 
ext2 or ext3, and if this is somehow writeback-related, it could be some 
bug that is just shared between the two by virtue of them still having a 
lot of stuff in common. 

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-26 Thread Al Viro
On Tue, Dec 26, 2006 at 05:51:55PM +, Al Viro wrote:
> On Sun, Dec 24, 2006 at 12:24:46PM -0800, Linus Torvalds wrote:
> > 
> > 
> > On Sun, 24 Dec 2006, Andrei Popa wrote:
> > > 
> > > Hash check on download completion found bad chunks, consider using
> > > "safe_sync".
> > 
> > Dang. Did you get any warning messages from the kernel?
> > 
> > Linus
> 
> BTW, rmap.c patch is broken - needs at least

... but that doesn't affect most of the architectures - only sparc64 and
some of powerpc.  So it's definitely not enough.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-26 Thread Al Viro
On Sun, Dec 24, 2006 at 12:24:46PM -0800, Linus Torvalds wrote:
> 
> 
> On Sun, 24 Dec 2006, Andrei Popa wrote:
> > 
> > Hash check on download completion found bad chunks, consider using
> > "safe_sync".
> 
> Dang. Did you get any warning messages from the kernel?
> 
>   Linus

BTW, rmap.c patch is broken - needs at least

Signed-off-by: Al Viro <[EMAIL PROTECTED]>
---
diff --git a/mm/rmap.c b/mm/rmap.c
index 57306fa..669acb2 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -452,7 +452,7 @@ static int page_mkclean_one(struct page 
entry = ptep_clear_flush(vma, address, pte);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
-   set_pte_at(vma, address, pte, entry);
+   set_pte_at(mm, address, pte, entry);
lazy_mmu_prot_update(entry);
ret = 1;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-26 Thread Nick Piggin

Linus Torvalds wrote:


On Sun, 24 Dec 2006, Linus Torvalds wrote:

Peter, tell me I'm crazy, but with the new rules, the following condition 
is a bug:


- shared mapping
- writable
- not already marked dirty in the PTE



Ok, so how about this diff.

I'm actually feeling good about this one. It really looks like 
"do_no_page()" was simply buggy, and that this explains everything.


Still trying to catch up here, so I'm not going to reply to any old
stuff and just start at the tip of the thread... Other than to say
that I really like cancel_page_dirty ;)

I think your patch is quite right so that's a good catch. But I'm not
too surprised that it does not help the problem, because I don't
think we have started shedding any old pte_dirty tests at
unmap/reclaim-time, have we? So the dirty bit isn't going to get lost,
as such.

I was hoping that you've almost narrowed it down to the filesystem
writeback code, with the last few mails?

Nick

Please please please test. Throw all the other patches away (with the 
possible exception of the "update_mmu_cache()" sanity checker, which is 
still interesting in case some _other_ place does this too).


Don't do the "wait_on_page_writeback()" thing, because it changes timings 
and might hide thngs for the wrong reasons.  Just apply this on top of a 
known failing kernel, and test.


Linus

---
diff --git a/mm/memory.c b/mm/memory.c
index 563792f..cf429c4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2247,21 +2249,23 @@ retry:
if (pte_none(*page_table)) {
flush_icache_page(vma, new_page);
entry = mk_pte(new_page, vma->vm_page_prot);
-   if (write_access)
-   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-   set_pte_at(mm, address, page_table, entry);
if (anon) {
inc_mm_counter(mm, anon_rss);
lru_cache_add_active(new_page);
page_add_new_anon_rmap(new_page, vma, address);
+   if (write_access)
+   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
} else {
inc_mm_counter(mm, file_rss);
page_add_file_rmap(new_page);
+   entry = pte_wrprotect(entry);
if (write_access) {
dirty_page = new_page;
get_page(dirty_page);
+   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
}
}
+   set_pte_at(mm, address, page_table, entry);
} else {
/* One of our sibling threads was faster, back out. */
page_cache_release(new_page);




--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-24 Thread Martin Michlmayr
* Linus Torvalds <[EMAIL PROTECTED]> [2006-12-24 11:35]:
> And if this doesn't fix it, I don't know what will..

Sorry, but it still fails (on top of plain 2.6.19).
-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-24 Thread Michael S. Tsirkin
> Quoting Linus Torvalds <[EMAIL PROTECTED]>:
> Subject: Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content 
> corruption on ext3)
>
> Peter, tell me I'm crazy, but with the new rules, the following condition 
> is a bug:
> 
>  - shared mapping
>  - writable
>  - not already marked dirty in the PTE
> 
> because that combination means that the hardware can mark the PTE dirty 
> without us even realizing (and thus not marking the "struct page *" 
> dirty).

Er.
Sorry about bumping in, and I'm not sure I understand all of the discussion,
but this reminded me of an old issue with COW that created what looks
like a vaguely similiar data corruption on infiniband. We solved this for
infiniband with MADV_DONTFORK, but I always wondered why does it not affect
other parts of kernel.  Small reminder from that discussion:

down mmap sem
get user pages
up mmap sem
page becomes shared, and COW (e.g. fork)
process writes to first byte of page <- gets a copy
Now we had a problem: struct page that we got from get user pages
does not point to a correct page in our process.
For example: if at some point we map this page for DMA, and
hardware writes to last byte of page -> process does not
see this data.

So for infiniband, what we do is a combination of
- prevent page from becoming COW while hardware might DMA to this page, and
- ask users not to write to page if hardware might DMA to same page
  (even if its using different bytes).

I just wandered - is there some chance something like this could be happening in
the fs code?

HTH,

-- 
MST
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-24 Thread Gordon Farquharson

On 12/24/06, Linus Torvalds <[EMAIL PROTECTED]> wrote:


Ok, so how about this diff.

I'm actually feeling good about this one. It really looks like
"do_no_page()" was simply buggy, and that this explains everything.


I tested with just this patch and 2.6.19 and no change. Sorry Linus,
no early Christmas present :-(

Gordon

--
Gordon Farquharson
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-24 Thread Andrei Popa
On Sun, 2006-12-24 at 12:24 -0800, Linus Torvalds wrote:
> 
> On Sun, 24 Dec 2006, Andrei Popa wrote:
> > 
> > Hash check on download completion found bad chunks, consider using
> > "safe_sync".
> 
> Dang. Did you get any warning messages from the kernel?
> 

only these:
ACPI: EC: evaluating _Q80
ACPI: EC: evaluating _Q80
ACPI: EC: evaluating _Q80

but I don't think has anything to do with...

>   Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-24 Thread Linus Torvalds


On Sun, 24 Dec 2006, Andrei Popa wrote:
> 
> Hash check on download completion found bad chunks, consider using
> "safe_sync".

Dang. Did you get any warning messages from the kernel?

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-24 Thread Andrei Popa
On Sun, 2006-12-24 at 11:35 -0800, Linus Torvalds wrote:
> 
> On Sun, 24 Dec 2006, Gordon Farquharson wrote:
> > 
> > The apt cache files (/var/cache/apt/*.bin) still get corrupted with
> > this patch and 2.6.19.
> 
> Yeah, if my guess about do_no_page() is right, _none_ of the previous 
> patches should have ANY effect what-so-ever. In fact, I'd say that even 
> the "ext3 works in writeback mode" thing that Andrei reports is probably a 
> total fluke brought on by timing changes rather than anything else.
> 
> So please try the latest patch instead (on top of anything that shows 
> corruption reliably - the patch should be _totally_ independent of all the 
> other issues, and I think it will apply cleanly on top of 2.6.18.3 and 
> 2.6.19 too, so anything that shows corruption is a fine target - but try 
> to choose something that has been the "best" at corrupting things for you, 
> to make the testing as good as possible).
> 
> Patch included here again (although I think you were cc'd on my previous 
> email too, so you should already have it, and our emails just crossed)
> 
> And if this doesn't fix it, I don't know what will..

With latest git and patches:
http://lkml.org/lkml/diff/2006/12/24/56/1
http://lkml.org/lkml/diff/2006/12/24/61/1

Hash check on download completion found bad chunks, consider using
"safe_sync".

> 
>   Linus
> 
> ---
> diff --git a/mm/memory.c b/mm/memory.c
> index 563792f..cf429c4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2247,21 +2249,23 @@ retry:
>   if (pte_none(*page_table)) {
>   flush_icache_page(vma, new_page);
>   entry = mk_pte(new_page, vma->vm_page_prot);
> - if (write_access)
> - entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> - set_pte_at(mm, address, page_table, entry);
>   if (anon) {
>   inc_mm_counter(mm, anon_rss);
>   lru_cache_add_active(new_page);
>   page_add_new_anon_rmap(new_page, vma, address);
> + if (write_access)
> + entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>   } else {
>   inc_mm_counter(mm, file_rss);
>   page_add_file_rmap(new_page);
> + entry = pte_wrprotect(entry);
>   if (write_access) {
>   dirty_page = new_page;
>   get_page(dirty_page);
> + entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>   }
>   }
> + set_pte_at(mm, address, page_table, entry);
>   } else {
>   /* One of our sibling threads was faster, back out. */
>   page_cache_release(new_page);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-24 Thread Linus Torvalds


On Sun, 24 Dec 2006, Gordon Farquharson wrote:
> 
> The apt cache files (/var/cache/apt/*.bin) still get corrupted with
> this patch and 2.6.19.

Yeah, if my guess about do_no_page() is right, _none_ of the previous 
patches should have ANY effect what-so-ever. In fact, I'd say that even 
the "ext3 works in writeback mode" thing that Andrei reports is probably a 
total fluke brought on by timing changes rather than anything else.

So please try the latest patch instead (on top of anything that shows 
corruption reliably - the patch should be _totally_ independent of all the 
other issues, and I think it will apply cleanly on top of 2.6.18.3 and 
2.6.19 too, so anything that shows corruption is a fine target - but try 
to choose something that has been the "best" at corrupting things for you, 
to make the testing as good as possible).

Patch included here again (although I think you were cc'd on my previous 
email too, so you should already have it, and our emails just crossed)

And if this doesn't fix it, I don't know what will..

Linus

---
diff --git a/mm/memory.c b/mm/memory.c
index 563792f..cf429c4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2247,21 +2249,23 @@ retry:
if (pte_none(*page_table)) {
flush_icache_page(vma, new_page);
entry = mk_pte(new_page, vma->vm_page_prot);
-   if (write_access)
-   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-   set_pte_at(mm, address, page_table, entry);
if (anon) {
inc_mm_counter(mm, anon_rss);
lru_cache_add_active(new_page);
page_add_new_anon_rmap(new_page, vma, address);
+   if (write_access)
+   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
} else {
inc_mm_counter(mm, file_rss);
page_add_file_rmap(new_page);
+   entry = pte_wrprotect(entry);
if (write_access) {
dirty_page = new_page;
get_page(dirty_page);
+   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
}
}
+   set_pte_at(mm, address, page_table, entry);
} else {
/* One of our sibling threads was faster, back out. */
page_cache_release(new_page);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-24 Thread Gordon Farquharson

On 12/24/06, Linus Torvalds <[EMAIL PROTECTED]> wrote:


How about this particularly stupid diff? (please test with something that
_would_ cause corruption normally).

It is _entirely_ untested, but what it tries to do is to simply serialize
any writeback in progress with any process that tries to re-map a shared
page into its address space and dirty it. I haven't tested it, and maybe
it misses some case, but it looks likea good way to try to avoid races
with marking pages dirty and the writeback phase ..


The apt cache files (/var/cache/apt/*.bin) still get corrupted with
this patch and 2.6.19.

Gordon

diff -Naupr linux-2.6.19.orig/fs/buffer.c linux-2.6.19/fs/buffer.c
--- linux-2.6.19.orig/fs/buffer.c   2006-11-29 14:57:37.0 -0700
+++ linux-2.6.19/fs/buffer.c2006-12-21 01:16:31.0 -0700
@@ -2832,7 +2832,7 @@ int try_to_free_buffers(struct page *pag
   int ret = 0;

   BUG_ON(!PageLocked(page));
-   if (PageWriteback(page))
+   if (PageDirty(page) || PageWriteback(page))
   return 0;

   if (mapping == NULL) {  /* can this still happen? */
@@ -2843,17 +2843,6 @@ int try_to_free_buffers(struct page *pag
   spin_lock(&mapping->private_lock);
   ret = drop_buffers(page, &buffers_to_free);
   spin_unlock(&mapping->private_lock);
-   if (ret) {
-   /*
-* If the filesystem writes its buffers by hand (eg ext3)
-* then we can have clean buffers against a dirty page.  We
-* clean the page here; otherwise later reattachment of buffers
-* could encounter a non-uptodate page, which is unresolvable.
-* This only applies in the rare case where try_to_free_buffers
-* succeeds but the page is not freed.
-*/
-   clear_page_dirty(page);
-   }
out:
   if (buffers_to_free) {
   struct buffer_head *bh = buffers_to_free;
diff -Naupr linux-2.6.19.orig/fs/hugetlbfs/inode.c
linux-2.6.19/fs/hugetlbfs/inode.c
--- linux-2.6.19.orig/fs/hugetlbfs/inode.c  2006-11-29
14:57:37.0 -0700
+++ linux-2.6.19/fs/hugetlbfs/inode.c   2006-12-21 01:15:21.0 -0700
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct

static void truncate_huge_page(struct page *page)
{
-   clear_page_dirty(page);
+   cancel_dirty_page(page, /* No IO accounting for huge pages? */0);
   ClearPageUptodate(page);
   remove_from_page_cache(page);
   put_page(page);
diff -Naupr linux-2.6.19.orig/include/linux/page-flags.h
linux-2.6.19/include/linux/page-flags.h
--- linux-2.6.19.orig/include/linux/page-flags.h2006-11-29
14:57:37.0 -0700
+++ linux-2.6.19/include/linux/page-flags.h 2006-12-21
01:15:21.0 -0700
@@ -253,15 +253,11 @@ static inline void SetPageUptodate(struc

struct page;   /* forward declaration */

-int test_clear_page_dirty(struct page *page);
+extern void cancel_dirty_page(struct page *page, unsigned int account_size);
+
int test_clear_page_writeback(struct page *page);
int test_set_page_writeback(struct page *page);

-static inline void clear_page_dirty(struct page *page)
-{
-   test_clear_page_dirty(page);
-}
-
static inline void set_page_writeback(struct page *page)
{
   test_set_page_writeback(page);
diff -Naupr linux-2.6.19.orig/mm/memory.c linux-2.6.19/mm/memory.c
--- linux-2.6.19.orig/mm/memory.c   2006-11-29 14:57:37.0 -0700
+++ linux-2.6.19/mm/memory.c2006-12-24 11:04:03.0 -0700
@@ -1534,6 +1534,7 @@ static int do_wp_page(struct mm_struct *
   if (!pte_same(*page_table, orig_pte))
   goto unlock;
   }
+   wait_on_page_writeback(old_page);
   dirty_page = old_page;
   get_page(dirty_page);
   reuse = 1;
@@ -1832,6 +1833,33 @@ void unmap_mapping_range(struct address_
}
EXPORT_SYMBOL(unmap_mapping_range);

+static void check_last_page(struct address_space *mapping, loff_t size)
+{
+   pgoff_t index;
+   unsigned int offset;
+   struct page *page;
+
+   if (!mapping)
+   return;
+   offset = size & ~PAGE_MASK;
+   if (!offset)
+   return;
+   index = size >> PAGE_SHIFT;
+   page = find_lock_page(mapping, index);
+   if (page) {
+   unsigned int check = 0;
+   unsigned char *kaddr = kmap_atomic(page, KM_USER0);
+   do {
+   check += kaddr[offset++];
+   } while (offset < PAGE_SIZE);
+   kunmap_atomic(kaddr,KM_USER0);
+   unlock_page(page);
+   page_cache_release(page);
+   if (check)
+   printk("%s: BADNESS: truncate check %u\n",
current->comm, check);
+   }
+}
+
/**
 * vmtruncate - unmap mappings "freed" by truncate() syscall
 * @inode: inode of the file used
@@ -1865,6 +1893,7 @@ do_expand:
   goto ou

Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-24 Thread Linus Torvalds


On Sun, 24 Dec 2006, Linus Torvalds wrote:
> 
> Peter, tell me I'm crazy, but with the new rules, the following condition 
> is a bug:
> 
>  - shared mapping
>  - writable
>  - not already marked dirty in the PTE

Ok, so how about this diff.

I'm actually feeling good about this one. It really looks like 
"do_no_page()" was simply buggy, and that this explains everything.

Please please please test. Throw all the other patches away (with the 
possible exception of the "update_mmu_cache()" sanity checker, which is 
still interesting in case some _other_ place does this too).

Don't do the "wait_on_page_writeback()" thing, because it changes timings 
and might hide thngs for the wrong reasons.  Just apply this on top of a 
known failing kernel, and test.

Linus

---
diff --git a/mm/memory.c b/mm/memory.c
index 563792f..cf429c4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2247,21 +2249,23 @@ retry:
if (pte_none(*page_table)) {
flush_icache_page(vma, new_page);
entry = mk_pte(new_page, vma->vm_page_prot);
-   if (write_access)
-   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-   set_pte_at(mm, address, page_table, entry);
if (anon) {
inc_mm_counter(mm, anon_rss);
lru_cache_add_active(new_page);
page_add_new_anon_rmap(new_page, vma, address);
+   if (write_access)
+   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
} else {
inc_mm_counter(mm, file_rss);
page_add_file_rmap(new_page);
+   entry = pte_wrprotect(entry);
if (write_access) {
dirty_page = new_page;
get_page(dirty_page);
+   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
}
}
+   set_pte_at(mm, address, page_table, entry);
} else {
/* One of our sibling threads was faster, back out. */
page_cache_release(new_page);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-24 Thread Linus Torvalds



On Sun, 24 Dec 2006, Linus Torvalds wrote:
>
> How about this particularly stupid diff? (please test with something that 
> _would_ cause corruption normally).

Actually, here's an even more stupid diff, which actually to some degree 
seems to capture the real problem better.

Peter, tell me I'm crazy, but with the new rules, the following condition 
is a bug:

 - shared mapping
 - writable
 - not already marked dirty in the PTE

because that combination means that the hardware can mark the PTE dirty 
without us even realizing (and thus not marking the "struct page *" 
dirty).

(The above is actually a valid situation for IO mappings, but not for 
"real" mappings. And IO mappings should never take page faults, I think).

So, with that in mind, I wrote this stupid patch (for 32-bit x86, since I 
used my Mac Mini for testing ratehr than my main machine - but the x86-64 
version should be pretty much identcal)..

And you know what, Peter? It triggers for me. I get

WARNING at mm/memory.c:2274 do_no_page()
 [] show_trace_log_lvl+0x1a/0x2f
 [] show_trace+0x12/0x14
 [] dump_stack+0x16/0x18
 [] __handle_mm_fault+0x38d/0x919
 [] do_page_fault+0x1ff/0x507
 [] error_code+0x7c/0x84

which seems to say that do_no_page() can be used to insert shared and 
non-dirty, but still writable, pages.

But maybe my patch is just bogus, and I didn't think it through.

Peter, I realize it's Christmas Eve, but let's face it, Santa appreciates 
good boys and girls, and we all want tons of loot. So please be good, and 
waste some time looking at this and tell me why I'm either wrong, or 
there's a real smoking gun here.. ;)

Linus

---
diff --git a/include/asm-i386/pgtable.h b/include/asm-i386/pgtable.h
index e6a4723..1389bb7 100644
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -494,7 +494,13 @@ do {   
\
  * The i386 doesn't have any external MMU info: the kernel page
  * tables contain all the necessary information.
  */
-#define update_mmu_cache(vma,address,pte) do { } while (0)
+#define bad_shared_pte(pte) (pte_write(pte) && !pte_dirty(pte))
+#define update_mmu_cache(vma,address,pte) do { \
+   static int __cnt;   \
+   WARN_ON(((vma)->vm_flags & VM_SHARED)   \
+&& bad_shared_pte(pte) \
+&& ++__cnt < 5);   \
+} while (0)
 #endif /* !__ASSEMBLY__ */
 
 #ifdef CONFIG_FLATMEM
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-24 Thread Andrew Morton
On Sun, 24 Dec 2006 09:16:06 -0800 (PST)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> 
> 
> On Sun, 24 Dec 2006, Andrei Popa wrote:
> 
> > On Sun, 2006-12-24 at 04:31 -0800, Andrew Morton wrote:
> > > Andrei Popa <[EMAIL PROTECTED]> wrote:
> > > > /dev/sda7 on / type ext3 (rw,noatime,nobh)
> > > > 
> > > > I don't have corruption. I tested twice.
> > > 
> > > This is a surprising result.  Can you pleas retest ext3 
> > > data=writeback,nobh?
> > 
> > Yes, no corruption. Also tested only with data=writeback and had no
> > corruption.
> 
> Ok, so it would seem to be writeback related _somehow_. However, most of 
> the differences (I _thought_) in ext3 actually show up only if you have 
> *both* "nobh" and "data=writeback", and as far as I can tell, just a 
> simple "data=writeback" should still use the bog-standard 
> "block_write_full_page()".
> 
> Andrew?
> 
> Although as far as I can see, then ext2 should work as-is too (since it 
> too also just uses "block_write_full_page()" without anything fancy).

ext2 uses the multipage-bio assembly code for writeback whereas ext3
doesn't.  But ext3 doesn't use that code in data=ordered mode, of course.

Still, this:

--- a/fs/ext2/inode.c~a
+++ a/fs/ext2/inode.c
@@ -693,7 +693,7 @@ const struct address_space_operations ex
.commit_write   = generic_commit_write,
.bmap   = ext2_bmap,
.direct_IO  = ext2_direct_IO,
-   .writepages = ext2_writepages,
+// .writepages = ext2_writepages,
.migratepage= buffer_migrate_page,
 };
 
@@ -711,7 +711,7 @@ const struct address_space_operations ex
.commit_write   = nobh_commit_write,
.bmap   = ext2_bmap,
.direct_IO  = ext2_direct_IO,
-   .writepages = ext2_writepages,
+// .writepages = ext2_writepages,
.migratepage= buffer_migrate_page,
 };
 
_

will switch it off for ext2.


> Strange.
> 
> How about this particularly stupid diff? (please test with something that 
> _would_ cause corruption normally).
> 
> It is _entirely_ untested, but what it tries to do is to simply serialize 
> any writeback in progress with any process that tries to re-map a shared 
> page into its address space and dirty it. I haven't tested it, and maybe 
> it misses some case, but it looks likea good way to try to avoid races 
> with marking pages dirty and the writeback phase ..
> 
>   Linus
> ---
> diff --git a/mm/memory.c b/mm/memory.c
> index 563792f..64ed10b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1544,6 +1544,7 @@ static int do_wp_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   if (!pte_same(*page_table, orig_pte))
>   goto unlock;
>   }
> + wait_on_page_writeback(old_page);
>   dirty_page = old_page;
>   get_page(dirty_page);
>   reuse = 1;
> @@ -2215,6 +2216,7 @@ retry:
>   page_cache_release(new_page);
>   return VM_FAULT_SIGBUS;
>   }
> + wait_on_page_writeback(new_page);
>   }
>   }

yup.  Also, we could perhaps lock the target page during pagefaults..

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-24 Thread Linus Torvalds


On Sun, 24 Dec 2006, Andrei Popa wrote:

> On Sun, 2006-12-24 at 04:31 -0800, Andrew Morton wrote:
> > Andrei Popa <[EMAIL PROTECTED]> wrote:
> > > /dev/sda7 on / type ext3 (rw,noatime,nobh)
> > > 
> > > I don't have corruption. I tested twice.
> > 
> > This is a surprising result.  Can you pleas retest ext3 data=writeback,nobh?
> 
> Yes, no corruption. Also tested only with data=writeback and had no
> corruption.

Ok, so it would seem to be writeback related _somehow_. However, most of 
the differences (I _thought_) in ext3 actually show up only if you have 
*both* "nobh" and "data=writeback", and as far as I can tell, just a 
simple "data=writeback" should still use the bog-standard 
"block_write_full_page()".

Andrew?

Although as far as I can see, then ext2 should work as-is too (since it 
too also just uses "block_write_full_page()" without anything fancy).

Strange.

How about this particularly stupid diff? (please test with something that 
_would_ cause corruption normally).

It is _entirely_ untested, but what it tries to do is to simply serialize 
any writeback in progress with any process that tries to re-map a shared 
page into its address space and dirty it. I haven't tested it, and maybe 
it misses some case, but it looks likea good way to try to avoid races 
with marking pages dirty and the writeback phase ..

Linus
---
diff --git a/mm/memory.c b/mm/memory.c
index 563792f..64ed10b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1544,6 +1544,7 @@ static int do_wp_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (!pte_same(*page_table, orig_pte))
goto unlock;
}
+   wait_on_page_writeback(old_page);
dirty_page = old_page;
get_page(dirty_page);
reuse = 1;
@@ -2215,6 +2216,7 @@ retry:
page_cache_release(new_page);
return VM_FAULT_SIGBUS;
}
+   wait_on_page_writeback(new_page);
}
}
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-24 Thread Andrei Popa
On Sun, 2006-12-24 at 04:31 -0800, Andrew Morton wrote:
> On Sun, 24 Dec 2006 14:14:38 +0200
> Andrei Popa <[EMAIL PROTECTED]> wrote:
> 
> > > - mount the fs with ext2 with the no-buffer-head option.  That means 
> > > either:
> > > 
> > >   grub.conf:  rootfstype=ext2 rootflags=nobh
> > >   /etc/fstab: ext2 nobh
> > 
> > ierdnac ~ # mount
> > /dev/sda7 on / type ext2 (rw,noatime,nobh)
> > 
> > I have corruption.
> > 
> > > 
> > > - mount the fs with ext3 data=writeback, nobh
> > > 
> > >   grub.conf:  rootfstype=ext3 rootflags=nobh,data=writeback  (I hope this 
> > > works)
> > >   /etc/fstab: ext2 data=writeback,nobh
> > 
> > ierdnac ~ # mount
> > /dev/sda7 on / type ext3 (rw,noatime,nobh)
> > 
> > ierdnac ~ # dmesg|grep EXT3
> > EXT3-fs: mounted filesystem with writeback data mode.
> > EXT3 FS on sda7, internal journal
> > 
> > I don't have corruption. I tested twice.
> 
> This is a surprising result.  Can you pleas retest ext3 data=writeback,nobh?

Yes, no corruption. Also tested only with data=writeback and had no
corruption.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-24 Thread Martin Michlmayr
* Andrew Morton <[EMAIL PROTECTED]> [2006-12-24 00:57]:
>   /etc/fstab: ext2 nobh
>   /etc/fstab: ext3 data=writeback,nobh

It seems that busybox mount ignores the nobh option but both ext2 and
ext3 data=writeback work for me.  This is with plain 2.6.19 which
normally always fails.
-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-24 Thread Andrew Morton
On Sun, 24 Dec 2006 14:14:38 +0200
Andrei Popa <[EMAIL PROTECTED]> wrote:

> > - mount the fs with ext2 with the no-buffer-head option.  That means either:
> > 
> >   grub.conf:  rootfstype=ext2 rootflags=nobh
> >   /etc/fstab: ext2 nobh
> 
> ierdnac ~ # mount
> /dev/sda7 on / type ext2 (rw,noatime,nobh)
> 
> I have corruption.
> 
> > 
> > - mount the fs with ext3 data=writeback, nobh
> > 
> >   grub.conf:  rootfstype=ext3 rootflags=nobh,data=writeback  (I hope this 
> > works)
> >   /etc/fstab: ext2 data=writeback,nobh
> 
> ierdnac ~ # mount
> /dev/sda7 on / type ext3 (rw,noatime,nobh)
> 
> ierdnac ~ # dmesg|grep EXT3
> EXT3-fs: mounted filesystem with writeback data mode.
> EXT3 FS on sda7, internal journal
> 
> I don't have corruption. I tested twice.

This is a surprising result.  Can you pleas retest ext3 data=writeback,nobh?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-24 Thread Andrew Morton
On Sun, 24 Dec 2006 14:26:01 +0200
Andrei Popa <[EMAIL PROTECTED]> wrote:

> I also tested with ext3 ordered, nobh  and I have file corruption...

ordered+nobh isn't a possible combination.  The filesystem probably ignored
nobh.  nobh mode only makes sense with data=writeback.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-24 Thread Andrei Popa
On Sun, 2006-12-24 at 14:14 +0200, Andrei Popa wrote:
> On Sun, 2006-12-24 at 00:57 -0800, Andrew Morton wrote: 
> > On Sun, 24 Dec 2006 00:43:54 -0800 (PST)
> > Linus Torvalds <[EMAIL PROTECTED]> wrote:
> > 
> > > I now _suspect_ that we're talking about something like
> > > 
> > >  - we started a writeout. The IO is still pending, and the page was 
> > >marked clean and is now in the "writeback" phase.
> > >  - a write happens to the page, and the page gets marked dirty again. 
> > >Marking the page dirty also marks all the _buffers_ in the page dirty, 
> > >but they were actually already dirty, because the IO hasn't completed 
> > >yet.
> > >  - the IO from the _previous_ write completes, and marks the buffers 
> > > clean 
> > >again.
> > 
> > Some things for the testers to try, please:
> > 
> > - mount the fs with ext2 with the no-buffer-head option.  That means either:
> > 
> >   grub.conf:  rootfstype=ext2 rootflags=nobh
> >   /etc/fstab: ext2 nobh
> 
> ierdnac ~ # mount
> /dev/sda7 on / type ext2 (rw,noatime,nobh)
> 
> I have corruption.
> 
> > 
> > - mount the fs with ext3 data=writeback, nobh
> > 
> >   grub.conf:  rootfstype=ext3 rootflags=nobh,data=writeback  (I hope this 
> > works)
> >   /etc/fstab: ext2 data=writeback,nobh
> 
> ierdnac ~ # mount
> /dev/sda7 on / type ext3 (rw,noatime,nobh)
> 
> ierdnac ~ # dmesg|grep EXT3
> EXT3-fs: mounted filesystem with writeback data mode.
> EXT3 FS on sda7, internal journal
> 
> I don't have corruption. I tested twice.
> 

I also tested with ext3 ordered, nobh  and I have file corruption...

> > 
> > if that still fails we can rule out buffer_head funnies.
> > 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2006-12-24 Thread Andrei Popa
On Sun, 2006-12-24 at 00:57 -0800, Andrew Morton wrote: 
> On Sun, 24 Dec 2006 00:43:54 -0800 (PST)
> Linus Torvalds <[EMAIL PROTECTED]> wrote:
> 
> > I now _suspect_ that we're talking about something like
> > 
> >  - we started a writeout. The IO is still pending, and the page was 
> >marked clean and is now in the "writeback" phase.
> >  - a write happens to the page, and the page gets marked dirty again. 
> >Marking the page dirty also marks all the _buffers_ in the page dirty, 
> >but they were actually already dirty, because the IO hasn't completed 
> >yet.
> >  - the IO from the _previous_ write completes, and marks the buffers clean 
> >again.
> 
> Some things for the testers to try, please:
> 
> - mount the fs with ext2 with the no-buffer-head option.  That means either:
> 
>   grub.conf:  rootfstype=ext2 rootflags=nobh
>   /etc/fstab: ext2 nobh

ierdnac ~ # mount
/dev/sda7 on / type ext2 (rw,noatime,nobh)

I have corruption.

> 
> - mount the fs with ext3 data=writeback, nobh
> 
>   grub.conf:  rootfstype=ext3 rootflags=nobh,data=writeback  (I hope this 
> works)
>   /etc/fstab: ext2 data=writeback,nobh

ierdnac ~ # mount
/dev/sda7 on / type ext3 (rw,noatime,nobh)

ierdnac ~ # dmesg|grep EXT3
EXT3-fs: mounted filesystem with writeback data mode.
EXT3 FS on sda7, internal journal

I don't have corruption. I tested twice.

> 
> if that still fails we can rule out buffer_head funnies.
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >