Re: [PATCH] mm: fix page_mkclean_one

2007-02-02 Thread Evgeniy Polyakov
On Fri, Feb 02, 2007 at 07:42:52PM +1100, Nick Piggin ([EMAIL PROTECTED]) wrote:
> Anyway, I had a look at your bugzilla test-case and managed to slim it
> down to something that easily shows what the problem is (available on
> request) -- the problem is that recipient of the sendfile is seeing
> modifications that occur to the source file _after_ the sender has
> completed the sendfile, because the file pages are not copied but
> queued.
> 
> I think the usual approach to what you are trying to do is to set TCP_CORK,
> then write(2) the header into the socket, then sendfile directly from the
> file you want.
> 
> Another approach I guess is to implement an ack in your userland protocol
> so you do not modify the sendfile source file until the client acks that
> it has all the data.

Mark, don't you use e1000 or other scatter-gather capable nic with
checksum offload? Likely yes.

Actual data sucking in that case happens when packet is supposed to be
transmitted by the NIC, not when sendfile() is returned. The same
applies to the case, when you have fancy egress filtering.

It is not allowed to modify pages until they are really transmitted, if
you want data integrity.

There are _no_ bugs in network or VFS cache in this test case.

> I'm not sure if there are any other usual ways to do this (ie. a barrier
> for sendfile, to ensure it will not pick up "future" modifications to the
> file). netdev cc'ed, someone there might have additional comments.
> 
> Please close this bug if/when you are satisfied it is not a kernel problem.
> 
> Thanks,
> Nick
> 
> -- 
> SUSE Labs, Novell Inc.
> Send instant messages to your online friends http://au.messenger.yahoo.com 

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2007-02-02 Thread Nick Piggin

Mark Groves wrote:

Hi,


I have been been seeing a problem when using sendfile repeatedly on an
SMP server, which I believe is related to the problem that was
discovered recently with marking dirty pages. The bug, as well as a test
script, is listed at http://bugzilla.kernel.org/show_bug.cgi?id=7650.
Currently, we're experiencing errors where part of a previous packet is
being sent out rather than the current packet.

I have applied the patch Linus posted to a 2.6.19 kernel but am still
getting the problem. So I am wondering if there are any other places in
the kernel which mark pages as dirty which might require a similar
patch?


Your issue is not related, firstly because the page_mkclean bug did not
exist before 2.6.19 kernels.

Anyway, I had a look at your bugzilla test-case and managed to slim it
down to something that easily shows what the problem is (available on
request) -- the problem is that recipient of the sendfile is seeing
modifications that occur to the source file _after_ the sender has
completed the sendfile, because the file pages are not copied but
queued.

I think the usual approach to what you are trying to do is to set TCP_CORK,
then write(2) the header into the socket, then sendfile directly from the
file you want.

Another approach I guess is to implement an ack in your userland protocol
so you do not modify the sendfile source file until the client acks that
it has all the data.

I'm not sure if there are any other usual ways to do this (ie. a barrier
for sendfile, to ensure it will not pick up "future" modifications to the
file). netdev cc'ed, someone there might have additional comments.

Please close this bug if/when you are satisfied it is not a kernel problem.

Thanks,
Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2007-02-02 Thread Nick Piggin

Mark Groves wrote:

Hi,


I have been been seeing a problem when using sendfile repeatedly on an
SMP server, which I believe is related to the problem that was
discovered recently with marking dirty pages. The bug, as well as a test
script, is listed at http://bugzilla.kernel.org/show_bug.cgi?id=7650.
Currently, we're experiencing errors where part of a previous packet is
being sent out rather than the current packet.

I have applied the patch Linus posted to a 2.6.19 kernel but am still
getting the problem. So I am wondering if there are any other places in
the kernel which mark pages as dirty which might require a similar
patch?


Your issue is not related, firstly because the page_mkclean bug did not
exist before 2.6.19 kernels.

Anyway, I had a look at your bugzilla test-case and managed to slim it
down to something that easily shows what the problem is (available on
request) -- the problem is that recipient of the sendfile is seeing
modifications that occur to the source file _after_ the sender has
completed the sendfile, because the file pages are not copied but
queued.

I think the usual approach to what you are trying to do is to set TCP_CORK,
then write(2) the header into the socket, then sendfile directly from the
file you want.

Another approach I guess is to implement an ack in your userland protocol
so you do not modify the sendfile source file until the client acks that
it has all the data.

I'm not sure if there are any other usual ways to do this (ie. a barrier
for sendfile, to ensure it will not pick up future modifications to the
file). netdev cc'ed, someone there might have additional comments.

Please close this bug if/when you are satisfied it is not a kernel problem.

Thanks,
Nick

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2007-02-02 Thread Evgeniy Polyakov
On Fri, Feb 02, 2007 at 07:42:52PM +1100, Nick Piggin ([EMAIL PROTECTED]) wrote:
 Anyway, I had a look at your bugzilla test-case and managed to slim it
 down to something that easily shows what the problem is (available on
 request) -- the problem is that recipient of the sendfile is seeing
 modifications that occur to the source file _after_ the sender has
 completed the sendfile, because the file pages are not copied but
 queued.
 
 I think the usual approach to what you are trying to do is to set TCP_CORK,
 then write(2) the header into the socket, then sendfile directly from the
 file you want.
 
 Another approach I guess is to implement an ack in your userland protocol
 so you do not modify the sendfile source file until the client acks that
 it has all the data.

Mark, don't you use e1000 or other scatter-gather capable nic with
checksum offload? Likely yes.

Actual data sucking in that case happens when packet is supposed to be
transmitted by the NIC, not when sendfile() is returned. The same
applies to the case, when you have fancy egress filtering.

It is not allowed to modify pages until they are really transmitted, if
you want data integrity.

There are _no_ bugs in network or VFS cache in this test case.

 I'm not sure if there are any other usual ways to do this (ie. a barrier
 for sendfile, to ensure it will not pick up future modifications to the
 file). netdev cc'ed, someone there might have additional comments.
 
 Please close this bug if/when you are satisfied it is not a kernel problem.
 
 Thanks,
 Nick
 
 -- 
 SUSE Labs, Novell Inc.
 Send instant messages to your online friends http://au.messenger.yahoo.com 

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2007-02-01 Thread Mark Groves
Hi,


I have been been seeing a problem when using sendfile repeatedly on an
SMP server, which I believe is related to the problem that was
discovered recently with marking dirty pages. The bug, as well as a test
script, is listed at http://bugzilla.kernel.org/show_bug.cgi?id=7650.
Currently, we're experiencing errors where part of a previous packet is
being sent out rather than the current packet.

I have applied the patch Linus posted to a 2.6.19 kernel but am still
getting the problem. So I am wondering if there are any other places in
the kernel which mark pages as dirty which might require a similar
patch?


Regards,

Mark Groves
Researcher
University of Waterloo
Waterloo, Ontario, Canada

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2007-02-01 Thread Mark Groves
Hi,


I have been been seeing a problem when using sendfile repeatedly on an
SMP server, which I believe is related to the problem that was
discovered recently with marking dirty pages. The bug, as well as a test
script, is listed at http://bugzilla.kernel.org/show_bug.cgi?id=7650.
Currently, we're experiencing errors where part of a previous packet is
being sent out rather than the current packet.

I have applied the patch Linus posted to a 2.6.19 kernel but am still
getting the problem. So I am wondering if there are any other places in
the kernel which mark pages as dirty which might require a similar
patch?


Regards,

Mark Groves
Researcher
University of Waterloo
Waterloo, Ontario, Canada

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2007-01-06 Thread Andrew Morton
On Sun, 7 Jan 2007 12:36:18 +1030
"Tom Lanyon" <[EMAIL PROTECTED]> wrote:

> On 12/27/06, Linus Torvalds <[EMAIL PROTECTED]> wrote:
> > What would also actually be interesting is whether somebody can reproduce
> > this on Reiserfs, for example. I _think_ all the reports I've seen are on
> > ext2 or ext3, and if this is somehow writeback-related, it could be some
> > bug that is just shared between the two by virtue of them still having a
> > lot of stuff in common.
> >
> > Linus
> 
> I've been following this thread for a while now as I started
> experiencing file corruption in rtorrent when I upgraded to 2.6.19. I
> am using reiserfs.

reiserfs defaults to data=ordered, so it's quite possibly the same bug.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2007-01-06 Thread Tom Lanyon

On 1/7/07, Tom Lanyon <[EMAIL PROTECTED]> wrote:

I've been following this thread for a while now as I started
experiencing file corruption in rtorrent when I upgraded to 2.6.19. I
am using reiserfs.


However, moving to 2.6.20-rc3 does indeed seem to fix the issue thus far...

--
Tom Lanyon
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2007-01-06 Thread Tom Lanyon

On 12/27/06, Linus Torvalds <[EMAIL PROTECTED]> wrote:

What would also actually be interesting is whether somebody can reproduce
this on Reiserfs, for example. I _think_ all the reports I've seen are on
ext2 or ext3, and if this is somehow writeback-related, it could be some
bug that is just shared between the two by virtue of them still having a
lot of stuff in common.

Linus


I've been following this thread for a while now as I started
experiencing file corruption in rtorrent when I upgraded to 2.6.19. I
am using reiserfs.

--
Tom Lanyon
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2007-01-06 Thread Tom Lanyon

On 12/27/06, Linus Torvalds [EMAIL PROTECTED] wrote:

What would also actually be interesting is whether somebody can reproduce
this on Reiserfs, for example. I _think_ all the reports I've seen are on
ext2 or ext3, and if this is somehow writeback-related, it could be some
bug that is just shared between the two by virtue of them still having a
lot of stuff in common.

Linus


I've been following this thread for a while now as I started
experiencing file corruption in rtorrent when I upgraded to 2.6.19. I
am using reiserfs.

--
Tom Lanyon
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2007-01-06 Thread Tom Lanyon

On 1/7/07, Tom Lanyon [EMAIL PROTECTED] wrote:

I've been following this thread for a while now as I started
experiencing file corruption in rtorrent when I upgraded to 2.6.19. I
am using reiserfs.


However, moving to 2.6.20-rc3 does indeed seem to fix the issue thus far...

--
Tom Lanyon
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

2007-01-06 Thread Andrew Morton
On Sun, 7 Jan 2007 12:36:18 +1030
Tom Lanyon [EMAIL PROTECTED] wrote:

 On 12/27/06, Linus Torvalds [EMAIL PROTECTED] wrote:
  What would also actually be interesting is whether somebody can reproduce
  this on Reiserfs, for example. I _think_ all the reports I've seen are on
  ext2 or ext3, and if this is somehow writeback-related, it could be some
  bug that is just shared between the two by virtue of them still having a
  lot of stuff in common.
 
  Linus
 
 I've been following this thread for a while now as I started
 experiencing file corruption in rtorrent when I upgraded to 2.6.19. I
 am using reiserfs.

reiserfs defaults to data=ordered, so it's quite possibly the same bug.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Andrew Morton
On Fri, 29 Dec 2006 16:58:41 -0800 (PST)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> 
> 
> On Fri, 29 Dec 2006, Andrew Morton wrote:
> >
> > > > Somewhat nastily, but as ext3 directories are metadata it is appropriate
> > > > that modifications to them be done in terms of buffer_heads (ie: 
> > > > blocks).
> > > 
> > > No. There is nothing "appropriate" about using buffer_heads for metadata. 
> > 
> > I said "modification".
> 
> You said "metadata".
> 
> Why do you think directories are any different from files? Yes, they are 
> metadata. So what? What does that have to do with anything?

We journal the contents of directories.  Fully.  So we handle their dirty
data at the block (ie: buffer_head) level.  When someone tries to dirty
part of a directory we need to cheat and not mark that part of the page as
dirty and we need to then write the block to the journal and then mark the
block as really dirty for checkpointing (but still attached to the journal)
and all that goop.

The regular page-based writeback doesn't apply until the block has been
written to the journal.  At that stage the block is considered dirty
against its real position on disk.  It will then be written back by pdflush
via the blockdev inode -> blkdev_writepage().  Unless kjournald needs to do
an early flush to reclaim the journal space, in which case kjournald will
write the block itself.

> 
> So I really don't understand why you make excuses for ext3 and talk about 
> "modifications" and "metadata". It was a fine design ten years ago. It's 
> not really very good any longer.
> 

As I said in another apparently-neglected email:

: We could possibly move ext3/4 directories out of the blockdev pagecache and
: into per-directory pagecache, but that wouldn't change anything - the
: journalling would still be block-based.

We already have all the code in place to journal blocks which are cached in
an address_space other than the blockdev inode's: ext3_journalled_aops.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Andrew Morton wrote:
>
> > > Somewhat nastily, but as ext3 directories are metadata it is appropriate
> > > that modifications to them be done in terms of buffer_heads (ie: blocks).
> > 
> > No. There is nothing "appropriate" about using buffer_heads for metadata. 
> 
> I said "modification".

You said "metadata".

Why do you think directories are any different from files? Yes, they are 
metadata. So what? What does that have to do with anything?

They should still use virtual indexes, the way files do. That doesn't 
preclude them from using buffer-heads to mark their (partial-page) 
modifications and for keeping the virtual->physical translations cached.

I mean, really. Look at ext2. It does exactly that. It keeps the 
directories in the page cache - virtually indexed. And it even modifies 
them there. Exactly the same way it modifies regular file data.

It all works exactly the same way it works for regular files. It uses

page->mapping->a_ops->prepare_write(NULL, page, from, to);
... do modification ...
ext2_commit_chunk(page, from, to);

exactly the way regular file data works. 

That's why I'm saying there is absolutely _zero_ thing about "metadata" 
here, or even about "modifications". It all works better in a virtual 
cache, because you get all the support that we give to page caches.

So I really don't understand why you make excuses for ext3 and talk about 
"modifications" and "metadata". It was a fine design ten years ago. It's 
not really very good any longer.

I suspect we're stuck with the design, but that doesn't make it any 
_better_.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Andrew Morton wrote:
> 
> Adam Richter spent considerable time a few years ago trying to make the
> mpage code go direct-to-BIO in all cases and we eventually gave up.  The
> conceptual layering of page<->blocks<->bio is pretty clean, and it is hard
> and ugly to fully optimise away the "block" bit in the middle.

Using the buffer cache as a translation layer to the physical address is 
fine. That's what _any_ block device will do.

I'm not at all sayign that "buffer heads must go away". They work fine.

What I'm saying is that

 - if you index by buffer heads, you're screwed.
 - if you do IO by starting at buffer heads, you're screwed.

Both indexing and writeback decisions should be done at the page cache 
layer. Then, when you actually need to do IO, you look at the buffers. But 
you start from the "page". YOU SHOULD NEVER LOOK UP a buffer on its own 
merits, and YOU SHOULD NEVER DO IO on a buffer head on its own cognizance.

So by all means keep the buffer heads as a way to keep the 
"virtual->physical" translation. It's what they were designed for. But 
they were _originally_ also designed for "lookup" and "driving the start 
of IO", and that is wrong, and has been wrong for a long time now, because

 - lookup based on physical address is fundamentally slow and inefficient. 
   You have to look up the virtual->physical translation somewhere else, 
   so it's by design an unnecessary indirection _and_ that "somewere 
   else" is also by definition filesystem-specific, so you can't do any 
   of these things at the VFS layer.

   Ergo: anything that needs to look up the physical address in order to 
   find the buffer head is BROKEN in this day and age. We look up the 
   _virtual_ page cache page, and then we can trivially find the buffer 
   heads within that page thanks to page->buffers.

   Example: ext2 vs ext3 readdir. One of them sucks, the other doesn't. 

 - starting IO based on the physical entity is insane. It's insane exactly 
   _because_ the VM doesn't actually think in physical addresses, or in 
   buffer-sized blocks. The VM only really knows about whole pages, and 
   all the VM decisions fundamentally have to be page-based. We don't ever 
   "free a buffer". We free a whole page, and as such, doing writeback 
   based on buffers is pointless, because it doesn't actually say anything 
   about the "page state" which is what the VM tracks.

But neither of these means that "buffer_head" itself has to go away. They 
both really boil down to the same thing: you should never KEY things by 
the buffer head. All actions should be based on virtual indexes as far as 
at all humanly possible.

Once you do lookup and locking and writeback _starting_ from the page, 
it's then easy to look up the actual buffer head within the page, and use 
that as a way to do the actual _IO_ on the physical address. So the buffer 
heads still exist in ext2, for example, but they don't drive the show 
quite as much.

(They still do in some areas: the allocation bitmaps, the xattr code etc. 
But as long as none of those have big VM footprints, and as long as no 
_common_ operations really care deeply, and as long as those data 
structures never need to be touched by the VM or VFS layer, nobody will 
ever really care).

The directory case comes up just because "readdir()" actually is very 
common, and sometimes very slow. And it can have a big VM working set 
footprint ("find"), so trying to be page-based actually really helps, 
because it all drives things like writeback on the _right_ issues, and we 
can do things like LRU's and writeback decisions on the level that really 
matters.

I actually suspect that the inode tables could benefit from being in the 
page cache too (although I think that the inode buffer address is actually 
"physical", so there's no indirection for inode tables, which means that 
the virtual vs physical addressing doesn't matter). For directories, there 
definitely is a big cost to continually doing the virtual->physical 
translation all the time.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Andrew Morton
On Fri, 29 Dec 2006 16:11:44 -0800 (PST)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> 
> 
> > JBD implements physical block-based journalling, so it is 100% appropriate
> > that JBD deal with these disk blocks using their buffer_head
> > representation.
> 
> And as long as it does that, you just have to face the fact that it's 
> going to perform like crap, including what you call "extra" writes, and 
> what I call "deal with it".

It is quite tiresome to delete things which your interlocutor said and to
then restate them as if it were some sort of relevation.

> > Somewhat nastily, but as ext3 directories are metadata it is appropriate
> > that modifications to them be done in terms of buffer_heads (ie: blocks).
> 
> No. There is nothing "appropriate" about using buffer_heads for metadata. 

I said "modification".

> [stuff about directory reads elided]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Andrew Morton wrote:
> 
> They're extra.  As in "can be optimised away".

Sure. Don't use buffer heads.

> The buffer_head is not an IO container.  It is the kernel's core
> representation of a disk block.

Please come back from the 90's.

The buffer heads are nothing but a mapping of where the hardware block is. 
If you use it for anything else, you're basically screwed.

> JBD implements physical block-based journalling, so it is 100% appropriate
> that JBD deal with these disk blocks using their buffer_head
> representation.

And as long as it does that, you just have to face the fact that it's 
going to perform like crap, including what you call "extra" writes, and 
what I call "deal with it".

Btw, you can make pages be physically indexed too, but they obviously
 (a) won't be coherent with any virtual mapping laid on top of it
 (b) will be _physical_, so any readahead etc will be based on physical 
 addresses too.

> I thought I fixed the performance problem?

No, you papered over it, for the reasonably common case where things were 
physically contiguous - exactly by using a physical page cache, so now it 
can do read-ahead based on that. Then, because the pages contain buffer 
heads, the directory accesses can look up buffers, and if it was all 
physically contiguous, it all works fine.

But if you actually want virtualluy indexed caching (and all _users_ want 
it), it really doesn't work.

> Somewhat nastily, but as ext3 directories are metadata it is appropriate
> that modifications to them be done in terms of buffer_heads (ie: blocks).

No. There is nothing "appropriate" about using buffer_heads for metadata. 

It's quite proper - and a hell of a lot more efficient - to use virtual 
page-caching for metadata too.

Look at the ext2 readdir() implementation, and compare it to the crapola 
horror that is ext3. Guess what? ext2 uses virtually indexed metadata, and 
as a result it is both simpler, smaller and a LOT faster than ext3 in 
accessing that metadata.

Face it, Andrew, you're wrong on this one. Really. Just take a look at 
ext2_readdir(). 

[ I'm not saying that ext2_readdir() is _beautiful_. If it had been 
  written with the page cache in mind, it would probably have been done 
  very differently. And it doesn't do any readahead, probably because 
  nobody cared enough, but it should be trivial to add, and it would 
  automatically "do the right thing" just because it's much easier at the 
  page cache level.

  But I _am_ saying that compared to ext3, the ext2 readdir is a work of 
  art. ]

"metadata" has _zero_ to do with "physically indexed". There is no 
correlation what-so-ever. If you think there is a correlation, it's all in 
your mind.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Andrew Morton
On Fri, 29 Dec 2006 18:32:07 -0500
Theodore Tso <[EMAIL PROTECTED]> wrote:

> On Fri, Dec 29, 2006 at 02:42:51PM -0800, Linus Torvalds wrote:
> > I think ext3 is terminally crap by now. It still uses buffer heads in 
> > places where it really really shouldn't, and as a result, things like 
> > directory accesses are simply slower than they should be. Sadly, I don't 
> > think ext4 is going to fix any of this, either.
> 
> Not just ext3; ocfs2 is using the jbd layer as well.  I think we're
> going to have to put this (a rework of jbd2 to use the page cache) on
> the ext4 todo list, and work with the ocfs2 folks to try to come up
> with something that suits their needs as well.  Fortunately we have
> this filesystem/storage summit thing coming up in the next few months,
> and we can try to get some discussion going on the linux-ext4 mailing
> list in the meantime.  Unfortunately, I don't think this is going to
> be trivial.

I suspect it would be insane to move any part of JBD (apart from the
ordered-data flush) to use pagecache.  The whole thing is fundamentally
block-based.  But only for metadata - there's no strong reason why ext3/4
needs to manipulate file data via buffer_heads if data=journal and chattr
+j aren't in use.

We could possibly move ext3/4 directories out of the blockdev pagecache and
into per-directory pagecache, but that wouldn't change anything - the
journalling would still be block-based.

Adam Richter spent considerable time a few years ago trying to make the
mpage code go direct-to-BIO in all cases and we eventually gave up.  The
conceptual layering of page<->blocks<->bio is pretty clean, and it is hard
and ugly to fully optimise away the "block" bit in the middle.

buffer_heads become more important with large PAGE_CACHE_SIZE.  I'd expect
nobh mode to be quite inefficient with some workloads on 64k pages.  We
need that representation of the state (and location) of the block-sized
hunks which make up the page.

> If we do get this fixed for ext4, one interesting question is whether
> people would accept a patch to backport the fixes to ext3, given the
> the grief this is causing the page I/O and VM routines.  OTOH, reiser3
> probably has the same problems, and I suspect the changes to ext3 to
> cause it to avoid buffer heads, especially in order to support for
> filesystem blocksizes < pagesize, are going to be sufficiently risky
> in terms of introducing regressions to ext3 that they would probably
> be rejected on those grounds.  So unfortunately, we probably are going
> to have to support flushes via buffer heads for the foreseeable
> future.

We'll see.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Theodore Tso wrote:
>
> If we do get this fixed for ext4, one interesting question is whether
> people would accept a patch to backport the fixes to ext3, given the
> the grief this is causing the page I/O and VM routines.

I don't think backporting is the smartest option (unless it's done _way_ 
later), but the real problem with it isn't actually the VM behaviour, but 
simply the fact that cached performance absoluely _sucks_ with the buffer 
cache.

With the physically indexed buffer cache thing, you end up always having 
to do these complicated translations into block numbers for every single 
access, and at some point when I benchmarked it, it was a huge overhead 
for doing simple things like readdir.

It's also a major pain for read-ahead, exactly partly due to the high cost 
of translation - because you can't cheaply check whether the next block is 
there, the cost of even asking the question "should I try to read ahead?" 
is much much higher. As a result, read-ahead is seriously limited, because 
it's so expensive for the cached case (which is still hopefully the 
_common_ case).

So because read-ahead is limited, the non-cached case then _really_ sucks.

It was somewhat fixed in a really god-awful fashion by having 
ext3_readdir() actually do _readahead_ though the page cache, even though 
it does everything else through the buffer cache. And that just happens to 
work because we hopefully have physically contiguous blocks, but when that 
isn't true, the readahead doesn't do squat.

It's really quite fundamentally broken. But none of that causes any 
problems for the VM, since directories cannot be mmap'ed anyway. But it's 
really pitiful, and it really doesn't work very well. Of course, other 
filesystems _also_ suck at this, and other operating systems haev even 
MORE problems, so people don't always seem to realize how horribly 
horribly broken this all is.

I really wish somebody would write a filesystem that did large cold-cache 
directories well. Open some horrible file manager on /usr/bin with cold 
caches, and weep. The biggest problem is the inode indirection, but at 
some point when I looked at why it sucked, it was doing basically 
synchronous single-buffer reads on the directory too, because readahead 
didn't work properly.

I was hoping that something like SpadFS would actually take off, because 
it seemed to do a lot of good design choices (having inodes in-line in the 
directory for when there are no hardlinks is probably a requirement for a 
good filesystem these days. The separate inode table had its uses, but 
indirection in a filesystem really does suck, and stat information is too 
important to be indirect unless it absolutely has to).

But I suspect it needs more than somebody who just wants to get his thesis 
written ;)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Andrew Morton
On Fri, 29 Dec 2006 14:42:51 -0800 (PST)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> 
> 
> On Fri, 29 Dec 2006, Andrew Morton wrote:
> > 
> > - The above change means that we do extra writeout.  If a page is dirtied
> >   once, kjournald will write it and then pdflush will come along and
> >   needlessly write it again.
> 
> There's zero extra writeout for any flushing that flushes BY PAGES.
> 
> Only broken flushers that flush by buffer heads (which really really 
> really shouldn't be done any more: welcome to the 21st century) will cause 
> extra writeouts. And those extra writeouts are obviously required for all 
> the dirty state to actually hit the disk - which is the point of the 
> patch.
> 
> So they're not "extra" - they are "required for correct working".

They're extra.  As in "can be optimised away".

> But I can't stress the fact enough that people SHOULD NOT do writeback by 
> buffer heads. The buffer head has been purely an "IO entity" for the last 
> several years now, and it's not a cache entity.

The buffer_head is not an IO container.  It is the kernel's core
representation of a disk block.  Usually (but not always) it is backed by
some memory which is in pagecache.  We can feed buffer_heads into IO
containers via submit_bh(), but that's far from the only thing we use
buffer_heads for.  We should have done s/buffer_head/block/g years ago.

JBD implements physical block-based journalling, so it is 100% appropriate
that JBD deal with these disk blocks using their buffer_head
representation.

That being said, ordered-data mode isn't really part of the JBD journalling
system at all (the data doesn't get journalled!) - ordered-mode is an
add-on to the JBD journal to make the metadata which we're about to journal
point at more-likely-to-be-correct data.

JBD's ordered-mode writeback is just a sync and I see no conceptual
problems with killing its old buffer_head based sync and moving it into the
21st century.

> Anybody who does writeback 
> by buffer heads is basically bypassing the real cache (the page cache), 
> and that's why all the problems happen.
> 
> I think ext3 is terminally crap by now. It still uses buffer heads in 
> places where it really really shouldn't,

The ordered-data mode flush: sure.  The rest of JBD's use of buffer_heads
is quite appropriate.

> and as a result, things like 
> directory accesses are simply slower than they should be. Sadly, I don't 
> think ext4 is going to fix any of this, either.

I thought I fixed the performance problem?

Somewhat nastily, but as ext3 directories are metadata it is appropriate
that modifications to them be done in terms of buffer_heads (ie: blocks).

> It's all just too inherently wrongly designed around the buffer head 
> (which was correct in 1995, but hasn't been correct for a long time in the 
> kernel any more).
> 
> > - Poor old IO accounting broke again.
> 
> No. That's why I used "set_page_dirty()" and did it that strange ugly way 
> ("set page dirty, even though it's already dirty, and even though the very 
> next thing we will do is TestClearPageDirty???").

nfs_set_page_dirty() and reiserfs_set_page_dirty() should now bail if
PageDirty() to avoid needless work.

> > - For a long time I've wanted to nuke the current ext3/jbd ordered-data
> >   implementation altogether, and just make kjournald call into the
> >   standard writeback code to do a standard suberblock->inodes->pages walk.
> 
> I really would like to see less of the buffer-head-based stuff, and yes, 
> more of the normal inode page walking. I don't think you can "order" 
> accesses within a page anyway, exactly because of memory mapping issues, 
> so any page ordering is not about buffer heads on the page itself, it 
> should be purely about metadata.

In this context ext3's "ordered" mode means "sync the file contents before
journalling the metadata which points at it".

> > - It's pretty obnoxious that the VM now sets a clean page "dirty" and
> >   then proceeds to modify its contents.  It would be nice to stop doing
> >   that.
> 
> No. I think this really the fundamental confusion people had. People 
> thought that setting the page dirty meant that it was no longer being 
> modified.

No.  Setting a page (or bh, or inode) dirty means "this is known to have
been modified".  ie: this cached entity is now out of sync with backing
store.

Ho hum.  I don't care much, really.  But then, I understand how all this
stuff works.  Try explaining to someone the relationship between
pte-dirtiness, page-dirtiness, radix-tree-dirtiness and
buffer_head-dirtiness.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Theodore Tso
On Fri, Dec 29, 2006 at 02:42:51PM -0800, Linus Torvalds wrote:
> I think ext3 is terminally crap by now. It still uses buffer heads in 
> places where it really really shouldn't, and as a result, things like 
> directory accesses are simply slower than they should be. Sadly, I don't 
> think ext4 is going to fix any of this, either.

Not just ext3; ocfs2 is using the jbd layer as well.  I think we're
going to have to put this (a rework of jbd2 to use the page cache) on
the ext4 todo list, and work with the ocfs2 folks to try to come up
with something that suits their needs as well.  Fortunately we have
this filesystem/storage summit thing coming up in the next few months,
and we can try to get some discussion going on the linux-ext4 mailing
list in the meantime.  Unfortunately, I don't think this is going to
be trivial.

If we do get this fixed for ext4, one interesting question is whether
people would accept a patch to backport the fixes to ext3, given the
the grief this is causing the page I/O and VM routines.  OTOH, reiser3
probably has the same problems, and I suspect the changes to ext3 to
cause it to avoid buffer heads, especially in order to support for
filesystem blocksizes < pagesize, are going to be sufficiently risky
in terms of introducing regressions to ext3 that they would probably
be rejected on those grounds.  So unfortunately, we probably are going
to have to support flushes via buffer heads for the foreseeable
future.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Andrew Morton wrote:
> 
> - The above change means that we do extra writeout.  If a page is dirtied
>   once, kjournald will write it and then pdflush will come along and
>   needlessly write it again.

There's zero extra writeout for any flushing that flushes BY PAGES.

Only broken flushers that flush by buffer heads (which really really 
really shouldn't be done any more: welcome to the 21st century) will cause 
extra writeouts. And those extra writeouts are obviously required for all 
the dirty state to actually hit the disk - which is the point of the 
patch.

So they're not "extra" - they are "required for correct working".

But I can't stress the fact enough that people SHOULD NOT do writeback by 
buffer heads. The buffer head has been purely an "IO entity" for the last 
several years now, and it's not a cache entity. Anybody who does writeback 
by buffer heads is basically bypassing the real cache (the page cache), 
and that's why all the problems happen.

I think ext3 is terminally crap by now. It still uses buffer heads in 
places where it really really shouldn't, and as a result, things like 
directory accesses are simply slower than they should be. Sadly, I don't 
think ext4 is going to fix any of this, either.

It's all just too inherently wrongly designed around the buffer head 
(which was correct in 1995, but hasn't been correct for a long time in the 
kernel any more).

> - Poor old IO accounting broke again.

No. That's why I used "set_page_dirty()" and did it that strange ugly way 
("set page dirty, even though it's already dirty, and even though the very 
next thing we will do is TestClearPageDirty???").

That code looks strange as a result, which is why it now has more comments 
on it than actual code ;)

> - People were saying that ext2 and ext3,data=writeback were also showing
>   corruption.  What's up with that?

I thought the "ext3,data=writeback" case was reported to be fine by 
several people?

I'm not sure about ext2. I didn't look at what it did based on buffer 
heads. I would have expected it to be ok.

That said, at least one report was later shown to be bogus (errors due to 
out of disk, not due to actual errors ;).

> - For a long time I've wanted to nuke the current ext3/jbd ordered-data
>   implementation altogether, and just make kjournald call into the
>   standard writeback code to do a standard suberblock->inodes->pages walk.

I really would like to see less of the buffer-head-based stuff, and yes, 
more of the normal inode page walking. I don't think you can "order" 
accesses within a page anyway, exactly because of memory mapping issues, 
so any page ordering is not about buffer heads on the page itself, it 
should be purely about metadata.

> - It's pretty obnoxious that the VM now sets a clean page "dirty" and
>   then proceeds to modify its contents.  It would be nice to stop doing
>   that.

No. I think this really the fundamental confusion people had. People 
thought that setting the page dirty meant that it was no longer being 
modified. It hasn't meant that in a LONG time - ever since the whole 
DIRTY_TAG thing, the most important part of the PG_dirty thing has really 
been that it's now efficiently findable by the writeout logic.

And that is very much what the whole page accounting _depends_ on. When we 
mmap a page, we need to mark it "findable" as dirty _before_ people 
actually start writing to it, because it's too late afterwards.

>   We could stop marking the page dirty in do_wp_page() and create a new
>   VM counter "NR_PTE_DIRTY", which means
> 
> "number of mapping_cap_account_dirty() pages which have a dirty pte
> pointing at them".

Well, then you need to change what PAGE_MAPPING_TAG_DIRTY means too.

That's very fundamental. That DIRTY _tag_ is now even more important than 
the PG_dirty bit itself, since that's what we actually use to _access_ 
those things.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Andrew Morton
On Fri, 29 Dec 2006 14:16:32 -0800
Andrew Morton <[EMAIL PROTECTED]> wrote:

> - Poor old IO accounting broke again.

No it didn't - we're relying upon the behaviour of __set_page_dirty_buffers()
against an already-dirty page.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Andrew Morton
On Fri, 29 Dec 2006 02:48:35 -0800 (PST)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> + if (mapping && mapping_cap_account_dirty(mapping)) {
> + /*
> +  * Yes, Virginia, this is indeed insane.
> +  *
> +  * We use this sequence to make sure that
> +  *  (a) we account for dirty stats properly
> +  *  (b) we tell the low-level filesystem to
> +  *  mark the whole page dirty if it was
> +  *  dirty in a pagetable. Only to then
> +  *  (c) clean the page again and return 1 to
> +  *  cause the writeback.
> +  *
> +  * This way we avoid all nasty races with the
> +  * dirty bit in multiple places and clearing
> +  * them concurrently from different threads.
> +  *
> +  * Note! Normally the "set_page_dirty(page)"
> +  * has no effect on the actual dirty bit - since
> +  * that will already usually be set. But we
> +  * need the side effects, and it can help us
> +  * avoid races.
> +  *
> +  * We basically use the page "master dirty bit"
> +  * as a serialization point for all the different
> +  * threds doing their things.
> +  *
> +  * FIXME! We still have a race here: if somebody
> +  * adds the page back to the page tables in
> +  * between the "page_mkclean()" and the "TestClearPageDirty()",
> +  * we might have it mapped without the dirty bit set.
> +  */
> + if (page_mkclean(page))
> + set_page_dirty(page);
> + if (TestClearPageDirty(page)) {
>   dec_zone_page_state(page, NR_FILE_DIRTY);
> + return 1;
>   }

- Presumably reiser3's ordered-data mode has the same problem.  And ext4,
  of course.  Dunno about other filesytems.

- The above change means that we do extra writeout.  If a page is dirtied
  once, kjournald will write it and then pdflush will come along and
  needlessly write it again.

  But otoh, if a mapping is being repeatedly dirtied, kjournald will
  write the page once per 30 seconds (dirty_expire_centisecs) and pdflush
  will write the page once per 30 seconds as well.  But we _should_ be
  writing it once per five seconds (kjournald commit interval).  So we're
  still ahead ;)

- Poor old IO accounting broke again.

- People were saying that ext2 and ext3,data=writeback were also showing
  corruption.  What's up with that?

- For a long time I've wanted to nuke the current ext3/jbd ordered-data
  implementation altogether, and just make kjournald call into the
  standard writeback code to do a standard suberblock->inodes->pages walk.

  I think it'd be fairly straightforward to do.  We'd need to teach the
  writeback code to be able to skip dirty pages which don't have a disk
  mapping, so that kjournald doesn't end up waiting for kjournald to free
  up journal space..

  Would need to avoid possible deadlocks where someone calls
  ext3_force_commit() or otherwise does a synchronous commit while holding
  VFS locks.

  reiser3 and ext4 could be converted too.

  Not a short-term project, but this would avoid the problem.

- It's pretty obnoxious that the VM now sets a clean page "dirty" and
  then proceeds to modify its contents.  It would be nice to stop doing
  that.

  We could stop marking the page dirty in do_wp_page() and create a new
  VM counter "NR_PTE_DIRTY", which means

"number of mapping_cap_account_dirty() pages which have a dirty pte
pointing at them".

  Or, perhaps

"number of dirty ptes which point at mapping_cap_account_dirty() pages".

  Which can be larger, but the writeout code will probably cope.

  Then we take NR_PTE_DIRTY into account in the dirty-page balancing act.
  So

  - do_wp_page() will still run balance_dirty_pages()

  - but it would no longer run set_page_dirty().

  - But it needs to run mark_inode_dirty() so the fs-writeback code
notices the file.

  - And mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) becomes insufficient.

  The tricky part here is "how do we do the writeback"?  The
  pte-dirty,!PageDirty pages aren't tagged as dirty in the radix-tree and
  writeback needs to find them so that it can effectively do an msync() on
  them.  Walking all the mm's and vma's would be insane.  Visiting all the
  pages in the file would also probably be insane.

  Perhaps this can be solved by adding a new radix-tree tag which means
  "this page might have dirty ptes pointing at it".  For each file
  writeback would do a radix-tree walk of these pages,
  cleaning-and-write-protecting ptes, marking the corresponding pages
  dirty and clearing their PAGECACHE_TAG_PTE_DIRTY tags.

  Then we can fix the mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)
  problem by doing


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Theodore Tso wrote:
> 
> I'm confused.  Does this mean that if "fs blocksize"=="VM pagesize"
> this bug can't trigger?

No. Even if there is just a single buffer-head, if the filesystem ever 
writes out that _single_ buffer-head out of turn (ie before the VM 
actually asks it to, with "->writepage()"), then the same issue will 
happen.

In fact, a bigger fs blocksize will likely just make this easier to 
trigger (although I doubt it makes a big difference), since any 
out-of-order buffer flushback will happen for the whole page, rather than 
just a part of the page.

So the "problem" really ends up being that the filesystem does flushing 
that the VM isn't aware of, so when the VM did "set_page_dirty()" at an 
earlier time, the VM _expected_ the "->writepages()" call that happened 
much later to write the whole page - but because the FS had flushed things 
behind it backs even _before_ the "->writepage" happens, by the time the 
VM actually asks for the page to be written out, the FS layer won't 
actually write it all out any more.

Blocksize doesn't matter, the only thing that matters is whether something 
writes out data on a buffer-cache level, not on a "page cache" level. Ext3 
apparently does this in "ordered" data more at least (and hey, I suspect 
that the code that tries to release buffer head data might try to do it on 
its own too).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Nick Piggin wrote:
> 
> > It still has a tiny tiny race (see the comment), but I bet nobody can really
> > hit it in real life anyway, and I know several ways to fix it, so I'm not
> > really _that_ worried about it.
> 
> Well the race isn't a data loss one, is it? Just a case where the
> pte may be dirty but the page dirty state not accounted for.

Right. We should be picking it up eventually, since it's still in the page 
tables, but if we've lost sight of the page dirtyness we won't react 
correctly to msync() and/or fdatasync(). So we don't _lose_ the data, we 
just might not write it out in a timely manner if we ever hit the race.

> Can we fix it by just putting the page_mkclean back inside the
> TestClearPageDirty check, and re-clearing PG_dirty after redoing
> the set_page_dirty?

I considered it, but quite frankly, if we did it that way, I'd really like 
to just fix the whole insane "set_page_dirty()" instead.

I think set_page_dirty() should be split up. One thing that confused me 
mentally was that almost all of the dirty handling was actualyl done only 
if PG_dirty wasn't already set, so the _bulk_ of set_page_dirty() really 
ends up being

if (!TestSetPageDirty(page)) {
.. we just marked the page dirty, it was clean before,
   so we need to add it to the queues etc ..
}

and that's the part that I (and probably others) always really thought 
about.

But then we have the _one_ thing that runs outside of that "do only once 
per dirty bit" logic, and it's the buffer dirtying. If we had had two 
separate operations for this all: "set_dirty_every_time()" and the regular 
"set_dirty()", I don't think this would have been nearly as confusing.

(And then the difference between "__set_page_dirty_nobuffers()" and 
"__set_page_dirty_buffers()" really boils down to one doing the 
"everytime" _and_ the "once per dirty" checks and the other one doing just 
the "once per dirty bit" act - and we could rename the damn things to 
something saner too).

If we split it up that way, then the whole clear_page_dirty_for_io() logic 
would boil down to

if (TestClearPageDirty(page)) {
if (page_mkclean(page))
set_dirty_every_time();
return 1;
}
return 0;

and we wouldn't even need to do any of the "clear dirty again" kind of 
idiocy, because the "set_dirty_every_time()" stuff is the one that doesn't 
even care about the state of the PG_dirty bit - it's done regardless, and 
doesn't really touch it.

That's what I wanted to do, but with the current "set_page_dirty()" setup, 
I think my patch makes reasonable sense.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Martin Michlmayr
* Stephen Clark <[EMAIL PROTECTED]> [2006-12-29 10:17]:
> >It works for me now, both your testcase as well as an installation of
> >Debian on this ARM device.  I manually applied the patch to 2.6.19.
> 
> Can you post a diff against 2.6.19?

--- a/mm/page-writeback.c   2006-11-29 21:57:37.0 +
+++ b/mm/page-writeback.c   2006-12-29 11:02:55.555147896 +
@@ -893,16 +893,45 @@
 {
struct address_space *mapping = page_mapping(page);
 
-   if (mapping) {
+   if (mapping && mapping_cap_account_dirty(mapping)) {
+   /*
+* Yes, Virginia, this is indeed insane.
+*
+* We use this sequence to make sure that
+*  (a) we account for dirty stats properly
+*  (b) we tell the low-level filesystem to
+*  mark the whole page dirty if it was
+*  dirty in a pagetable. Only to then
+*  (c) clean the page again and return 1 to
+*  cause the writeback.
+*
+* This way we avoid all nasty races with the
+* dirty bit in multiple places and clearing
+* them concurrently from different threads.
+*
+* Note! Normally the "set_page_dirty(page)"
+* has no effect on the actual dirty bit - since
+* that will already usually be set. But we
+* need the side effects, and it can help us
+* avoid races.
+*
+* We basically use the page "master dirty bit"
+* as a serialization point for all the different
+* threds doing their things.
+*
+* FIXME! We still have a race here: if somebody
+* adds the page back to the page tables in
+* between the "page_mkclean()" and the "TestClearPageDirty()",
+* we might have it mapped without the dirty bit set.
+*/
+   if (page_mkclean(page))
+   set_page_dirty(page);
if (TestClearPageDirty(page)) {
-   if (mapping_cap_account_dirty(mapping)) {
-   page_mkclean(page);
-   dec_zone_page_state(page, NR_FILE_DIRTY);
-   }
+   dec_zone_page_state(page, NR_FILE_DIRTY);
return 1;
}
return 0;
-   }
+   }
return TestClearPageDirty(page);
 }
 EXPORT_SYMBOL(clear_page_dirty_for_io);

-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Theodore Tso
On Fri, Dec 29, 2006 at 12:58:12AM -0800, Linus Torvalds wrote:
> Because what "__set_page_dirty_buffers()" does is that AT THE TIME THE 
> "set_page_dirty()" IS CALLED, it will mark all the buffers on that page as 
> dirty. That may _sound_ like what we want, but it really isn't. Because by 
> the time "writepage()" is actually called (which can be MUCH MUCH later), 
> some internal filesystem activity may actually have cleaned one or more of 
> those buffers in the meantime, and now we call "writepage()" (which really 
> wants to write them _all_), and it will write only part of them, or none 
> at all.

I'm confused.  Does this mean that if "fs blocksize"=="VM pagesize"
this bug can't trigger?  But I thought at least one of people
reporting corruption was using a filesystem with a 4k block size on an
i386?

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Stephen Clark

Martin Michlmayr wrote:


* Linus Torvalds <[EMAIL PROTECTED]> [2006-12-29 02:48]:
 


Can anybody get corruption with this thing applied? It goes on top
of plain v2.6.20-rc2.
   



It works for me now, both your testcase as well as an installation of
Debian on this ARM device.  I manually applied the patch to 2.6.19.

Thanks.
 


Hi Martin,

Can you post a diff against 2.6.19?

Thanks,
Steve

--

"They that give up essential liberty to obtain temporary safety, 
deserve neither liberty nor safety."  (Ben Franklin)


"The course of history shows that as a government grows, liberty 
decreases."  (Thomas Jefferson)




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Martin Michlmayr
* Linus Torvalds <[EMAIL PROTECTED]> [2006-12-29 02:48]:
> Can anybody get corruption with this thing applied? It goes on top
> of plain v2.6.20-rc2.

It works for me now, both your testcase as well as an installation of
Debian on this ARM device.  I manually applied the patch to 2.6.19.

Thanks.
-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Martin Johansson

Linus Torvalds wrote:
[...] 
The patch is mostly a comment. The "real" meat of it is actually just a 
few lines.


Can anybody get corruption with this thing applied? It goes on top of 
plain v2.6.20-rc2. 


No corruption with the testcase here. Will check with rtorrent too later 
today but I suppose it will work just fine.
Nice work! It has been interesting (and educating) to follow this 
bug-hunt :)


/Martin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> > Hmm? I'd love it if somebody else wrote the patch and tested it, 
> > because I'm getting sick and tired of this bug ;)
> 
> Who the hell am I kidding? I haven't been able to sleep right for the 
> last few days over this bug. It was really getting to me.
> 
> And putting on the thinking cap, there's actually a fairly simple an 
> nonintrusive patch. [...]

ok, your patch seems to fix the testcase here too on -rc2-rt.

[ Damn, i should have slept a bit more, that would have saved me a ~4
  hour debug and tracing session today to analyze your testcase, just to
  find your patch and your explanation on lkml, right after i sent my
  analysis and workaround patch ;-) At least now we know it from two
  independent tracing results that the suspect code is the same. ]

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Nick Piggin

Hey nice work Linus!

Linus Torvalds wrote:


On Fri, 29 Dec 2006, Linus Torvalds wrote:

Hmm? I'd love it if somebody else wrote the patch and tested it, because 
I'm getting sick and tired of this bug ;)



Who the hell am I kidding? I haven't been able to sleep right for the last 
few days over this bug. It was really getting to me.


And putting on the thinking cap, there's actually a fairly simple an 
nonintrusive patch.


Yeah *this* makes more sense. And in retrospect it was simple, we
can't just throw out pte dirtiness information if the page doesn't
have all buffers dirtied.

It still has a tiny tiny race (see the comment), but I 
bet nobody can really hit it in real life anyway, and I know several ways 
to fix it, so I'm not really _that_ worried about it.


Well the race isn't a data loss one, is it? Just a case where the
pte may be dirty but the page dirty state not accounted for.

Can we fix it by just putting the page_mkclean back inside the
TestClearPageDirty check, and re-clearing PG_dirty after redoing
the set_page_dirty?




The patch is mostly a comment. The "real" meat of it is actually just a 
few lines.


Can anybody get corruption with this thing applied? It goes on top of 
plain v2.6.20-rc2.


Linus


diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b3a198c..ec01da1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page)
 {
struct address_space *mapping = page_mapping(page);
 
-	if (!mapping)

-   return TestClearPageDirty(page);
-
-   if (TestClearPageDirty(page)) {
-   if (mapping_cap_account_dirty(mapping)) {
-   page_mkclean(page);
+   if (mapping && mapping_cap_account_dirty(mapping)) {
+   /*
+* Yes, Virginia, this is indeed insane.
+*
+* We use this sequence to make sure that
+*  (a) we account for dirty stats properly
+*  (b) we tell the low-level filesystem to
+*  mark the whole page dirty if it was
+*  dirty in a pagetable. Only to then
+*  (c) clean the page again and return 1 to
+*  cause the writeback.
+*
+* This way we avoid all nasty races with the
+* dirty bit in multiple places and clearing
+* them concurrently from different threads.
+*
+* Note! Normally the "set_page_dirty(page)"
+* has no effect on the actual dirty bit - since
+* that will already usually be set. But we
+* need the side effects, and it can help us
+* avoid races.
+*
+* We basically use the page "master dirty bit"
+* as a serialization point for all the different
+* threds doing their things.
+*
+* FIXME! We still have a race here: if somebody
+* adds the page back to the page tables in
+* between the "page_mkclean()" and the "TestClearPageDirty()",
+* we might have it mapped without the dirty bit set.
+*/
+   if (page_mkclean(page))
+   set_page_dirty(page);
+   if (TestClearPageDirty(page)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
+   return 1;
}
-   return 1;
+   return 0;
}
-   return 0;
+   return TestClearPageDirty(page);
 }
 EXPORT_SYMBOL(clear_page_dirty_for_io);
 




--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Andrei Popa
On Fri, 2006-12-29 at 02:48 -0800, Linus Torvalds wrote:
> 
> On Fri, 29 Dec 2006, Linus Torvalds wrote:
> > 
> > Hmm? I'd love it if somebody else wrote the patch and tested it, because 
> > I'm getting sick and tired of this bug ;)
> 
> Who the hell am I kidding? I haven't been able to sleep right for the last 
> few days over this bug. It was really getting to me.
> 
> And putting on the thinking cap, there's actually a fairly simple an 
> nonintrusive patch. It still has a tiny tiny race (see the comment), but I 
> bet nobody can really hit it in real life anyway, and I know several ways 
> to fix it, so I'm not really _that_ worried about it.
> 
> The patch is mostly a comment. The "real" meat of it is actually just a 
> few lines.
> 
> Can anybody get corruption with this thing applied? It goes on top of 
> plain v2.6.20-rc2.

Tested with rtorrent and there is no corruption.


> 
>   Linus
> 
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index b3a198c..ec01da1 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page)
>  {
>   struct address_space *mapping = page_mapping(page);
>  
> - if (!mapping)
> - return TestClearPageDirty(page);
> -
> - if (TestClearPageDirty(page)) {
> - if (mapping_cap_account_dirty(mapping)) {
> - page_mkclean(page);
> + if (mapping && mapping_cap_account_dirty(mapping)) {
> + /*
> +  * Yes, Virginia, this is indeed insane.
> +  *
> +  * We use this sequence to make sure that
> +  *  (a) we account for dirty stats properly
> +  *  (b) we tell the low-level filesystem to
> +  *  mark the whole page dirty if it was
> +  *  dirty in a pagetable. Only to then
> +  *  (c) clean the page again and return 1 to
> +  *  cause the writeback.
> +  *
> +  * This way we avoid all nasty races with the
> +  * dirty bit in multiple places and clearing
> +  * them concurrently from different threads.
> +  *
> +  * Note! Normally the "set_page_dirty(page)"
> +  * has no effect on the actual dirty bit - since
> +  * that will already usually be set. But we
> +  * need the side effects, and it can help us
> +  * avoid races.
> +  *
> +  * We basically use the page "master dirty bit"
> +  * as a serialization point for all the different
> +  * threds doing their things.
> +  *
> +  * FIXME! We still have a race here: if somebody
> +  * adds the page back to the page tables in
> +  * between the "page_mkclean()" and the "TestClearPageDirty()",
> +  * we might have it mapped without the dirty bit set.
> +  */
> + if (page_mkclean(page))
> + set_page_dirty(page);
> + if (TestClearPageDirty(page)) {
>   dec_zone_page_state(page, NR_FILE_DIRTY);
> + return 1;
>   }
> - return 1;
> + return 0;
>   }
> - return 0;
> + return TestClearPageDirty(page);
>  }
>  EXPORT_SYMBOL(clear_page_dirty_for_io);
>  

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Linus Torvalds wrote:
> 
> Hmm? I'd love it if somebody else wrote the patch and tested it, because 
> I'm getting sick and tired of this bug ;)

Who the hell am I kidding? I haven't been able to sleep right for the last 
few days over this bug. It was really getting to me.

And putting on the thinking cap, there's actually a fairly simple an 
nonintrusive patch. It still has a tiny tiny race (see the comment), but I 
bet nobody can really hit it in real life anyway, and I know several ways 
to fix it, so I'm not really _that_ worried about it.

The patch is mostly a comment. The "real" meat of it is actually just a 
few lines.

Can anybody get corruption with this thing applied? It goes on top of 
plain v2.6.20-rc2.

Linus


diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b3a198c..ec01da1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page)
 {
struct address_space *mapping = page_mapping(page);
 
-   if (!mapping)
-   return TestClearPageDirty(page);
-
-   if (TestClearPageDirty(page)) {
-   if (mapping_cap_account_dirty(mapping)) {
-   page_mkclean(page);
+   if (mapping && mapping_cap_account_dirty(mapping)) {
+   /*
+* Yes, Virginia, this is indeed insane.
+*
+* We use this sequence to make sure that
+*  (a) we account for dirty stats properly
+*  (b) we tell the low-level filesystem to
+*  mark the whole page dirty if it was
+*  dirty in a pagetable. Only to then
+*  (c) clean the page again and return 1 to
+*  cause the writeback.
+*
+* This way we avoid all nasty races with the
+* dirty bit in multiple places and clearing
+* them concurrently from different threads.
+*
+* Note! Normally the "set_page_dirty(page)"
+* has no effect on the actual dirty bit - since
+* that will already usually be set. But we
+* need the side effects, and it can help us
+* avoid races.
+*
+* We basically use the page "master dirty bit"
+* as a serialization point for all the different
+* threds doing their things.
+*
+* FIXME! We still have a race here: if somebody
+* adds the page back to the page tables in
+* between the "page_mkclean()" and the "TestClearPageDirty()",
+* we might have it mapped without the dirty bit set.
+*/
+   if (page_mkclean(page))
+   set_page_dirty(page);
+   if (TestClearPageDirty(page)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
+   return 1;
}
-   return 1;
+   return 0;
}
-   return 0;
+   return TestClearPageDirty(page);
 }
 EXPORT_SYMBOL(clear_page_dirty_for_io);
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Thu, 28 Dec 2006, Linus Torvalds wrote:
> 
> So everything I have ever seen says that the VM layer is actually doing 
> everything right.

That was true, but at the same time, it's not. Let me explain.

> That to me says: "somebody didn't actually write out out". The VM layer 
> asked the filesystem to do the write, but the filesystem just didn't do 
> it. I personally think it's because some buffer-head BH_dirty bit got 
> scrogged

Ok, I have proof of this now.

Here's a trace (with cycle counts), and with a new trace event added: this 
is for another corrupted page. I have:

 49105  PG 15d8 (14800): mm/page-writeback.c:872 clean_for_io
 49106  PG 15d8 (6900): mm/rmap.c:451 cleaning PTE b7fa6000
 49107  PG 15d8 (9900): mm/page-writeback.c:914 set writeback
 49108  PG 15d8 (6480): mm/page-writeback.c:916 setting TAG_WRITEBACK
 49109  PG 15d8 (7110): mm/page-writeback.c:922 clearing TAG_DIRTY
 49110  PG 15d8 (7190): fs/buffer.c:1713 no IO underway
 49111  PG 15d8 (6180): mm/page-writeback.c:891 end writeback
 49112  PG 15d8 (6460): mm/page-writeback.c:893 clearing TAG_WRITEBACK

where that first column is the trace event number again, and the "PG 
15d8" is that corrupted page. The thing in the parenthesis is "CPU 
cycles since last event), and the important part to note is that this is 
indeed all one single thing with no actual IO anywhere (~7000 CPU cycles 
may sound like a lot, but (a) it's not that many cache misses and (b) a 
lot of it is the logging overhead - back-to-back log events will take 
about 3500 cycles) just because it does the actual ASCII printk() etc.

Also, the new event is:

fs/buffer.c:1713 no IO underway

which is just the 

if (nr_underway == 0)

case in fs/buffer.c

And I now finally really believe that I fully understand the corruption, 
but I don't have a simple solution, much less a patch.

What the problem basically boils down to is that "set_page_dirty()" is 
supposed to be a mark for dirtying THE WHOLE PAGE, but it really is not 
"the whole page when the 'set_page_dirty()' itself happens", but more of a 
"the next writepage() needs to write back the whole page" thing.

And that's not what "__set_page_dirty_buffers()" really does.

Because what "__set_page_dirty_buffers()" does is that AT THE TIME THE 
"set_page_dirty()" IS CALLED, it will mark all the buffers on that page as 
dirty. That may _sound_ like what we want, but it really isn't. Because by 
the time "writepage()" is actually called (which can be MUCH MUCH later), 
some internal filesystem activity may actually have cleaned one or more of 
those buffers in the meantime, and now we call "writepage()" (which really 
wants to write them _all_), and it will write only part of them, or none 
at all.

So the VM thought that since it did a "writepage()", all the dirty state 
at that point got written back. But it didn't - the filesystem could have 
written back part or all of the page much earlier, and the writepage() 
actually does nothing at all.

Both filesystem and VM actually _think_ they do the right thing, because 
they simply have totally different expectations. The filesystem thinks 
that it should care about dirty buffers (that got marked dirty _after_ 
they were dirtied), while the filesystem thinks that it cares about dirty 
_pages_ (that got dirted at any time _before_ "writepage()" was called).

Neither is really "wrong", per se, it's just that the two parts have 
different expectations, and the _combination_ just doesn't work. 
"set_page_dirty()" at some point meant "the writes have been done", but 
these days it really means something else.

Now, the reason there is no trivial patch is not that this is conceptually 
really hard to fix. I can see several different approaches to fixing it, 
but they all really boil down to two alternatives:

 (a) splitting the one "PG_dirty" bit up into two bits: the 
 "PG_writescheduled" bit and the "PG_alldirty" bit.

 The "PG_write_scheduled" bit would be the bit that the filesystem 
 would set when it has pending dirty data that it wrote itself (and 
 that may not cover the whole page), and is the part of PG_dirty that 
 sets the PAGECACHE_TAG_DIRTY. It's also what forces "writepage()" to 
 be called.

 The "PG_alldirty" bit is just an additional "somebody else dirtied 
 random parts of this page, and we don't know what" flag, which is set 
 by "set_page_dirty()" in addition to doing the PG_write_scheduled 
 stuff. We would test-and-clear it at "writepage()" time, and pass it 
 in to "writepages()" to tell the writepage() function that it can't 
 just write out its own small limited notion of what is dirty.

 (There are various variations on this whole theme: instead of having 
 a flag to "writepage()", we could split the "whole page" case out as 
 a separate callback or similar)

 (b) making sure that all "set_page_dirty()" calls are _after_ the page 
 

Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Thu, 28 Dec 2006, Linus Torvalds wrote:
 
 So everything I have ever seen says that the VM layer is actually doing 
 everything right.

That was true, but at the same time, it's not. Let me explain.

 That to me says: somebody didn't actually write out out. The VM layer 
 asked the filesystem to do the write, but the filesystem just didn't do 
 it. I personally think it's because some buffer-head BH_dirty bit got 
 scrogged

Ok, I have proof of this now.

Here's a trace (with cycle counts), and with a new trace event added: this 
is for another corrupted page. I have:

 49105  PG 15d8 (14800): mm/page-writeback.c:872 clean_for_io
 49106  PG 15d8 (6900): mm/rmap.c:451 cleaning PTE b7fa6000
 49107  PG 15d8 (9900): mm/page-writeback.c:914 set writeback
 49108  PG 15d8 (6480): mm/page-writeback.c:916 setting TAG_WRITEBACK
 49109  PG 15d8 (7110): mm/page-writeback.c:922 clearing TAG_DIRTY
 49110  PG 15d8 (7190): fs/buffer.c:1713 no IO underway
 49111  PG 15d8 (6180): mm/page-writeback.c:891 end writeback
 49112  PG 15d8 (6460): mm/page-writeback.c:893 clearing TAG_WRITEBACK

where that first column is the trace event number again, and the PG 
15d8 is that corrupted page. The thing in the parenthesis is CPU 
cycles since last event), and the important part to note is that this is 
indeed all one single thing with no actual IO anywhere (~7000 CPU cycles 
may sound like a lot, but (a) it's not that many cache misses and (b) a 
lot of it is the logging overhead - back-to-back log events will take 
about 3500 cycles) just because it does the actual ASCII printk() etc.

Also, the new event is:

fs/buffer.c:1713 no IO underway

which is just the 

if (nr_underway == 0)

case in fs/buffer.c

And I now finally really believe that I fully understand the corruption, 
but I don't have a simple solution, much less a patch.

What the problem basically boils down to is that set_page_dirty() is 
supposed to be a mark for dirtying THE WHOLE PAGE, but it really is not 
the whole page when the 'set_page_dirty()' itself happens, but more of a 
the next writepage() needs to write back the whole page thing.

And that's not what __set_page_dirty_buffers() really does.

Because what __set_page_dirty_buffers() does is that AT THE TIME THE 
set_page_dirty() IS CALLED, it will mark all the buffers on that page as 
dirty. That may _sound_ like what we want, but it really isn't. Because by 
the time writepage() is actually called (which can be MUCH MUCH later), 
some internal filesystem activity may actually have cleaned one or more of 
those buffers in the meantime, and now we call writepage() (which really 
wants to write them _all_), and it will write only part of them, or none 
at all.

So the VM thought that since it did a writepage(), all the dirty state 
at that point got written back. But it didn't - the filesystem could have 
written back part or all of the page much earlier, and the writepage() 
actually does nothing at all.

Both filesystem and VM actually _think_ they do the right thing, because 
they simply have totally different expectations. The filesystem thinks 
that it should care about dirty buffers (that got marked dirty _after_ 
they were dirtied), while the filesystem thinks that it cares about dirty 
_pages_ (that got dirted at any time _before_ writepage() was called).

Neither is really wrong, per se, it's just that the two parts have 
different expectations, and the _combination_ just doesn't work. 
set_page_dirty() at some point meant the writes have been done, but 
these days it really means something else.

Now, the reason there is no trivial patch is not that this is conceptually 
really hard to fix. I can see several different approaches to fixing it, 
but they all really boil down to two alternatives:

 (a) splitting the one PG_dirty bit up into two bits: the 
 PG_writescheduled bit and the PG_alldirty bit.

 The PG_write_scheduled bit would be the bit that the filesystem 
 would set when it has pending dirty data that it wrote itself (and 
 that may not cover the whole page), and is the part of PG_dirty that 
 sets the PAGECACHE_TAG_DIRTY. It's also what forces writepage() to 
 be called.

 The PG_alldirty bit is just an additional somebody else dirtied 
 random parts of this page, and we don't know what flag, which is set 
 by set_page_dirty() in addition to doing the PG_write_scheduled 
 stuff. We would test-and-clear it at writepage() time, and pass it 
 in to writepages() to tell the writepage() function that it can't 
 just write out its own small limited notion of what is dirty.

 (There are various variations on this whole theme: instead of having 
 a flag to writepage(), we could split the whole page case out as 
 a separate callback or similar)

 (b) making sure that all set_page_dirty() calls are _after_ the page 
 has been marked dirty (which in the case of memory mapped pages 

Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Linus Torvalds wrote:
 
 Hmm? I'd love it if somebody else wrote the patch and tested it, because 
 I'm getting sick and tired of this bug ;)

Who the hell am I kidding? I haven't been able to sleep right for the last 
few days over this bug. It was really getting to me.

And putting on the thinking cap, there's actually a fairly simple an 
nonintrusive patch. It still has a tiny tiny race (see the comment), but I 
bet nobody can really hit it in real life anyway, and I know several ways 
to fix it, so I'm not really _that_ worried about it.

The patch is mostly a comment. The real meat of it is actually just a 
few lines.

Can anybody get corruption with this thing applied? It goes on top of 
plain v2.6.20-rc2.

Linus


diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b3a198c..ec01da1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page)
 {
struct address_space *mapping = page_mapping(page);
 
-   if (!mapping)
-   return TestClearPageDirty(page);
-
-   if (TestClearPageDirty(page)) {
-   if (mapping_cap_account_dirty(mapping)) {
-   page_mkclean(page);
+   if (mapping  mapping_cap_account_dirty(mapping)) {
+   /*
+* Yes, Virginia, this is indeed insane.
+*
+* We use this sequence to make sure that
+*  (a) we account for dirty stats properly
+*  (b) we tell the low-level filesystem to
+*  mark the whole page dirty if it was
+*  dirty in a pagetable. Only to then
+*  (c) clean the page again and return 1 to
+*  cause the writeback.
+*
+* This way we avoid all nasty races with the
+* dirty bit in multiple places and clearing
+* them concurrently from different threads.
+*
+* Note! Normally the set_page_dirty(page)
+* has no effect on the actual dirty bit - since
+* that will already usually be set. But we
+* need the side effects, and it can help us
+* avoid races.
+*
+* We basically use the page master dirty bit
+* as a serialization point for all the different
+* threds doing their things.
+*
+* FIXME! We still have a race here: if somebody
+* adds the page back to the page tables in
+* between the page_mkclean() and the TestClearPageDirty(),
+* we might have it mapped without the dirty bit set.
+*/
+   if (page_mkclean(page))
+   set_page_dirty(page);
+   if (TestClearPageDirty(page)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
+   return 1;
}
-   return 1;
+   return 0;
}
-   return 0;
+   return TestClearPageDirty(page);
 }
 EXPORT_SYMBOL(clear_page_dirty_for_io);
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Andrei Popa
On Fri, 2006-12-29 at 02:48 -0800, Linus Torvalds wrote:
 
 On Fri, 29 Dec 2006, Linus Torvalds wrote:
  
  Hmm? I'd love it if somebody else wrote the patch and tested it, because 
  I'm getting sick and tired of this bug ;)
 
 Who the hell am I kidding? I haven't been able to sleep right for the last 
 few days over this bug. It was really getting to me.
 
 And putting on the thinking cap, there's actually a fairly simple an 
 nonintrusive patch. It still has a tiny tiny race (see the comment), but I 
 bet nobody can really hit it in real life anyway, and I know several ways 
 to fix it, so I'm not really _that_ worried about it.
 
 The patch is mostly a comment. The real meat of it is actually just a 
 few lines.
 
 Can anybody get corruption with this thing applied? It goes on top of 
 plain v2.6.20-rc2.

Tested with rtorrent and there is no corruption.


 
   Linus
 
 
 diff --git a/mm/page-writeback.c b/mm/page-writeback.c
 index b3a198c..ec01da1 100644
 --- a/mm/page-writeback.c
 +++ b/mm/page-writeback.c
 @@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page)
  {
   struct address_space *mapping = page_mapping(page);
  
 - if (!mapping)
 - return TestClearPageDirty(page);
 -
 - if (TestClearPageDirty(page)) {
 - if (mapping_cap_account_dirty(mapping)) {
 - page_mkclean(page);
 + if (mapping  mapping_cap_account_dirty(mapping)) {
 + /*
 +  * Yes, Virginia, this is indeed insane.
 +  *
 +  * We use this sequence to make sure that
 +  *  (a) we account for dirty stats properly
 +  *  (b) we tell the low-level filesystem to
 +  *  mark the whole page dirty if it was
 +  *  dirty in a pagetable. Only to then
 +  *  (c) clean the page again and return 1 to
 +  *  cause the writeback.
 +  *
 +  * This way we avoid all nasty races with the
 +  * dirty bit in multiple places and clearing
 +  * them concurrently from different threads.
 +  *
 +  * Note! Normally the set_page_dirty(page)
 +  * has no effect on the actual dirty bit - since
 +  * that will already usually be set. But we
 +  * need the side effects, and it can help us
 +  * avoid races.
 +  *
 +  * We basically use the page master dirty bit
 +  * as a serialization point for all the different
 +  * threds doing their things.
 +  *
 +  * FIXME! We still have a race here: if somebody
 +  * adds the page back to the page tables in
 +  * between the page_mkclean() and the TestClearPageDirty(),
 +  * we might have it mapped without the dirty bit set.
 +  */
 + if (page_mkclean(page))
 + set_page_dirty(page);
 + if (TestClearPageDirty(page)) {
   dec_zone_page_state(page, NR_FILE_DIRTY);
 + return 1;
   }
 - return 1;
 + return 0;
   }
 - return 0;
 + return TestClearPageDirty(page);
  }
  EXPORT_SYMBOL(clear_page_dirty_for_io);
  

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Nick Piggin

Hey nice work Linus!

Linus Torvalds wrote:


On Fri, 29 Dec 2006, Linus Torvalds wrote:

Hmm? I'd love it if somebody else wrote the patch and tested it, because 
I'm getting sick and tired of this bug ;)



Who the hell am I kidding? I haven't been able to sleep right for the last 
few days over this bug. It was really getting to me.


And putting on the thinking cap, there's actually a fairly simple an 
nonintrusive patch.


Yeah *this* makes more sense. And in retrospect it was simple, we
can't just throw out pte dirtiness information if the page doesn't
have all buffers dirtied.

It still has a tiny tiny race (see the comment), but I 
bet nobody can really hit it in real life anyway, and I know several ways 
to fix it, so I'm not really _that_ worried about it.


Well the race isn't a data loss one, is it? Just a case where the
pte may be dirty but the page dirty state not accounted for.

Can we fix it by just putting the page_mkclean back inside the
TestClearPageDirty check, and re-clearing PG_dirty after redoing
the set_page_dirty?




The patch is mostly a comment. The real meat of it is actually just a 
few lines.


Can anybody get corruption with this thing applied? It goes on top of 
plain v2.6.20-rc2.


Linus


diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b3a198c..ec01da1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page)
 {
struct address_space *mapping = page_mapping(page);
 
-	if (!mapping)

-   return TestClearPageDirty(page);
-
-   if (TestClearPageDirty(page)) {
-   if (mapping_cap_account_dirty(mapping)) {
-   page_mkclean(page);
+   if (mapping  mapping_cap_account_dirty(mapping)) {
+   /*
+* Yes, Virginia, this is indeed insane.
+*
+* We use this sequence to make sure that
+*  (a) we account for dirty stats properly
+*  (b) we tell the low-level filesystem to
+*  mark the whole page dirty if it was
+*  dirty in a pagetable. Only to then
+*  (c) clean the page again and return 1 to
+*  cause the writeback.
+*
+* This way we avoid all nasty races with the
+* dirty bit in multiple places and clearing
+* them concurrently from different threads.
+*
+* Note! Normally the set_page_dirty(page)
+* has no effect on the actual dirty bit - since
+* that will already usually be set. But we
+* need the side effects, and it can help us
+* avoid races.
+*
+* We basically use the page master dirty bit
+* as a serialization point for all the different
+* threds doing their things.
+*
+* FIXME! We still have a race here: if somebody
+* adds the page back to the page tables in
+* between the page_mkclean() and the TestClearPageDirty(),
+* we might have it mapped without the dirty bit set.
+*/
+   if (page_mkclean(page))
+   set_page_dirty(page);
+   if (TestClearPageDirty(page)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
+   return 1;
}
-   return 1;
+   return 0;
}
-   return 0;
+   return TestClearPageDirty(page);
 }
 EXPORT_SYMBOL(clear_page_dirty_for_io);
 




--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Ingo Molnar

* Linus Torvalds [EMAIL PROTECTED] wrote:

  Hmm? I'd love it if somebody else wrote the patch and tested it, 
  because I'm getting sick and tired of this bug ;)
 
 Who the hell am I kidding? I haven't been able to sleep right for the 
 last few days over this bug. It was really getting to me.
 
 And putting on the thinking cap, there's actually a fairly simple an 
 nonintrusive patch. [...]

ok, your patch seems to fix the testcase here too on -rc2-rt.

[ Damn, i should have slept a bit more, that would have saved me a ~4
  hour debug and tracing session today to analyze your testcase, just to
  find your patch and your explanation on lkml, right after i sent my
  analysis and workaround patch ;-) At least now we know it from two
  independent tracing results that the suspect code is the same. ]

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Martin Johansson

Linus Torvalds wrote:
[...] 
The patch is mostly a comment. The real meat of it is actually just a 
few lines.


Can anybody get corruption with this thing applied? It goes on top of 
plain v2.6.20-rc2. 


No corruption with the testcase here. Will check with rtorrent too later 
today but I suppose it will work just fine.
Nice work! It has been interesting (and educating) to follow this 
bug-hunt :)


/Martin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Martin Michlmayr
* Linus Torvalds [EMAIL PROTECTED] [2006-12-29 02:48]:
 Can anybody get corruption with this thing applied? It goes on top
 of plain v2.6.20-rc2.

It works for me now, both your testcase as well as an installation of
Debian on this ARM device.  I manually applied the patch to 2.6.19.

Thanks.
-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Stephen Clark

Martin Michlmayr wrote:


* Linus Torvalds [EMAIL PROTECTED] [2006-12-29 02:48]:
 


Can anybody get corruption with this thing applied? It goes on top
of plain v2.6.20-rc2.
   



It works for me now, both your testcase as well as an installation of
Debian on this ARM device.  I manually applied the patch to 2.6.19.

Thanks.
 


Hi Martin,

Can you post a diff against 2.6.19?

Thanks,
Steve

--

They that give up essential liberty to obtain temporary safety, 
deserve neither liberty nor safety.  (Ben Franklin)


The course of history shows that as a government grows, liberty 
decreases.  (Thomas Jefferson)




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Theodore Tso
On Fri, Dec 29, 2006 at 12:58:12AM -0800, Linus Torvalds wrote:
 Because what __set_page_dirty_buffers() does is that AT THE TIME THE 
 set_page_dirty() IS CALLED, it will mark all the buffers on that page as 
 dirty. That may _sound_ like what we want, but it really isn't. Because by 
 the time writepage() is actually called (which can be MUCH MUCH later), 
 some internal filesystem activity may actually have cleaned one or more of 
 those buffers in the meantime, and now we call writepage() (which really 
 wants to write them _all_), and it will write only part of them, or none 
 at all.

I'm confused.  Does this mean that if fs blocksize==VM pagesize
this bug can't trigger?  But I thought at least one of people
reporting corruption was using a filesystem with a 4k block size on an
i386?

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Martin Michlmayr
* Stephen Clark [EMAIL PROTECTED] [2006-12-29 10:17]:
 It works for me now, both your testcase as well as an installation of
 Debian on this ARM device.  I manually applied the patch to 2.6.19.
 
 Can you post a diff against 2.6.19?

--- a/mm/page-writeback.c   2006-11-29 21:57:37.0 +
+++ b/mm/page-writeback.c   2006-12-29 11:02:55.555147896 +
@@ -893,16 +893,45 @@
 {
struct address_space *mapping = page_mapping(page);
 
-   if (mapping) {
+   if (mapping  mapping_cap_account_dirty(mapping)) {
+   /*
+* Yes, Virginia, this is indeed insane.
+*
+* We use this sequence to make sure that
+*  (a) we account for dirty stats properly
+*  (b) we tell the low-level filesystem to
+*  mark the whole page dirty if it was
+*  dirty in a pagetable. Only to then
+*  (c) clean the page again and return 1 to
+*  cause the writeback.
+*
+* This way we avoid all nasty races with the
+* dirty bit in multiple places and clearing
+* them concurrently from different threads.
+*
+* Note! Normally the set_page_dirty(page)
+* has no effect on the actual dirty bit - since
+* that will already usually be set. But we
+* need the side effects, and it can help us
+* avoid races.
+*
+* We basically use the page master dirty bit
+* as a serialization point for all the different
+* threds doing their things.
+*
+* FIXME! We still have a race here: if somebody
+* adds the page back to the page tables in
+* between the page_mkclean() and the TestClearPageDirty(),
+* we might have it mapped without the dirty bit set.
+*/
+   if (page_mkclean(page))
+   set_page_dirty(page);
if (TestClearPageDirty(page)) {
-   if (mapping_cap_account_dirty(mapping)) {
-   page_mkclean(page);
-   dec_zone_page_state(page, NR_FILE_DIRTY);
-   }
+   dec_zone_page_state(page, NR_FILE_DIRTY);
return 1;
}
return 0;
-   }
+   }
return TestClearPageDirty(page);
 }
 EXPORT_SYMBOL(clear_page_dirty_for_io);

-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Nick Piggin wrote:
 
  It still has a tiny tiny race (see the comment), but I bet nobody can really
  hit it in real life anyway, and I know several ways to fix it, so I'm not
  really _that_ worried about it.
 
 Well the race isn't a data loss one, is it? Just a case where the
 pte may be dirty but the page dirty state not accounted for.

Right. We should be picking it up eventually, since it's still in the page 
tables, but if we've lost sight of the page dirtyness we won't react 
correctly to msync() and/or fdatasync(). So we don't _lose_ the data, we 
just might not write it out in a timely manner if we ever hit the race.

 Can we fix it by just putting the page_mkclean back inside the
 TestClearPageDirty check, and re-clearing PG_dirty after redoing
 the set_page_dirty?

I considered it, but quite frankly, if we did it that way, I'd really like 
to just fix the whole insane set_page_dirty() instead.

I think set_page_dirty() should be split up. One thing that confused me 
mentally was that almost all of the dirty handling was actualyl done only 
if PG_dirty wasn't already set, so the _bulk_ of set_page_dirty() really 
ends up being

if (!TestSetPageDirty(page)) {
.. we just marked the page dirty, it was clean before,
   so we need to add it to the queues etc ..
}

and that's the part that I (and probably others) always really thought 
about.

But then we have the _one_ thing that runs outside of that do only once 
per dirty bit logic, and it's the buffer dirtying. If we had had two 
separate operations for this all: set_dirty_every_time() and the regular 
set_dirty(), I don't think this would have been nearly as confusing.

(And then the difference between __set_page_dirty_nobuffers() and 
__set_page_dirty_buffers() really boils down to one doing the 
everytime _and_ the once per dirty checks and the other one doing just 
the once per dirty bit act - and we could rename the damn things to 
something saner too).

If we split it up that way, then the whole clear_page_dirty_for_io() logic 
would boil down to

if (TestClearPageDirty(page)) {
if (page_mkclean(page))
set_dirty_every_time();
return 1;
}
return 0;

and we wouldn't even need to do any of the clear dirty again kind of 
idiocy, because the set_dirty_every_time() stuff is the one that doesn't 
even care about the state of the PG_dirty bit - it's done regardless, and 
doesn't really touch it.

That's what I wanted to do, but with the current set_page_dirty() setup, 
I think my patch makes reasonable sense.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Theodore Tso wrote:
 
 I'm confused.  Does this mean that if fs blocksize==VM pagesize
 this bug can't trigger?

No. Even if there is just a single buffer-head, if the filesystem ever 
writes out that _single_ buffer-head out of turn (ie before the VM 
actually asks it to, with -writepage()), then the same issue will 
happen.

In fact, a bigger fs blocksize will likely just make this easier to 
trigger (although I doubt it makes a big difference), since any 
out-of-order buffer flushback will happen for the whole page, rather than 
just a part of the page.

So the problem really ends up being that the filesystem does flushing 
that the VM isn't aware of, so when the VM did set_page_dirty() at an 
earlier time, the VM _expected_ the -writepages() call that happened 
much later to write the whole page - but because the FS had flushed things 
behind it backs even _before_ the -writepage happens, by the time the 
VM actually asks for the page to be written out, the FS layer won't 
actually write it all out any more.

Blocksize doesn't matter, the only thing that matters is whether something 
writes out data on a buffer-cache level, not on a page cache level. Ext3 
apparently does this in ordered data more at least (and hey, I suspect 
that the code that tries to release buffer head data might try to do it on 
its own too).

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Andrew Morton
On Fri, 29 Dec 2006 02:48:35 -0800 (PST)
Linus Torvalds [EMAIL PROTECTED] wrote:

 + if (mapping  mapping_cap_account_dirty(mapping)) {
 + /*
 +  * Yes, Virginia, this is indeed insane.
 +  *
 +  * We use this sequence to make sure that
 +  *  (a) we account for dirty stats properly
 +  *  (b) we tell the low-level filesystem to
 +  *  mark the whole page dirty if it was
 +  *  dirty in a pagetable. Only to then
 +  *  (c) clean the page again and return 1 to
 +  *  cause the writeback.
 +  *
 +  * This way we avoid all nasty races with the
 +  * dirty bit in multiple places and clearing
 +  * them concurrently from different threads.
 +  *
 +  * Note! Normally the set_page_dirty(page)
 +  * has no effect on the actual dirty bit - since
 +  * that will already usually be set. But we
 +  * need the side effects, and it can help us
 +  * avoid races.
 +  *
 +  * We basically use the page master dirty bit
 +  * as a serialization point for all the different
 +  * threds doing their things.
 +  *
 +  * FIXME! We still have a race here: if somebody
 +  * adds the page back to the page tables in
 +  * between the page_mkclean() and the TestClearPageDirty(),
 +  * we might have it mapped without the dirty bit set.
 +  */
 + if (page_mkclean(page))
 + set_page_dirty(page);
 + if (TestClearPageDirty(page)) {
   dec_zone_page_state(page, NR_FILE_DIRTY);
 + return 1;
   }

- Presumably reiser3's ordered-data mode has the same problem.  And ext4,
  of course.  Dunno about other filesytems.

- The above change means that we do extra writeout.  If a page is dirtied
  once, kjournald will write it and then pdflush will come along and
  needlessly write it again.

  But otoh, if a mapping is being repeatedly dirtied, kjournald will
  write the page once per 30 seconds (dirty_expire_centisecs) and pdflush
  will write the page once per 30 seconds as well.  But we _should_ be
  writing it once per five seconds (kjournald commit interval).  So we're
  still ahead ;)

- Poor old IO accounting broke again.

- People were saying that ext2 and ext3,data=writeback were also showing
  corruption.  What's up with that?

- For a long time I've wanted to nuke the current ext3/jbd ordered-data
  implementation altogether, and just make kjournald call into the
  standard writeback code to do a standard suberblock-inodes-pages walk.

  I think it'd be fairly straightforward to do.  We'd need to teach the
  writeback code to be able to skip dirty pages which don't have a disk
  mapping, so that kjournald doesn't end up waiting for kjournald to free
  up journal space..

  Would need to avoid possible deadlocks where someone calls
  ext3_force_commit() or otherwise does a synchronous commit while holding
  VFS locks.

  reiser3 and ext4 could be converted too.

  Not a short-term project, but this would avoid the problem.

- It's pretty obnoxious that the VM now sets a clean page dirty and
  then proceeds to modify its contents.  It would be nice to stop doing
  that.

  We could stop marking the page dirty in do_wp_page() and create a new
  VM counter NR_PTE_DIRTY, which means

number of mapping_cap_account_dirty() pages which have a dirty pte
pointing at them.

  Or, perhaps

number of dirty ptes which point at mapping_cap_account_dirty() pages.

  Which can be larger, but the writeout code will probably cope.

  Then we take NR_PTE_DIRTY into account in the dirty-page balancing act.
  So

  - do_wp_page() will still run balance_dirty_pages()

  - but it would no longer run set_page_dirty().

  - But it needs to run mark_inode_dirty() so the fs-writeback code
notices the file.

  - And mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) becomes insufficient.

  The tricky part here is how do we do the writeback?  The
  pte-dirty,!PageDirty pages aren't tagged as dirty in the radix-tree and
  writeback needs to find them so that it can effectively do an msync() on
  them.  Walking all the mm's and vma's would be insane.  Visiting all the
  pages in the file would also probably be insane.

  Perhaps this can be solved by adding a new radix-tree tag which means
  this page might have dirty ptes pointing at it.  For each file
  writeback would do a radix-tree walk of these pages,
  cleaning-and-write-protecting ptes, marking the corresponding pages
  dirty and clearing their PAGECACHE_TAG_PTE_DIRTY tags.

  Then we can fix the mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)
  problem by doing

mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) ||

Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Andrew Morton
On Fri, 29 Dec 2006 14:16:32 -0800
Andrew Morton [EMAIL PROTECTED] wrote:

 - Poor old IO accounting broke again.

No it didn't - we're relying upon the behaviour of __set_page_dirty_buffers()
against an already-dirty page.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Andrew Morton wrote:
 
 - The above change means that we do extra writeout.  If a page is dirtied
   once, kjournald will write it and then pdflush will come along and
   needlessly write it again.

There's zero extra writeout for any flushing that flushes BY PAGES.

Only broken flushers that flush by buffer heads (which really really 
really shouldn't be done any more: welcome to the 21st century) will cause 
extra writeouts. And those extra writeouts are obviously required for all 
the dirty state to actually hit the disk - which is the point of the 
patch.

So they're not extra - they are required for correct working.

But I can't stress the fact enough that people SHOULD NOT do writeback by 
buffer heads. The buffer head has been purely an IO entity for the last 
several years now, and it's not a cache entity. Anybody who does writeback 
by buffer heads is basically bypassing the real cache (the page cache), 
and that's why all the problems happen.

I think ext3 is terminally crap by now. It still uses buffer heads in 
places where it really really shouldn't, and as a result, things like 
directory accesses are simply slower than they should be. Sadly, I don't 
think ext4 is going to fix any of this, either.

It's all just too inherently wrongly designed around the buffer head 
(which was correct in 1995, but hasn't been correct for a long time in the 
kernel any more).

 - Poor old IO accounting broke again.

No. That's why I used set_page_dirty() and did it that strange ugly way 
(set page dirty, even though it's already dirty, and even though the very 
next thing we will do is TestClearPageDirty???).

That code looks strange as a result, which is why it now has more comments 
on it than actual code ;)

 - People were saying that ext2 and ext3,data=writeback were also showing
   corruption.  What's up with that?

I thought the ext3,data=writeback case was reported to be fine by 
several people?

I'm not sure about ext2. I didn't look at what it did based on buffer 
heads. I would have expected it to be ok.

That said, at least one report was later shown to be bogus (errors due to 
out of disk, not due to actual errors ;).

 - For a long time I've wanted to nuke the current ext3/jbd ordered-data
   implementation altogether, and just make kjournald call into the
   standard writeback code to do a standard suberblock-inodes-pages walk.

I really would like to see less of the buffer-head-based stuff, and yes, 
more of the normal inode page walking. I don't think you can order 
accesses within a page anyway, exactly because of memory mapping issues, 
so any page ordering is not about buffer heads on the page itself, it 
should be purely about metadata.

 - It's pretty obnoxious that the VM now sets a clean page dirty and
   then proceeds to modify its contents.  It would be nice to stop doing
   that.

No. I think this really the fundamental confusion people had. People 
thought that setting the page dirty meant that it was no longer being 
modified. It hasn't meant that in a LONG time - ever since the whole 
DIRTY_TAG thing, the most important part of the PG_dirty thing has really 
been that it's now efficiently findable by the writeout logic.

And that is very much what the whole page accounting _depends_ on. When we 
mmap a page, we need to mark it findable as dirty _before_ people 
actually start writing to it, because it's too late afterwards.

   We could stop marking the page dirty in do_wp_page() and create a new
   VM counter NR_PTE_DIRTY, which means
 
 number of mapping_cap_account_dirty() pages which have a dirty pte
 pointing at them.

Well, then you need to change what PAGE_MAPPING_TAG_DIRTY means too.

That's very fundamental. That DIRTY _tag_ is now even more important than 
the PG_dirty bit itself, since that's what we actually use to _access_ 
those things.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Andrew Morton
On Fri, 29 Dec 2006 14:42:51 -0800 (PST)
Linus Torvalds [EMAIL PROTECTED] wrote:

 
 
 On Fri, 29 Dec 2006, Andrew Morton wrote:
  
  - The above change means that we do extra writeout.  If a page is dirtied
once, kjournald will write it and then pdflush will come along and
needlessly write it again.
 
 There's zero extra writeout for any flushing that flushes BY PAGES.
 
 Only broken flushers that flush by buffer heads (which really really 
 really shouldn't be done any more: welcome to the 21st century) will cause 
 extra writeouts. And those extra writeouts are obviously required for all 
 the dirty state to actually hit the disk - which is the point of the 
 patch.
 
 So they're not extra - they are required for correct working.

They're extra.  As in can be optimised away.

 But I can't stress the fact enough that people SHOULD NOT do writeback by 
 buffer heads. The buffer head has been purely an IO entity for the last 
 several years now, and it's not a cache entity.

The buffer_head is not an IO container.  It is the kernel's core
representation of a disk block.  Usually (but not always) it is backed by
some memory which is in pagecache.  We can feed buffer_heads into IO
containers via submit_bh(), but that's far from the only thing we use
buffer_heads for.  We should have done s/buffer_head/block/g years ago.

JBD implements physical block-based journalling, so it is 100% appropriate
that JBD deal with these disk blocks using their buffer_head
representation.

That being said, ordered-data mode isn't really part of the JBD journalling
system at all (the data doesn't get journalled!) - ordered-mode is an
add-on to the JBD journal to make the metadata which we're about to journal
point at more-likely-to-be-correct data.

JBD's ordered-mode writeback is just a sync and I see no conceptual
problems with killing its old buffer_head based sync and moving it into the
21st century.

 Anybody who does writeback 
 by buffer heads is basically bypassing the real cache (the page cache), 
 and that's why all the problems happen.
 
 I think ext3 is terminally crap by now. It still uses buffer heads in 
 places where it really really shouldn't,

The ordered-data mode flush: sure.  The rest of JBD's use of buffer_heads
is quite appropriate.

 and as a result, things like 
 directory accesses are simply slower than they should be. Sadly, I don't 
 think ext4 is going to fix any of this, either.

I thought I fixed the performance problem?

Somewhat nastily, but as ext3 directories are metadata it is appropriate
that modifications to them be done in terms of buffer_heads (ie: blocks).

 It's all just too inherently wrongly designed around the buffer head 
 (which was correct in 1995, but hasn't been correct for a long time in the 
 kernel any more).
 
  - Poor old IO accounting broke again.
 
 No. That's why I used set_page_dirty() and did it that strange ugly way 
 (set page dirty, even though it's already dirty, and even though the very 
 next thing we will do is TestClearPageDirty???).

nfs_set_page_dirty() and reiserfs_set_page_dirty() should now bail if
PageDirty() to avoid needless work.

  - For a long time I've wanted to nuke the current ext3/jbd ordered-data
implementation altogether, and just make kjournald call into the
standard writeback code to do a standard suberblock-inodes-pages walk.
 
 I really would like to see less of the buffer-head-based stuff, and yes, 
 more of the normal inode page walking. I don't think you can order 
 accesses within a page anyway, exactly because of memory mapping issues, 
 so any page ordering is not about buffer heads on the page itself, it 
 should be purely about metadata.

In this context ext3's ordered mode means sync the file contents before
journalling the metadata which points at it.

  - It's pretty obnoxious that the VM now sets a clean page dirty and
then proceeds to modify its contents.  It would be nice to stop doing
that.
 
 No. I think this really the fundamental confusion people had. People 
 thought that setting the page dirty meant that it was no longer being 
 modified.

No.  Setting a page (or bh, or inode) dirty means this is known to have
been modified.  ie: this cached entity is now out of sync with backing
store.

Ho hum.  I don't care much, really.  But then, I understand how all this
stuff works.  Try explaining to someone the relationship between
pte-dirtiness, page-dirtiness, radix-tree-dirtiness and
buffer_head-dirtiness.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Andrew Morton wrote:
 
 They're extra.  As in can be optimised away.

Sure. Don't use buffer heads.

 The buffer_head is not an IO container.  It is the kernel's core
 representation of a disk block.

Please come back from the 90's.

The buffer heads are nothing but a mapping of where the hardware block is. 
If you use it for anything else, you're basically screwed.

 JBD implements physical block-based journalling, so it is 100% appropriate
 that JBD deal with these disk blocks using their buffer_head
 representation.

And as long as it does that, you just have to face the fact that it's 
going to perform like crap, including what you call extra writes, and 
what I call deal with it.

Btw, you can make pages be physically indexed too, but they obviously
 (a) won't be coherent with any virtual mapping laid on top of it
 (b) will be _physical_, so any readahead etc will be based on physical 
 addresses too.

 I thought I fixed the performance problem?

No, you papered over it, for the reasonably common case where things were 
physically contiguous - exactly by using a physical page cache, so now it 
can do read-ahead based on that. Then, because the pages contain buffer 
heads, the directory accesses can look up buffers, and if it was all 
physically contiguous, it all works fine.

But if you actually want virtualluy indexed caching (and all _users_ want 
it), it really doesn't work.

 Somewhat nastily, but as ext3 directories are metadata it is appropriate
 that modifications to them be done in terms of buffer_heads (ie: blocks).

No. There is nothing appropriate about using buffer_heads for metadata. 

It's quite proper - and a hell of a lot more efficient - to use virtual 
page-caching for metadata too.

Look at the ext2 readdir() implementation, and compare it to the crapola 
horror that is ext3. Guess what? ext2 uses virtually indexed metadata, and 
as a result it is both simpler, smaller and a LOT faster than ext3 in 
accessing that metadata.

Face it, Andrew, you're wrong on this one. Really. Just take a look at 
ext2_readdir(). 

[ I'm not saying that ext2_readdir() is _beautiful_. If it had been 
  written with the page cache in mind, it would probably have been done 
  very differently. And it doesn't do any readahead, probably because 
  nobody cared enough, but it should be trivial to add, and it would 
  automatically do the right thing just because it's much easier at the 
  page cache level.

  But I _am_ saying that compared to ext3, the ext2 readdir is a work of 
  art. ]

metadata has _zero_ to do with physically indexed. There is no 
correlation what-so-ever. If you think there is a correlation, it's all in 
your mind.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Theodore Tso
On Fri, Dec 29, 2006 at 02:42:51PM -0800, Linus Torvalds wrote:
 I think ext3 is terminally crap by now. It still uses buffer heads in 
 places where it really really shouldn't, and as a result, things like 
 directory accesses are simply slower than they should be. Sadly, I don't 
 think ext4 is going to fix any of this, either.

Not just ext3; ocfs2 is using the jbd layer as well.  I think we're
going to have to put this (a rework of jbd2 to use the page cache) on
the ext4 todo list, and work with the ocfs2 folks to try to come up
with something that suits their needs as well.  Fortunately we have
this filesystem/storage summit thing coming up in the next few months,
and we can try to get some discussion going on the linux-ext4 mailing
list in the meantime.  Unfortunately, I don't think this is going to
be trivial.

If we do get this fixed for ext4, one interesting question is whether
people would accept a patch to backport the fixes to ext3, given the
the grief this is causing the page I/O and VM routines.  OTOH, reiser3
probably has the same problems, and I suspect the changes to ext3 to
cause it to avoid buffer heads, especially in order to support for
filesystem blocksizes  pagesize, are going to be sufficiently risky
in terms of introducing regressions to ext3 that they would probably
be rejected on those grounds.  So unfortunately, we probably are going
to have to support flushes via buffer heads for the foreseeable
future.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Theodore Tso wrote:

 If we do get this fixed for ext4, one interesting question is whether
 people would accept a patch to backport the fixes to ext3, given the
 the grief this is causing the page I/O and VM routines.

I don't think backporting is the smartest option (unless it's done _way_ 
later), but the real problem with it isn't actually the VM behaviour, but 
simply the fact that cached performance absoluely _sucks_ with the buffer 
cache.

With the physically indexed buffer cache thing, you end up always having 
to do these complicated translations into block numbers for every single 
access, and at some point when I benchmarked it, it was a huge overhead 
for doing simple things like readdir.

It's also a major pain for read-ahead, exactly partly due to the high cost 
of translation - because you can't cheaply check whether the next block is 
there, the cost of even asking the question should I try to read ahead? 
is much much higher. As a result, read-ahead is seriously limited, because 
it's so expensive for the cached case (which is still hopefully the 
_common_ case).

So because read-ahead is limited, the non-cached case then _really_ sucks.

It was somewhat fixed in a really god-awful fashion by having 
ext3_readdir() actually do _readahead_ though the page cache, even though 
it does everything else through the buffer cache. And that just happens to 
work because we hopefully have physically contiguous blocks, but when that 
isn't true, the readahead doesn't do squat.

It's really quite fundamentally broken. But none of that causes any 
problems for the VM, since directories cannot be mmap'ed anyway. But it's 
really pitiful, and it really doesn't work very well. Of course, other 
filesystems _also_ suck at this, and other operating systems haev even 
MORE problems, so people don't always seem to realize how horribly 
horribly broken this all is.

I really wish somebody would write a filesystem that did large cold-cache 
directories well. Open some horrible file manager on /usr/bin with cold 
caches, and weep. The biggest problem is the inode indirection, but at 
some point when I looked at why it sucked, it was doing basically 
synchronous single-buffer reads on the directory too, because readahead 
didn't work properly.

I was hoping that something like SpadFS would actually take off, because 
it seemed to do a lot of good design choices (having inodes in-line in the 
directory for when there are no hardlinks is probably a requirement for a 
good filesystem these days. The separate inode table had its uses, but 
indirection in a filesystem really does suck, and stat information is too 
important to be indirect unless it absolutely has to).

But I suspect it needs more than somebody who just wants to get his thesis 
written ;)

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Andrew Morton
On Fri, 29 Dec 2006 18:32:07 -0500
Theodore Tso [EMAIL PROTECTED] wrote:

 On Fri, Dec 29, 2006 at 02:42:51PM -0800, Linus Torvalds wrote:
  I think ext3 is terminally crap by now. It still uses buffer heads in 
  places where it really really shouldn't, and as a result, things like 
  directory accesses are simply slower than they should be. Sadly, I don't 
  think ext4 is going to fix any of this, either.
 
 Not just ext3; ocfs2 is using the jbd layer as well.  I think we're
 going to have to put this (a rework of jbd2 to use the page cache) on
 the ext4 todo list, and work with the ocfs2 folks to try to come up
 with something that suits their needs as well.  Fortunately we have
 this filesystem/storage summit thing coming up in the next few months,
 and we can try to get some discussion going on the linux-ext4 mailing
 list in the meantime.  Unfortunately, I don't think this is going to
 be trivial.

I suspect it would be insane to move any part of JBD (apart from the
ordered-data flush) to use pagecache.  The whole thing is fundamentally
block-based.  But only for metadata - there's no strong reason why ext3/4
needs to manipulate file data via buffer_heads if data=journal and chattr
+j aren't in use.

We could possibly move ext3/4 directories out of the blockdev pagecache and
into per-directory pagecache, but that wouldn't change anything - the
journalling would still be block-based.

Adam Richter spent considerable time a few years ago trying to make the
mpage code go direct-to-BIO in all cases and we eventually gave up.  The
conceptual layering of page-blocks-bio is pretty clean, and it is hard
and ugly to fully optimise away the block bit in the middle.

buffer_heads become more important with large PAGE_CACHE_SIZE.  I'd expect
nobh mode to be quite inefficient with some workloads on 64k pages.  We
need that representation of the state (and location) of the block-sized
hunks which make up the page.

 If we do get this fixed for ext4, one interesting question is whether
 people would accept a patch to backport the fixes to ext3, given the
 the grief this is causing the page I/O and VM routines.  OTOH, reiser3
 probably has the same problems, and I suspect the changes to ext3 to
 cause it to avoid buffer heads, especially in order to support for
 filesystem blocksizes  pagesize, are going to be sufficiently risky
 in terms of introducing regressions to ext3 that they would probably
 be rejected on those grounds.  So unfortunately, we probably are going
 to have to support flushes via buffer heads for the foreseeable
 future.

We'll see.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Andrew Morton
On Fri, 29 Dec 2006 16:11:44 -0800 (PST)
Linus Torvalds [EMAIL PROTECTED] wrote:

 
 
  JBD implements physical block-based journalling, so it is 100% appropriate
  that JBD deal with these disk blocks using their buffer_head
  representation.
 
 And as long as it does that, you just have to face the fact that it's 
 going to perform like crap, including what you call extra writes, and 
 what I call deal with it.

It is quite tiresome to delete things which your interlocutor said and to
then restate them as if it were some sort of relevation.

  Somewhat nastily, but as ext3 directories are metadata it is appropriate
  that modifications to them be done in terms of buffer_heads (ie: blocks).
 
 No. There is nothing appropriate about using buffer_heads for metadata. 

I said modification.

 [stuff about directory reads elided]

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Andrew Morton wrote:

   Somewhat nastily, but as ext3 directories are metadata it is appropriate
   that modifications to them be done in terms of buffer_heads (ie: blocks).
  
  No. There is nothing appropriate about using buffer_heads for metadata. 
 
 I said modification.

You said metadata.

Why do you think directories are any different from files? Yes, they are 
metadata. So what? What does that have to do with anything?

They should still use virtual indexes, the way files do. That doesn't 
preclude them from using buffer-heads to mark their (partial-page) 
modifications and for keeping the virtual-physical translations cached.

I mean, really. Look at ext2. It does exactly that. It keeps the 
directories in the page cache - virtually indexed. And it even modifies 
them there. Exactly the same way it modifies regular file data.

It all works exactly the same way it works for regular files. It uses

page-mapping-a_ops-prepare_write(NULL, page, from, to);
... do modification ...
ext2_commit_chunk(page, from, to);

exactly the way regular file data works. 

That's why I'm saying there is absolutely _zero_ thing about metadata 
here, or even about modifications. It all works better in a virtual 
cache, because you get all the support that we give to page caches.

So I really don't understand why you make excuses for ext3 and talk about 
modifications and metadata. It was a fine design ten years ago. It's 
not really very good any longer.

I suspect we're stuck with the design, but that doesn't make it any 
_better_.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Andrew Morton
On Fri, 29 Dec 2006 16:58:41 -0800 (PST)
Linus Torvalds [EMAIL PROTECTED] wrote:

 
 
 On Fri, 29 Dec 2006, Andrew Morton wrote:
 
Somewhat nastily, but as ext3 directories are metadata it is appropriate
that modifications to them be done in terms of buffer_heads (ie: 
blocks).
   
   No. There is nothing appropriate about using buffer_heads for metadata. 
  
  I said modification.
 
 You said metadata.
 
 Why do you think directories are any different from files? Yes, they are 
 metadata. So what? What does that have to do with anything?

We journal the contents of directories.  Fully.  So we handle their dirty
data at the block (ie: buffer_head) level.  When someone tries to dirty
part of a directory we need to cheat and not mark that part of the page as
dirty and we need to then write the block to the journal and then mark the
block as really dirty for checkpointing (but still attached to the journal)
and all that goop.

The regular page-based writeback doesn't apply until the block has been
written to the journal.  At that stage the block is considered dirty
against its real position on disk.  It will then be written back by pdflush
via the blockdev inode - blkdev_writepage().  Unless kjournald needs to do
an early flush to reclaim the journal space, in which case kjournald will
write the block itself.

 
 So I really don't understand why you make excuses for ext3 and talk about 
 modifications and metadata. It was a fine design ten years ago. It's 
 not really very good any longer.
 

As I said in another apparently-neglected email:

: We could possibly move ext3/4 directories out of the blockdev pagecache and
: into per-directory pagecache, but that wouldn't change anything - the
: journalling would still be block-based.

We already have all the code in place to journal blocks which are cached in
an address_space other than the blockdev inode's: ext3_journalled_aops.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

2006-12-29 Thread Linus Torvalds


On Fri, 29 Dec 2006, Andrew Morton wrote:
 
 Adam Richter spent considerable time a few years ago trying to make the
 mpage code go direct-to-BIO in all cases and we eventually gave up.  The
 conceptual layering of page-blocks-bio is pretty clean, and it is hard
 and ugly to fully optimise away the block bit in the middle.

Using the buffer cache as a translation layer to the physical address is 
fine. That's what _any_ block device will do.

I'm not at all sayign that buffer heads must go away. They work fine.

What I'm saying is that

 - if you index by buffer heads, you're screwed.
 - if you do IO by starting at buffer heads, you're screwed.

Both indexing and writeback decisions should be done at the page cache 
layer. Then, when you actually need to do IO, you look at the buffers. But 
you start from the page. YOU SHOULD NEVER LOOK UP a buffer on its own 
merits, and YOU SHOULD NEVER DO IO on a buffer head on its own cognizance.

So by all means keep the buffer heads as a way to keep the 
virtual-physical translation. It's what they were designed for. But 
they were _originally_ also designed for lookup and driving the start 
of IO, and that is wrong, and has been wrong for a long time now, because

 - lookup based on physical address is fundamentally slow and inefficient. 
   You have to look up the virtual-physical translation somewhere else, 
   so it's by design an unnecessary indirection _and_ that somewere 
   else is also by definition filesystem-specific, so you can't do any 
   of these things at the VFS layer.

   Ergo: anything that needs to look up the physical address in order to 
   find the buffer head is BROKEN in this day and age. We look up the 
   _virtual_ page cache page, and then we can trivially find the buffer 
   heads within that page thanks to page-buffers.

   Example: ext2 vs ext3 readdir. One of them sucks, the other doesn't. 

 - starting IO based on the physical entity is insane. It's insane exactly 
   _because_ the VM doesn't actually think in physical addresses, or in 
   buffer-sized blocks. The VM only really knows about whole pages, and 
   all the VM decisions fundamentally have to be page-based. We don't ever 
   free a buffer. We free a whole page, and as such, doing writeback 
   based on buffers is pointless, because it doesn't actually say anything 
   about the page state which is what the VM tracks.

But neither of these means that buffer_head itself has to go away. They 
both really boil down to the same thing: you should never KEY things by 
the buffer head. All actions should be based on virtual indexes as far as 
at all humanly possible.

Once you do lookup and locking and writeback _starting_ from the page, 
it's then easy to look up the actual buffer head within the page, and use 
that as a way to do the actual _IO_ on the physical address. So the buffer 
heads still exist in ext2, for example, but they don't drive the show 
quite as much.

(They still do in some areas: the allocation bitmaps, the xattr code etc. 
But as long as none of those have big VM footprints, and as long as no 
_common_ operations really care deeply, and as long as those data 
structures never need to be touched by the VM or VFS layer, nobody will 
ever really care).

The directory case comes up just because readdir() actually is very 
common, and sometimes very slow. And it can have a big VM working set 
footprint (find), so trying to be page-based actually really helps, 
because it all drives things like writeback on the _right_ issues, and we 
can do things like LRU's and writeback decisions on the level that really 
matters.

I actually suspect that the inode tables could benefit from being in the 
page cache too (although I think that the inode buffer address is actually 
physical, so there's no indirection for inode tables, which means that 
the virtual vs physical addressing doesn't matter). For directories, there 
definitely is a big cost to continually doing the virtual-physical 
translation all the time.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Linus Torvalds


On Fri, 29 Dec 2006, Segher Boessenkool wrote:
>
> > I think what might be happening is that pdflush writes them out fine,
> > however we don't trap writes by the application _during_ that writeout.
> 
> Yeah.  I believe that more exactly it happens if the very last
> write to the page causes a writeback (due to dirty balancing)
> while another writeback for the page is already happening.
> 
> As usual in these cases, I have zero proof.

I actually have proof to the contrary, ie I have traces that say "the 
write was started" after the last write.

And the VM layer in this area is actually fairly sane and civilized. It 
has a bit that says "writeback in progress", and if that bit is set, it 
simply _will_not_ start a new write. It even has various BUG_ON()'s to 
that effect.

So everything I have ever seen says that the VM layer is actually doing 
everything right.

> It's the do_wp_page -> balance_dirty_pages -> generic_writepages
> path for sure.  Maybe it's enough to change
> 
> if (wbc->sync_mode != WB_SYNC_NONE)
> wait_on_page_writeback(page);
> 
> if (PageWriteback(page) ||
> !clear_page_dirty_for_io(page)) {
> unlock_page(page);
> continue;
> }

Notive how this one basically says:

 - if it's under writeback, don't even clear the page dirty flag.

Your suggested change:

> if (wbc->sync_mode != WB_SYNC_NONE)
> wait_on_page_writeback(page);
> 
> if (PageWriteback(page)) {
>   redirty_page_for_writepage(wbc, page);

makes no sense, because we simply never _did_ the "clear_page_dirty()" if 
the thing was under writeback in the first place. That's how C 
conditionals work.  So there's no reason to "redirty" it, because it 
wasn't cleaned in the first place.

I've double- and triple-checked the dirty bits, including having traces 
that actually say that the IO was started (from a VM perspective) _after_ 
the last write was done. The IO just didn't hit the disk.

I'm personally fairly convinced that it's not a VM issue, but a "IO 
issue". Either in a low-level filesystem or in some of the fs/buffer.c 
helper routines.

But I'd love to be proven wrong. 

I do have a few interesting details from the trace I haven't really 
analyzed yet. Here's the trace for events on one of the pages that was 
corrupted. Note how the events are numbered (there were 171640 events 
total), so the thing you see is just a small set of events from the whole 
big trace, but it's the ones that talk about _that_ particular page.

I've grouped them so hat "consecutive" events group together. That just 
means that no events on any other pages happened in between those events, 
and it is usually a sign that it's really one single call-chain that 
causes all the events.

For example, for the first group of three events (44366-44368), it's the 
page fault that brings in the page, and since it's a write-fault, it will 
not only map the page, it will mark the page itself dirty and then also 
set the TAG_DIRTY on the mapping. So the "group" is just really a result 
of one single event happening, which causes several things to happen to 
that page. That's exactly what you'd expect.

Anyway, here is the list of events that page went through:

   44366  PG 0f6d: mm/memory.c:2254 mapping at b789fc54 (write)
   44367  PG 0f6d: mm/page-writeback.c:817 setting dirty
   44368  PG 0f6d: fs/buffer.c:738 setting TAG_DIRTY

   64231  PG 0f6d: mm/page-writeback.c:872 clean_for_io
   64232  PG 0f6d: mm/rmap.c:451 cleaning PTE b789f000
   64233  PG 0f6d: mm/page-writeback.c:914 set writeback
   64234  PG 0f6d: mm/page-writeback.c:916 setting TAG_WRITEBACK
   64235  PG 0f6d: mm/page-writeback.c:922 clearing TAG_DIRTY

   67570  PG 0f6d: mm/page-writeback.c:891 end writeback
   67571  PG 0f6d: mm/page-writeback.c:893 clearing TAG_WRITEBACK

   76705  PG 0f6d: mm/page-writeback.c:817 setting dirty
   76706  PG 0f6d: fs/buffer.c:725 dirtied buffers
   76707  PG 0f6d: fs/buffer.c:738 setting TAG_DIRTY

  105267  PG 0f6d: mm/page-writeback.c:872 clean_for_io
  105268  PG 0f6d: mm/rmap.c:451 cleaning PTE b789f000
  105269  PG 0f6d: mm/page-writeback.c:914 set writeback
  105270  PG 0f6d: mm/page-writeback.c:916 setting TAG_WRITEBACK
  105271  PG 0f6d: mm/page-writeback.c:922 clearing TAG_DIRTY
  105272  PG 0f6d: mm/page-writeback.c:891 end writeback
  105273  PG 0f6d: mm/page-writeback.c:893 clearing TAG_WRITEBACK

  128032  PG 0f6d: mm/memory.c:670 unmapped at b789f000

  132662  PG 0f6d: mm/filemap.c:119 Removing page cache

  148278  PG 0f6d: mm/memory.c:2254 mapping at b789f000 (read)

  166326  PG 0f6d: mm/memory.c:670 unmapped at b789f000

  171958  PG 

Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Segher Boessenkool

I think what might be happening is that pdflush writes them out fine,
however we don't trap writes by the application _during_ that writeout.


Yeah.  I believe that more exactly it happens if the very last
write to the page causes a writeback (due to dirty balancing)
while another writeback for the page is already happening.

As usual in these cases, I have zero proof.


It's something that will only occur with writeback and MAP_SHARED
writable access to the file pages.


It's the do_wp_page -> balance_dirty_pages -> generic_writepages
path for sure.  Maybe it's enough to change

if (wbc->sync_mode != WB_SYNC_NONE)
wait_on_page_writeback(page);

if (PageWriteback(page) ||
!clear_page_dirty_for_io(page)) {
unlock_page(page);
continue;
}

to

if (wbc->sync_mode != WB_SYNC_NONE)
wait_on_page_writeback(page);

if (PageWriteback(page)) {
redirty_page_for_writepage(wbc, page);
unlock_page(page);
continue;
}

if (!clear_page_dirty_for_io(page)) {
unlock_page(page);
continue;
}

or something along those lines.  Completely untested of course :-)


Segher

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread David Miller
From: Linus Torvalds <[EMAIL PROTECTED]>
Date: Thu, 28 Dec 2006 12:14:31 -0800 (PST)

> I get corruption - but the whole point is that it's very much pdflush that 
> should be writing these pages out.

I think what might be happening is that pdflush writes them out fine,
however we don't trap writes by the application _during_ that writeout.

These corruptions look exactly as if:

1) pdflush begins writeback of page X
2) page goes to disk
3) application writes a chunk to the page
4) pdflush et al. think the page is clean, so it gets tossed, losing
   the writes done in #3

So there's a missing PTE change in there, so that we never get proper
re-dirtying of the page if the application tries to write to the page
during the writeback.

It's something that will only occur with writeback and MAP_SHARED
writable access to the file pages.  That's why we never see this
with normal filesystem writes, since those explicitly manage the
page dirty state.

I think the dirty balancing logic etc. isn't where the problems are,
to me it's a PTE state update issue for sure.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Mike Galbraith
On Thu, 2006-12-28 at 11:45 -0800, Andrew Morton wrote:
> On Thu, 28 Dec 2006 11:28:52 -0800 (PST)
> Linus Torvalds <[EMAIL PROTECTED]> wrote:
> 
> > 
> > 
> > On Thu, 28 Dec 2006, Guillaume Chazarain wrote:
> > > 
> > > The attached patch fixes the corruption for me.
> > 
> > Well, that's a good hint, but it's really just a symptom. You effectively 
> > just made the test-program not even try to flush the data to disk, so the 
> > page cache would stay in memory, and you'd not see the corruption as well.
> > 
> > So you basically disabled the code that tried to trigger the bug more 
> > easily.
> > 
> > But the reason I say it's interesting is that "WB_SYNC_NONE" is very much 
> > implicated in mm/page-writeback.c, and if there is a bug triggered by 
> > WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also 
> > fails..
> > 
> 
> It would be interesting to convert your app to do fsync() before
> FADV_DONTNEED.  That would take WB_SYNC_NONE out of the picture as well
> (apart from pdflush activity).

I did fdatasync(), tried remapping before unmapping... nogo here.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Linus Torvalds


On Thu, 28 Dec 2006, Andrew Morton wrote:
> 
> It would be interesting to convert your app to do fsync() before
> FADV_DONTNEED.  That would take WB_SYNC_NONE out of the picture as well
> (apart from pdflush activity).

I get corruption - but the whole point is that it's very much pdflush that 
should be writing these pages out.

Andrew - give my test-program a try. It can run in about 1 minute if you 
have a 256MB machine (I didn't, but "mem=256M" is my friend), and it seems 
to very consistently cause corruption.

What I do is:

# Make sure we write back aggressively
echo 5 > /proc/sys/vm/dirty_ratio

as root, and then just run the thing. Tons of corruption. But the 
corruption goes away if I just leave the default dirty ratio alone (but 
then I can increse the file size to trigger it, of course - but that 
also makes the test run a lot slower).

Now, with a pre-2.6.19 kernel, I bet you won't get the corruption as 
easily (at least with the "fsync()"), but that's less to do with anything 
new, and probably just because then you simply won't have any pdflushing 
going on - since the kernel won't even notice that you have tons of dirty 
pages ;)

It might also depend on the speed of your disk drive - the machine I test 
this on has a slow 4200 rpm laptop drive in it, and that probably makes 
things go south more easily. That's _especially_ true if this is related 
to any "bdi_write_congested()" logic.

Now, it could also be related to various code snippets like

...
if (wbc->sync_mode != WB_SYNC_NONE)
wait_on_page_writeback(page);

if (PageWriteback(page) ||
!clear_page_dirty_for_io(page)) {
unlock_page(page);
continue;
}
...

where the WB_SYNC_NONE case will hit the "PageWriteback()" and just not do 
the writeback at all (but it also won't clear the dirty bit, so it's 
certainly not an *OBVIOUS* bug).

We also have code like this ("pageout()"):

if (clear_page_dirty_for_io(page)) {
int res;
struct writeback_control wbc = {
.sync_mode = WB_SYNC_NONE,
..
}
...
res = mapping->a_ops->writepage(page, );

and in this case, if the "WB_SYNC_NONE" means that the "writepage()" call 
won't do anything at all because of congestion, then that would be a _bad_ 
thing, and would certainly explain how something didn't get written out.

But that particular path should only trigger for the "shrink_page_list()" 
case, and it's not the case I seem to be testing with my "low dirty_ratio" 
testing.

Linus#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define TARGETSIZE (22 << 20)
#define CHUNKSIZE (1460)
#define NRCHUNKS (TARGETSIZE / CHUNKSIZE)
#define SIZE (NRCHUNKS * CHUNKSIZE)

static void fillmem(void *start, int nr)
{
memset(start, nr, CHUNKSIZE);
}

#define page_offset(buf, off) (unsigned)((unsigned long)(buf)+(off)-(unsigned 
long)(mapping))

static int chunkorder[NRCHUNKS];
static char *mapping;

static int order(int nr)
{
int i;
if (nr < 0 || nr >= NRCHUNKS)
return -1;
for (i = 0; i < NRCHUNKS; i++)
if (chunkorder[i] == nr)
return i;
return -2;
}

static void checkmem(void *buf, int nr)
{
unsigned int start = ~0u, end = 0;
unsigned char c = nr, *p = buf, differs = 0;
int i;
for (i = 0; i < CHUNKSIZE; i++) {
unsigned char got = *p++;
if (got != c) {
if (i < start)
start = i;
if (i > end)
end = i;
differs = got;
}
}
if (start < end) {
printf("Chunk %d corrupted (%u-%u)  (%x-%x)\n", nr, 
start, end,
page_offset(buf, start), page_offset(buf, end));
printf("Expected %u, got %u\n", c, differs);
printf("Written as (%d)%d(%d)\n", order(nr-1), order(nr), 
order(nr+1));
}
}

static char *remap(int fd, char *mapping)
{
if (mapping) {
munmap(mapping, SIZE);
//  fsync(fd);
posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED);
}
return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
}

int main(int argc, char **argv)
{
int fd, i;

/*
 * Make some random ordering of writing the chunks to the
 * memory map..
 *
 * Start with fully ordered..
 */
for (i = 0; i < NRCHUNKS; i++)
chunkorder[i] = i;

/* ..and then mix it up randomly */
srandom(time(NULL));
for (i = 0; i < NRCHUNKS; i++) {
int index = (unsigned int) random() % 

Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Andrew Morton
On Thu, 28 Dec 2006 11:28:52 -0800 (PST)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> 
> 
> On Thu, 28 Dec 2006, Guillaume Chazarain wrote:
> > 
> > The attached patch fixes the corruption for me.
> 
> Well, that's a good hint, but it's really just a symptom. You effectively 
> just made the test-program not even try to flush the data to disk, so the 
> page cache would stay in memory, and you'd not see the corruption as well.
> 
> So you basically disabled the code that tried to trigger the bug more 
> easily.
> 
> But the reason I say it's interesting is that "WB_SYNC_NONE" is very much 
> implicated in mm/page-writeback.c, and if there is a bug triggered by 
> WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also 
> fails..
> 

It would be interesting to convert your app to do fsync() before
FADV_DONTNEED.  That would take WB_SYNC_NONE out of the picture as well
(apart from pdflush activity).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Linus Torvalds


On Thu, 28 Dec 2006, Guillaume Chazarain wrote:
> 
> The attached patch fixes the corruption for me.

Well, that's a good hint, but it's really just a symptom. You effectively 
just made the test-program not even try to flush the data to disk, so the 
page cache would stay in memory, and you'd not see the corruption as well.

So you basically disabled the code that tried to trigger the bug more 
easily.

But the reason I say it's interesting is that "WB_SYNC_NONE" is very much 
implicated in mm/page-writeback.c, and if there is a bug triggered by 
WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also 
fails..

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Guillaume Chazarain

Guillaume Chazarain a écrit :
I get this kind of corruption: 
http://guichaz.free.fr/linux-bug/corruption.png


Actually in qemu, I get three different behaviours:
- no corruption at all : with linux-2.4
- corruption only on the first chunks: before  [PATCH] mm: balance dirty 
pages as identified by Kenneth

- corruption of all chunks: after the balance dirty pages patch

Bisecting in linux-2.5 land I found 
http://kernel.org/pub/linux/kernel/people/akpm/patches/2.5/2.5.66/2.5.66-mm3/broken-out/fadvise-flush-data.patch 
to cause the corruption for me.


The attached patch fixes the corruption for me.

--
Guillaume

diff -r 3859b1144d3a mm/fadvise.c
--- a/mm/fadvise.c	Sun Dec 24 05:00:03 2006 +
+++ b/mm/fadvise.c	Thu Dec 28 19:53:40 2006 +0100
@@ -96,9 +96,6 @@ asmlinkage long sys_fadvise64_64(int fd,
 	case POSIX_FADV_NOREUSE:
 		break;
 	case POSIX_FADV_DONTNEED:
-		if (!bdi_write_congested(mapping->backing_dev_info))
-			filemap_flush(mapping);
-
 		/* First and last FULL page! */
 		start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
 		end_index = (endbyte >> PAGE_CACHE_SHIFT);


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Linus Torvalds


On Thu, 28 Dec 2006, Russell King wrote:
> 
> Yup, but I have nothing to do with glibc because I refuse to do that
> silly copyright assignment FSF thing.  Hopefully someone else can
> resolve it, but...

Yeah, me too.

> _is_ a fix whether _you_ like it or not to work around the issue so
> people can at least run your test program.  I'm not saying it's a
> proper fix though.

My point was that it wasn't a "fix", it's a "workaround". The _fix_ would 
be in glibc.

Nothing more..

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Russell King
On Thu, Dec 28, 2006 at 09:27:12AM -0800, Linus Torvalds wrote:
> On Thu, 28 Dec 2006, Russell King wrote:
> > and if you look at glibc's memset() function, you'll notice that's exactly
> > what you expect if you pass a non-8bit value to it.  Ergo, what you're
> > seeing is utterly expected given glibc's memset() implementation on ARM.
> 
> Guys, you _really_ should fix memset(). What you describe is a _bug_. 

Yup, but I have nothing to do with glibc because I refuse to do that
silly copyright assignment FSF thing.  Hopefully someone else can
resolve it, but...

> > Fixing Linus' test program to pass nr & 255 to memset
> 
> No. I'm almost certain that that is not a "fix", it's a workaround for a 
> serious bug in your glibc crap.

_is_ a fix whether _you_ like it or not to work around the issue so
people can at least run your test program.  I'm not saying it's a
proper fix though.

Of course, if you prefer to be mislead by incorrect bug reports...

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Linus Torvalds


On Thu, 28 Dec 2006, Russell King wrote:
> 
> and if you look at glibc's memset() function, you'll notice that's exactly
> what you expect if you pass a non-8bit value to it.  Ergo, what you're
> seeing is utterly expected given glibc's memset() implementation on ARM.

Guys, you _really_ should fix memset(). What you describe is a _bug_. 

"memset()" takes an "int" as its argument (always has), and has to convert 
it to a byte _itself_. It may not be common, but it's perfectly normal, to 
pass it values outside 0-255 (negative values that still fit in a "signed 
char" in particular are very normal, but my usage of "let the thing 
truncate it itself" is also quite fine).

> Fixing Linus' test program to pass nr & 255 to memset

No. I'm almost certain that that is not a "fix", it's a workaround for a 
serious bug in your glibc crap.

But it does explain all the unexpected strange behaviour (and the really 
small writeback size - now it doesn't need any /proc/sys/vm/dirty_ratio 
assumptions to be explicable.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Linus Torvalds


On Thu, 28 Dec 2006, Zhang, Yanmin wrote:
> 
> The test program is a process to write/read data. pdflush might write data
> to disk asynchronously. After pdflush writes a page to disk, it will call 
> (either by
> softirq) clear_page_dirty to clear the dirty bit after getting the interrupt
> notification.

That would indeed be a horrible bug. However, we don't do 
"clear_page_dirty()" _after_ the IO has completed, we do it _before_ the 
IO starts.

If you can actually find a place that does clear_page_dirty _after_ IO, 
then yes, you've just found the bug. But I haven't found it.

So the rule is _always_:

 - call "clear_page_dirty_for_io()" with the page lock held, and _before_ 
   the IO starts.
 - do "set_page_writeback()" before unlocking the page again
 - do a "end_page_writeback()" when the IO actually finishes.

and any code sequence that doesn't honor those rules would be buggy. A 
beer for anybody that finds it..

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Linus Torvalds


On Wed, 27 Dec 2006, Chen, Kenneth W wrote:
> > 
> > Running the test code, git bisect points its finger at this commit. 
> > Reverting
> > this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code.
> > 
> > [PATCH] mm: balance dirty pages
> > 
> > Now that we can detect writers of shared mappings, throttle them.  
> > Avoids OOM
> > by surprise.
> 
> Oh, never mind :-(  I just didn't create enough write out pressure when
> test this. I just saw bug got triggered on a kernel I previously thought
> was OK.

Btw, this is an important point - people have long felt that the new page 
balancing in 2.6.19 was to blame, but you've just confirmed the long-held 
suspicion (at least by me) that it's not actually a new bug at all, it's 
just that the dirty page balancing causes writeback to happen _earlier_, 
and thus is better able to _show_ a bug that we've likely had for a long 
long time.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Linus Torvalds


On Wed, 27 Dec 2006, Gordon Farquharson wrote:
> 
> 100kB and 200kB files always succeed on the ARM system. 400kB and
> larger always seem to fail.

Oh, wow. Yeah, I've just repressed how tiny 32MB is. And especially if you 
lowered the /proc/sys/vm/dirty_ratio to a smaller percentage, I guess 
400kB should be enough to cause writeback.

Ugh. I tested a 128MB machine a few weeks ago, and found it painful.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread [EMAIL PROTECTED]

On Thu Dec 28 15:09 , Guillaume Chazarain sent:

>I set a qemu environment to test kernels: http://guichaz.free.fr/linux-bug/
>I have corruption with every Fedora release kernel except the first, that is
>2.4.22 works, but 2.6.5, 2.6.9, 2.6.11, 2.6.15 and 2.6.18-1.2798 exhibit 
>some corruption.

Confirm that I see corruption with Fedora kernel 2.6.18-1.2239.fc5:

...
Chunk 142969 corrupted (0-1459)  (2580-4039)
Expected 121, got 0
Written as (89632)127652(124721)
Chunk 142976 corrupted (0-1459)  (512-1971)
Expected 128, got 0
Written as (121734)128324(108589)
Checking chunk 143639/143640 (99%)
$ uname -a
Linux 2.6.18-1.2239.fc5 #1 Fri Nov 10 13:04:06 EST 2006 i686 athlon i386 
GNU/Linux

/Martin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Martin Michlmayr
* Gordon Farquharson <[EMAIL PROTECTED]> [2006-12-28 07:15]:
> Thanks for the fix, Russell.
> 
> I can now trigger the (real) problem by using a 25 MB file (100 << 18)
> and the Linksys NSLU2 (ARM, IXP420 processor, 32 MB RAM).

Me too (using 100 << 18).  Interestingly, I don't seem to get any
corruption on a different ARM board, an IOP32x based machine with 128
MB RAM.
-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Guillaume Chazarain

I set a qemu environment to test kernels: http://guichaz.free.fr/linux-bug/
I have corruption with every Fedora release kernel except the first, that is
2.4.22 works, but 2.6.5, 2.6.9, 2.6.11, 2.6.15 and 2.6.18-1.2798 exhibit 
some

corruption.

Command line to test:

qemu root_fs -snapshot -kernel FC-kernels/FC2-vmlinuz-2.6.5-1.358 -append 'rw 
root=/dev/hda'

I get this kind of corruption: 
http://guichaz.free.fr/linux-bug/corruption.png


--
Guillaume

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Martin Michlmayr
* Russell King <[EMAIL PROTECTED]> [2006-12-28 10:49]:
> > By the way, I just tried it with TARGETSIZE (100 << 12) on a different
> > ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of
> > memory) and I see similar results to that from Gordon:
> 
> Work around the glibc memset() problem by passing nr & 255, and re-run
> the test.  You're getting false positives.

Yeah, I saw your message about this problem after I sent mine.
-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Gordon Farquharson

On 12/28/06, Russell King <[EMAIL PROTECTED]> wrote:


Fixing Linus' test program to pass nr & 255 to memset results in clean
passes on 2.6.9 on TheCus N2100 (IOP8032x) and 2.6.16.9 StrongARM
machines (as would be expected.)


Thanks for the fix, Russell.

I can now trigger the (real) problem by using a 25 MB file (100 << 18)
and the Linksys NSLU2 (ARM, IXP420 processor, 32 MB RAM).

$ ./linus-test
Writing chunk 17954/17955 (99%)
Chunk 514 corrupted (0-1459)  (872-2331)
Expected 2, got 0
Written as (8479)11160(10312)
Chunk 516 corrupted (0-303)  (3792-4095)
Expected 4, got 0
Written as (10312)10569(4426)
Chunk 959 corrupted (0-691)  (3404-4095)
Expected 191, got 0
Written as (687)4881(1522)
Chunk 1895 corrupted (0-1459)  (1900-3359)
Expected 103, got 0
Written as (7746)8389(6231)
Chunk 2702 corrupted (0-1459)  (472-1931)
Expected 142, got 0
Written as (4866)7103(2409)
Chunk 3314 corrupted (0-1459)  (1064-2523)
Expected 242, got 0
Written as (4287)7064(1730)
Chunk 4043 corrupted (0-1459)  (444-1903)
Expected 203, got 0
Written as (6495)8509(4464)
Chunk 5180 corrupted (0-1459)  (1584-3043)
Expected 60, got 0
Written as (11056)12826(10797)
Chunk 5672 corrupted (0-991)  (3104-4095)
Expected 40, got 0
Written as (9944)4872(41)
Chunk 5793 corrupted (460-1459)  (0-999)
Expected 161, got 0
Written as (7059)5038(4377)
Chunk 6089 corrupted (0-1459)  (1620-3079)
Expected 201, got 0
Written as (4672)5230(4403)
Chunk 6545 corrupted (268-1459)  (0-1191)
Expected 145, got 0
Written as (3701)5969(4668)
Chunk 7578 corrupted (0-1459)  (584-2043)
Expected 154, got 0
Written as (10015)5082(1648)
Chunk 7880 corrupted (864-1459)  (0-595)
Expected 200, got 0
Written as (17869)5064(4745)
Chunk 8086 corrupted (0-1459)  (888-2347)
Expected 150, got 0
Written as (10206)11050(10374)
Chunk 8749 corrupted (0-1459)  (2212-3671)
Expected 45, got 0
Written as (15263)7132(4825)
Chunk 9068 corrupted (0-1459)  (1008-2467)
Expected 108, got 0
Written as (5557)7571(6771)
Chunk 9193 corrupted (812-1459)  (0-647)
Expected 233, got 0
Written as (9238)7277(4757)
Chunk 10032 corrupted (576-1459)  (0-883)
Expected 48, got 0
Written as (15741)10012(1753)
Chunk 10056 corrupted (0-1459)  (1696-3155)
Expected 72, got 0
Written as (5379)7431(262)
Chunk 10395 corrupted (0-1459)  (1020-2479)
Expected 155, got 0
Written as (21)7442(5902)
Chunk 10791 corrupted (0-1459)  (1644-3103)
Expected 39, got 0
Written as (4753)5925(5926)
Chunk 10792 corrupted (0-991)  (3104-4095)
Expected 40, got 0
Written as (5925)5926(8555)
Chunk 11036 corrupted (0-1103)  (2992-4095)
Expected 28, got 0
Written as (13755)14449(7458)
Chunk 11387 corrupted (644-1459)  (0-815)
Expected 123, got 0
Written as (10853)11459(9445)
Chunk 11586 corrupted (920-1459)  (0-539)
Expected 66, got 0
Written as (3769)11691(11123)
Chunk 11882 corrupted (0-1459)  (1160-2619)
Expected 106, got 0
Written as (10736)11696(2788)
Chunk 12397 corrupted (0-603)  (3492-4095)
Expected 109, got 0
Written as (2352)7515(2437)
Chunk 12669 corrupted (0-795)  (3300-4095)
Expected 125, got 0
Written as (1191)7661(5266)
Chunk 13162 corrupted (0-1459)  (2184-3643)
Expected 106, got 0
Written as (9383)13662(11544)
Chunk 14653 corrupted (0-27)  (4068-4095)
Expected 61, got 0
Written as (8100)9456(1275)
Chunk 17332 corrupted (0-367)  (3728-4095)
Expected 180, got 0
Written as (760)12247(1244)
Chunk 17445 corrupted (0-1459)  (772-2231)
Expected 37, got 0
Written as (8007)16481(14439)
Chunk 17556 corrupted (0-1007)  (3088-4095)
Expected 148, got 0
Written as (10113)10657(10477)
Chunk 17859 corrupted (0-995)  (3100-4095)
Expected 195, got 0
Written as (14472)14767(11426)
Checking chunk 17954/17955 (99%)

Gordon

--
Gordon Farquharson
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Petri Kaukasoina
On Wed, Dec 27, 2006 at 07:04:34PM -0800, Linus Torvalds wrote:
> [ Modified test-program that tells you where the corruption happens (and 
>   when the missing parts were supposed to be written out) appended, in 
>   case people care. ]

Hi

2.6.18 (and 2.6.18.6) is ok, 2.6.19-rc1 is broken. I tried some snapshots
between them but they hung before shell (2.6.18-git11, 2.6.18-git16,
2.6.18-git20, 2.6.18-git21). 2.6.18-git22 booted and was broken.

(UP, no preempt)

-Petri
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Russell King
On Thu, Dec 28, 2006 at 11:16:59AM +0100, Martin Michlmayr wrote:
> * Gordon Farquharson <[EMAIL PROTECTED]> [2006-12-27 22:38]:
> > >> #define TARGETSIZE (100 << 12)
> > >
> > >That's just 400kB!
> > >
> > >There's no way you should see corruption with that kind of value. It
> > >should all stay solidly in the cache.
> > >
> > >Is this perhaps with ARM nommu or something else strange? It may be that
> > >the program just doesn't work at all if mmap() is faked out with a malloc
> > >or similar.
> > 
> > Definitely a question for the ARM gurus. I'm out of my depth.
> 
> By the way, I just tried it with TARGETSIZE (100 << 12) on a different
> ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of
> memory) and I see similar results to that from Gordon:

Work around the glibc memset() problem by passing nr & 255, and re-run
the test.  You're getting false positives.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Martin Michlmayr
* Gordon Farquharson <[EMAIL PROTECTED]> [2006-12-27 22:38]:
> >> #define TARGETSIZE (100 << 12)
> >
> >That's just 400kB!
> >
> >There's no way you should see corruption with that kind of value. It
> >should all stay solidly in the cache.
> >
> >Is this perhaps with ARM nommu or something else strange? It may be that
> >the program just doesn't work at all if mmap() is faked out with a malloc
> >or similar.
> 
> Definitely a question for the ARM gurus. I'm out of my depth.

By the way, I just tried it with TARGETSIZE (100 << 12) on a different
ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of
memory) and I see similar results to that from Gordon:

Writing chunk 279/280 (99%)
Chunk 256 corrupted (1-1455)  (1025-2479)
Expected 0, got 1
Written as (199)43(184)
Chunk 258 corrupted (1-1455)  (3945-1303)
Expected 2, got 3
Written as (184)74(145)
Chunk 260 corrupted (1-1455)  (2769-127)
Expected 4, got 5
Written as (145)89(237)
Chunk 262 corrupted (1-1455)  (1593-3047)
Expected 6, got 7
Written as (237)168(174)
Chunk 264 corrupted (1-1455)  (417-1871)
Expected 8, got 9
Written as (174)135(161)
Chunk 266 corrupted (1-1455)  (3337-695)
Expected 10, got 11
Written as (161)123(180)
Chunk 268 corrupted (1-1455)  (2161-3615)
Expected 12, got 13
Written as (180)13(19)
Chunk 270 corrupted (1-1455)  (985-2439)
Expected 14, got 15
Written as (19)172(106)
Chunk 272 corrupted (1-1455)  (3905-1263)
Expected 16, got 17
Written as (106)212(140)
Chunk 274 corrupted (1-1455)  (2729-87)
Expected 18, got 19
Written as (140)124(73)
Chunk 276 corrupted (1-1455)  (1553-3007)
Expected 20, got 21
Written as (73)151(52)
Chunk 278 corrupted (1-1455)  (377-1831)
Expected 22, got 23
Written as (52)215(99)
Checking chunk 279/280 (99%)

-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Russell King
On Wed, Dec 27, 2006 at 10:20:20PM -0700, Gordon Farquharson wrote:
> I have run the program a few times, and the output is pretty
> consistent. However, when I increase the target size, the difference
> between the expected and actual values is larger.
> 
> Written as (749)935(738)
> Chunk 1113 corrupted (1-1455)  (2965-323)
> Expected 89, got 93

This is not the corruption Linus is after.  Note that the corruption starts
at offset '1'.  Also note that:

89 = 1113 & 255
93 = 1113 & 255 | (1113 >> 8)

and if you look at glibc's memset() function, you'll notice that's exactly
what you expect if you pass a non-8bit value to it.  Ergo, what you're
seeing is utterly expected given glibc's memset() implementation on ARM.

Fixing Linus' test program to pass nr & 255 to memset results in clean
passes on 2.6.9 on TheCus N2100 (IOP8032x) and 2.6.16.9 StrongARM
machines (as would be expected.)

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Martin Michlmayr
* Gordon Farquharson <[EMAIL PROTECTED]> [2006-12-27 22:38]:
> >That's just 400kB!
> >
> >There's no way you should see corruption with that kind of value. It
> >should all stay solidly in the cache.
> >
> >Is this perhaps with ARM nommu or something else strange? It may be that
> >the program just doesn't work at all if mmap() is faked out with a malloc
> >or similar.
> 
> Definitely a question for the ARM gurus. I'm out of my depth.

The CPU has a MMU.  For reference, it's a IXP4xx based device with 32 MB
of memory.
-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Zhang, Yanmin
On Wed, 2006-12-27 at 19:04 -0800, Linus Torvalds wrote:
> 
> On Wed, 27 Dec 2006, David Miller wrote:
> > > 
> > > I still don't see _why_, though. But maybe smarter people than me can see 
> > > it..
> > 
> > FWIW this program definitely triggers the bug for me.
> 
> Ok, now that I have something simple to do repeatable stuff with, I can 
> say what the pattern is.. It's not all that surprising, but it's still 
> worth just stating for the record.
> 
> What happens is that when I do the "packetized writes" in random order, 
> the _last_ write to a page occasionally just goes missing. It's not always 
> at the end of a page, as shown by for example:
> 
>  - A whole chunk got dropped:
> 
>   Chunk 2094 corrupted (0-1459)  (1624-3083)
>   Expected 46, got 0
>   Written as (30912)55414(1)
> 
>That "Written as (x)y(z)" line means that the corrupted chunk was 
>written as chunk #y, and the preceding and following chunks (that were 
>_not_ corrupt) on the page was written as #x and #z respectively.
> 
>In other words, the missing chunk (which is still zero) was written 
>much later than the ones that were ok, and never hit the disk. It's a 
>contiguous chunk in the middle of the page (chunks are 1460 bytes in 
>size)
> 
>The first line means that all bytes of the chunk (0-1459) were 
>corrupted, and the values in parenthesis are the offsets within a page. 
>In other words, this was a chunk in the _middle_ of a page.
> 
>  - The missing data can also be at the beginning or ends of pages:
> 
>Beginning of the chunk missing, it was at the end of a page (page 
>offsets 3288-4095) and the _next_ page got written out fine:
> 
>   Chunk 2126 corrupted (0-807)  (3288-4095)
>   Expected 78, got 0
>   Written as (32713)55573(14301)
> 
>End of a chunk missing, it was the beginning of a page (and the 
>_previous_ page that contained the beginning of the chunk was written 
>out fine)
> 
>   Chunk 2179 corrupted (1252-1459)  (0-207)
>   Expected 131, got 0
>   Written as (45189)55489(15515)
> 
> Now, the reason I say this isn't surprising is that this is entirely 
> consistent with the dirty bit being dropped on the floor somewhere, and 
> likely through some interaction with the previous changes being in the 
> process of being written out.
> 
> Something (incorrectly) ends up deciding that it doesn't need to write the 
> page, since it's already written, or alternatively clears the dirty bit 
> too late (clears it because an _earlier_ write finished, never mind that 
> the new dirty data didn't make it).
There might be a narrow race between set_page_dirty and clear_page_dirty.

The test program is a process to write/read data. pdflush might write data
to disk asynchronously. After pdflush writes a page to disk, it will call 
(either by
softirq) clear_page_dirty to clear the dirty bit after getting the interrupt
notification. But just after the page is written to disk and before the 
interrupt
reports the result, the test process might change the data and unmap the area. 
When
the area is unmapped, the page is marked as dirty again, but just after that, 
the
interrupt arrives and the dirty bit is cleared, so the late data will not be 
written
to disk.

Function zap_pte_range checks pte to set page dirty if needed, but it doesn't
hold page lock. If the page lock is held before set page dirty, the race might
be avoided.

Yanmin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Segher Boessenkool

I think what might be happening is that pdflush writes them out fine,
however we don't trap writes by the application _during_ that writeout.


Yeah.  I believe that more exactly it happens if the very last
write to the page causes a writeback (due to dirty balancing)
while another writeback for the page is already happening.

As usual in these cases, I have zero proof.


It's something that will only occur with writeback and MAP_SHARED
writable access to the file pages.


It's the do_wp_page - balance_dirty_pages - generic_writepages
path for sure.  Maybe it's enough to change

if (wbc-sync_mode != WB_SYNC_NONE)
wait_on_page_writeback(page);

if (PageWriteback(page) ||
!clear_page_dirty_for_io(page)) {
unlock_page(page);
continue;
}

to

if (wbc-sync_mode != WB_SYNC_NONE)
wait_on_page_writeback(page);

if (PageWriteback(page)) {
redirty_page_for_writepage(wbc, page);
unlock_page(page);
continue;
}

if (!clear_page_dirty_for_io(page)) {
unlock_page(page);
continue;
}

or something along those lines.  Completely untested of course :-)


Segher

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Linus Torvalds


On Fri, 29 Dec 2006, Segher Boessenkool wrote:

  I think what might be happening is that pdflush writes them out fine,
  however we don't trap writes by the application _during_ that writeout.
 
 Yeah.  I believe that more exactly it happens if the very last
 write to the page causes a writeback (due to dirty balancing)
 while another writeback for the page is already happening.
 
 As usual in these cases, I have zero proof.

I actually have proof to the contrary, ie I have traces that say the 
write was started after the last write.

And the VM layer in this area is actually fairly sane and civilized. It 
has a bit that says writeback in progress, and if that bit is set, it 
simply _will_not_ start a new write. It even has various BUG_ON()'s to 
that effect.

So everything I have ever seen says that the VM layer is actually doing 
everything right.

 It's the do_wp_page - balance_dirty_pages - generic_writepages
 path for sure.  Maybe it's enough to change
 
 if (wbc-sync_mode != WB_SYNC_NONE)
 wait_on_page_writeback(page);
 
 if (PageWriteback(page) ||
 !clear_page_dirty_for_io(page)) {
 unlock_page(page);
 continue;
 }

Notive how this one basically says:

 - if it's under writeback, don't even clear the page dirty flag.

Your suggested change:

 if (wbc-sync_mode != WB_SYNC_NONE)
 wait_on_page_writeback(page);
 
 if (PageWriteback(page)) {
   redirty_page_for_writepage(wbc, page);

makes no sense, because we simply never _did_ the clear_page_dirty() if 
the thing was under writeback in the first place. That's how C 
conditionals work.  So there's no reason to redirty it, because it 
wasn't cleaned in the first place.

I've double- and triple-checked the dirty bits, including having traces 
that actually say that the IO was started (from a VM perspective) _after_ 
the last write was done. The IO just didn't hit the disk.

I'm personally fairly convinced that it's not a VM issue, but a IO 
issue. Either in a low-level filesystem or in some of the fs/buffer.c 
helper routines.

But I'd love to be proven wrong. 

I do have a few interesting details from the trace I haven't really 
analyzed yet. Here's the trace for events on one of the pages that was 
corrupted. Note how the events are numbered (there were 171640 events 
total), so the thing you see is just a small set of events from the whole 
big trace, but it's the ones that talk about _that_ particular page.

I've grouped them so hat consecutive events group together. That just 
means that no events on any other pages happened in between those events, 
and it is usually a sign that it's really one single call-chain that 
causes all the events.

For example, for the first group of three events (44366-44368), it's the 
page fault that brings in the page, and since it's a write-fault, it will 
not only map the page, it will mark the page itself dirty and then also 
set the TAG_DIRTY on the mapping. So the group is just really a result 
of one single event happening, which causes several things to happen to 
that page. That's exactly what you'd expect.

Anyway, here is the list of events that page went through:

   44366  PG 0f6d: mm/memory.c:2254 mapping at b789fc54 (write)
   44367  PG 0f6d: mm/page-writeback.c:817 setting dirty
   44368  PG 0f6d: fs/buffer.c:738 setting TAG_DIRTY

   64231  PG 0f6d: mm/page-writeback.c:872 clean_for_io
   64232  PG 0f6d: mm/rmap.c:451 cleaning PTE b789f000
   64233  PG 0f6d: mm/page-writeback.c:914 set writeback
   64234  PG 0f6d: mm/page-writeback.c:916 setting TAG_WRITEBACK
   64235  PG 0f6d: mm/page-writeback.c:922 clearing TAG_DIRTY

   67570  PG 0f6d: mm/page-writeback.c:891 end writeback
   67571  PG 0f6d: mm/page-writeback.c:893 clearing TAG_WRITEBACK

   76705  PG 0f6d: mm/page-writeback.c:817 setting dirty
   76706  PG 0f6d: fs/buffer.c:725 dirtied buffers
   76707  PG 0f6d: fs/buffer.c:738 setting TAG_DIRTY

  105267  PG 0f6d: mm/page-writeback.c:872 clean_for_io
  105268  PG 0f6d: mm/rmap.c:451 cleaning PTE b789f000
  105269  PG 0f6d: mm/page-writeback.c:914 set writeback
  105270  PG 0f6d: mm/page-writeback.c:916 setting TAG_WRITEBACK
  105271  PG 0f6d: mm/page-writeback.c:922 clearing TAG_DIRTY
  105272  PG 0f6d: mm/page-writeback.c:891 end writeback
  105273  PG 0f6d: mm/page-writeback.c:893 clearing TAG_WRITEBACK

  128032  PG 0f6d: mm/memory.c:670 unmapped at b789f000

  132662  PG 0f6d: mm/filemap.c:119 Removing page cache

  148278  PG 0f6d: mm/memory.c:2254 mapping at b789f000 (read)

  166326  PG 0f6d: mm/memory.c:670 unmapped at b789f000

  171958  PG 0f6d: mm/filemap.c:119 Removing page cache


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Zhang, Yanmin
On Wed, 2006-12-27 at 19:04 -0800, Linus Torvalds wrote:
 
 On Wed, 27 Dec 2006, David Miller wrote:
   
   I still don't see _why_, though. But maybe smarter people than me can see 
   it..
  
  FWIW this program definitely triggers the bug for me.
 
 Ok, now that I have something simple to do repeatable stuff with, I can 
 say what the pattern is.. It's not all that surprising, but it's still 
 worth just stating for the record.
 
 What happens is that when I do the packetized writes in random order, 
 the _last_ write to a page occasionally just goes missing. It's not always 
 at the end of a page, as shown by for example:
 
  - A whole chunk got dropped:
 
   Chunk 2094 corrupted (0-1459)  (1624-3083)
   Expected 46, got 0
   Written as (30912)55414(1)
 
That Written as (x)y(z) line means that the corrupted chunk was 
written as chunk #y, and the preceding and following chunks (that were 
_not_ corrupt) on the page was written as #x and #z respectively.
 
In other words, the missing chunk (which is still zero) was written 
much later than the ones that were ok, and never hit the disk. It's a 
contiguous chunk in the middle of the page (chunks are 1460 bytes in 
size)
 
The first line means that all bytes of the chunk (0-1459) were 
corrupted, and the values in parenthesis are the offsets within a page. 
In other words, this was a chunk in the _middle_ of a page.
 
  - The missing data can also be at the beginning or ends of pages:
 
Beginning of the chunk missing, it was at the end of a page (page 
offsets 3288-4095) and the _next_ page got written out fine:
 
   Chunk 2126 corrupted (0-807)  (3288-4095)
   Expected 78, got 0
   Written as (32713)55573(14301)
 
End of a chunk missing, it was the beginning of a page (and the 
_previous_ page that contained the beginning of the chunk was written 
out fine)
 
   Chunk 2179 corrupted (1252-1459)  (0-207)
   Expected 131, got 0
   Written as (45189)55489(15515)
 
 Now, the reason I say this isn't surprising is that this is entirely 
 consistent with the dirty bit being dropped on the floor somewhere, and 
 likely through some interaction with the previous changes being in the 
 process of being written out.
 
 Something (incorrectly) ends up deciding that it doesn't need to write the 
 page, since it's already written, or alternatively clears the dirty bit 
 too late (clears it because an _earlier_ write finished, never mind that 
 the new dirty data didn't make it).
There might be a narrow race between set_page_dirty and clear_page_dirty.

The test program is a process to write/read data. pdflush might write data
to disk asynchronously. After pdflush writes a page to disk, it will call 
(either by
softirq) clear_page_dirty to clear the dirty bit after getting the interrupt
notification. But just after the page is written to disk and before the 
interrupt
reports the result, the test process might change the data and unmap the area. 
When
the area is unmapped, the page is marked as dirty again, but just after that, 
the
interrupt arrives and the dirty bit is cleared, so the late data will not be 
written
to disk.

Function zap_pte_range checks pte to set page dirty if needed, but it doesn't
hold page lock. If the page lock is held before set page dirty, the race might
be avoided.

Yanmin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Martin Michlmayr
* Gordon Farquharson [EMAIL PROTECTED] [2006-12-27 22:38]:
 That's just 400kB!
 
 There's no way you should see corruption with that kind of value. It
 should all stay solidly in the cache.
 
 Is this perhaps with ARM nommu or something else strange? It may be that
 the program just doesn't work at all if mmap() is faked out with a malloc
 or similar.
 
 Definitely a question for the ARM gurus. I'm out of my depth.

The CPU has a MMU.  For reference, it's a IXP4xx based device with 32 MB
of memory.
-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Russell King
On Wed, Dec 27, 2006 at 10:20:20PM -0700, Gordon Farquharson wrote:
 I have run the program a few times, and the output is pretty
 consistent. However, when I increase the target size, the difference
 between the expected and actual values is larger.
 
 Written as (749)935(738)
 Chunk 1113 corrupted (1-1455)  (2965-323)
 Expected 89, got 93

This is not the corruption Linus is after.  Note that the corruption starts
at offset '1'.  Also note that:

89 = 1113  255
93 = 1113  255 | (1113  8)

and if you look at glibc's memset() function, you'll notice that's exactly
what you expect if you pass a non-8bit value to it.  Ergo, what you're
seeing is utterly expected given glibc's memset() implementation on ARM.

Fixing Linus' test program to pass nr  255 to memset results in clean
passes on 2.6.9 on TheCus N2100 (IOP8032x) and 2.6.16.9 StrongARM
machines (as would be expected.)

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Martin Michlmayr
* Gordon Farquharson [EMAIL PROTECTED] [2006-12-27 22:38]:
  #define TARGETSIZE (100  12)
 
 That's just 400kB!
 
 There's no way you should see corruption with that kind of value. It
 should all stay solidly in the cache.
 
 Is this perhaps with ARM nommu or something else strange? It may be that
 the program just doesn't work at all if mmap() is faked out with a malloc
 or similar.
 
 Definitely a question for the ARM gurus. I'm out of my depth.

By the way, I just tried it with TARGETSIZE (100  12) on a different
ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of
memory) and I see similar results to that from Gordon:

Writing chunk 279/280 (99%)
Chunk 256 corrupted (1-1455)  (1025-2479)
Expected 0, got 1
Written as (199)43(184)
Chunk 258 corrupted (1-1455)  (3945-1303)
Expected 2, got 3
Written as (184)74(145)
Chunk 260 corrupted (1-1455)  (2769-127)
Expected 4, got 5
Written as (145)89(237)
Chunk 262 corrupted (1-1455)  (1593-3047)
Expected 6, got 7
Written as (237)168(174)
Chunk 264 corrupted (1-1455)  (417-1871)
Expected 8, got 9
Written as (174)135(161)
Chunk 266 corrupted (1-1455)  (3337-695)
Expected 10, got 11
Written as (161)123(180)
Chunk 268 corrupted (1-1455)  (2161-3615)
Expected 12, got 13
Written as (180)13(19)
Chunk 270 corrupted (1-1455)  (985-2439)
Expected 14, got 15
Written as (19)172(106)
Chunk 272 corrupted (1-1455)  (3905-1263)
Expected 16, got 17
Written as (106)212(140)
Chunk 274 corrupted (1-1455)  (2729-87)
Expected 18, got 19
Written as (140)124(73)
Chunk 276 corrupted (1-1455)  (1553-3007)
Expected 20, got 21
Written as (73)151(52)
Chunk 278 corrupted (1-1455)  (377-1831)
Expected 22, got 23
Written as (52)215(99)
Checking chunk 279/280 (99%)

-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Russell King
On Thu, Dec 28, 2006 at 11:16:59AM +0100, Martin Michlmayr wrote:
 * Gordon Farquharson [EMAIL PROTECTED] [2006-12-27 22:38]:
   #define TARGETSIZE (100  12)
  
  That's just 400kB!
  
  There's no way you should see corruption with that kind of value. It
  should all stay solidly in the cache.
  
  Is this perhaps with ARM nommu or something else strange? It may be that
  the program just doesn't work at all if mmap() is faked out with a malloc
  or similar.
  
  Definitely a question for the ARM gurus. I'm out of my depth.
 
 By the way, I just tried it with TARGETSIZE (100  12) on a different
 ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of
 memory) and I see similar results to that from Gordon:

Work around the glibc memset() problem by passing nr  255, and re-run
the test.  You're getting false positives.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Petri Kaukasoina
On Wed, Dec 27, 2006 at 07:04:34PM -0800, Linus Torvalds wrote:
 [ Modified test-program that tells you where the corruption happens (and 
   when the missing parts were supposed to be written out) appended, in 
   case people care. ]

Hi

2.6.18 (and 2.6.18.6) is ok, 2.6.19-rc1 is broken. I tried some snapshots
between them but they hung before shell (2.6.18-git11, 2.6.18-git16,
2.6.18-git20, 2.6.18-git21). 2.6.18-git22 booted and was broken.

(UP, no preempt)

-Petri
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Gordon Farquharson

On 12/28/06, Russell King [EMAIL PROTECTED] wrote:


Fixing Linus' test program to pass nr  255 to memset results in clean
passes on 2.6.9 on TheCus N2100 (IOP8032x) and 2.6.16.9 StrongARM
machines (as would be expected.)


Thanks for the fix, Russell.

I can now trigger the (real) problem by using a 25 MB file (100  18)
and the Linksys NSLU2 (ARM, IXP420 processor, 32 MB RAM).

$ ./linus-test
Writing chunk 17954/17955 (99%)
Chunk 514 corrupted (0-1459)  (872-2331)
Expected 2, got 0
Written as (8479)11160(10312)
Chunk 516 corrupted (0-303)  (3792-4095)
Expected 4, got 0
Written as (10312)10569(4426)
Chunk 959 corrupted (0-691)  (3404-4095)
Expected 191, got 0
Written as (687)4881(1522)
Chunk 1895 corrupted (0-1459)  (1900-3359)
Expected 103, got 0
Written as (7746)8389(6231)
Chunk 2702 corrupted (0-1459)  (472-1931)
Expected 142, got 0
Written as (4866)7103(2409)
Chunk 3314 corrupted (0-1459)  (1064-2523)
Expected 242, got 0
Written as (4287)7064(1730)
Chunk 4043 corrupted (0-1459)  (444-1903)
Expected 203, got 0
Written as (6495)8509(4464)
Chunk 5180 corrupted (0-1459)  (1584-3043)
Expected 60, got 0
Written as (11056)12826(10797)
Chunk 5672 corrupted (0-991)  (3104-4095)
Expected 40, got 0
Written as (9944)4872(41)
Chunk 5793 corrupted (460-1459)  (0-999)
Expected 161, got 0
Written as (7059)5038(4377)
Chunk 6089 corrupted (0-1459)  (1620-3079)
Expected 201, got 0
Written as (4672)5230(4403)
Chunk 6545 corrupted (268-1459)  (0-1191)
Expected 145, got 0
Written as (3701)5969(4668)
Chunk 7578 corrupted (0-1459)  (584-2043)
Expected 154, got 0
Written as (10015)5082(1648)
Chunk 7880 corrupted (864-1459)  (0-595)
Expected 200, got 0
Written as (17869)5064(4745)
Chunk 8086 corrupted (0-1459)  (888-2347)
Expected 150, got 0
Written as (10206)11050(10374)
Chunk 8749 corrupted (0-1459)  (2212-3671)
Expected 45, got 0
Written as (15263)7132(4825)
Chunk 9068 corrupted (0-1459)  (1008-2467)
Expected 108, got 0
Written as (5557)7571(6771)
Chunk 9193 corrupted (812-1459)  (0-647)
Expected 233, got 0
Written as (9238)7277(4757)
Chunk 10032 corrupted (576-1459)  (0-883)
Expected 48, got 0
Written as (15741)10012(1753)
Chunk 10056 corrupted (0-1459)  (1696-3155)
Expected 72, got 0
Written as (5379)7431(262)
Chunk 10395 corrupted (0-1459)  (1020-2479)
Expected 155, got 0
Written as (21)7442(5902)
Chunk 10791 corrupted (0-1459)  (1644-3103)
Expected 39, got 0
Written as (4753)5925(5926)
Chunk 10792 corrupted (0-991)  (3104-4095)
Expected 40, got 0
Written as (5925)5926(8555)
Chunk 11036 corrupted (0-1103)  (2992-4095)
Expected 28, got 0
Written as (13755)14449(7458)
Chunk 11387 corrupted (644-1459)  (0-815)
Expected 123, got 0
Written as (10853)11459(9445)
Chunk 11586 corrupted (920-1459)  (0-539)
Expected 66, got 0
Written as (3769)11691(11123)
Chunk 11882 corrupted (0-1459)  (1160-2619)
Expected 106, got 0
Written as (10736)11696(2788)
Chunk 12397 corrupted (0-603)  (3492-4095)
Expected 109, got 0
Written as (2352)7515(2437)
Chunk 12669 corrupted (0-795)  (3300-4095)
Expected 125, got 0
Written as (1191)7661(5266)
Chunk 13162 corrupted (0-1459)  (2184-3643)
Expected 106, got 0
Written as (9383)13662(11544)
Chunk 14653 corrupted (0-27)  (4068-4095)
Expected 61, got 0
Written as (8100)9456(1275)
Chunk 17332 corrupted (0-367)  (3728-4095)
Expected 180, got 0
Written as (760)12247(1244)
Chunk 17445 corrupted (0-1459)  (772-2231)
Expected 37, got 0
Written as (8007)16481(14439)
Chunk 17556 corrupted (0-1007)  (3088-4095)
Expected 148, got 0
Written as (10113)10657(10477)
Chunk 17859 corrupted (0-995)  (3100-4095)
Expected 195, got 0
Written as (14472)14767(11426)
Checking chunk 17954/17955 (99%)

Gordon

--
Gordon Farquharson
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Martin Michlmayr
* Russell King [EMAIL PROTECTED] [2006-12-28 10:49]:
  By the way, I just tried it with TARGETSIZE (100  12) on a different
  ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of
  memory) and I see similar results to that from Gordon:
 
 Work around the glibc memset() problem by passing nr  255, and re-run
 the test.  You're getting false positives.

Yeah, I saw your message about this problem after I sent mine.
-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Guillaume Chazarain

I set a qemu environment to test kernels: http://guichaz.free.fr/linux-bug/
I have corruption with every Fedora release kernel except the first, that is
2.4.22 works, but 2.6.5, 2.6.9, 2.6.11, 2.6.15 and 2.6.18-1.2798 exhibit 
some

corruption.

Command line to test:

qemu root_fs -snapshot -kernel FC-kernels/FC2-vmlinuz-2.6.5-1.358 -append 'rw 
root=/dev/hda'

I get this kind of corruption: 
http://guichaz.free.fr/linux-bug/corruption.png


--
Guillaume

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Martin Michlmayr
* Gordon Farquharson [EMAIL PROTECTED] [2006-12-28 07:15]:
 Thanks for the fix, Russell.
 
 I can now trigger the (real) problem by using a 25 MB file (100  18)
 and the Linksys NSLU2 (ARM, IXP420 processor, 32 MB RAM).

Me too (using 100  18).  Interestingly, I don't seem to get any
corruption on a different ARM board, an IOP32x based machine with 128
MB RAM.
-- 
Martin Michlmayr
http://www.cyrius.com/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread [EMAIL PROTECTED]

On Thu Dec 28 15:09 , Guillaume Chazarain sent:

I set a qemu environment to test kernels: http://guichaz.free.fr/linux-bug/
I have corruption with every Fedora release kernel except the first, that is
2.4.22 works, but 2.6.5, 2.6.9, 2.6.11, 2.6.15 and 2.6.18-1.2798 exhibit 
some corruption.

Confirm that I see corruption with Fedora kernel 2.6.18-1.2239.fc5:

...
Chunk 142969 corrupted (0-1459)  (2580-4039)
Expected 121, got 0
Written as (89632)127652(124721)
Chunk 142976 corrupted (0-1459)  (512-1971)
Expected 128, got 0
Written as (121734)128324(108589)
Checking chunk 143639/143640 (99%)
$ uname -a
Linux 2.6.18-1.2239.fc5 #1 Fri Nov 10 13:04:06 EST 2006 i686 athlon i386 
GNU/Linux

/Martin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Linus Torvalds


On Wed, 27 Dec 2006, Gordon Farquharson wrote:
 
 100kB and 200kB files always succeed on the ARM system. 400kB and
 larger always seem to fail.

Oh, wow. Yeah, I've just repressed how tiny 32MB is. And especially if you 
lowered the /proc/sys/vm/dirty_ratio to a smaller percentage, I guess 
400kB should be enough to cause writeback.

Ugh. I tested a 128MB machine a few weeks ago, and found it painful.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm: fix page_mkclean_one

2006-12-28 Thread Linus Torvalds


On Wed, 27 Dec 2006, Chen, Kenneth W wrote:
  
  Running the test code, git bisect points its finger at this commit. 
  Reverting
  this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code.
  
  [PATCH] mm: balance dirty pages
  
  Now that we can detect writers of shared mappings, throttle them.  
  Avoids OOM
  by surprise.
 
 Oh, never mind :-(  I just didn't create enough write out pressure when
 test this. I just saw bug got triggered on a kernel I previously thought
 was OK.

Btw, this is an important point - people have long felt that the new page 
balancing in 2.6.19 was to blame, but you've just confirmed the long-held 
suspicion (at least by me) that it's not actually a new bug at all, it's 
just that the dirty page balancing causes writeback to happen _earlier_, 
and thus is better able to _show_ a bug that we've likely had for a long 
long time.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   >