Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-25 Thread Nick Piggin

David Chinner wrote:


Only if we leave the page in the page cache. If we toss the page,
the time it takes to do the I/O for the page fault is enough for
the direct I/o to complete. Sure it's not an absolute guarantee,
but if you want an absolute guarantee:


So I guess you *could* relax it in theory... Anyway, don't take
my pestering as advocacy for wanting XFS to do something more
clever in such a corner case. I think you're quite right to be
conservative and share codepaths between direct IO read and
write.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-25 Thread Nick Piggin

David Chinner wrote:


Only if we leave the page in the page cache. If we toss the page,
the time it takes to do the I/O for the page fault is enough for
the direct I/o to complete. Sure it's not an absolute guarantee,
but if you want an absolute guarantee:


So I guess you *could* relax it in theory... Anyway, don't take
my pestering as advocacy for wanting XFS to do something more
clever in such a corner case. I think you're quite right to be
conservative and share codepaths between direct IO read and
write.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread David Chinner
On Thu, Jan 25, 2007 at 03:25:29PM +1100, Nick Piggin wrote:
> David Chinner wrote:
> >On Thu, Jan 25, 2007 at 01:01:09PM +1100, Nick Piggin wrote:
> >
> >>David Chinner wrote:
> 
> >>>No. The only thing that will happen here is that the direct read
> >>>will see _none_ of the write because the mmap write occurred during
> >>>the DIO read to a different set of pages in memory. There is no
> >>>"some" or "all" case here.
> >>
> >>But if the buffers get partially or completely written back in the
> >>meantime, then the DIO read could see that.
> >
> >
> >Only if you can dirty them and flush them to disk while the direct
> >read is waiting in the I/O queue (remember, the direct read flushes
> >dirty cached data before being issued). Given that we don't lock the
> >inode in the buffered I/O *writeback* path, we have to stop pages being
> >dirtied in the page cache up front so we don't have mmap writeback
> >over the top of the direct read.
> 
> However unlikely it may be, that is what I'm talking about in my "some"
> or "all" cases. Note that I'm not talking about a specific implementation
> (eg. XFS I guess avoids "some"), but just the possible scenarios.

Unfortunately, behaviour is different for different filesystems. I
can only answer for XFS, which is different to most of the other
filesystems in both locking and the way it treats the page cache.
IOWs, if you want to talk about details, then have to talk about
specific implementations because.


> >Hence we have to prevent mmap for dirtying the same file offset we
> >are doing direct reads on until the direct read has been issued.
> >
> >i.e. we need a barrier.
> 
> So you need to eliminate the "some" case? Because of course "none" and
> "all" are unavoidable.

"all" is avoidable, too, once you've kicked the pages out of
the page cache - you just have to block the buffered read triggered
by refaulting the page will cause until the direct I/O completes.

> >>>IOWs, at a single point in time we have 2 different views
> >>>of the one file which are both apparently valid and that is what
> >>>we are trying to avoid. We have a coherency problem here which is
> >>>solved by forcing the mmap write to reread the data off disk
> >>
> >>I don't see why the mmap write needs to reread data off disk. The
> >>data on disk won't get changed by the DIO read.
> >
> >
> >No, but the data _in memory_ will, and now when the direct read
> >completes it will data that is different to what is in the page
> >cache. For direct I/O we define the correct data to be what is on
> >disk, not what is in memory, so any time we bypass what is in
> >memory, we need to ensure that we prevent the data being changed
> >again in memory before we issue the disk I/O.
> 
> But when you drop your locks, before the direct IO read returns, some
> guy can mmap and dirty the pagecache anyway.
> By the time the read returns, the data is stale.

Only if we leave the page in the page cache. If we toss the page,
the time it takes to do the I/O for the page fault is enough for
the direct I/o to complete. Sure it's not an absolute guarantee,
but if you want an absolute guarantee:

> This obviously must be synchronised in
> userspace.

As I said earlier.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread Nick Piggin

David Chinner wrote:

On Thu, Jan 25, 2007 at 01:01:09PM +1100, Nick Piggin wrote:


David Chinner wrote:



No. The only thing that will happen here is that the direct read
will see _none_ of the write because the mmap write occurred during
the DIO read to a different set of pages in memory. There is no
"some" or "all" case here.


But if the buffers get partially or completely written back in the
meantime, then the DIO read could see that.



Only if you can dirty them and flush them to disk while the direct
read is waiting in the I/O queue (remember, the direct read flushes
dirty cached data before being issued). Given that we don't lock the
inode in the buffered I/O *writeback* path, we have to stop pages being
dirtied in the page cache up front so we don't have mmap writeback
over the top of the direct read.


However unlikely it may be, that is what I'm talking about in my "some"
or "all" cases. Note that I'm not talking about a specific implementation
(eg. XFS I guess avoids "some"), but just the possible scenarios.


Hence we have to prevent mmap for dirtying the same file offset we
are doing direct reads on until the direct read has been issued.

i.e. we need a barrier.


So you need to eliminate the "some" case? Because of course "none" and
"all" are unavoidable.


IOWs, at a single point in time we have 2 different views
of the one file which are both apparently valid and that is what
we are trying to avoid. We have a coherency problem here which is
solved by forcing the mmap write to reread the data off disk


I don't see why the mmap write needs to reread data off disk. The
data on disk won't get changed by the DIO read.



No, but the data _in memory_ will, and now when the direct read
completes it will data that is different to what is in the page
cache. For direct I/O we define the correct data to be what is on
disk, not what is in memory, so any time we bypass what is in
memory, we need to ensure that we prevent the data being changed
again in memory before we issue the disk I/O.


But when you drop your locks, before the direct IO read returns, some
guy can mmap and dirty the pagecache anyway. By the time the read
returns, the data is stale. This obviously must be synchronised in
userspace.

As I said, you can't avoid "none" or "all", and you can't say that
userspace will see the most uptodate copy of the data. All you can
say is that it will be no older than when the syscall is first made.
Which is what you get if you simply writeback but do not invalidate
pagecache for direct IO reads.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread David Chinner
On Thu, Jan 25, 2007 at 01:01:09PM +1100, Nick Piggin wrote:
> David Chinner wrote:
> >On Thu, Jan 25, 2007 at 11:47:24AM +1100, Nick Piggin wrote:
> >
> >>David Chinner wrote:
> >>
> >>>On Thu, Jan 25, 2007 at 11:12:41AM +1100, Nick Piggin wrote:
> >>But I'm just interested about DIO reads. I think you can get pretty
> >>reasonable semantics without discarding pagecache, but the semantics
> >>are weaker in one aspect.
> >>
> >>DIO read
> >>1. writeback page
> >>2. read from disk
> >>
> >>Now your read will pick up data no older than 1. And if a buffered
> >>write happens after 2, then there is no problem either.
> >>
> >>So if you are doing a buffered write and DIO read concurrently, you
> >>want synchronisation so the buffered write happens either before 1
> >>or after 2 -- the DIO read will see either all or none of the write.
> >>
> >>Supposing your pagecache isn't invalidated, then a buffered write
> >>(from mmap, if XFS doesn't allow write(2)) comes in between 1 and 2,
> >>then the DIO read will find either none, some, or all of that write.
> >>
> >>So I guess what you are preventing is the "some" case. Am I right?
> >
> >
> >No. The only thing that will happen here is that the direct read
> >will see _none_ of the write because the mmap write occurred during
> >the DIO read to a different set of pages in memory. There is no
> >"some" or "all" case here.
> 
> But if the buffers get partially or completely written back in the
> meantime, then the DIO read could see that.

Only if you can dirty them and flush them to disk while the direct
read is waiting in the I/O queue (remember, the direct read flushes
dirty cached data before being issued). Given that we don't lock the
inode in the buffered I/O *writeback* path, we have to stop pages being
dirtied in the page cache up front so we don't have mmap writeback
over the top of the direct read.

Hence we have to prevent mmap for dirtying the same file offset we
are doing direct reads on until the direct read has been issued.

i.e. we need a barrier.

> >IOWs, at a single point in time we have 2 different views
> >of the one file which are both apparently valid and that is what
> >we are trying to avoid. We have a coherency problem here which is
> >solved by forcing the mmap write to reread the data off disk
> 
> I don't see why the mmap write needs to reread data off disk. The
> data on disk won't get changed by the DIO read.

No, but the data _in memory_ will, and now when the direct read
completes it will data that is different to what is in the page
cache. For direct I/O we define the correct data to be what is on
disk, not what is in memory, so any time we bypass what is in
memory, we need to ensure that we prevent the data being changed
again in memory before we issue the disk I/O.

> >Look at it this way - direct I/O in XFS implies an I/O barrier
> >(similar to a memory barrier). Writing back and tossing out of the
> >page cache at the start of the direct IO gives us an I/O coherency
> >barrier - everything before the direct IO is sync'd to disk before
> >the direct IO can proceed, and everything after the direct IO has
> >started must be fetched from disk again.
> >
> >Because mmap I/O doesn't necessarily need I/O to change the state
> >of a page (think of a read fault then a later write fault), to make
> >the I/O barrier work correctly with mmap() we need to ensure that
> >it will fault the page from disk again. We can only do that by
> >unmapping the pages before tossing them from the page cache.
> 
> OK, but direct IO *reads* do not conceptually invalidate pagecache
> sitting on top of those blocks. Pagecache becomes invalid when the
> page no longer represents the most uptodate copy of the data (eg.
> in the case of a direct IO write).

In theory, yes.

In practice, if you don't invalidate the page cache you have no
mechanism of synchronising mmap with direct I/O, and at that
point you have no coherency model that you can work with in
your filesystem. You have to be able to guarantee synchronisation
betwen the different methods that can dirty data before you
can give any guarantees about data coherency.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread Nick Piggin

David Chinner wrote:

On Thu, Jan 25, 2007 at 11:47:24AM +1100, Nick Piggin wrote:


David Chinner wrote:


On Thu, Jan 25, 2007 at 11:12:41AM +1100, Nick Piggin wrote:



... so surely if you do a direct read followed by a buffered read,
you should *not* get the same data if there has been some activity
to modify that part of the file in the meantime (whether that be a
buffered or direct write).



Right. And that is what happens in XFS because it purges the
caches on direct I/O and forces data to be re-read from disk.


And that is critical for direct IO writes, of course.



Effectively, if you are mixing direct I/O with other types of I/O
(buffered or mmap) then the application really needs to be certain
it is doing the right thing because there are races that can occur
below the filesystem. All we care about in the filesystem is that
what we cache is the same as what is on disk, and that implies that
direct I/O needs to purge the cache regardless of the state it is in

Hence we need to unmap pages and use truncate semantics on them to
ensure they are removed from the page cache


OK, I understand that this does need to happen (at least for writes),
so you need to fix it regardless of the DIO read issue.

But I'm just interested about DIO reads. I think you can get pretty
reasonable semantics without discarding pagecache, but the semantics
are weaker in one aspect.

DIO read
1. writeback page
2. read from disk

Now your read will pick up data no older than 1. And if a buffered
write happens after 2, then there is no problem either.

So if you are doing a buffered write and DIO read concurrently, you
want synchronisation so the buffered write happens either before 1
or after 2 -- the DIO read will see either all or none of the write.

Supposing your pagecache isn't invalidated, then a buffered write
(from mmap, if XFS doesn't allow write(2)) comes in between 1 and 2,
then the DIO read will find either none, some, or all of that write.

So I guess what you are preventing is the "some" case. Am I right?



No. The only thing that will happen here is that the direct read
will see _none_ of the write because the mmap write occurred during
the DIO read to a different set of pages in memory. There is no
"some" or "all" case here.


But if the buffers get partially or completely written back in the
meantime, then the DIO read could see that.


IOWs, at a single point in time we have 2 different views
of the one file which are both apparently valid and that is what
we are trying to avoid. We have a coherency problem here which is
solved by forcing the mmap write to reread the data off disk


I don't see why the mmap write needs to reread data off disk. The
data on disk won't get changed by the DIO read.


Look at it this way - direct I/O in XFS implies an I/O barrier
(similar to a memory barrier). Writing back and tossing out of the
page cache at the start of the direct IO gives us an I/O coherency
barrier - everything before the direct IO is sync'd to disk before
the direct IO can proceed, and everything after the direct IO has
started must be fetched from disk again.

Because mmap I/O doesn't necessarily need I/O to change the state
of a page (think of a read fault then a later write fault), to make
the I/O barrier work correctly with mmap() we need to ensure that
it will fault the page from disk again. We can only do that by
unmapping the pages before tossing them from the page cache.


OK, but direct IO *reads* do not conceptually invalidate pagecache
sitting on top of those blocks. Pagecache becomes invalid when the
page no longer represents the most uptodate copy of the data (eg.
in the case of a direct IO write).

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread David Chinner
On Thu, Jan 25, 2007 at 11:47:24AM +1100, Nick Piggin wrote:
> David Chinner wrote:
> >On Thu, Jan 25, 2007 at 11:12:41AM +1100, Nick Piggin wrote:
> 
> >>... so surely if you do a direct read followed by a buffered read,
> >>you should *not* get the same data if there has been some activity
> >>to modify that part of the file in the meantime (whether that be a
> >>buffered or direct write).
> >
> >
> >Right. And that is what happens in XFS because it purges the
> >caches on direct I/O and forces data to be re-read from disk.
> 
> And that is critical for direct IO writes, of course.
> 
> >Effectively, if you are mixing direct I/O with other types of I/O
> >(buffered or mmap) then the application really needs to be certain
> >it is doing the right thing because there are races that can occur
> >below the filesystem. All we care about in the filesystem is that
> >what we cache is the same as what is on disk, and that implies that
> >direct I/O needs to purge the cache regardless of the state it is in
> >
> >Hence we need to unmap pages and use truncate semantics on them to
> >ensure they are removed from the page cache
> 
> OK, I understand that this does need to happen (at least for writes),
> so you need to fix it regardless of the DIO read issue.
> 
> But I'm just interested about DIO reads. I think you can get pretty
> reasonable semantics without discarding pagecache, but the semantics
> are weaker in one aspect.
> 
> DIO read
> 1. writeback page
> 2. read from disk
> 
> Now your read will pick up data no older than 1. And if a buffered
> write happens after 2, then there is no problem either.
> 
> So if you are doing a buffered write and DIO read concurrently, you
> want synchronisation so the buffered write happens either before 1
> or after 2 -- the DIO read will see either all or none of the write.
> 
> Supposing your pagecache isn't invalidated, then a buffered write
> (from mmap, if XFS doesn't allow write(2)) comes in between 1 and 2,
> then the DIO read will find either none, some, or all of that write.
>
> So I guess what you are preventing is the "some" case. Am I right?

No. The only thing that will happen here is that the direct read
will see _none_ of the write because the mmap write occurred during
the DIO read to a different set of pages in memory. There is no
"some" or "all" case here.

IOWs, at a single point in time we have 2 different views
of the one file which are both apparently valid and that is what
we are trying to avoid. We have a coherency problem here which is
solved by forcing the mmap write to reread the data off disk

Look at it this way - direct I/O in XFS implies an I/O barrier
(similar to a memory barrier). Writing back and tossing out of the
page cache at the start of the direct IO gives us an I/O coherency
barrier - everything before the direct IO is sync'd to disk before
the direct IO can proceed, and everything after the direct IO has
started must be fetched from disk again.

Because mmap I/O doesn't necessarily need I/O to change the state
of a page (think of a read fault then a later write fault), to make
the I/O barrier work correctly with mmap() we need to ensure that
it will fault the page from disk again. We can only do that by
unmapping the pages before tossing them from the page cache.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread Nick Piggin

David Chinner wrote:

On Thu, Jan 25, 2007 at 11:12:41AM +1100, Nick Piggin wrote:



... so surely if you do a direct read followed by a buffered read,
you should *not* get the same data if there has been some activity
to modify that part of the file in the meantime (whether that be a
buffered or direct write).



Right. And that is what happens in XFS because it purges the
caches on direct I/O and forces data to be re-read from disk.


And that is critical for direct IO writes, of course.


Effectively, if you are mixing direct I/O with other types of I/O
(buffered or mmap) then the application really needs to be certain
it is doing the right thing because there are races that can occur
below the filesystem. All we care about in the filesystem is that
what we cache is the same as what is on disk, and that implies that
direct I/O needs to purge the cache regardless of the state it is in

Hence we need to unmap pages and use truncate semantics on them to
ensure they are removed from the page cache


OK, I understand that this does need to happen (at least for writes),
so you need to fix it regardless of the DIO read issue.

But I'm just interested about DIO reads. I think you can get pretty
reasonable semantics without discarding pagecache, but the semantics
are weaker in one aspect.

DIO read
1. writeback page
2. read from disk

Now your read will pick up data no older than 1. And if a buffered
write happens after 2, then there is no problem either.

So if you are doing a buffered write and DIO read concurrently, you
want synchronisation so the buffered write happens either before 1
or after 2 -- the DIO read will see either all or none of the write.

Supposing your pagecache isn't invalidated, then a buffered write
(from mmap, if XFS doesn't allow write(2)) comes in between 1 and 2,
then the DIO read will find either none, some, or all of that write.

So I guess what you are preventing is the "some" case. Am I right?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread David Chinner
On Thu, Jan 25, 2007 at 11:12:41AM +1100, Nick Piggin wrote:
> David Chinner wrote:
> >On Thu, Jan 25, 2007 at 12:43:23AM +1100, Nick Piggin wrote:
> 
> >>And why not just leave it in the pagecache and be done with it?
> >
> >
> >because what is in cache is then not coherent with what is on disk,
> >and a direct read is supposed to read the data that is present
> >in the file at the time it is issued. 
> 
> So after a writeout it will be coherent of course, so the point in
> question is what happens when someone comes in and dirties it at the
> worst possible moment? That relates to the paragraph below...
> 
> >>All you need is to do a writeout before a direct IO read, which is
> >>what generic dio code does.
> >
> >
> >No, that's not good enough - after writeout but before the
> >direct I/O read is issued a process can fault the page and dirty
> >it. If you do a direct read, followed by a buffered read you should
> >get the same data. The only way to guarantee this is to chuck out
> >any cached pages across the range of the direct I/O so they are
> >fetched again from disk on the next buffered I/O. i.e. coherent
> >at the time the direct I/O is issued.
> 
> ... so surely if you do a direct read followed by a buffered read,
> you should *not* get the same data if there has been some activity
> to modify that part of the file in the meantime (whether that be a
> buffered or direct write).

Right. And that is what happens in XFS because it purges the
caches on direct I/O and forces data to be re-read from disk.

Effectively, if you are mixing direct I/O with other types of I/O
(buffered or mmap) then the application really needs to be certain
it is doing the right thing because there are races that can occur
below the filesystem. All we care about in the filesystem is that
what we cache is the same as what is on disk, and that implies that
direct I/O needs to purge the cache regardless of the state it is in

Hence we need to unmap pages and use truncate semantics on them to
ensure they are removed from the page cache

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread Nick Piggin

David Chinner wrote:

On Thu, Jan 25, 2007 at 12:43:23AM +1100, Nick Piggin wrote:



And why not just leave it in the pagecache and be done with it?



because what is in cache is then not coherent with what is on disk,
and a direct read is supposed to read the data that is present
in the file at the time it is issued. 


So after a writeout it will be coherent of course, so the point in
question is what happens when someone comes in and dirties it at the
worst possible moment? That relates to the paragraph below...


All you need is to do a writeout before a direct IO read, which is
what generic dio code does.



No, that's not good enough - after writeout but before the
direct I/O read is issued a process can fault the page and dirty
it. If you do a direct read, followed by a buffered read you should
get the same data. The only way to guarantee this is to chuck out
any cached pages across the range of the direct I/O so they are
fetched again from disk on the next buffered I/O. i.e. coherent
at the time the direct I/O is issued.


... so surely if you do a direct read followed by a buffered read,
you should *not* get the same data if there has been some activity
to modify that part of the file in the meantime (whether that be a
buffered or direct write).


but in that case you'll either have to live with some racyness
(which is what the generic code does), or have a higher level
synchronisation to prevent buffered + direct IO writes I suppose?



The XFS inode iolock - direct I/O writes take it shared, buffered
writes takes it exclusive - so you can't do both at once. Buffered
reads take is shared, which is another reason why we need to purge
the cache on direct I/O writes - they can operate concurrently
(and coherently) with buffered reads.


Ah, I'm glad to see somebody cares about doing the right thing ;)
Maybe I'll use XFS for my filesystems in future.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread Nick Piggin

Peter Zijlstra wrote:

On Thu, 2007-01-25 at 00:43 +1100, Nick Piggin wrote:



Have you seen the new launder_page() a_op? called from
invalidate_inode_pages2_range()


It would have been nice to make that one into a more potentially
useful generic callback.



That can still be done when the need arises, right?


Yeah I guess so.


But why was it introduced, exactly? I can't tell from the code or
the discussion why NFS couldn't start the IO, and signal the caller
to wait_on_page_writeback and retry? That seemed to me like the
convetional fix.



to quote a bit:

On Tue, 19 Dec 2006 18:19:38 -0500
Trond Myklebust <[EMAIL PROTECTED]> wrote:



   NFS: Fix race in nfs_release_page()
   
   invalidate_inode_pages2() may set the dirty bit on a page owing to the call

   to unmap_mapping_range() after the page was locked. In order to fix this,
   NFS has hooked the releasepage() method. This, however leads to deadlocks
   in other parts of the VM.



and:



Now, arguably the VM shouldn't be calling try_to_release_page() with
__GFP_FS when it's holding a lock on a page.

But otoh, NFS should never be running lock_page() within nfs_release_page()
against the page which was passed into nfs_release_page().  It'll deadlock
for sure.


The reason why it is happening is that the last dirty page from that
inode gets cleaned, resulting in a call to dput().


OK but what's the problem with just failing to release the page if it
is dirty, I wonder? In the worst case, page reclaim will just end up
doing a writeout to clean it.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread David Chinner
On Thu, Jan 25, 2007 at 12:43:23AM +1100, Nick Piggin wrote:
> Peter Zijlstra wrote:
> >On Wed, 2007-01-24 at 09:37 +1100, David Chinner wrote:
> >
> >>With the recent changes to cancel_dirty_pages(), XFS will
> >>dump warnings in the syslog because it can truncate_inode_pages()
> >>on dirty mapped pages.
> >>
> >>I've determined that this is indeed correct behaviour for XFS
> >>as this can happen in the case of races on mmap()d files with
> >>direct I/O. In this case when we do a direct I/O read, we
> >>flush the dirty pages to disk, then truncate them out of the
> >>page cache. Unfortunately, between the flush and the truncate
> >>the mmap could dirty the page again. At this point we toss a
> >>dirty page that is mapped.
> >
> >
> >This sounds iffy, why not just leave the page in the pagecache if its
> >mapped anyway?
> 
> And why not just leave it in the pagecache and be done with it?

because what is in cache is then not coherent with what is on disk,
and a direct read is supposed to read the data that is present
in the file at the time it is issued. 

> All you need is to do a writeout before a direct IO read, which is
> what generic dio code does.

No, that's not good enough - after writeout but before the
direct I/O read is issued a process can fault the page and dirty
it. If you do a direct read, followed by a buffered read you should
get the same data. The only way to guarantee this is to chuck out
any cached pages across the range of the direct I/O so they are
fetched again from disk on the next buffered I/O. i.e. coherent
at the time the direct I/O is issued.

> I guess you'll say that direct writes still need to remove pages,

Yup.

> but in that case you'll either have to live with some racyness
> (which is what the generic code does), or have a higher level
> synchronisation to prevent buffered + direct IO writes I suppose?

The XFS inode iolock - direct I/O writes take it shared, buffered
writes takes it exclusive - so you can't do both at once. Buffered
reads take is shared, which is another reason why we need to purge
the cache on direct I/O writes - they can operate concurrently
(and coherently) with buffered reads.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread David Chinner
On Wed, Jan 24, 2007 at 01:13:55PM +0100, Peter Zijlstra wrote:
> On Wed, 2007-01-24 at 09:37 +1100, David Chinner wrote:
> > With the recent changes to cancel_dirty_pages(), XFS will
> > dump warnings in the syslog because it can truncate_inode_pages()
> > on dirty mapped pages.
> > 
> > I've determined that this is indeed correct behaviour for XFS
> > as this can happen in the case of races on mmap()d files with
> > direct I/O. In this case when we do a direct I/O read, we
> > flush the dirty pages to disk, then truncate them out of the
> > page cache. Unfortunately, between the flush and the truncate
> > the mmap could dirty the page again. At this point we toss a
> > dirty page that is mapped.
> 
> This sounds iffy, why not just leave the page in the pagecache if its
> mapped anyway?

Because then fsx fails.

> > None of the existing functions for truncating pages or invalidating
> > pages work in this situation. Invalidating a page only works for
> > non-dirty pages with non-dirty buffers, and they only work for
> > whole pages and XFS requires partial page truncation.
> > 
> > On top of that the page invalidation functions don't actually
> > call into the filesystem to invalidate the page and so the filesystem
> > can't actually invalidate the page properly (e.g. do stuff based on
> > private buffer head flags).
> 
> Have you seen the new launder_page() a_op? called from
> invalidate_inode_pages2_range()

No, but we can't use invalidate_inode_pages2_range() because it
doesn't handle partial pages. I tried that first and it left warnings
in the syslog and fsx failed.

> > So that leaves us needing to use truncate semantics and the problem
> > is that none of them unmap pages in a non-racy manner - if they
> > unmap pages they do it separately to the truncate of the page,
> > leading to races with mmap redirtying the page between the unmap and
> > the truncate ofthe page.
> 
> Isn't there still a race where the page fault path doesn't yet lock the
> page and can just reinsert it?

Yes, but it's a tiny race compared to the other mechanisms
available.

> Nick's pagefault rework should rid us of this by always locking the page
> in the fault path.

Yes, and that's what I'm relying on to fix the problem completely.
invalidate_inode_pages2_range() needs this fix as well to be race
free, so it's not like I'm introducing a new problem

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread Peter Zijlstra
On Thu, 2007-01-25 at 00:43 +1100, Nick Piggin wrote:

> > Have you seen the new launder_page() a_op? called from
> > invalidate_inode_pages2_range()
> 
> It would have been nice to make that one into a more potentially
> useful generic callback.

That can still be done when the need arises, right?

> But why was it introduced, exactly? I can't tell from the code or
> the discussion why NFS couldn't start the IO, and signal the caller
> to wait_on_page_writeback and retry? That seemed to me like the
> convetional fix.

to quote a bit:

On Tue, 19 Dec 2006 18:19:38 -0500
Trond Myklebust <[EMAIL PROTECTED]> wrote:

> NFS: Fix race in nfs_release_page()
> 
> invalidate_inode_pages2() may set the dirty bit on a page owing to the 
> call
> to unmap_mapping_range() after the page was locked. In order to fix this,
> NFS has hooked the releasepage() method. This, however leads to deadlocks
> in other parts of the VM.

and:

> > Now, arguably the VM shouldn't be calling try_to_release_page() with
> > __GFP_FS when it's holding a lock on a page.
> > 
> > But otoh, NFS should never be running lock_page() within nfs_release_page()
> > against the page which was passed into nfs_release_page().  It'll deadlock
> > for sure.
> 
> The reason why it is happening is that the last dirty page from that
> inode gets cleaned, resulting in a call to dput().

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread Nick Piggin

Peter Zijlstra wrote:

On Wed, 2007-01-24 at 09:37 +1100, David Chinner wrote:


With the recent changes to cancel_dirty_pages(), XFS will
dump warnings in the syslog because it can truncate_inode_pages()
on dirty mapped pages.

I've determined that this is indeed correct behaviour for XFS
as this can happen in the case of races on mmap()d files with
direct I/O. In this case when we do a direct I/O read, we
flush the dirty pages to disk, then truncate them out of the
page cache. Unfortunately, between the flush and the truncate
the mmap could dirty the page again. At this point we toss a
dirty page that is mapped.



This sounds iffy, why not just leave the page in the pagecache if its
mapped anyway?


And why not just leave it in the pagecache and be done with it?

All you need is to do a writeout before a direct IO read, which is
what generic dio code does.

I guess you'll say that direct writes still need to remove pages,
but in that case you'll either have to live with some racyness
(which is what the generic code does), or have a higher level
synchronisation to prevent buffered + direct IO writes I suppose?


None of the existing functions for truncating pages or invalidating
pages work in this situation. Invalidating a page only works for
non-dirty pages with non-dirty buffers, and they only work for
whole pages and XFS requires partial page truncation.

On top of that the page invalidation functions don't actually
call into the filesystem to invalidate the page and so the filesystem
can't actually invalidate the page properly (e.g. do stuff based on
private buffer head flags).



Have you seen the new launder_page() a_op? called from
invalidate_inode_pages2_range()


It would have been nice to make that one into a more potentially
useful generic callback.

But why was it introduced, exactly? I can't tell from the code or
the discussion why NFS couldn't start the IO, and signal the caller
to wait_on_page_writeback and retry? That seemed to me like the
convetional fix.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread Peter Zijlstra
On Wed, 2007-01-24 at 09:37 +1100, David Chinner wrote:
> With the recent changes to cancel_dirty_pages(), XFS will
> dump warnings in the syslog because it can truncate_inode_pages()
> on dirty mapped pages.
> 
> I've determined that this is indeed correct behaviour for XFS
> as this can happen in the case of races on mmap()d files with
> direct I/O. In this case when we do a direct I/O read, we
> flush the dirty pages to disk, then truncate them out of the
> page cache. Unfortunately, between the flush and the truncate
> the mmap could dirty the page again. At this point we toss a
> dirty page that is mapped.

This sounds iffy, why not just leave the page in the pagecache if its
mapped anyway?

> None of the existing functions for truncating pages or invalidating
> pages work in this situation. Invalidating a page only works for
> non-dirty pages with non-dirty buffers, and they only work for
> whole pages and XFS requires partial page truncation.
> 
> On top of that the page invalidation functions don't actually
> call into the filesystem to invalidate the page and so the filesystem
> can't actually invalidate the page properly (e.g. do stuff based on
> private buffer head flags).

Have you seen the new launder_page() a_op? called from
invalidate_inode_pages2_range()

> So that leaves us needing to use truncate semantics and the problem
> is that none of them unmap pages in a non-racy manner - if they
> unmap pages they do it separately to the truncate of the page,
> leading to races with mmap redirtying the page between the unmap and
> the truncate ofthe page.

Isn't there still a race where the page fault path doesn't yet lock the
page and can just reinsert it?

Nick's pagefault rework should rid us of this by always locking the page
in the fault path.

> Hence we need a truncate function that unmaps the pages while they
> are locked for truncate in a similar fashion to
> invalidate_inode_pages2_range(). The following patch (unchanged from
> the last time it was sent) does this. The XFS changes are in a
> second patch.
> 
> The patch has been test on ia64 and x86-64 via XFSQA and a lot
> of fsx mixing mmap and direct I/O operations.
> 
> Signed-off-by: Dave Chinner <[EMAIL PROTECTED]>


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread Peter Zijlstra
On Wed, 2007-01-24 at 09:37 +1100, David Chinner wrote:
 With the recent changes to cancel_dirty_pages(), XFS will
 dump warnings in the syslog because it can truncate_inode_pages()
 on dirty mapped pages.
 
 I've determined that this is indeed correct behaviour for XFS
 as this can happen in the case of races on mmap()d files with
 direct I/O. In this case when we do a direct I/O read, we
 flush the dirty pages to disk, then truncate them out of the
 page cache. Unfortunately, between the flush and the truncate
 the mmap could dirty the page again. At this point we toss a
 dirty page that is mapped.

This sounds iffy, why not just leave the page in the pagecache if its
mapped anyway?

 None of the existing functions for truncating pages or invalidating
 pages work in this situation. Invalidating a page only works for
 non-dirty pages with non-dirty buffers, and they only work for
 whole pages and XFS requires partial page truncation.
 
 On top of that the page invalidation functions don't actually
 call into the filesystem to invalidate the page and so the filesystem
 can't actually invalidate the page properly (e.g. do stuff based on
 private buffer head flags).

Have you seen the new launder_page() a_op? called from
invalidate_inode_pages2_range()

 So that leaves us needing to use truncate semantics and the problem
 is that none of them unmap pages in a non-racy manner - if they
 unmap pages they do it separately to the truncate of the page,
 leading to races with mmap redirtying the page between the unmap and
 the truncate ofthe page.

Isn't there still a race where the page fault path doesn't yet lock the
page and can just reinsert it?

Nick's pagefault rework should rid us of this by always locking the page
in the fault path.

 Hence we need a truncate function that unmaps the pages while they
 are locked for truncate in a similar fashion to
 invalidate_inode_pages2_range(). The following patch (unchanged from
 the last time it was sent) does this. The XFS changes are in a
 second patch.
 
 The patch has been test on ia64 and x86-64 via XFSQA and a lot
 of fsx mixing mmap and direct I/O operations.
 
 Signed-off-by: Dave Chinner [EMAIL PROTECTED]


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread Nick Piggin

Peter Zijlstra wrote:

On Wed, 2007-01-24 at 09:37 +1100, David Chinner wrote:


With the recent changes to cancel_dirty_pages(), XFS will
dump warnings in the syslog because it can truncate_inode_pages()
on dirty mapped pages.

I've determined that this is indeed correct behaviour for XFS
as this can happen in the case of races on mmap()d files with
direct I/O. In this case when we do a direct I/O read, we
flush the dirty pages to disk, then truncate them out of the
page cache. Unfortunately, between the flush and the truncate
the mmap could dirty the page again. At this point we toss a
dirty page that is mapped.



This sounds iffy, why not just leave the page in the pagecache if its
mapped anyway?


And why not just leave it in the pagecache and be done with it?

All you need is to do a writeout before a direct IO read, which is
what generic dio code does.

I guess you'll say that direct writes still need to remove pages,
but in that case you'll either have to live with some racyness
(which is what the generic code does), or have a higher level
synchronisation to prevent buffered + direct IO writes I suppose?


None of the existing functions for truncating pages or invalidating
pages work in this situation. Invalidating a page only works for
non-dirty pages with non-dirty buffers, and they only work for
whole pages and XFS requires partial page truncation.

On top of that the page invalidation functions don't actually
call into the filesystem to invalidate the page and so the filesystem
can't actually invalidate the page properly (e.g. do stuff based on
private buffer head flags).



Have you seen the new launder_page() a_op? called from
invalidate_inode_pages2_range()


It would have been nice to make that one into a more potentially
useful generic callback.

But why was it introduced, exactly? I can't tell from the code or
the discussion why NFS couldn't start the IO, and signal the caller
to wait_on_page_writeback and retry? That seemed to me like the
convetional fix.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread Peter Zijlstra
On Thu, 2007-01-25 at 00:43 +1100, Nick Piggin wrote:

  Have you seen the new launder_page() a_op? called from
  invalidate_inode_pages2_range()
 
 It would have been nice to make that one into a more potentially
 useful generic callback.

That can still be done when the need arises, right?

 But why was it introduced, exactly? I can't tell from the code or
 the discussion why NFS couldn't start the IO, and signal the caller
 to wait_on_page_writeback and retry? That seemed to me like the
 convetional fix.

to quote a bit:

On Tue, 19 Dec 2006 18:19:38 -0500
Trond Myklebust [EMAIL PROTECTED] wrote:

 NFS: Fix race in nfs_release_page()
 
 invalidate_inode_pages2() may set the dirty bit on a page owing to the 
 call
 to unmap_mapping_range() after the page was locked. In order to fix this,
 NFS has hooked the releasepage() method. This, however leads to deadlocks
 in other parts of the VM.

and:

  Now, arguably the VM shouldn't be calling try_to_release_page() with
  __GFP_FS when it's holding a lock on a page.
  
  But otoh, NFS should never be running lock_page() within nfs_release_page()
  against the page which was passed into nfs_release_page().  It'll deadlock
  for sure.
 
 The reason why it is happening is that the last dirty page from that
 inode gets cleaned, resulting in a call to dput().

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread David Chinner
On Wed, Jan 24, 2007 at 01:13:55PM +0100, Peter Zijlstra wrote:
 On Wed, 2007-01-24 at 09:37 +1100, David Chinner wrote:
  With the recent changes to cancel_dirty_pages(), XFS will
  dump warnings in the syslog because it can truncate_inode_pages()
  on dirty mapped pages.
  
  I've determined that this is indeed correct behaviour for XFS
  as this can happen in the case of races on mmap()d files with
  direct I/O. In this case when we do a direct I/O read, we
  flush the dirty pages to disk, then truncate them out of the
  page cache. Unfortunately, between the flush and the truncate
  the mmap could dirty the page again. At this point we toss a
  dirty page that is mapped.
 
 This sounds iffy, why not just leave the page in the pagecache if its
 mapped anyway?

Because then fsx fails.

  None of the existing functions for truncating pages or invalidating
  pages work in this situation. Invalidating a page only works for
  non-dirty pages with non-dirty buffers, and they only work for
  whole pages and XFS requires partial page truncation.
  
  On top of that the page invalidation functions don't actually
  call into the filesystem to invalidate the page and so the filesystem
  can't actually invalidate the page properly (e.g. do stuff based on
  private buffer head flags).
 
 Have you seen the new launder_page() a_op? called from
 invalidate_inode_pages2_range()

No, but we can't use invalidate_inode_pages2_range() because it
doesn't handle partial pages. I tried that first and it left warnings
in the syslog and fsx failed.

  So that leaves us needing to use truncate semantics and the problem
  is that none of them unmap pages in a non-racy manner - if they
  unmap pages they do it separately to the truncate of the page,
  leading to races with mmap redirtying the page between the unmap and
  the truncate ofthe page.
 
 Isn't there still a race where the page fault path doesn't yet lock the
 page and can just reinsert it?

Yes, but it's a tiny race compared to the other mechanisms
available.

 Nick's pagefault rework should rid us of this by always locking the page
 in the fault path.

Yes, and that's what I'm relying on to fix the problem completely.
invalidate_inode_pages2_range() needs this fix as well to be race
free, so it's not like I'm introducing a new problem

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread David Chinner
On Thu, Jan 25, 2007 at 12:43:23AM +1100, Nick Piggin wrote:
 Peter Zijlstra wrote:
 On Wed, 2007-01-24 at 09:37 +1100, David Chinner wrote:
 
 With the recent changes to cancel_dirty_pages(), XFS will
 dump warnings in the syslog because it can truncate_inode_pages()
 on dirty mapped pages.
 
 I've determined that this is indeed correct behaviour for XFS
 as this can happen in the case of races on mmap()d files with
 direct I/O. In this case when we do a direct I/O read, we
 flush the dirty pages to disk, then truncate them out of the
 page cache. Unfortunately, between the flush and the truncate
 the mmap could dirty the page again. At this point we toss a
 dirty page that is mapped.
 
 
 This sounds iffy, why not just leave the page in the pagecache if its
 mapped anyway?
 
 And why not just leave it in the pagecache and be done with it?

because what is in cache is then not coherent with what is on disk,
and a direct read is supposed to read the data that is present
in the file at the time it is issued. 

 All you need is to do a writeout before a direct IO read, which is
 what generic dio code does.

No, that's not good enough - after writeout but before the
direct I/O read is issued a process can fault the page and dirty
it. If you do a direct read, followed by a buffered read you should
get the same data. The only way to guarantee this is to chuck out
any cached pages across the range of the direct I/O so they are
fetched again from disk on the next buffered I/O. i.e. coherent
at the time the direct I/O is issued.

 I guess you'll say that direct writes still need to remove pages,

Yup.

 but in that case you'll either have to live with some racyness
 (which is what the generic code does), or have a higher level
 synchronisation to prevent buffered + direct IO writes I suppose?

The XFS inode iolock - direct I/O writes take it shared, buffered
writes takes it exclusive - so you can't do both at once. Buffered
reads take is shared, which is another reason why we need to purge
the cache on direct I/O writes - they can operate concurrently
(and coherently) with buffered reads.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread Nick Piggin

Peter Zijlstra wrote:

On Thu, 2007-01-25 at 00:43 +1100, Nick Piggin wrote:



Have you seen the new launder_page() a_op? called from
invalidate_inode_pages2_range()


It would have been nice to make that one into a more potentially
useful generic callback.



That can still be done when the need arises, right?


Yeah I guess so.


But why was it introduced, exactly? I can't tell from the code or
the discussion why NFS couldn't start the IO, and signal the caller
to wait_on_page_writeback and retry? That seemed to me like the
convetional fix.



to quote a bit:

On Tue, 19 Dec 2006 18:19:38 -0500
Trond Myklebust [EMAIL PROTECTED] wrote:



   NFS: Fix race in nfs_release_page()
   
   invalidate_inode_pages2() may set the dirty bit on a page owing to the call

   to unmap_mapping_range() after the page was locked. In order to fix this,
   NFS has hooked the releasepage() method. This, however leads to deadlocks
   in other parts of the VM.



and:



Now, arguably the VM shouldn't be calling try_to_release_page() with
__GFP_FS when it's holding a lock on a page.

But otoh, NFS should never be running lock_page() within nfs_release_page()
against the page which was passed into nfs_release_page().  It'll deadlock
for sure.


The reason why it is happening is that the last dirty page from that
inode gets cleaned, resulting in a call to dput().


OK but what's the problem with just failing to release the page if it
is dirty, I wonder? In the worst case, page reclaim will just end up
doing a writeout to clean it.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread Nick Piggin

David Chinner wrote:

On Thu, Jan 25, 2007 at 12:43:23AM +1100, Nick Piggin wrote:



And why not just leave it in the pagecache and be done with it?



because what is in cache is then not coherent with what is on disk,
and a direct read is supposed to read the data that is present
in the file at the time it is issued. 


So after a writeout it will be coherent of course, so the point in
question is what happens when someone comes in and dirties it at the
worst possible moment? That relates to the paragraph below...


All you need is to do a writeout before a direct IO read, which is
what generic dio code does.



No, that's not good enough - after writeout but before the
direct I/O read is issued a process can fault the page and dirty
it. If you do a direct read, followed by a buffered read you should
get the same data. The only way to guarantee this is to chuck out
any cached pages across the range of the direct I/O so they are
fetched again from disk on the next buffered I/O. i.e. coherent
at the time the direct I/O is issued.


... so surely if you do a direct read followed by a buffered read,
you should *not* get the same data if there has been some activity
to modify that part of the file in the meantime (whether that be a
buffered or direct write).


but in that case you'll either have to live with some racyness
(which is what the generic code does), or have a higher level
synchronisation to prevent buffered + direct IO writes I suppose?



The XFS inode iolock - direct I/O writes take it shared, buffered
writes takes it exclusive - so you can't do both at once. Buffered
reads take is shared, which is another reason why we need to purge
the cache on direct I/O writes - they can operate concurrently
(and coherently) with buffered reads.


Ah, I'm glad to see somebody cares about doing the right thing ;)
Maybe I'll use XFS for my filesystems in future.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread David Chinner
On Thu, Jan 25, 2007 at 11:12:41AM +1100, Nick Piggin wrote:
 David Chinner wrote:
 On Thu, Jan 25, 2007 at 12:43:23AM +1100, Nick Piggin wrote:
 
 And why not just leave it in the pagecache and be done with it?
 
 
 because what is in cache is then not coherent with what is on disk,
 and a direct read is supposed to read the data that is present
 in the file at the time it is issued. 
 
 So after a writeout it will be coherent of course, so the point in
 question is what happens when someone comes in and dirties it at the
 worst possible moment? That relates to the paragraph below...
 
 All you need is to do a writeout before a direct IO read, which is
 what generic dio code does.
 
 
 No, that's not good enough - after writeout but before the
 direct I/O read is issued a process can fault the page and dirty
 it. If you do a direct read, followed by a buffered read you should
 get the same data. The only way to guarantee this is to chuck out
 any cached pages across the range of the direct I/O so they are
 fetched again from disk on the next buffered I/O. i.e. coherent
 at the time the direct I/O is issued.
 
 ... so surely if you do a direct read followed by a buffered read,
 you should *not* get the same data if there has been some activity
 to modify that part of the file in the meantime (whether that be a
 buffered or direct write).

Right. And that is what happens in XFS because it purges the
caches on direct I/O and forces data to be re-read from disk.

Effectively, if you are mixing direct I/O with other types of I/O
(buffered or mmap) then the application really needs to be certain
it is doing the right thing because there are races that can occur
below the filesystem. All we care about in the filesystem is that
what we cache is the same as what is on disk, and that implies that
direct I/O needs to purge the cache regardless of the state it is in

Hence we need to unmap pages and use truncate semantics on them to
ensure they are removed from the page cache

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread Nick Piggin

David Chinner wrote:

On Thu, Jan 25, 2007 at 11:12:41AM +1100, Nick Piggin wrote:



... so surely if you do a direct read followed by a buffered read,
you should *not* get the same data if there has been some activity
to modify that part of the file in the meantime (whether that be a
buffered or direct write).



Right. And that is what happens in XFS because it purges the
caches on direct I/O and forces data to be re-read from disk.


And that is critical for direct IO writes, of course.


Effectively, if you are mixing direct I/O with other types of I/O
(buffered or mmap) then the application really needs to be certain
it is doing the right thing because there are races that can occur
below the filesystem. All we care about in the filesystem is that
what we cache is the same as what is on disk, and that implies that
direct I/O needs to purge the cache regardless of the state it is in

Hence we need to unmap pages and use truncate semantics on them to
ensure they are removed from the page cache


OK, I understand that this does need to happen (at least for writes),
so you need to fix it regardless of the DIO read issue.

But I'm just interested about DIO reads. I think you can get pretty
reasonable semantics without discarding pagecache, but the semantics
are weaker in one aspect.

DIO read
1. writeback page
2. read from disk

Now your read will pick up data no older than 1. And if a buffered
write happens after 2, then there is no problem either.

So if you are doing a buffered write and DIO read concurrently, you
want synchronisation so the buffered write happens either before 1
or after 2 -- the DIO read will see either all or none of the write.

Supposing your pagecache isn't invalidated, then a buffered write
(from mmap, if XFS doesn't allow write(2)) comes in between 1 and 2,
then the DIO read will find either none, some, or all of that write.

So I guess what you are preventing is the some case. Am I right?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread David Chinner
On Thu, Jan 25, 2007 at 11:47:24AM +1100, Nick Piggin wrote:
 David Chinner wrote:
 On Thu, Jan 25, 2007 at 11:12:41AM +1100, Nick Piggin wrote:
 
 ... so surely if you do a direct read followed by a buffered read,
 you should *not* get the same data if there has been some activity
 to modify that part of the file in the meantime (whether that be a
 buffered or direct write).
 
 
 Right. And that is what happens in XFS because it purges the
 caches on direct I/O and forces data to be re-read from disk.
 
 And that is critical for direct IO writes, of course.
 
 Effectively, if you are mixing direct I/O with other types of I/O
 (buffered or mmap) then the application really needs to be certain
 it is doing the right thing because there are races that can occur
 below the filesystem. All we care about in the filesystem is that
 what we cache is the same as what is on disk, and that implies that
 direct I/O needs to purge the cache regardless of the state it is in
 
 Hence we need to unmap pages and use truncate semantics on them to
 ensure they are removed from the page cache
 
 OK, I understand that this does need to happen (at least for writes),
 so you need to fix it regardless of the DIO read issue.
 
 But I'm just interested about DIO reads. I think you can get pretty
 reasonable semantics without discarding pagecache, but the semantics
 are weaker in one aspect.
 
 DIO read
 1. writeback page
 2. read from disk
 
 Now your read will pick up data no older than 1. And if a buffered
 write happens after 2, then there is no problem either.
 
 So if you are doing a buffered write and DIO read concurrently, you
 want synchronisation so the buffered write happens either before 1
 or after 2 -- the DIO read will see either all or none of the write.
 
 Supposing your pagecache isn't invalidated, then a buffered write
 (from mmap, if XFS doesn't allow write(2)) comes in between 1 and 2,
 then the DIO read will find either none, some, or all of that write.

 So I guess what you are preventing is the some case. Am I right?

No. The only thing that will happen here is that the direct read
will see _none_ of the write because the mmap write occurred during
the DIO read to a different set of pages in memory. There is no
some or all case here.

IOWs, at a single point in time we have 2 different views
of the one file which are both apparently valid and that is what
we are trying to avoid. We have a coherency problem here which is
solved by forcing the mmap write to reread the data off disk

Look at it this way - direct I/O in XFS implies an I/O barrier
(similar to a memory barrier). Writing back and tossing out of the
page cache at the start of the direct IO gives us an I/O coherency
barrier - everything before the direct IO is sync'd to disk before
the direct IO can proceed, and everything after the direct IO has
started must be fetched from disk again.

Because mmap I/O doesn't necessarily need I/O to change the state
of a page (think of a read fault then a later write fault), to make
the I/O barrier work correctly with mmap() we need to ensure that
it will fault the page from disk again. We can only do that by
unmapping the pages before tossing them from the page cache.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread Nick Piggin

David Chinner wrote:

On Thu, Jan 25, 2007 at 11:47:24AM +1100, Nick Piggin wrote:


David Chinner wrote:


On Thu, Jan 25, 2007 at 11:12:41AM +1100, Nick Piggin wrote:



... so surely if you do a direct read followed by a buffered read,
you should *not* get the same data if there has been some activity
to modify that part of the file in the meantime (whether that be a
buffered or direct write).



Right. And that is what happens in XFS because it purges the
caches on direct I/O and forces data to be re-read from disk.


And that is critical for direct IO writes, of course.



Effectively, if you are mixing direct I/O with other types of I/O
(buffered or mmap) then the application really needs to be certain
it is doing the right thing because there are races that can occur
below the filesystem. All we care about in the filesystem is that
what we cache is the same as what is on disk, and that implies that
direct I/O needs to purge the cache regardless of the state it is in

Hence we need to unmap pages and use truncate semantics on them to
ensure they are removed from the page cache


OK, I understand that this does need to happen (at least for writes),
so you need to fix it regardless of the DIO read issue.

But I'm just interested about DIO reads. I think you can get pretty
reasonable semantics without discarding pagecache, but the semantics
are weaker in one aspect.

DIO read
1. writeback page
2. read from disk

Now your read will pick up data no older than 1. And if a buffered
write happens after 2, then there is no problem either.

So if you are doing a buffered write and DIO read concurrently, you
want synchronisation so the buffered write happens either before 1
or after 2 -- the DIO read will see either all or none of the write.

Supposing your pagecache isn't invalidated, then a buffered write
(from mmap, if XFS doesn't allow write(2)) comes in between 1 and 2,
then the DIO read will find either none, some, or all of that write.

So I guess what you are preventing is the some case. Am I right?



No. The only thing that will happen here is that the direct read
will see _none_ of the write because the mmap write occurred during
the DIO read to a different set of pages in memory. There is no
some or all case here.


But if the buffers get partially or completely written back in the
meantime, then the DIO read could see that.


IOWs, at a single point in time we have 2 different views
of the one file which are both apparently valid and that is what
we are trying to avoid. We have a coherency problem here which is
solved by forcing the mmap write to reread the data off disk


I don't see why the mmap write needs to reread data off disk. The
data on disk won't get changed by the DIO read.


Look at it this way - direct I/O in XFS implies an I/O barrier
(similar to a memory barrier). Writing back and tossing out of the
page cache at the start of the direct IO gives us an I/O coherency
barrier - everything before the direct IO is sync'd to disk before
the direct IO can proceed, and everything after the direct IO has
started must be fetched from disk again.

Because mmap I/O doesn't necessarily need I/O to change the state
of a page (think of a read fault then a later write fault), to make
the I/O barrier work correctly with mmap() we need to ensure that
it will fault the page from disk again. We can only do that by
unmapping the pages before tossing them from the page cache.


OK, but direct IO *reads* do not conceptually invalidate pagecache
sitting on top of those blocks. Pagecache becomes invalid when the
page no longer represents the most uptodate copy of the data (eg.
in the case of a direct IO write).

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread David Chinner
On Thu, Jan 25, 2007 at 01:01:09PM +1100, Nick Piggin wrote:
 David Chinner wrote:
 On Thu, Jan 25, 2007 at 11:47:24AM +1100, Nick Piggin wrote:
 
 David Chinner wrote:
 
 On Thu, Jan 25, 2007 at 11:12:41AM +1100, Nick Piggin wrote:
 But I'm just interested about DIO reads. I think you can get pretty
 reasonable semantics without discarding pagecache, but the semantics
 are weaker in one aspect.
 
 DIO read
 1. writeback page
 2. read from disk
 
 Now your read will pick up data no older than 1. And if a buffered
 write happens after 2, then there is no problem either.
 
 So if you are doing a buffered write and DIO read concurrently, you
 want synchronisation so the buffered write happens either before 1
 or after 2 -- the DIO read will see either all or none of the write.
 
 Supposing your pagecache isn't invalidated, then a buffered write
 (from mmap, if XFS doesn't allow write(2)) comes in between 1 and 2,
 then the DIO read will find either none, some, or all of that write.
 
 So I guess what you are preventing is the some case. Am I right?
 
 
 No. The only thing that will happen here is that the direct read
 will see _none_ of the write because the mmap write occurred during
 the DIO read to a different set of pages in memory. There is no
 some or all case here.
 
 But if the buffers get partially or completely written back in the
 meantime, then the DIO read could see that.

Only if you can dirty them and flush them to disk while the direct
read is waiting in the I/O queue (remember, the direct read flushes
dirty cached data before being issued). Given that we don't lock the
inode in the buffered I/O *writeback* path, we have to stop pages being
dirtied in the page cache up front so we don't have mmap writeback
over the top of the direct read.

Hence we have to prevent mmap for dirtying the same file offset we
are doing direct reads on until the direct read has been issued.

i.e. we need a barrier.

 IOWs, at a single point in time we have 2 different views
 of the one file which are both apparently valid and that is what
 we are trying to avoid. We have a coherency problem here which is
 solved by forcing the mmap write to reread the data off disk
 
 I don't see why the mmap write needs to reread data off disk. The
 data on disk won't get changed by the DIO read.

No, but the data _in memory_ will, and now when the direct read
completes it will data that is different to what is in the page
cache. For direct I/O we define the correct data to be what is on
disk, not what is in memory, so any time we bypass what is in
memory, we need to ensure that we prevent the data being changed
again in memory before we issue the disk I/O.

 Look at it this way - direct I/O in XFS implies an I/O barrier
 (similar to a memory barrier). Writing back and tossing out of the
 page cache at the start of the direct IO gives us an I/O coherency
 barrier - everything before the direct IO is sync'd to disk before
 the direct IO can proceed, and everything after the direct IO has
 started must be fetched from disk again.
 
 Because mmap I/O doesn't necessarily need I/O to change the state
 of a page (think of a read fault then a later write fault), to make
 the I/O barrier work correctly with mmap() we need to ensure that
 it will fault the page from disk again. We can only do that by
 unmapping the pages before tossing them from the page cache.
 
 OK, but direct IO *reads* do not conceptually invalidate pagecache
 sitting on top of those blocks. Pagecache becomes invalid when the
 page no longer represents the most uptodate copy of the data (eg.
 in the case of a direct IO write).

In theory, yes.

In practice, if you don't invalidate the page cache you have no
mechanism of synchronising mmap with direct I/O, and at that
point you have no coherency model that you can work with in
your filesystem. You have to be able to guarantee synchronisation
betwen the different methods that can dirty data before you
can give any guarantees about data coherency.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread Nick Piggin

David Chinner wrote:

On Thu, Jan 25, 2007 at 01:01:09PM +1100, Nick Piggin wrote:


David Chinner wrote:



No. The only thing that will happen here is that the direct read
will see _none_ of the write because the mmap write occurred during
the DIO read to a different set of pages in memory. There is no
some or all case here.


But if the buffers get partially or completely written back in the
meantime, then the DIO read could see that.



Only if you can dirty them and flush them to disk while the direct
read is waiting in the I/O queue (remember, the direct read flushes
dirty cached data before being issued). Given that we don't lock the
inode in the buffered I/O *writeback* path, we have to stop pages being
dirtied in the page cache up front so we don't have mmap writeback
over the top of the direct read.


However unlikely it may be, that is what I'm talking about in my some
or all cases. Note that I'm not talking about a specific implementation
(eg. XFS I guess avoids some), but just the possible scenarios.


Hence we have to prevent mmap for dirtying the same file offset we
are doing direct reads on until the direct read has been issued.

i.e. we need a barrier.


So you need to eliminate the some case? Because of course none and
all are unavoidable.


IOWs, at a single point in time we have 2 different views
of the one file which are both apparently valid and that is what
we are trying to avoid. We have a coherency problem here which is
solved by forcing the mmap write to reread the data off disk


I don't see why the mmap write needs to reread data off disk. The
data on disk won't get changed by the DIO read.



No, but the data _in memory_ will, and now when the direct read
completes it will data that is different to what is in the page
cache. For direct I/O we define the correct data to be what is on
disk, not what is in memory, so any time we bypass what is in
memory, we need to ensure that we prevent the data being changed
again in memory before we issue the disk I/O.


But when you drop your locks, before the direct IO read returns, some
guy can mmap and dirty the pagecache anyway. By the time the read
returns, the data is stale. This obviously must be synchronised in
userspace.

As I said, you can't avoid none or all, and you can't say that
userspace will see the most uptodate copy of the data. All you can
say is that it will be no older than when the syscall is first made.
Which is what you get if you simply writeback but do not invalidate
pagecache for direct IO reads.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-24 Thread David Chinner
On Thu, Jan 25, 2007 at 03:25:29PM +1100, Nick Piggin wrote:
 David Chinner wrote:
 On Thu, Jan 25, 2007 at 01:01:09PM +1100, Nick Piggin wrote:
 
 David Chinner wrote:
 
 No. The only thing that will happen here is that the direct read
 will see _none_ of the write because the mmap write occurred during
 the DIO read to a different set of pages in memory. There is no
 some or all case here.
 
 But if the buffers get partially or completely written back in the
 meantime, then the DIO read could see that.
 
 
 Only if you can dirty them and flush them to disk while the direct
 read is waiting in the I/O queue (remember, the direct read flushes
 dirty cached data before being issued). Given that we don't lock the
 inode in the buffered I/O *writeback* path, we have to stop pages being
 dirtied in the page cache up front so we don't have mmap writeback
 over the top of the direct read.
 
 However unlikely it may be, that is what I'm talking about in my some
 or all cases. Note that I'm not talking about a specific implementation
 (eg. XFS I guess avoids some), but just the possible scenarios.

Unfortunately, behaviour is different for different filesystems. I
can only answer for XFS, which is different to most of the other
filesystems in both locking and the way it treats the page cache.
IOWs, if you want to talk about details, then have to talk about
specific implementations because.


 Hence we have to prevent mmap for dirtying the same file offset we
 are doing direct reads on until the direct read has been issued.
 
 i.e. we need a barrier.
 
 So you need to eliminate the some case? Because of course none and
 all are unavoidable.

all is avoidable, too, once you've kicked the pages out of
the page cache - you just have to block the buffered read triggered
by refaulting the page will cause until the direct I/O completes.

 IOWs, at a single point in time we have 2 different views
 of the one file which are both apparently valid and that is what
 we are trying to avoid. We have a coherency problem here which is
 solved by forcing the mmap write to reread the data off disk
 
 I don't see why the mmap write needs to reread data off disk. The
 data on disk won't get changed by the DIO read.
 
 
 No, but the data _in memory_ will, and now when the direct read
 completes it will data that is different to what is in the page
 cache. For direct I/O we define the correct data to be what is on
 disk, not what is in memory, so any time we bypass what is in
 memory, we need to ensure that we prevent the data being changed
 again in memory before we issue the disk I/O.
 
 But when you drop your locks, before the direct IO read returns, some
 guy can mmap and dirty the pagecache anyway.
 By the time the read returns, the data is stale.

Only if we leave the page in the page cache. If we toss the page,
the time it takes to do the I/O for the page fault is enough for
the direct I/o to complete. Sure it's not an absolute guarantee,
but if you want an absolute guarantee:

 This obviously must be synchronised in
 userspace.

As I said earlier.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-23 Thread David Chinner

With the recent changes to cancel_dirty_pages(), XFS will
dump warnings in the syslog because it can truncate_inode_pages()
on dirty mapped pages.

I've determined that this is indeed correct behaviour for XFS
as this can happen in the case of races on mmap()d files with
direct I/O. In this case when we do a direct I/O read, we
flush the dirty pages to disk, then truncate them out of the
page cache. Unfortunately, between the flush and the truncate
the mmap could dirty the page again. At this point we toss a
dirty page that is mapped.

None of the existing functions for truncating pages or invalidating
pages work in this situation. Invalidating a page only works for
non-dirty pages with non-dirty buffers, and they only work for
whole pages and XFS requires partial page truncation.

On top of that the page invalidation functions don't actually
call into the filesystem to invalidate the page and so the filesystem
can't actually invalidate the page properly (e.g. do stuff based on
private buffer head flags).

So that leaves us needing to use truncate semantics an the problem
is that none of them unmap pages in a non-racy manner - if they
unmap pages they do it separately tothe truncate of the page,
leading to races with mmap redirtying the page between the unmap and
the truncate ofthe page.

Hence we need a truncate function that unmaps the pages while they
are locked for truncate in a similar fashion to
invalidate_inode_pages2_range(). The following patch (unchanged from
the last time it was sent) does this. The XFS changes are in a
second patch.

The patch has been test on ia64 and x86-64 via XFSQA and a lot
of fsx mixing mmap and direct I/O operations.

Signed-off-by: Dave Chinner <[EMAIL PROTECTED]>

---
 
Index: 2.6.x-xfs-new/include/linux/mm.h
===
--- 2.6.x-xfs-new.orig/include/linux/mm.h   2007-01-15 15:09:57.0 
+1100
+++ 2.6.x-xfs-new/include/linux/mm.h2007-01-16 08:59:24.031897743 +1100
@@ -1060,6 +1060,8 @@ extern unsigned long page_unuse(struct p
 extern void truncate_inode_pages(struct address_space *, loff_t);
 extern void truncate_inode_pages_range(struct address_space *,
   loff_t lstart, loff_t lend);
+extern void truncate_unmap_inode_pages_range(struct address_space *,
+  loff_t lstart, loff_t lend, int unmap);
 
 /* generic vm_area_ops exported for stackable file systems */
 extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int 
*);
Index: 2.6.x-xfs-new/mm/truncate.c
===
--- 2.6.x-xfs-new.orig/mm/truncate.c2007-01-16 08:59:23.947908876 +1100
+++ 2.6.x-xfs-new/mm/truncate.c 2007-01-16 09:35:53.102924243 +1100
@@ -59,7 +59,7 @@ void cancel_dirty_page(struct page *page
 
WARN_ON(++warncount < 5);
}
-   
+
if (TestClearPageDirty(page)) {
struct address_space *mapping = page->mapping;
if (mapping && mapping_cap_account_dirty(mapping)) {
@@ -122,16 +122,34 @@ invalidate_complete_page(struct address_
return ret;
 }
 
+/*
+ * This is a helper for truncate_unmap_inode_page. Unmap the page we
+ * are passed. Page must be locked by the caller.
+ */
+static void
+unmap_single_page(struct address_space *mapping, struct page *page)
+{
+   BUG_ON(!PageLocked(page));
+   while (page_mapped(page)) {
+   unmap_mapping_range(mapping,
+   (loff_t)page->index << PAGE_CACHE_SHIFT,
+   PAGE_CACHE_SIZE, 0);
+   }
+}
+
 /**
- * truncate_inode_pages - truncate range of pages specified by start and
+ * truncate_unmap_inode_pages_range - truncate range of pages specified by
+ * start and end byte offsets and optionally unmap them first.
  * end byte offsets
  * @mapping: mapping to truncate
  * @lstart: offset from which to truncate
  * @lend: offset to which to truncate
+ * @unmap: unmap whole truncated pages if non-zero
  *
  * Truncate the page cache, removing the pages that are between
  * specified offsets (and zeroing out partial page
- * (if lstart is not page aligned)).
+ * (if lstart is not page aligned)). If specified, unmap the pages
+ * before they are removed.
  *
  * Truncate takes two passes - the first pass is nonblocking.  It will not
  * block on page locks and it will not block on writeback.  The second pass
@@ -146,8 +164,8 @@ invalidate_complete_page(struct address_
  * mapping is large, it is probably the case that the final pages are the most
  * recently touched, and freeing happens in ascending file offset order.
  */
-void truncate_inode_pages_range(struct address_space *mapping,
-   loff_t lstart, loff_t lend)
+void truncate_unmap_inode_pages_range(struct address_space *mapping,
+   loff_t lstart, loff_t lend, int unmap)
 {
const pgoff_t 

[PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS

2007-01-23 Thread David Chinner

With the recent changes to cancel_dirty_pages(), XFS will
dump warnings in the syslog because it can truncate_inode_pages()
on dirty mapped pages.

I've determined that this is indeed correct behaviour for XFS
as this can happen in the case of races on mmap()d files with
direct I/O. In this case when we do a direct I/O read, we
flush the dirty pages to disk, then truncate them out of the
page cache. Unfortunately, between the flush and the truncate
the mmap could dirty the page again. At this point we toss a
dirty page that is mapped.

None of the existing functions for truncating pages or invalidating
pages work in this situation. Invalidating a page only works for
non-dirty pages with non-dirty buffers, and they only work for
whole pages and XFS requires partial page truncation.

On top of that the page invalidation functions don't actually
call into the filesystem to invalidate the page and so the filesystem
can't actually invalidate the page properly (e.g. do stuff based on
private buffer head flags).

So that leaves us needing to use truncate semantics an the problem
is that none of them unmap pages in a non-racy manner - if they
unmap pages they do it separately tothe truncate of the page,
leading to races with mmap redirtying the page between the unmap and
the truncate ofthe page.

Hence we need a truncate function that unmaps the pages while they
are locked for truncate in a similar fashion to
invalidate_inode_pages2_range(). The following patch (unchanged from
the last time it was sent) does this. The XFS changes are in a
second patch.

The patch has been test on ia64 and x86-64 via XFSQA and a lot
of fsx mixing mmap and direct I/O operations.

Signed-off-by: Dave Chinner [EMAIL PROTECTED]

---
 
Index: 2.6.x-xfs-new/include/linux/mm.h
===
--- 2.6.x-xfs-new.orig/include/linux/mm.h   2007-01-15 15:09:57.0 
+1100
+++ 2.6.x-xfs-new/include/linux/mm.h2007-01-16 08:59:24.031897743 +1100
@@ -1060,6 +1060,8 @@ extern unsigned long page_unuse(struct p
 extern void truncate_inode_pages(struct address_space *, loff_t);
 extern void truncate_inode_pages_range(struct address_space *,
   loff_t lstart, loff_t lend);
+extern void truncate_unmap_inode_pages_range(struct address_space *,
+  loff_t lstart, loff_t lend, int unmap);
 
 /* generic vm_area_ops exported for stackable file systems */
 extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int 
*);
Index: 2.6.x-xfs-new/mm/truncate.c
===
--- 2.6.x-xfs-new.orig/mm/truncate.c2007-01-16 08:59:23.947908876 +1100
+++ 2.6.x-xfs-new/mm/truncate.c 2007-01-16 09:35:53.102924243 +1100
@@ -59,7 +59,7 @@ void cancel_dirty_page(struct page *page
 
WARN_ON(++warncount  5);
}
-   
+
if (TestClearPageDirty(page)) {
struct address_space *mapping = page-mapping;
if (mapping  mapping_cap_account_dirty(mapping)) {
@@ -122,16 +122,34 @@ invalidate_complete_page(struct address_
return ret;
 }
 
+/*
+ * This is a helper for truncate_unmap_inode_page. Unmap the page we
+ * are passed. Page must be locked by the caller.
+ */
+static void
+unmap_single_page(struct address_space *mapping, struct page *page)
+{
+   BUG_ON(!PageLocked(page));
+   while (page_mapped(page)) {
+   unmap_mapping_range(mapping,
+   (loff_t)page-index  PAGE_CACHE_SHIFT,
+   PAGE_CACHE_SIZE, 0);
+   }
+}
+
 /**
- * truncate_inode_pages - truncate range of pages specified by start and
+ * truncate_unmap_inode_pages_range - truncate range of pages specified by
+ * start and end byte offsets and optionally unmap them first.
  * end byte offsets
  * @mapping: mapping to truncate
  * @lstart: offset from which to truncate
  * @lend: offset to which to truncate
+ * @unmap: unmap whole truncated pages if non-zero
  *
  * Truncate the page cache, removing the pages that are between
  * specified offsets (and zeroing out partial page
- * (if lstart is not page aligned)).
+ * (if lstart is not page aligned)). If specified, unmap the pages
+ * before they are removed.
  *
  * Truncate takes two passes - the first pass is nonblocking.  It will not
  * block on page locks and it will not block on writeback.  The second pass
@@ -146,8 +164,8 @@ invalidate_complete_page(struct address_
  * mapping is large, it is probably the case that the final pages are the most
  * recently touched, and freeing happens in ascending file offset order.
  */
-void truncate_inode_pages_range(struct address_space *mapping,
-   loff_t lstart, loff_t lend)
+void truncate_unmap_inode_pages_range(struct address_space *mapping,
+   loff_t lstart, loff_t lend, int unmap)
 {
const pgoff_t start =