Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
David Chinner wrote: Only if we leave the page in the page cache. If we toss the page, the time it takes to do the I/O for the page fault is enough for the direct I/o to complete. Sure it's not an absolute guarantee, but if you want an absolute guarantee: So I guess you *could* relax it in theory... Anyway, don't take my pestering as advocacy for wanting XFS to do something more clever in such a corner case. I think you're quite right to be conservative and share codepaths between direct IO read and write. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
David Chinner wrote: Only if we leave the page in the page cache. If we toss the page, the time it takes to do the I/O for the page fault is enough for the direct I/o to complete. Sure it's not an absolute guarantee, but if you want an absolute guarantee: So I guess you *could* relax it in theory... Anyway, don't take my pestering as advocacy for wanting XFS to do something more clever in such a corner case. I think you're quite right to be conservative and share codepaths between direct IO read and write. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
On Thu, Jan 25, 2007 at 03:25:29PM +1100, Nick Piggin wrote: > David Chinner wrote: > >On Thu, Jan 25, 2007 at 01:01:09PM +1100, Nick Piggin wrote: > > > >>David Chinner wrote: > > >>>No. The only thing that will happen here is that the direct read > >>>will see _none_ of the write because the mmap write occurred during > >>>the DIO read to a different set of pages in memory. There is no > >>>"some" or "all" case here. > >> > >>But if the buffers get partially or completely written back in the > >>meantime, then the DIO read could see that. > > > > > >Only if you can dirty them and flush them to disk while the direct > >read is waiting in the I/O queue (remember, the direct read flushes > >dirty cached data before being issued). Given that we don't lock the > >inode in the buffered I/O *writeback* path, we have to stop pages being > >dirtied in the page cache up front so we don't have mmap writeback > >over the top of the direct read. > > However unlikely it may be, that is what I'm talking about in my "some" > or "all" cases. Note that I'm not talking about a specific implementation > (eg. XFS I guess avoids "some"), but just the possible scenarios. Unfortunately, behaviour is different for different filesystems. I can only answer for XFS, which is different to most of the other filesystems in both locking and the way it treats the page cache. IOWs, if you want to talk about details, then have to talk about specific implementations because. > >Hence we have to prevent mmap for dirtying the same file offset we > >are doing direct reads on until the direct read has been issued. > > > >i.e. we need a barrier. > > So you need to eliminate the "some" case? Because of course "none" and > "all" are unavoidable. "all" is avoidable, too, once you've kicked the pages out of the page cache - you just have to block the buffered read triggered by refaulting the page will cause until the direct I/O completes. > >>>IOWs, at a single point in time we have 2 different views > >>>of the one file which are both apparently valid and that is what > >>>we are trying to avoid. We have a coherency problem here which is > >>>solved by forcing the mmap write to reread the data off disk > >> > >>I don't see why the mmap write needs to reread data off disk. The > >>data on disk won't get changed by the DIO read. > > > > > >No, but the data _in memory_ will, and now when the direct read > >completes it will data that is different to what is in the page > >cache. For direct I/O we define the correct data to be what is on > >disk, not what is in memory, so any time we bypass what is in > >memory, we need to ensure that we prevent the data being changed > >again in memory before we issue the disk I/O. > > But when you drop your locks, before the direct IO read returns, some > guy can mmap and dirty the pagecache anyway. > By the time the read returns, the data is stale. Only if we leave the page in the page cache. If we toss the page, the time it takes to do the I/O for the page fault is enough for the direct I/o to complete. Sure it's not an absolute guarantee, but if you want an absolute guarantee: > This obviously must be synchronised in > userspace. As I said earlier. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
David Chinner wrote: On Thu, Jan 25, 2007 at 01:01:09PM +1100, Nick Piggin wrote: David Chinner wrote: No. The only thing that will happen here is that the direct read will see _none_ of the write because the mmap write occurred during the DIO read to a different set of pages in memory. There is no "some" or "all" case here. But if the buffers get partially or completely written back in the meantime, then the DIO read could see that. Only if you can dirty them and flush them to disk while the direct read is waiting in the I/O queue (remember, the direct read flushes dirty cached data before being issued). Given that we don't lock the inode in the buffered I/O *writeback* path, we have to stop pages being dirtied in the page cache up front so we don't have mmap writeback over the top of the direct read. However unlikely it may be, that is what I'm talking about in my "some" or "all" cases. Note that I'm not talking about a specific implementation (eg. XFS I guess avoids "some"), but just the possible scenarios. Hence we have to prevent mmap for dirtying the same file offset we are doing direct reads on until the direct read has been issued. i.e. we need a barrier. So you need to eliminate the "some" case? Because of course "none" and "all" are unavoidable. IOWs, at a single point in time we have 2 different views of the one file which are both apparently valid and that is what we are trying to avoid. We have a coherency problem here which is solved by forcing the mmap write to reread the data off disk I don't see why the mmap write needs to reread data off disk. The data on disk won't get changed by the DIO read. No, but the data _in memory_ will, and now when the direct read completes it will data that is different to what is in the page cache. For direct I/O we define the correct data to be what is on disk, not what is in memory, so any time we bypass what is in memory, we need to ensure that we prevent the data being changed again in memory before we issue the disk I/O. But when you drop your locks, before the direct IO read returns, some guy can mmap and dirty the pagecache anyway. By the time the read returns, the data is stale. This obviously must be synchronised in userspace. As I said, you can't avoid "none" or "all", and you can't say that userspace will see the most uptodate copy of the data. All you can say is that it will be no older than when the syscall is first made. Which is what you get if you simply writeback but do not invalidate pagecache for direct IO reads. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
On Thu, Jan 25, 2007 at 01:01:09PM +1100, Nick Piggin wrote: > David Chinner wrote: > >On Thu, Jan 25, 2007 at 11:47:24AM +1100, Nick Piggin wrote: > > > >>David Chinner wrote: > >> > >>>On Thu, Jan 25, 2007 at 11:12:41AM +1100, Nick Piggin wrote: > >>But I'm just interested about DIO reads. I think you can get pretty > >>reasonable semantics without discarding pagecache, but the semantics > >>are weaker in one aspect. > >> > >>DIO read > >>1. writeback page > >>2. read from disk > >> > >>Now your read will pick up data no older than 1. And if a buffered > >>write happens after 2, then there is no problem either. > >> > >>So if you are doing a buffered write and DIO read concurrently, you > >>want synchronisation so the buffered write happens either before 1 > >>or after 2 -- the DIO read will see either all or none of the write. > >> > >>Supposing your pagecache isn't invalidated, then a buffered write > >>(from mmap, if XFS doesn't allow write(2)) comes in between 1 and 2, > >>then the DIO read will find either none, some, or all of that write. > >> > >>So I guess what you are preventing is the "some" case. Am I right? > > > > > >No. The only thing that will happen here is that the direct read > >will see _none_ of the write because the mmap write occurred during > >the DIO read to a different set of pages in memory. There is no > >"some" or "all" case here. > > But if the buffers get partially or completely written back in the > meantime, then the DIO read could see that. Only if you can dirty them and flush them to disk while the direct read is waiting in the I/O queue (remember, the direct read flushes dirty cached data before being issued). Given that we don't lock the inode in the buffered I/O *writeback* path, we have to stop pages being dirtied in the page cache up front so we don't have mmap writeback over the top of the direct read. Hence we have to prevent mmap for dirtying the same file offset we are doing direct reads on until the direct read has been issued. i.e. we need a barrier. > >IOWs, at a single point in time we have 2 different views > >of the one file which are both apparently valid and that is what > >we are trying to avoid. We have a coherency problem here which is > >solved by forcing the mmap write to reread the data off disk > > I don't see why the mmap write needs to reread data off disk. The > data on disk won't get changed by the DIO read. No, but the data _in memory_ will, and now when the direct read completes it will data that is different to what is in the page cache. For direct I/O we define the correct data to be what is on disk, not what is in memory, so any time we bypass what is in memory, we need to ensure that we prevent the data being changed again in memory before we issue the disk I/O. > >Look at it this way - direct I/O in XFS implies an I/O barrier > >(similar to a memory barrier). Writing back and tossing out of the > >page cache at the start of the direct IO gives us an I/O coherency > >barrier - everything before the direct IO is sync'd to disk before > >the direct IO can proceed, and everything after the direct IO has > >started must be fetched from disk again. > > > >Because mmap I/O doesn't necessarily need I/O to change the state > >of a page (think of a read fault then a later write fault), to make > >the I/O barrier work correctly with mmap() we need to ensure that > >it will fault the page from disk again. We can only do that by > >unmapping the pages before tossing them from the page cache. > > OK, but direct IO *reads* do not conceptually invalidate pagecache > sitting on top of those blocks. Pagecache becomes invalid when the > page no longer represents the most uptodate copy of the data (eg. > in the case of a direct IO write). In theory, yes. In practice, if you don't invalidate the page cache you have no mechanism of synchronising mmap with direct I/O, and at that point you have no coherency model that you can work with in your filesystem. You have to be able to guarantee synchronisation betwen the different methods that can dirty data before you can give any guarantees about data coherency. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
David Chinner wrote: On Thu, Jan 25, 2007 at 11:47:24AM +1100, Nick Piggin wrote: David Chinner wrote: On Thu, Jan 25, 2007 at 11:12:41AM +1100, Nick Piggin wrote: ... so surely if you do a direct read followed by a buffered read, you should *not* get the same data if there has been some activity to modify that part of the file in the meantime (whether that be a buffered or direct write). Right. And that is what happens in XFS because it purges the caches on direct I/O and forces data to be re-read from disk. And that is critical for direct IO writes, of course. Effectively, if you are mixing direct I/O with other types of I/O (buffered or mmap) then the application really needs to be certain it is doing the right thing because there are races that can occur below the filesystem. All we care about in the filesystem is that what we cache is the same as what is on disk, and that implies that direct I/O needs to purge the cache regardless of the state it is in Hence we need to unmap pages and use truncate semantics on them to ensure they are removed from the page cache OK, I understand that this does need to happen (at least for writes), so you need to fix it regardless of the DIO read issue. But I'm just interested about DIO reads. I think you can get pretty reasonable semantics without discarding pagecache, but the semantics are weaker in one aspect. DIO read 1. writeback page 2. read from disk Now your read will pick up data no older than 1. And if a buffered write happens after 2, then there is no problem either. So if you are doing a buffered write and DIO read concurrently, you want synchronisation so the buffered write happens either before 1 or after 2 -- the DIO read will see either all or none of the write. Supposing your pagecache isn't invalidated, then a buffered write (from mmap, if XFS doesn't allow write(2)) comes in between 1 and 2, then the DIO read will find either none, some, or all of that write. So I guess what you are preventing is the "some" case. Am I right? No. The only thing that will happen here is that the direct read will see _none_ of the write because the mmap write occurred during the DIO read to a different set of pages in memory. There is no "some" or "all" case here. But if the buffers get partially or completely written back in the meantime, then the DIO read could see that. IOWs, at a single point in time we have 2 different views of the one file which are both apparently valid and that is what we are trying to avoid. We have a coherency problem here which is solved by forcing the mmap write to reread the data off disk I don't see why the mmap write needs to reread data off disk. The data on disk won't get changed by the DIO read. Look at it this way - direct I/O in XFS implies an I/O barrier (similar to a memory barrier). Writing back and tossing out of the page cache at the start of the direct IO gives us an I/O coherency barrier - everything before the direct IO is sync'd to disk before the direct IO can proceed, and everything after the direct IO has started must be fetched from disk again. Because mmap I/O doesn't necessarily need I/O to change the state of a page (think of a read fault then a later write fault), to make the I/O barrier work correctly with mmap() we need to ensure that it will fault the page from disk again. We can only do that by unmapping the pages before tossing them from the page cache. OK, but direct IO *reads* do not conceptually invalidate pagecache sitting on top of those blocks. Pagecache becomes invalid when the page no longer represents the most uptodate copy of the data (eg. in the case of a direct IO write). -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
On Thu, Jan 25, 2007 at 11:47:24AM +1100, Nick Piggin wrote: > David Chinner wrote: > >On Thu, Jan 25, 2007 at 11:12:41AM +1100, Nick Piggin wrote: > > >>... so surely if you do a direct read followed by a buffered read, > >>you should *not* get the same data if there has been some activity > >>to modify that part of the file in the meantime (whether that be a > >>buffered or direct write). > > > > > >Right. And that is what happens in XFS because it purges the > >caches on direct I/O and forces data to be re-read from disk. > > And that is critical for direct IO writes, of course. > > >Effectively, if you are mixing direct I/O with other types of I/O > >(buffered or mmap) then the application really needs to be certain > >it is doing the right thing because there are races that can occur > >below the filesystem. All we care about in the filesystem is that > >what we cache is the same as what is on disk, and that implies that > >direct I/O needs to purge the cache regardless of the state it is in > > > >Hence we need to unmap pages and use truncate semantics on them to > >ensure they are removed from the page cache > > OK, I understand that this does need to happen (at least for writes), > so you need to fix it regardless of the DIO read issue. > > But I'm just interested about DIO reads. I think you can get pretty > reasonable semantics without discarding pagecache, but the semantics > are weaker in one aspect. > > DIO read > 1. writeback page > 2. read from disk > > Now your read will pick up data no older than 1. And if a buffered > write happens after 2, then there is no problem either. > > So if you are doing a buffered write and DIO read concurrently, you > want synchronisation so the buffered write happens either before 1 > or after 2 -- the DIO read will see either all or none of the write. > > Supposing your pagecache isn't invalidated, then a buffered write > (from mmap, if XFS doesn't allow write(2)) comes in between 1 and 2, > then the DIO read will find either none, some, or all of that write. > > So I guess what you are preventing is the "some" case. Am I right? No. The only thing that will happen here is that the direct read will see _none_ of the write because the mmap write occurred during the DIO read to a different set of pages in memory. There is no "some" or "all" case here. IOWs, at a single point in time we have 2 different views of the one file which are both apparently valid and that is what we are trying to avoid. We have a coherency problem here which is solved by forcing the mmap write to reread the data off disk Look at it this way - direct I/O in XFS implies an I/O barrier (similar to a memory barrier). Writing back and tossing out of the page cache at the start of the direct IO gives us an I/O coherency barrier - everything before the direct IO is sync'd to disk before the direct IO can proceed, and everything after the direct IO has started must be fetched from disk again. Because mmap I/O doesn't necessarily need I/O to change the state of a page (think of a read fault then a later write fault), to make the I/O barrier work correctly with mmap() we need to ensure that it will fault the page from disk again. We can only do that by unmapping the pages before tossing them from the page cache. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
David Chinner wrote: On Thu, Jan 25, 2007 at 11:12:41AM +1100, Nick Piggin wrote: ... so surely if you do a direct read followed by a buffered read, you should *not* get the same data if there has been some activity to modify that part of the file in the meantime (whether that be a buffered or direct write). Right. And that is what happens in XFS because it purges the caches on direct I/O and forces data to be re-read from disk. And that is critical for direct IO writes, of course. Effectively, if you are mixing direct I/O with other types of I/O (buffered or mmap) then the application really needs to be certain it is doing the right thing because there are races that can occur below the filesystem. All we care about in the filesystem is that what we cache is the same as what is on disk, and that implies that direct I/O needs to purge the cache regardless of the state it is in Hence we need to unmap pages and use truncate semantics on them to ensure they are removed from the page cache OK, I understand that this does need to happen (at least for writes), so you need to fix it regardless of the DIO read issue. But I'm just interested about DIO reads. I think you can get pretty reasonable semantics without discarding pagecache, but the semantics are weaker in one aspect. DIO read 1. writeback page 2. read from disk Now your read will pick up data no older than 1. And if a buffered write happens after 2, then there is no problem either. So if you are doing a buffered write and DIO read concurrently, you want synchronisation so the buffered write happens either before 1 or after 2 -- the DIO read will see either all or none of the write. Supposing your pagecache isn't invalidated, then a buffered write (from mmap, if XFS doesn't allow write(2)) comes in between 1 and 2, then the DIO read will find either none, some, or all of that write. So I guess what you are preventing is the "some" case. Am I right? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
On Thu, Jan 25, 2007 at 11:12:41AM +1100, Nick Piggin wrote: > David Chinner wrote: > >On Thu, Jan 25, 2007 at 12:43:23AM +1100, Nick Piggin wrote: > > >>And why not just leave it in the pagecache and be done with it? > > > > > >because what is in cache is then not coherent with what is on disk, > >and a direct read is supposed to read the data that is present > >in the file at the time it is issued. > > So after a writeout it will be coherent of course, so the point in > question is what happens when someone comes in and dirties it at the > worst possible moment? That relates to the paragraph below... > > >>All you need is to do a writeout before a direct IO read, which is > >>what generic dio code does. > > > > > >No, that's not good enough - after writeout but before the > >direct I/O read is issued a process can fault the page and dirty > >it. If you do a direct read, followed by a buffered read you should > >get the same data. The only way to guarantee this is to chuck out > >any cached pages across the range of the direct I/O so they are > >fetched again from disk on the next buffered I/O. i.e. coherent > >at the time the direct I/O is issued. > > ... so surely if you do a direct read followed by a buffered read, > you should *not* get the same data if there has been some activity > to modify that part of the file in the meantime (whether that be a > buffered or direct write). Right. And that is what happens in XFS because it purges the caches on direct I/O and forces data to be re-read from disk. Effectively, if you are mixing direct I/O with other types of I/O (buffered or mmap) then the application really needs to be certain it is doing the right thing because there are races that can occur below the filesystem. All we care about in the filesystem is that what we cache is the same as what is on disk, and that implies that direct I/O needs to purge the cache regardless of the state it is in Hence we need to unmap pages and use truncate semantics on them to ensure they are removed from the page cache Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
David Chinner wrote: On Thu, Jan 25, 2007 at 12:43:23AM +1100, Nick Piggin wrote: And why not just leave it in the pagecache and be done with it? because what is in cache is then not coherent with what is on disk, and a direct read is supposed to read the data that is present in the file at the time it is issued. So after a writeout it will be coherent of course, so the point in question is what happens when someone comes in and dirties it at the worst possible moment? That relates to the paragraph below... All you need is to do a writeout before a direct IO read, which is what generic dio code does. No, that's not good enough - after writeout but before the direct I/O read is issued a process can fault the page and dirty it. If you do a direct read, followed by a buffered read you should get the same data. The only way to guarantee this is to chuck out any cached pages across the range of the direct I/O so they are fetched again from disk on the next buffered I/O. i.e. coherent at the time the direct I/O is issued. ... so surely if you do a direct read followed by a buffered read, you should *not* get the same data if there has been some activity to modify that part of the file in the meantime (whether that be a buffered or direct write). but in that case you'll either have to live with some racyness (which is what the generic code does), or have a higher level synchronisation to prevent buffered + direct IO writes I suppose? The XFS inode iolock - direct I/O writes take it shared, buffered writes takes it exclusive - so you can't do both at once. Buffered reads take is shared, which is another reason why we need to purge the cache on direct I/O writes - they can operate concurrently (and coherently) with buffered reads. Ah, I'm glad to see somebody cares about doing the right thing ;) Maybe I'll use XFS for my filesystems in future. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
Peter Zijlstra wrote: On Thu, 2007-01-25 at 00:43 +1100, Nick Piggin wrote: Have you seen the new launder_page() a_op? called from invalidate_inode_pages2_range() It would have been nice to make that one into a more potentially useful generic callback. That can still be done when the need arises, right? Yeah I guess so. But why was it introduced, exactly? I can't tell from the code or the discussion why NFS couldn't start the IO, and signal the caller to wait_on_page_writeback and retry? That seemed to me like the convetional fix. to quote a bit: On Tue, 19 Dec 2006 18:19:38 -0500 Trond Myklebust <[EMAIL PROTECTED]> wrote: NFS: Fix race in nfs_release_page() invalidate_inode_pages2() may set the dirty bit on a page owing to the call to unmap_mapping_range() after the page was locked. In order to fix this, NFS has hooked the releasepage() method. This, however leads to deadlocks in other parts of the VM. and: Now, arguably the VM shouldn't be calling try_to_release_page() with __GFP_FS when it's holding a lock on a page. But otoh, NFS should never be running lock_page() within nfs_release_page() against the page which was passed into nfs_release_page(). It'll deadlock for sure. The reason why it is happening is that the last dirty page from that inode gets cleaned, resulting in a call to dput(). OK but what's the problem with just failing to release the page if it is dirty, I wonder? In the worst case, page reclaim will just end up doing a writeout to clean it. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
On Thu, Jan 25, 2007 at 12:43:23AM +1100, Nick Piggin wrote: > Peter Zijlstra wrote: > >On Wed, 2007-01-24 at 09:37 +1100, David Chinner wrote: > > > >>With the recent changes to cancel_dirty_pages(), XFS will > >>dump warnings in the syslog because it can truncate_inode_pages() > >>on dirty mapped pages. > >> > >>I've determined that this is indeed correct behaviour for XFS > >>as this can happen in the case of races on mmap()d files with > >>direct I/O. In this case when we do a direct I/O read, we > >>flush the dirty pages to disk, then truncate them out of the > >>page cache. Unfortunately, between the flush and the truncate > >>the mmap could dirty the page again. At this point we toss a > >>dirty page that is mapped. > > > > > >This sounds iffy, why not just leave the page in the pagecache if its > >mapped anyway? > > And why not just leave it in the pagecache and be done with it? because what is in cache is then not coherent with what is on disk, and a direct read is supposed to read the data that is present in the file at the time it is issued. > All you need is to do a writeout before a direct IO read, which is > what generic dio code does. No, that's not good enough - after writeout but before the direct I/O read is issued a process can fault the page and dirty it. If you do a direct read, followed by a buffered read you should get the same data. The only way to guarantee this is to chuck out any cached pages across the range of the direct I/O so they are fetched again from disk on the next buffered I/O. i.e. coherent at the time the direct I/O is issued. > I guess you'll say that direct writes still need to remove pages, Yup. > but in that case you'll either have to live with some racyness > (which is what the generic code does), or have a higher level > synchronisation to prevent buffered + direct IO writes I suppose? The XFS inode iolock - direct I/O writes take it shared, buffered writes takes it exclusive - so you can't do both at once. Buffered reads take is shared, which is another reason why we need to purge the cache on direct I/O writes - they can operate concurrently (and coherently) with buffered reads. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
On Wed, Jan 24, 2007 at 01:13:55PM +0100, Peter Zijlstra wrote: > On Wed, 2007-01-24 at 09:37 +1100, David Chinner wrote: > > With the recent changes to cancel_dirty_pages(), XFS will > > dump warnings in the syslog because it can truncate_inode_pages() > > on dirty mapped pages. > > > > I've determined that this is indeed correct behaviour for XFS > > as this can happen in the case of races on mmap()d files with > > direct I/O. In this case when we do a direct I/O read, we > > flush the dirty pages to disk, then truncate them out of the > > page cache. Unfortunately, between the flush and the truncate > > the mmap could dirty the page again. At this point we toss a > > dirty page that is mapped. > > This sounds iffy, why not just leave the page in the pagecache if its > mapped anyway? Because then fsx fails. > > None of the existing functions for truncating pages or invalidating > > pages work in this situation. Invalidating a page only works for > > non-dirty pages with non-dirty buffers, and they only work for > > whole pages and XFS requires partial page truncation. > > > > On top of that the page invalidation functions don't actually > > call into the filesystem to invalidate the page and so the filesystem > > can't actually invalidate the page properly (e.g. do stuff based on > > private buffer head flags). > > Have you seen the new launder_page() a_op? called from > invalidate_inode_pages2_range() No, but we can't use invalidate_inode_pages2_range() because it doesn't handle partial pages. I tried that first and it left warnings in the syslog and fsx failed. > > So that leaves us needing to use truncate semantics and the problem > > is that none of them unmap pages in a non-racy manner - if they > > unmap pages they do it separately to the truncate of the page, > > leading to races with mmap redirtying the page between the unmap and > > the truncate ofthe page. > > Isn't there still a race where the page fault path doesn't yet lock the > page and can just reinsert it? Yes, but it's a tiny race compared to the other mechanisms available. > Nick's pagefault rework should rid us of this by always locking the page > in the fault path. Yes, and that's what I'm relying on to fix the problem completely. invalidate_inode_pages2_range() needs this fix as well to be race free, so it's not like I'm introducing a new problem Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
On Thu, 2007-01-25 at 00:43 +1100, Nick Piggin wrote: > > Have you seen the new launder_page() a_op? called from > > invalidate_inode_pages2_range() > > It would have been nice to make that one into a more potentially > useful generic callback. That can still be done when the need arises, right? > But why was it introduced, exactly? I can't tell from the code or > the discussion why NFS couldn't start the IO, and signal the caller > to wait_on_page_writeback and retry? That seemed to me like the > convetional fix. to quote a bit: On Tue, 19 Dec 2006 18:19:38 -0500 Trond Myklebust <[EMAIL PROTECTED]> wrote: > NFS: Fix race in nfs_release_page() > > invalidate_inode_pages2() may set the dirty bit on a page owing to the > call > to unmap_mapping_range() after the page was locked. In order to fix this, > NFS has hooked the releasepage() method. This, however leads to deadlocks > in other parts of the VM. and: > > Now, arguably the VM shouldn't be calling try_to_release_page() with > > __GFP_FS when it's holding a lock on a page. > > > > But otoh, NFS should never be running lock_page() within nfs_release_page() > > against the page which was passed into nfs_release_page(). It'll deadlock > > for sure. > > The reason why it is happening is that the last dirty page from that > inode gets cleaned, resulting in a call to dput(). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
Peter Zijlstra wrote: On Wed, 2007-01-24 at 09:37 +1100, David Chinner wrote: With the recent changes to cancel_dirty_pages(), XFS will dump warnings in the syslog because it can truncate_inode_pages() on dirty mapped pages. I've determined that this is indeed correct behaviour for XFS as this can happen in the case of races on mmap()d files with direct I/O. In this case when we do a direct I/O read, we flush the dirty pages to disk, then truncate them out of the page cache. Unfortunately, between the flush and the truncate the mmap could dirty the page again. At this point we toss a dirty page that is mapped. This sounds iffy, why not just leave the page in the pagecache if its mapped anyway? And why not just leave it in the pagecache and be done with it? All you need is to do a writeout before a direct IO read, which is what generic dio code does. I guess you'll say that direct writes still need to remove pages, but in that case you'll either have to live with some racyness (which is what the generic code does), or have a higher level synchronisation to prevent buffered + direct IO writes I suppose? None of the existing functions for truncating pages or invalidating pages work in this situation. Invalidating a page only works for non-dirty pages with non-dirty buffers, and they only work for whole pages and XFS requires partial page truncation. On top of that the page invalidation functions don't actually call into the filesystem to invalidate the page and so the filesystem can't actually invalidate the page properly (e.g. do stuff based on private buffer head flags). Have you seen the new launder_page() a_op? called from invalidate_inode_pages2_range() It would have been nice to make that one into a more potentially useful generic callback. But why was it introduced, exactly? I can't tell from the code or the discussion why NFS couldn't start the IO, and signal the caller to wait_on_page_writeback and retry? That seemed to me like the convetional fix. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
On Wed, 2007-01-24 at 09:37 +1100, David Chinner wrote: > With the recent changes to cancel_dirty_pages(), XFS will > dump warnings in the syslog because it can truncate_inode_pages() > on dirty mapped pages. > > I've determined that this is indeed correct behaviour for XFS > as this can happen in the case of races on mmap()d files with > direct I/O. In this case when we do a direct I/O read, we > flush the dirty pages to disk, then truncate them out of the > page cache. Unfortunately, between the flush and the truncate > the mmap could dirty the page again. At this point we toss a > dirty page that is mapped. This sounds iffy, why not just leave the page in the pagecache if its mapped anyway? > None of the existing functions for truncating pages or invalidating > pages work in this situation. Invalidating a page only works for > non-dirty pages with non-dirty buffers, and they only work for > whole pages and XFS requires partial page truncation. > > On top of that the page invalidation functions don't actually > call into the filesystem to invalidate the page and so the filesystem > can't actually invalidate the page properly (e.g. do stuff based on > private buffer head flags). Have you seen the new launder_page() a_op? called from invalidate_inode_pages2_range() > So that leaves us needing to use truncate semantics and the problem > is that none of them unmap pages in a non-racy manner - if they > unmap pages they do it separately to the truncate of the page, > leading to races with mmap redirtying the page between the unmap and > the truncate ofthe page. Isn't there still a race where the page fault path doesn't yet lock the page and can just reinsert it? Nick's pagefault rework should rid us of this by always locking the page in the fault path. > Hence we need a truncate function that unmaps the pages while they > are locked for truncate in a similar fashion to > invalidate_inode_pages2_range(). The following patch (unchanged from > the last time it was sent) does this. The XFS changes are in a > second patch. > > The patch has been test on ia64 and x86-64 via XFSQA and a lot > of fsx mixing mmap and direct I/O operations. > > Signed-off-by: Dave Chinner <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
On Wed, 2007-01-24 at 09:37 +1100, David Chinner wrote: With the recent changes to cancel_dirty_pages(), XFS will dump warnings in the syslog because it can truncate_inode_pages() on dirty mapped pages. I've determined that this is indeed correct behaviour for XFS as this can happen in the case of races on mmap()d files with direct I/O. In this case when we do a direct I/O read, we flush the dirty pages to disk, then truncate them out of the page cache. Unfortunately, between the flush and the truncate the mmap could dirty the page again. At this point we toss a dirty page that is mapped. This sounds iffy, why not just leave the page in the pagecache if its mapped anyway? None of the existing functions for truncating pages or invalidating pages work in this situation. Invalidating a page only works for non-dirty pages with non-dirty buffers, and they only work for whole pages and XFS requires partial page truncation. On top of that the page invalidation functions don't actually call into the filesystem to invalidate the page and so the filesystem can't actually invalidate the page properly (e.g. do stuff based on private buffer head flags). Have you seen the new launder_page() a_op? called from invalidate_inode_pages2_range() So that leaves us needing to use truncate semantics and the problem is that none of them unmap pages in a non-racy manner - if they unmap pages they do it separately to the truncate of the page, leading to races with mmap redirtying the page between the unmap and the truncate ofthe page. Isn't there still a race where the page fault path doesn't yet lock the page and can just reinsert it? Nick's pagefault rework should rid us of this by always locking the page in the fault path. Hence we need a truncate function that unmaps the pages while they are locked for truncate in a similar fashion to invalidate_inode_pages2_range(). The following patch (unchanged from the last time it was sent) does this. The XFS changes are in a second patch. The patch has been test on ia64 and x86-64 via XFSQA and a lot of fsx mixing mmap and direct I/O operations. Signed-off-by: Dave Chinner [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
Peter Zijlstra wrote: On Wed, 2007-01-24 at 09:37 +1100, David Chinner wrote: With the recent changes to cancel_dirty_pages(), XFS will dump warnings in the syslog because it can truncate_inode_pages() on dirty mapped pages. I've determined that this is indeed correct behaviour for XFS as this can happen in the case of races on mmap()d files with direct I/O. In this case when we do a direct I/O read, we flush the dirty pages to disk, then truncate them out of the page cache. Unfortunately, between the flush and the truncate the mmap could dirty the page again. At this point we toss a dirty page that is mapped. This sounds iffy, why not just leave the page in the pagecache if its mapped anyway? And why not just leave it in the pagecache and be done with it? All you need is to do a writeout before a direct IO read, which is what generic dio code does. I guess you'll say that direct writes still need to remove pages, but in that case you'll either have to live with some racyness (which is what the generic code does), or have a higher level synchronisation to prevent buffered + direct IO writes I suppose? None of the existing functions for truncating pages or invalidating pages work in this situation. Invalidating a page only works for non-dirty pages with non-dirty buffers, and they only work for whole pages and XFS requires partial page truncation. On top of that the page invalidation functions don't actually call into the filesystem to invalidate the page and so the filesystem can't actually invalidate the page properly (e.g. do stuff based on private buffer head flags). Have you seen the new launder_page() a_op? called from invalidate_inode_pages2_range() It would have been nice to make that one into a more potentially useful generic callback. But why was it introduced, exactly? I can't tell from the code or the discussion why NFS couldn't start the IO, and signal the caller to wait_on_page_writeback and retry? That seemed to me like the convetional fix. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
On Thu, 2007-01-25 at 00:43 +1100, Nick Piggin wrote: Have you seen the new launder_page() a_op? called from invalidate_inode_pages2_range() It would have been nice to make that one into a more potentially useful generic callback. That can still be done when the need arises, right? But why was it introduced, exactly? I can't tell from the code or the discussion why NFS couldn't start the IO, and signal the caller to wait_on_page_writeback and retry? That seemed to me like the convetional fix. to quote a bit: On Tue, 19 Dec 2006 18:19:38 -0500 Trond Myklebust [EMAIL PROTECTED] wrote: NFS: Fix race in nfs_release_page() invalidate_inode_pages2() may set the dirty bit on a page owing to the call to unmap_mapping_range() after the page was locked. In order to fix this, NFS has hooked the releasepage() method. This, however leads to deadlocks in other parts of the VM. and: Now, arguably the VM shouldn't be calling try_to_release_page() with __GFP_FS when it's holding a lock on a page. But otoh, NFS should never be running lock_page() within nfs_release_page() against the page which was passed into nfs_release_page(). It'll deadlock for sure. The reason why it is happening is that the last dirty page from that inode gets cleaned, resulting in a call to dput(). - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
On Wed, Jan 24, 2007 at 01:13:55PM +0100, Peter Zijlstra wrote: On Wed, 2007-01-24 at 09:37 +1100, David Chinner wrote: With the recent changes to cancel_dirty_pages(), XFS will dump warnings in the syslog because it can truncate_inode_pages() on dirty mapped pages. I've determined that this is indeed correct behaviour for XFS as this can happen in the case of races on mmap()d files with direct I/O. In this case when we do a direct I/O read, we flush the dirty pages to disk, then truncate them out of the page cache. Unfortunately, between the flush and the truncate the mmap could dirty the page again. At this point we toss a dirty page that is mapped. This sounds iffy, why not just leave the page in the pagecache if its mapped anyway? Because then fsx fails. None of the existing functions for truncating pages or invalidating pages work in this situation. Invalidating a page only works for non-dirty pages with non-dirty buffers, and they only work for whole pages and XFS requires partial page truncation. On top of that the page invalidation functions don't actually call into the filesystem to invalidate the page and so the filesystem can't actually invalidate the page properly (e.g. do stuff based on private buffer head flags). Have you seen the new launder_page() a_op? called from invalidate_inode_pages2_range() No, but we can't use invalidate_inode_pages2_range() because it doesn't handle partial pages. I tried that first and it left warnings in the syslog and fsx failed. So that leaves us needing to use truncate semantics and the problem is that none of them unmap pages in a non-racy manner - if they unmap pages they do it separately to the truncate of the page, leading to races with mmap redirtying the page between the unmap and the truncate ofthe page. Isn't there still a race where the page fault path doesn't yet lock the page and can just reinsert it? Yes, but it's a tiny race compared to the other mechanisms available. Nick's pagefault rework should rid us of this by always locking the page in the fault path. Yes, and that's what I'm relying on to fix the problem completely. invalidate_inode_pages2_range() needs this fix as well to be race free, so it's not like I'm introducing a new problem Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
On Thu, Jan 25, 2007 at 12:43:23AM +1100, Nick Piggin wrote: Peter Zijlstra wrote: On Wed, 2007-01-24 at 09:37 +1100, David Chinner wrote: With the recent changes to cancel_dirty_pages(), XFS will dump warnings in the syslog because it can truncate_inode_pages() on dirty mapped pages. I've determined that this is indeed correct behaviour for XFS as this can happen in the case of races on mmap()d files with direct I/O. In this case when we do a direct I/O read, we flush the dirty pages to disk, then truncate them out of the page cache. Unfortunately, between the flush and the truncate the mmap could dirty the page again. At this point we toss a dirty page that is mapped. This sounds iffy, why not just leave the page in the pagecache if its mapped anyway? And why not just leave it in the pagecache and be done with it? because what is in cache is then not coherent with what is on disk, and a direct read is supposed to read the data that is present in the file at the time it is issued. All you need is to do a writeout before a direct IO read, which is what generic dio code does. No, that's not good enough - after writeout but before the direct I/O read is issued a process can fault the page and dirty it. If you do a direct read, followed by a buffered read you should get the same data. The only way to guarantee this is to chuck out any cached pages across the range of the direct I/O so they are fetched again from disk on the next buffered I/O. i.e. coherent at the time the direct I/O is issued. I guess you'll say that direct writes still need to remove pages, Yup. but in that case you'll either have to live with some racyness (which is what the generic code does), or have a higher level synchronisation to prevent buffered + direct IO writes I suppose? The XFS inode iolock - direct I/O writes take it shared, buffered writes takes it exclusive - so you can't do both at once. Buffered reads take is shared, which is another reason why we need to purge the cache on direct I/O writes - they can operate concurrently (and coherently) with buffered reads. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
Peter Zijlstra wrote: On Thu, 2007-01-25 at 00:43 +1100, Nick Piggin wrote: Have you seen the new launder_page() a_op? called from invalidate_inode_pages2_range() It would have been nice to make that one into a more potentially useful generic callback. That can still be done when the need arises, right? Yeah I guess so. But why was it introduced, exactly? I can't tell from the code or the discussion why NFS couldn't start the IO, and signal the caller to wait_on_page_writeback and retry? That seemed to me like the convetional fix. to quote a bit: On Tue, 19 Dec 2006 18:19:38 -0500 Trond Myklebust [EMAIL PROTECTED] wrote: NFS: Fix race in nfs_release_page() invalidate_inode_pages2() may set the dirty bit on a page owing to the call to unmap_mapping_range() after the page was locked. In order to fix this, NFS has hooked the releasepage() method. This, however leads to deadlocks in other parts of the VM. and: Now, arguably the VM shouldn't be calling try_to_release_page() with __GFP_FS when it's holding a lock on a page. But otoh, NFS should never be running lock_page() within nfs_release_page() against the page which was passed into nfs_release_page(). It'll deadlock for sure. The reason why it is happening is that the last dirty page from that inode gets cleaned, resulting in a call to dput(). OK but what's the problem with just failing to release the page if it is dirty, I wonder? In the worst case, page reclaim will just end up doing a writeout to clean it. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
David Chinner wrote: On Thu, Jan 25, 2007 at 12:43:23AM +1100, Nick Piggin wrote: And why not just leave it in the pagecache and be done with it? because what is in cache is then not coherent with what is on disk, and a direct read is supposed to read the data that is present in the file at the time it is issued. So after a writeout it will be coherent of course, so the point in question is what happens when someone comes in and dirties it at the worst possible moment? That relates to the paragraph below... All you need is to do a writeout before a direct IO read, which is what generic dio code does. No, that's not good enough - after writeout but before the direct I/O read is issued a process can fault the page and dirty it. If you do a direct read, followed by a buffered read you should get the same data. The only way to guarantee this is to chuck out any cached pages across the range of the direct I/O so they are fetched again from disk on the next buffered I/O. i.e. coherent at the time the direct I/O is issued. ... so surely if you do a direct read followed by a buffered read, you should *not* get the same data if there has been some activity to modify that part of the file in the meantime (whether that be a buffered or direct write). but in that case you'll either have to live with some racyness (which is what the generic code does), or have a higher level synchronisation to prevent buffered + direct IO writes I suppose? The XFS inode iolock - direct I/O writes take it shared, buffered writes takes it exclusive - so you can't do both at once. Buffered reads take is shared, which is another reason why we need to purge the cache on direct I/O writes - they can operate concurrently (and coherently) with buffered reads. Ah, I'm glad to see somebody cares about doing the right thing ;) Maybe I'll use XFS for my filesystems in future. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
On Thu, Jan 25, 2007 at 11:12:41AM +1100, Nick Piggin wrote: David Chinner wrote: On Thu, Jan 25, 2007 at 12:43:23AM +1100, Nick Piggin wrote: And why not just leave it in the pagecache and be done with it? because what is in cache is then not coherent with what is on disk, and a direct read is supposed to read the data that is present in the file at the time it is issued. So after a writeout it will be coherent of course, so the point in question is what happens when someone comes in and dirties it at the worst possible moment? That relates to the paragraph below... All you need is to do a writeout before a direct IO read, which is what generic dio code does. No, that's not good enough - after writeout but before the direct I/O read is issued a process can fault the page and dirty it. If you do a direct read, followed by a buffered read you should get the same data. The only way to guarantee this is to chuck out any cached pages across the range of the direct I/O so they are fetched again from disk on the next buffered I/O. i.e. coherent at the time the direct I/O is issued. ... so surely if you do a direct read followed by a buffered read, you should *not* get the same data if there has been some activity to modify that part of the file in the meantime (whether that be a buffered or direct write). Right. And that is what happens in XFS because it purges the caches on direct I/O and forces data to be re-read from disk. Effectively, if you are mixing direct I/O with other types of I/O (buffered or mmap) then the application really needs to be certain it is doing the right thing because there are races that can occur below the filesystem. All we care about in the filesystem is that what we cache is the same as what is on disk, and that implies that direct I/O needs to purge the cache regardless of the state it is in Hence we need to unmap pages and use truncate semantics on them to ensure they are removed from the page cache Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
David Chinner wrote: On Thu, Jan 25, 2007 at 11:12:41AM +1100, Nick Piggin wrote: ... so surely if you do a direct read followed by a buffered read, you should *not* get the same data if there has been some activity to modify that part of the file in the meantime (whether that be a buffered or direct write). Right. And that is what happens in XFS because it purges the caches on direct I/O and forces data to be re-read from disk. And that is critical for direct IO writes, of course. Effectively, if you are mixing direct I/O with other types of I/O (buffered or mmap) then the application really needs to be certain it is doing the right thing because there are races that can occur below the filesystem. All we care about in the filesystem is that what we cache is the same as what is on disk, and that implies that direct I/O needs to purge the cache regardless of the state it is in Hence we need to unmap pages and use truncate semantics on them to ensure they are removed from the page cache OK, I understand that this does need to happen (at least for writes), so you need to fix it regardless of the DIO read issue. But I'm just interested about DIO reads. I think you can get pretty reasonable semantics without discarding pagecache, but the semantics are weaker in one aspect. DIO read 1. writeback page 2. read from disk Now your read will pick up data no older than 1. And if a buffered write happens after 2, then there is no problem either. So if you are doing a buffered write and DIO read concurrently, you want synchronisation so the buffered write happens either before 1 or after 2 -- the DIO read will see either all or none of the write. Supposing your pagecache isn't invalidated, then a buffered write (from mmap, if XFS doesn't allow write(2)) comes in between 1 and 2, then the DIO read will find either none, some, or all of that write. So I guess what you are preventing is the some case. Am I right? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
On Thu, Jan 25, 2007 at 11:47:24AM +1100, Nick Piggin wrote: David Chinner wrote: On Thu, Jan 25, 2007 at 11:12:41AM +1100, Nick Piggin wrote: ... so surely if you do a direct read followed by a buffered read, you should *not* get the same data if there has been some activity to modify that part of the file in the meantime (whether that be a buffered or direct write). Right. And that is what happens in XFS because it purges the caches on direct I/O and forces data to be re-read from disk. And that is critical for direct IO writes, of course. Effectively, if you are mixing direct I/O with other types of I/O (buffered or mmap) then the application really needs to be certain it is doing the right thing because there are races that can occur below the filesystem. All we care about in the filesystem is that what we cache is the same as what is on disk, and that implies that direct I/O needs to purge the cache regardless of the state it is in Hence we need to unmap pages and use truncate semantics on them to ensure they are removed from the page cache OK, I understand that this does need to happen (at least for writes), so you need to fix it regardless of the DIO read issue. But I'm just interested about DIO reads. I think you can get pretty reasonable semantics without discarding pagecache, but the semantics are weaker in one aspect. DIO read 1. writeback page 2. read from disk Now your read will pick up data no older than 1. And if a buffered write happens after 2, then there is no problem either. So if you are doing a buffered write and DIO read concurrently, you want synchronisation so the buffered write happens either before 1 or after 2 -- the DIO read will see either all or none of the write. Supposing your pagecache isn't invalidated, then a buffered write (from mmap, if XFS doesn't allow write(2)) comes in between 1 and 2, then the DIO read will find either none, some, or all of that write. So I guess what you are preventing is the some case. Am I right? No. The only thing that will happen here is that the direct read will see _none_ of the write because the mmap write occurred during the DIO read to a different set of pages in memory. There is no some or all case here. IOWs, at a single point in time we have 2 different views of the one file which are both apparently valid and that is what we are trying to avoid. We have a coherency problem here which is solved by forcing the mmap write to reread the data off disk Look at it this way - direct I/O in XFS implies an I/O barrier (similar to a memory barrier). Writing back and tossing out of the page cache at the start of the direct IO gives us an I/O coherency barrier - everything before the direct IO is sync'd to disk before the direct IO can proceed, and everything after the direct IO has started must be fetched from disk again. Because mmap I/O doesn't necessarily need I/O to change the state of a page (think of a read fault then a later write fault), to make the I/O barrier work correctly with mmap() we need to ensure that it will fault the page from disk again. We can only do that by unmapping the pages before tossing them from the page cache. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
David Chinner wrote: On Thu, Jan 25, 2007 at 11:47:24AM +1100, Nick Piggin wrote: David Chinner wrote: On Thu, Jan 25, 2007 at 11:12:41AM +1100, Nick Piggin wrote: ... so surely if you do a direct read followed by a buffered read, you should *not* get the same data if there has been some activity to modify that part of the file in the meantime (whether that be a buffered or direct write). Right. And that is what happens in XFS because it purges the caches on direct I/O and forces data to be re-read from disk. And that is critical for direct IO writes, of course. Effectively, if you are mixing direct I/O with other types of I/O (buffered or mmap) then the application really needs to be certain it is doing the right thing because there are races that can occur below the filesystem. All we care about in the filesystem is that what we cache is the same as what is on disk, and that implies that direct I/O needs to purge the cache regardless of the state it is in Hence we need to unmap pages and use truncate semantics on them to ensure they are removed from the page cache OK, I understand that this does need to happen (at least for writes), so you need to fix it regardless of the DIO read issue. But I'm just interested about DIO reads. I think you can get pretty reasonable semantics without discarding pagecache, but the semantics are weaker in one aspect. DIO read 1. writeback page 2. read from disk Now your read will pick up data no older than 1. And if a buffered write happens after 2, then there is no problem either. So if you are doing a buffered write and DIO read concurrently, you want synchronisation so the buffered write happens either before 1 or after 2 -- the DIO read will see either all or none of the write. Supposing your pagecache isn't invalidated, then a buffered write (from mmap, if XFS doesn't allow write(2)) comes in between 1 and 2, then the DIO read will find either none, some, or all of that write. So I guess what you are preventing is the some case. Am I right? No. The only thing that will happen here is that the direct read will see _none_ of the write because the mmap write occurred during the DIO read to a different set of pages in memory. There is no some or all case here. But if the buffers get partially or completely written back in the meantime, then the DIO read could see that. IOWs, at a single point in time we have 2 different views of the one file which are both apparently valid and that is what we are trying to avoid. We have a coherency problem here which is solved by forcing the mmap write to reread the data off disk I don't see why the mmap write needs to reread data off disk. The data on disk won't get changed by the DIO read. Look at it this way - direct I/O in XFS implies an I/O barrier (similar to a memory barrier). Writing back and tossing out of the page cache at the start of the direct IO gives us an I/O coherency barrier - everything before the direct IO is sync'd to disk before the direct IO can proceed, and everything after the direct IO has started must be fetched from disk again. Because mmap I/O doesn't necessarily need I/O to change the state of a page (think of a read fault then a later write fault), to make the I/O barrier work correctly with mmap() we need to ensure that it will fault the page from disk again. We can only do that by unmapping the pages before tossing them from the page cache. OK, but direct IO *reads* do not conceptually invalidate pagecache sitting on top of those blocks. Pagecache becomes invalid when the page no longer represents the most uptodate copy of the data (eg. in the case of a direct IO write). -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
On Thu, Jan 25, 2007 at 01:01:09PM +1100, Nick Piggin wrote: David Chinner wrote: On Thu, Jan 25, 2007 at 11:47:24AM +1100, Nick Piggin wrote: David Chinner wrote: On Thu, Jan 25, 2007 at 11:12:41AM +1100, Nick Piggin wrote: But I'm just interested about DIO reads. I think you can get pretty reasonable semantics without discarding pagecache, but the semantics are weaker in one aspect. DIO read 1. writeback page 2. read from disk Now your read will pick up data no older than 1. And if a buffered write happens after 2, then there is no problem either. So if you are doing a buffered write and DIO read concurrently, you want synchronisation so the buffered write happens either before 1 or after 2 -- the DIO read will see either all or none of the write. Supposing your pagecache isn't invalidated, then a buffered write (from mmap, if XFS doesn't allow write(2)) comes in between 1 and 2, then the DIO read will find either none, some, or all of that write. So I guess what you are preventing is the some case. Am I right? No. The only thing that will happen here is that the direct read will see _none_ of the write because the mmap write occurred during the DIO read to a different set of pages in memory. There is no some or all case here. But if the buffers get partially or completely written back in the meantime, then the DIO read could see that. Only if you can dirty them and flush them to disk while the direct read is waiting in the I/O queue (remember, the direct read flushes dirty cached data before being issued). Given that we don't lock the inode in the buffered I/O *writeback* path, we have to stop pages being dirtied in the page cache up front so we don't have mmap writeback over the top of the direct read. Hence we have to prevent mmap for dirtying the same file offset we are doing direct reads on until the direct read has been issued. i.e. we need a barrier. IOWs, at a single point in time we have 2 different views of the one file which are both apparently valid and that is what we are trying to avoid. We have a coherency problem here which is solved by forcing the mmap write to reread the data off disk I don't see why the mmap write needs to reread data off disk. The data on disk won't get changed by the DIO read. No, but the data _in memory_ will, and now when the direct read completes it will data that is different to what is in the page cache. For direct I/O we define the correct data to be what is on disk, not what is in memory, so any time we bypass what is in memory, we need to ensure that we prevent the data being changed again in memory before we issue the disk I/O. Look at it this way - direct I/O in XFS implies an I/O barrier (similar to a memory barrier). Writing back and tossing out of the page cache at the start of the direct IO gives us an I/O coherency barrier - everything before the direct IO is sync'd to disk before the direct IO can proceed, and everything after the direct IO has started must be fetched from disk again. Because mmap I/O doesn't necessarily need I/O to change the state of a page (think of a read fault then a later write fault), to make the I/O barrier work correctly with mmap() we need to ensure that it will fault the page from disk again. We can only do that by unmapping the pages before tossing them from the page cache. OK, but direct IO *reads* do not conceptually invalidate pagecache sitting on top of those blocks. Pagecache becomes invalid when the page no longer represents the most uptodate copy of the data (eg. in the case of a direct IO write). In theory, yes. In practice, if you don't invalidate the page cache you have no mechanism of synchronising mmap with direct I/O, and at that point you have no coherency model that you can work with in your filesystem. You have to be able to guarantee synchronisation betwen the different methods that can dirty data before you can give any guarantees about data coherency. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
David Chinner wrote: On Thu, Jan 25, 2007 at 01:01:09PM +1100, Nick Piggin wrote: David Chinner wrote: No. The only thing that will happen here is that the direct read will see _none_ of the write because the mmap write occurred during the DIO read to a different set of pages in memory. There is no some or all case here. But if the buffers get partially or completely written back in the meantime, then the DIO read could see that. Only if you can dirty them and flush them to disk while the direct read is waiting in the I/O queue (remember, the direct read flushes dirty cached data before being issued). Given that we don't lock the inode in the buffered I/O *writeback* path, we have to stop pages being dirtied in the page cache up front so we don't have mmap writeback over the top of the direct read. However unlikely it may be, that is what I'm talking about in my some or all cases. Note that I'm not talking about a specific implementation (eg. XFS I guess avoids some), but just the possible scenarios. Hence we have to prevent mmap for dirtying the same file offset we are doing direct reads on until the direct read has been issued. i.e. we need a barrier. So you need to eliminate the some case? Because of course none and all are unavoidable. IOWs, at a single point in time we have 2 different views of the one file which are both apparently valid and that is what we are trying to avoid. We have a coherency problem here which is solved by forcing the mmap write to reread the data off disk I don't see why the mmap write needs to reread data off disk. The data on disk won't get changed by the DIO read. No, but the data _in memory_ will, and now when the direct read completes it will data that is different to what is in the page cache. For direct I/O we define the correct data to be what is on disk, not what is in memory, so any time we bypass what is in memory, we need to ensure that we prevent the data being changed again in memory before we issue the disk I/O. But when you drop your locks, before the direct IO read returns, some guy can mmap and dirty the pagecache anyway. By the time the read returns, the data is stale. This obviously must be synchronised in userspace. As I said, you can't avoid none or all, and you can't say that userspace will see the most uptodate copy of the data. All you can say is that it will be no older than when the syscall is first made. Which is what you get if you simply writeback but do not invalidate pagecache for direct IO reads. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
On Thu, Jan 25, 2007 at 03:25:29PM +1100, Nick Piggin wrote: David Chinner wrote: On Thu, Jan 25, 2007 at 01:01:09PM +1100, Nick Piggin wrote: David Chinner wrote: No. The only thing that will happen here is that the direct read will see _none_ of the write because the mmap write occurred during the DIO read to a different set of pages in memory. There is no some or all case here. But if the buffers get partially or completely written back in the meantime, then the DIO read could see that. Only if you can dirty them and flush them to disk while the direct read is waiting in the I/O queue (remember, the direct read flushes dirty cached data before being issued). Given that we don't lock the inode in the buffered I/O *writeback* path, we have to stop pages being dirtied in the page cache up front so we don't have mmap writeback over the top of the direct read. However unlikely it may be, that is what I'm talking about in my some or all cases. Note that I'm not talking about a specific implementation (eg. XFS I guess avoids some), but just the possible scenarios. Unfortunately, behaviour is different for different filesystems. I can only answer for XFS, which is different to most of the other filesystems in both locking and the way it treats the page cache. IOWs, if you want to talk about details, then have to talk about specific implementations because. Hence we have to prevent mmap for dirtying the same file offset we are doing direct reads on until the direct read has been issued. i.e. we need a barrier. So you need to eliminate the some case? Because of course none and all are unavoidable. all is avoidable, too, once you've kicked the pages out of the page cache - you just have to block the buffered read triggered by refaulting the page will cause until the direct I/O completes. IOWs, at a single point in time we have 2 different views of the one file which are both apparently valid and that is what we are trying to avoid. We have a coherency problem here which is solved by forcing the mmap write to reread the data off disk I don't see why the mmap write needs to reread data off disk. The data on disk won't get changed by the DIO read. No, but the data _in memory_ will, and now when the direct read completes it will data that is different to what is in the page cache. For direct I/O we define the correct data to be what is on disk, not what is in memory, so any time we bypass what is in memory, we need to ensure that we prevent the data being changed again in memory before we issue the disk I/O. But when you drop your locks, before the direct IO read returns, some guy can mmap and dirty the pagecache anyway. By the time the read returns, the data is stale. Only if we leave the page in the page cache. If we toss the page, the time it takes to do the I/O for the page fault is enough for the direct I/o to complete. Sure it's not an absolute guarantee, but if you want an absolute guarantee: This obviously must be synchronised in userspace. As I said earlier. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
With the recent changes to cancel_dirty_pages(), XFS will dump warnings in the syslog because it can truncate_inode_pages() on dirty mapped pages. I've determined that this is indeed correct behaviour for XFS as this can happen in the case of races on mmap()d files with direct I/O. In this case when we do a direct I/O read, we flush the dirty pages to disk, then truncate them out of the page cache. Unfortunately, between the flush and the truncate the mmap could dirty the page again. At this point we toss a dirty page that is mapped. None of the existing functions for truncating pages or invalidating pages work in this situation. Invalidating a page only works for non-dirty pages with non-dirty buffers, and they only work for whole pages and XFS requires partial page truncation. On top of that the page invalidation functions don't actually call into the filesystem to invalidate the page and so the filesystem can't actually invalidate the page properly (e.g. do stuff based on private buffer head flags). So that leaves us needing to use truncate semantics an the problem is that none of them unmap pages in a non-racy manner - if they unmap pages they do it separately tothe truncate of the page, leading to races with mmap redirtying the page between the unmap and the truncate ofthe page. Hence we need a truncate function that unmaps the pages while they are locked for truncate in a similar fashion to invalidate_inode_pages2_range(). The following patch (unchanged from the last time it was sent) does this. The XFS changes are in a second patch. The patch has been test on ia64 and x86-64 via XFSQA and a lot of fsx mixing mmap and direct I/O operations. Signed-off-by: Dave Chinner <[EMAIL PROTECTED]> --- Index: 2.6.x-xfs-new/include/linux/mm.h === --- 2.6.x-xfs-new.orig/include/linux/mm.h 2007-01-15 15:09:57.0 +1100 +++ 2.6.x-xfs-new/include/linux/mm.h2007-01-16 08:59:24.031897743 +1100 @@ -1060,6 +1060,8 @@ extern unsigned long page_unuse(struct p extern void truncate_inode_pages(struct address_space *, loff_t); extern void truncate_inode_pages_range(struct address_space *, loff_t lstart, loff_t lend); +extern void truncate_unmap_inode_pages_range(struct address_space *, + loff_t lstart, loff_t lend, int unmap); /* generic vm_area_ops exported for stackable file systems */ extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int *); Index: 2.6.x-xfs-new/mm/truncate.c === --- 2.6.x-xfs-new.orig/mm/truncate.c2007-01-16 08:59:23.947908876 +1100 +++ 2.6.x-xfs-new/mm/truncate.c 2007-01-16 09:35:53.102924243 +1100 @@ -59,7 +59,7 @@ void cancel_dirty_page(struct page *page WARN_ON(++warncount < 5); } - + if (TestClearPageDirty(page)) { struct address_space *mapping = page->mapping; if (mapping && mapping_cap_account_dirty(mapping)) { @@ -122,16 +122,34 @@ invalidate_complete_page(struct address_ return ret; } +/* + * This is a helper for truncate_unmap_inode_page. Unmap the page we + * are passed. Page must be locked by the caller. + */ +static void +unmap_single_page(struct address_space *mapping, struct page *page) +{ + BUG_ON(!PageLocked(page)); + while (page_mapped(page)) { + unmap_mapping_range(mapping, + (loff_t)page->index << PAGE_CACHE_SHIFT, + PAGE_CACHE_SIZE, 0); + } +} + /** - * truncate_inode_pages - truncate range of pages specified by start and + * truncate_unmap_inode_pages_range - truncate range of pages specified by + * start and end byte offsets and optionally unmap them first. * end byte offsets * @mapping: mapping to truncate * @lstart: offset from which to truncate * @lend: offset to which to truncate + * @unmap: unmap whole truncated pages if non-zero * * Truncate the page cache, removing the pages that are between * specified offsets (and zeroing out partial page - * (if lstart is not page aligned)). + * (if lstart is not page aligned)). If specified, unmap the pages + * before they are removed. * * Truncate takes two passes - the first pass is nonblocking. It will not * block on page locks and it will not block on writeback. The second pass @@ -146,8 +164,8 @@ invalidate_complete_page(struct address_ * mapping is large, it is probably the case that the final pages are the most * recently touched, and freeing happens in ascending file offset order. */ -void truncate_inode_pages_range(struct address_space *mapping, - loff_t lstart, loff_t lend) +void truncate_unmap_inode_pages_range(struct address_space *mapping, + loff_t lstart, loff_t lend, int unmap) { const pgoff_t
[PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS
With the recent changes to cancel_dirty_pages(), XFS will dump warnings in the syslog because it can truncate_inode_pages() on dirty mapped pages. I've determined that this is indeed correct behaviour for XFS as this can happen in the case of races on mmap()d files with direct I/O. In this case when we do a direct I/O read, we flush the dirty pages to disk, then truncate them out of the page cache. Unfortunately, between the flush and the truncate the mmap could dirty the page again. At this point we toss a dirty page that is mapped. None of the existing functions for truncating pages or invalidating pages work in this situation. Invalidating a page only works for non-dirty pages with non-dirty buffers, and they only work for whole pages and XFS requires partial page truncation. On top of that the page invalidation functions don't actually call into the filesystem to invalidate the page and so the filesystem can't actually invalidate the page properly (e.g. do stuff based on private buffer head flags). So that leaves us needing to use truncate semantics an the problem is that none of them unmap pages in a non-racy manner - if they unmap pages they do it separately tothe truncate of the page, leading to races with mmap redirtying the page between the unmap and the truncate ofthe page. Hence we need a truncate function that unmaps the pages while they are locked for truncate in a similar fashion to invalidate_inode_pages2_range(). The following patch (unchanged from the last time it was sent) does this. The XFS changes are in a second patch. The patch has been test on ia64 and x86-64 via XFSQA and a lot of fsx mixing mmap and direct I/O operations. Signed-off-by: Dave Chinner [EMAIL PROTECTED] --- Index: 2.6.x-xfs-new/include/linux/mm.h === --- 2.6.x-xfs-new.orig/include/linux/mm.h 2007-01-15 15:09:57.0 +1100 +++ 2.6.x-xfs-new/include/linux/mm.h2007-01-16 08:59:24.031897743 +1100 @@ -1060,6 +1060,8 @@ extern unsigned long page_unuse(struct p extern void truncate_inode_pages(struct address_space *, loff_t); extern void truncate_inode_pages_range(struct address_space *, loff_t lstart, loff_t lend); +extern void truncate_unmap_inode_pages_range(struct address_space *, + loff_t lstart, loff_t lend, int unmap); /* generic vm_area_ops exported for stackable file systems */ extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int *); Index: 2.6.x-xfs-new/mm/truncate.c === --- 2.6.x-xfs-new.orig/mm/truncate.c2007-01-16 08:59:23.947908876 +1100 +++ 2.6.x-xfs-new/mm/truncate.c 2007-01-16 09:35:53.102924243 +1100 @@ -59,7 +59,7 @@ void cancel_dirty_page(struct page *page WARN_ON(++warncount 5); } - + if (TestClearPageDirty(page)) { struct address_space *mapping = page-mapping; if (mapping mapping_cap_account_dirty(mapping)) { @@ -122,16 +122,34 @@ invalidate_complete_page(struct address_ return ret; } +/* + * This is a helper for truncate_unmap_inode_page. Unmap the page we + * are passed. Page must be locked by the caller. + */ +static void +unmap_single_page(struct address_space *mapping, struct page *page) +{ + BUG_ON(!PageLocked(page)); + while (page_mapped(page)) { + unmap_mapping_range(mapping, + (loff_t)page-index PAGE_CACHE_SHIFT, + PAGE_CACHE_SIZE, 0); + } +} + /** - * truncate_inode_pages - truncate range of pages specified by start and + * truncate_unmap_inode_pages_range - truncate range of pages specified by + * start and end byte offsets and optionally unmap them first. * end byte offsets * @mapping: mapping to truncate * @lstart: offset from which to truncate * @lend: offset to which to truncate + * @unmap: unmap whole truncated pages if non-zero * * Truncate the page cache, removing the pages that are between * specified offsets (and zeroing out partial page - * (if lstart is not page aligned)). + * (if lstart is not page aligned)). If specified, unmap the pages + * before they are removed. * * Truncate takes two passes - the first pass is nonblocking. It will not * block on page locks and it will not block on writeback. The second pass @@ -146,8 +164,8 @@ invalidate_complete_page(struct address_ * mapping is large, it is probably the case that the final pages are the most * recently touched, and freeing happens in ascending file offset order. */ -void truncate_inode_pages_range(struct address_space *mapping, - loff_t lstart, loff_t lend) +void truncate_unmap_inode_pages_range(struct address_space *mapping, + loff_t lstart, loff_t lend, int unmap) { const pgoff_t start =