Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
In message <[EMAIL PROTECTED]>, Hugh Dickins writes: > On Tue, 13 Nov 2007, Erez Zadok wrote: [...] > I'm glad to report that this unionfs, not the one in 2.6.24-rc2-mm1 > but the one including those 9 patches you posted, now gets through > my testing with tmpfs without a problem. I do still get occasional > "unionfs: new lower inode mtime (bindex=0, name=)" > messages, but nothing worse seen yet: a big improvement. Excellent. > I did think you could clean up the doubled set_page_dirtys, > but it's of no consequence. Yes, looks good. I'll send that as a patch. Thanks. > Hugh > > --- 2.6.24-rc2-mm1+9/fs/unionfs/mmap.c2007-11-17 12:23:30.0 > + > +++ linux/fs/unionfs/mmap.c 2007-11-17 20:22:29.0 + > @@ -56,6 +56,7 @@ static int unionfs_writepage(struct page > copy_highpage(lower_page, page); > flush_dcache_page(lower_page); > SetPageUptodate(lower_page); > + set_page_dirty(lower_page); > > /* >* Call lower writepage (expects locked page). However, if we are > @@ -66,12 +67,11 @@ static int unionfs_writepage(struct page >* success. >*/ > if (wbc->for_reclaim) { > - set_page_dirty(lower_page); > unlock_page(lower_page); > goto out_release; > } > + > BUG_ON(!lower_mapping->a_ops->writepage); > - set_page_dirty(lower_page); > clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */ > err = lower_mapping->a_ops->writepage(lower_page, wbc); > if (err < 0) Erez. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
In message [EMAIL PROTECTED], Hugh Dickins writes: On Tue, 13 Nov 2007, Erez Zadok wrote: [...] I'm glad to report that this unionfs, not the one in 2.6.24-rc2-mm1 but the one including those 9 patches you posted, now gets through my testing with tmpfs without a problem. I do still get occasional unionfs: new lower inode mtime (bindex=0, name=directory) messages, but nothing worse seen yet: a big improvement. Excellent. I did think you could clean up the doubled set_page_dirtys, but it's of no consequence. Yes, looks good. I'll send that as a patch. Thanks. Hugh --- 2.6.24-rc2-mm1+9/fs/unionfs/mmap.c2007-11-17 12:23:30.0 + +++ linux/fs/unionfs/mmap.c 2007-11-17 20:22:29.0 + @@ -56,6 +56,7 @@ static int unionfs_writepage(struct page copy_highpage(lower_page, page); flush_dcache_page(lower_page); SetPageUptodate(lower_page); + set_page_dirty(lower_page); /* * Call lower writepage (expects locked page). However, if we are @@ -66,12 +67,11 @@ static int unionfs_writepage(struct page * success. */ if (wbc-for_reclaim) { - set_page_dirty(lower_page); unlock_page(lower_page); goto out_release; } + BUG_ON(!lower_mapping-a_ops-writepage); - set_page_dirty(lower_page); clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */ err = lower_mapping-a_ops-writepage(lower_page, wbc); if (err 0) Erez. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Tue, 13 Nov 2007, Erez Zadok wrote: > > I posted all of these patches just now. You're CC'ed. Hopefully Andrew can > pull from my unionfs.git branch soon. > > You also reported in your previous emails some hangs/oopses while doing make > -j 20 in unionfs on top of a single tmpfs, using -mm. After several days, > I've not been able to reproduce this w/ my latest set of patches. If you > can send me your .config and the specs on the h/w you're using (cpus, mem, > etc.), I'll see if I can find something similar to it on my end and run the > same tests. I'm glad to report that this unionfs, not the one in 2.6.24-rc2-mm1 but the one including those 9 patches you posted, now gets through my testing with tmpfs without a problem. I do still get occasional "unionfs: new lower inode mtime (bindex=0, name=)" messages, but nothing worse seen yet: a big improvement. I deceived myself for a while that the danger of shmem_writepage hitting its BUG_ON(entry->val) was dealt with too; but that's wrong, I must go back to working out an escape from that one (despite never seeing it). I did think you could clean up the doubled set_page_dirtys, but it's of no consequence. Hugh --- 2.6.24-rc2-mm1+9/fs/unionfs/mmap.c 2007-11-17 12:23:30.0 + +++ linux/fs/unionfs/mmap.c 2007-11-17 20:22:29.0 + @@ -56,6 +56,7 @@ static int unionfs_writepage(struct page copy_highpage(lower_page, page); flush_dcache_page(lower_page); SetPageUptodate(lower_page); + set_page_dirty(lower_page); /* * Call lower writepage (expects locked page). However, if we are @@ -66,12 +67,11 @@ static int unionfs_writepage(struct page * success. */ if (wbc->for_reclaim) { - set_page_dirty(lower_page); unlock_page(lower_page); goto out_release; } + BUG_ON(!lower_mapping->a_ops->writepage); - set_page_dirty(lower_page); clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */ err = lower_mapping->a_ops->writepage(lower_page, wbc); if (err < 0) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Tue, 13 Nov 2007, Erez Zadok wrote: I posted all of these patches just now. You're CC'ed. Hopefully Andrew can pull from my unionfs.git branch soon. You also reported in your previous emails some hangs/oopses while doing make -j 20 in unionfs on top of a single tmpfs, using -mm. After several days, I've not been able to reproduce this w/ my latest set of patches. If you can send me your .config and the specs on the h/w you're using (cpus, mem, etc.), I'll see if I can find something similar to it on my end and run the same tests. I'm glad to report that this unionfs, not the one in 2.6.24-rc2-mm1 but the one including those 9 patches you posted, now gets through my testing with tmpfs without a problem. I do still get occasional unionfs: new lower inode mtime (bindex=0, name=directory) messages, but nothing worse seen yet: a big improvement. I deceived myself for a while that the danger of shmem_writepage hitting its BUG_ON(entry-val) was dealt with too; but that's wrong, I must go back to working out an escape from that one (despite never seeing it). I did think you could clean up the doubled set_page_dirtys, but it's of no consequence. Hugh --- 2.6.24-rc2-mm1+9/fs/unionfs/mmap.c 2007-11-17 12:23:30.0 + +++ linux/fs/unionfs/mmap.c 2007-11-17 20:22:29.0 + @@ -56,6 +56,7 @@ static int unionfs_writepage(struct page copy_highpage(lower_page, page); flush_dcache_page(lower_page); SetPageUptodate(lower_page); + set_page_dirty(lower_page); /* * Call lower writepage (expects locked page). However, if we are @@ -66,12 +67,11 @@ static int unionfs_writepage(struct page * success. */ if (wbc-for_reclaim) { - set_page_dirty(lower_page); unlock_page(lower_page); goto out_release; } + BUG_ON(!lower_mapping-a_ops-writepage); - set_page_dirty(lower_page); clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */ err = lower_mapping-a_ops-writepage(lower_page, wbc); if (err 0) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
In message <[EMAIL PROTECTED]>, Hugh Dickins writes: > On Fri, 9 Nov 2007, Erez Zadok wrote: > > In message <[EMAIL PROTECTED]>, Hugh Dickins writes: > > > > > Three, I believe you need to add a flush_dcache_page(lower_page) > > > after the copy_highpage(lower_page): some architectures will need > > > that to see the new data if they have lower_page mapped (though I > > > expect it's anyway shaky ground to be accessing through the lower > > > mount at the same time as modifying through the upper). > > > > OK. > > While looking into something else entirely, I realize that _here_ > you are missing a SetPageUptodate(lower_page): should go in after > the flush_dcache_page(lower_page) I'm suggesting. (Nick would argue > for some kind of barrier there too, but I don't think unionfs has a > special need to be ahead of the pack on that issue.) > > Think about it: > when find_or_create_page has created a fresh page in the cache, > and you've just done copy_highpage to put the data into it, you > now need to mark it as Uptodate: otherwise a subsequent vfs_read > or whatever on the lower level will find that page !Uptodate and > read stale data back from disk instead of what you just copied in, > unless its dirtiness has got it written back to disk meanwhile. Hehe. Funny, you mention this... A few days ago, while I was doing your other recommended pageuptodate cleanups, I also added the same call to SetPageUptodate(lower_page) as you suggested. I tested that change along w/ the other changes you suggested, and they all seem to work great all the way from my 2.6.9 backport to 2.6.24-rc2 and -mm (modulo the fact that I had to work around or fix more non-unionfs bugs in -mm than unionfs ones to get it to work :-) I posted all of these patches just now. You're CC'ed. Hopefully Andrew can pull from my unionfs.git branch soon. You also reported in your previous emails some hangs/oopses while doing make -j 20 in unionfs on top of a single tmpfs, using -mm. After several days, I've not been able to reproduce this w/ my latest set of patches. If you can send me your .config and the specs on the h/w you're using (cpus, mem, etc.), I'll see if I can find something similar to it on my end and run the same tests. Cheers, Erez. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
In message [EMAIL PROTECTED], Hugh Dickins writes: On Fri, 9 Nov 2007, Erez Zadok wrote: In message [EMAIL PROTECTED], Hugh Dickins writes: Three, I believe you need to add a flush_dcache_page(lower_page) after the copy_highpage(lower_page): some architectures will need that to see the new data if they have lower_page mapped (though I expect it's anyway shaky ground to be accessing through the lower mount at the same time as modifying through the upper). OK. While looking into something else entirely, I realize that _here_ you are missing a SetPageUptodate(lower_page): should go in after the flush_dcache_page(lower_page) I'm suggesting. (Nick would argue for some kind of barrier there too, but I don't think unionfs has a special need to be ahead of the pack on that issue.) Think about it: when find_or_create_page has created a fresh page in the cache, and you've just done copy_highpage to put the data into it, you now need to mark it as Uptodate: otherwise a subsequent vfs_read or whatever on the lower level will find that page !Uptodate and read stale data back from disk instead of what you just copied in, unless its dirtiness has got it written back to disk meanwhile. Hehe. Funny, you mention this... A few days ago, while I was doing your other recommended pageuptodate cleanups, I also added the same call to SetPageUptodate(lower_page) as you suggested. I tested that change along w/ the other changes you suggested, and they all seem to work great all the way from my 2.6.9 backport to 2.6.24-rc2 and -mm (modulo the fact that I had to work around or fix more non-unionfs bugs in -mm than unionfs ones to get it to work :-) I posted all of these patches just now. You're CC'ed. Hopefully Andrew can pull from my unionfs.git branch soon. You also reported in your previous emails some hangs/oopses while doing make -j 20 in unionfs on top of a single tmpfs, using -mm. After several days, I've not been able to reproduce this w/ my latest set of patches. If you can send me your .config and the specs on the h/w you're using (cpus, mem, etc.), I'll see if I can find something similar to it on my end and run the same tests. Cheers, Erez. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Fri, 9 Nov 2007, Erez Zadok wrote: > In message <[EMAIL PROTECTED]>, Hugh Dickins writes: > > > Three, I believe you need to add a flush_dcache_page(lower_page) > > after the copy_highpage(lower_page): some architectures will need > > that to see the new data if they have lower_page mapped (though I > > expect it's anyway shaky ground to be accessing through the lower > > mount at the same time as modifying through the upper). > > OK. While looking into something else entirely, I realize that _here_ you are missing a SetPageUptodate(lower_page): should go in after the flush_dcache_page(lower_page) I'm suggesting. (Nick would argue for some kind of barrier there too, but I don't think unionfs has a special need to be ahead of the pack on that issue.) Think about it: when find_or_create_page has created a fresh page in the cache, and you've just done copy_highpage to put the data into it, you now need to mark it as Uptodate: otherwise a subsequent vfs_read or whatever on the lower level will find that page !Uptodate and read stale data back from disk instead of what you just copied in, unless its dirtiness has got it written back to disk meanwhile. Odd that that hasn't been noticed at all: I guess it may be hard to get testing to reclaim lower/upper pages in such a way as to test out such paths thoroughly. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Fri, 9 Nov 2007, Erez Zadok wrote: In message [EMAIL PROTECTED], Hugh Dickins writes: Three, I believe you need to add a flush_dcache_page(lower_page) after the copy_highpage(lower_page): some architectures will need that to see the new data if they have lower_page mapped (though I expect it's anyway shaky ground to be accessing through the lower mount at the same time as modifying through the upper). OK. While looking into something else entirely, I realize that _here_ you are missing a SetPageUptodate(lower_page): should go in after the flush_dcache_page(lower_page) I'm suggesting. (Nick would argue for some kind of barrier there too, but I don't think unionfs has a special need to be ahead of the pack on that issue.) Think about it: when find_or_create_page has created a fresh page in the cache, and you've just done copy_highpage to put the data into it, you now need to mark it as Uptodate: otherwise a subsequent vfs_read or whatever on the lower level will find that page !Uptodate and read stale data back from disk instead of what you just copied in, unless its dirtiness has got it written back to disk meanwhile. Odd that that hasn't been noticed at all: I guess it may be hard to get testing to reclaim lower/upper pages in such a way as to test out such paths thoroughly. Hugh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Fri, 9 Nov 2007, Erez Zadok wrote: > In message <[EMAIL PROTECTED]>, Hugh Dickins writes: > > > > One, I think you would be safer to do a set_page_dirty(lower_page) > > before your clear_page_dirty_for_io(lower_page). I know that sounds > > silly, but see Linus' "Yes, Virginia" comment in clear_page_dirty_for_io: > > there's a lot of subtlety hereabouts, and I think you'd be mimicing the > > usual path closer if you set_page_dirty first - there's nothing else > > doing it on that lower_page, is there? I'm not certain that you need > > to, but I think you'd do well to look into it and make up your own mind. > > Hugh, my code looks like: > > if (wbc->for_reclaim) { > set_page_dirty(lower_page); > unlock_page(lower_page); > goto out_release; > } > BUG_ON(!lower_mapping->a_ops->writepage); > clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */ > err = lower_mapping->a_ops->writepage(lower_page, wbc); > > Do you mean I should set_page_dirty(lower_page) unconditionally before > clear_page_dirty_for_io? (I already do that in the 'if' statement above it.) Yes. Whether you're wrong not to be doing that already, I've not checked; but I think doing so will make unionfs safer against any future changes in the relationship between set_page_dirty and clear_page_dirty_for_io. For example, look at clear_page_dirty_for_io: it's decrementing some statistics which __set_page_dirty_nobuffers increments. Does use of unionfs (over some filesystems) make those numbers wrap increasingly negative? Does adding this set_page_dirty(lower_page) correct that? I suspect so, but may be wrong. > > Two, I'm unsure of the way you're clearing or setting PageUptodate on > > the upper page there. The rules for PageUptodate are fairly obvious > > when reading, but when a write fails, it's not so obvious. Again, I'm > > not saying what you've got is wrong (it may be unavoidable, to keep > > synch between lower and upper), but it deserves a second thought. > > I looked at all mainline filesystems's ->writepage to see what, if any, they > do with their page's uptodate flag. Most f/s don't touch the flag one way > or another. I'm not going to try and guess what assorted filesystems are up to, and not all of them will be bugfree. The crucial point of PageUptodate is that we insert a filesystem page into the page cache before it's had any data read in: it needs to be !PageUptodate until the data is there, and then marked PageUptodate to say the data is good and others can start using it. See mm/filemap.c. > And finally, unionfs clears the uptodate flag on error from the lower > ->writepage, and otherwise sets the flag on success from the lower > ->writepage. My gut feeling is that unionfs shouldn't change the page > uptodate flag at all: if the VFS passes unionfs_writepage a page which isn't > uptodate, then the VFS has a serious problem b/c it'd be asking a f/s to > write out a page which isn't up-to-date, right? Otherwise, whether > unionfs_writepage manages to write the lower page or not, why should that > invalidate the state of the unionfs page itself? Come to think of it, I > think clearing pageuptodate on error from ->writepage(lower_page) may be > bad. Imagine if after such a failed unionfs_writepage, I get a > unionfs_readpage: that ->readpage will get data from the lower f/s page and > copy it *over* the unionfs page, even if the upper page's data was more > recent prior to the failed call to unionfs_writepage. IOW, we could be > reverting a user-visible mmap'ed page to a previous on-disk version. What > do you think: could this happen? Anyway, I'll run some exhaustive testing > next and see what happens if I don't set/clear the uptodate flag in > unionfs_writepage. That was my point, and I don't really have more to add. It's unusual to do anything with PageUptodate when writing. By clearing it when the lower level has an error, you're throwing away the changes already made at the upper level. You might have some good reason for that, but it's worth questioning. If you don't know why you're Set/Clear'ing it there, better to just take that out. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Fri, 9 Nov 2007, Erez Zadok wrote: In message [EMAIL PROTECTED], Hugh Dickins writes: One, I think you would be safer to do a set_page_dirty(lower_page) before your clear_page_dirty_for_io(lower_page). I know that sounds silly, but see Linus' Yes, Virginia comment in clear_page_dirty_for_io: there's a lot of subtlety hereabouts, and I think you'd be mimicing the usual path closer if you set_page_dirty first - there's nothing else doing it on that lower_page, is there? I'm not certain that you need to, but I think you'd do well to look into it and make up your own mind. Hugh, my code looks like: if (wbc-for_reclaim) { set_page_dirty(lower_page); unlock_page(lower_page); goto out_release; } BUG_ON(!lower_mapping-a_ops-writepage); clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */ err = lower_mapping-a_ops-writepage(lower_page, wbc); Do you mean I should set_page_dirty(lower_page) unconditionally before clear_page_dirty_for_io? (I already do that in the 'if' statement above it.) Yes. Whether you're wrong not to be doing that already, I've not checked; but I think doing so will make unionfs safer against any future changes in the relationship between set_page_dirty and clear_page_dirty_for_io. For example, look at clear_page_dirty_for_io: it's decrementing some statistics which __set_page_dirty_nobuffers increments. Does use of unionfs (over some filesystems) make those numbers wrap increasingly negative? Does adding this set_page_dirty(lower_page) correct that? I suspect so, but may be wrong. Two, I'm unsure of the way you're clearing or setting PageUptodate on the upper page there. The rules for PageUptodate are fairly obvious when reading, but when a write fails, it's not so obvious. Again, I'm not saying what you've got is wrong (it may be unavoidable, to keep synch between lower and upper), but it deserves a second thought. I looked at all mainline filesystems's -writepage to see what, if any, they do with their page's uptodate flag. Most f/s don't touch the flag one way or another. I'm not going to try and guess what assorted filesystems are up to, and not all of them will be bugfree. The crucial point of PageUptodate is that we insert a filesystem page into the page cache before it's had any data read in: it needs to be !PageUptodate until the data is there, and then marked PageUptodate to say the data is good and others can start using it. See mm/filemap.c. And finally, unionfs clears the uptodate flag on error from the lower -writepage, and otherwise sets the flag on success from the lower -writepage. My gut feeling is that unionfs shouldn't change the page uptodate flag at all: if the VFS passes unionfs_writepage a page which isn't uptodate, then the VFS has a serious problem b/c it'd be asking a f/s to write out a page which isn't up-to-date, right? Otherwise, whether unionfs_writepage manages to write the lower page or not, why should that invalidate the state of the unionfs page itself? Come to think of it, I think clearing pageuptodate on error from -writepage(lower_page) may be bad. Imagine if after such a failed unionfs_writepage, I get a unionfs_readpage: that -readpage will get data from the lower f/s page and copy it *over* the unionfs page, even if the upper page's data was more recent prior to the failed call to unionfs_writepage. IOW, we could be reverting a user-visible mmap'ed page to a previous on-disk version. What do you think: could this happen? Anyway, I'll run some exhaustive testing next and see what happens if I don't set/clear the uptodate flag in unionfs_writepage. That was my point, and I don't really have more to add. It's unusual to do anything with PageUptodate when writing. By clearing it when the lower level has an error, you're throwing away the changes already made at the upper level. You might have some good reason for that, but it's worth questioning. If you don't know why you're Set/Clear'ing it there, better to just take that out. Hugh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
In message <[EMAIL PROTECTED]>, Hugh Dickins writes: > [Dave, I've Cc'ed you re handle_write_count_underflow, see below.] > > On Wed, 31 Oct 2007, Erez Zadok wrote: > > > > Hi Hugh, I've addressed all of your concerns and am happy to report that the > > newly revised unionfs_writepage works even better, including under my > > memory-pressure conditions. To summarize my changes since the last time: > > > > - I'm only masking __GFP_FS, not __GFP_IO > > - using find_or_create_page to avoid locking issues around mapping mask > > - handle for_reclaim case more efficiently > > - using copy_highpage so we handle KM_USER* > > - un/locking upper/lower page as/when needed > > - updated comments to clarify what/why > > - unionfs_sync_page: gone (yes, vfs.txt did confuse me, plus ecryptfs used > > to have it) > > > > Below is the newest version of unionfs_writepage. Let me know what you > > think. > > > > I have to say that with these changes, unionfs appears visibly faster under > > memory pressure. I suspect the for_reclaim handling is probably the largest > > contributor to this speedup. > > That's good news, and that unionfs_writepage looks good to me - > with three reservations I've not observed before. > > One, I think you would be safer to do a set_page_dirty(lower_page) > before your clear_page_dirty_for_io(lower_page). I know that sounds > silly, but see Linus' "Yes, Virginia" comment in clear_page_dirty_for_io: > there's a lot of subtlety hereabouts, and I think you'd be mimicing the > usual path closer if you set_page_dirty first - there's nothing else > doing it on that lower_page, is there? I'm not certain that you need > to, but I think you'd do well to look into it and make up your own mind. Hugh, my code looks like: if (wbc->for_reclaim) { set_page_dirty(lower_page); unlock_page(lower_page); goto out_release; } BUG_ON(!lower_mapping->a_ops->writepage); clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */ err = lower_mapping->a_ops->writepage(lower_page, wbc); Do you mean I should set_page_dirty(lower_page) unconditionally before clear_page_dirty_for_io? (I already do that in the 'if' statement above it.) > Two, I'm unsure of the way you're clearing or setting PageUptodate on > the upper page there. The rules for PageUptodate are fairly obvious > when reading, but when a write fails, it's not so obvious. Again, I'm > not saying what you've got is wrong (it may be unavoidable, to keep > synch between lower and upper), but it deserves a second thought. I looked at all mainline filesystems's ->writepage to see what, if any, they do with their page's uptodate flag. Most f/s don't touch the flag one way or another. cifs_writepage sets the uptodate flag unconditionally: why? ecryptfs_writepage has a legit reason: if encrypting the page failed, it doesn't want anyone to use it, so it clears its page's uptodate flag (else it sets it as uptodate). hostfs_writepage clears pageuptodate if it failed to write_file(), which I'm not sure if it makes sense or not. ntfs_writepage goes as far as doing BUG_ON(!PageUptodate(page)) which indicates to me that the page passed to ->writepage should always be uptodate. Is that a fair statement? smb_writepage pretty much unconditionally calls SetPageUptodate(page). Why? Is there a reason smbfs and cifs both do this unconditionally? If so, then why is ntfs calling BUG_ON if the page isn't uptodate? Either that BUG_ON in ntfs is redundant, or cifs/smbfs's SetPageUptodate is redundant, but they can't both be right. And finally, unionfs clears the uptodate flag on error from the lower ->writepage, and otherwise sets the flag on success from the lower ->writepage. My gut feeling is that unionfs shouldn't change the page uptodate flag at all: if the VFS passes unionfs_writepage a page which isn't uptodate, then the VFS has a serious problem b/c it'd be asking a f/s to write out a page which isn't up-to-date, right? Otherwise, whether unionfs_writepage manages to write the lower page or not, why should that invalidate the state of the unionfs page itself? Come to think of it, I think clearing pageuptodate on error from ->writepage(lower_page) may be bad. Imagine if after such a failed unionfs_writepage, I get a unionfs_readpage: that ->readpage will get data from the lower f/s page and copy it *over* the unionfs page, even if the upper page's data was more recent prior to the failed call to unionfs_writepage. IOW, we could be reverting a user-visible mmap'ed page to a previous on-disk version. What do you think: could this happen? Anyway, I'll run some exhaustive testing next and see what happens if I don't set/clear the uptodate flag in unionfs_writepage. > Three, I believe you need to add a flush_dcache_page(lower_page) > after the copy_highpage(lower_page): some architectures will need > that to see the new data if they have lower_page mapped
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
In message <[EMAIL PROTECTED]>, Dave Hansen writes: > On Mon, 2007-11-05 at 15:40 +, Hugh Dickins wrote: [...] > I have a decent guess what the bug is, too. In the unionfs code: > > > int init_lower_nd(struct nameidata *nd, unsigned int flags) > > { > > ... > > #ifdef ALLOC_LOWER_ND_FILE > > file = kzalloc(sizeof(struct file), GFP_KERNEL); > > if (unlikely(!file)) { > > err = -ENOMEM; > > break; /* exit switch statement and thus return */ > > } > > nd->intent.open.file = file; > > #endif /* ALLOC_LOWER_ND_FILE */ > > The r/o bind mount code will mnt_drop_write() on that file's f_vfsmnt at > __fput() time. Since that code never got a write on the mount, we'll > see an imbalance if the file was opened for a write. I don't see this > file's mnt set anywhere, so I'm not completely sure that this is it. In > any case, rolling your own 'struct file' without using alloc_file() and > friends is a no-no. [...] This #ifdef'd code in unionfs is actually not enabled. I left it there as a reminder of possible future things to come (esp. if nameidata gets split). There's a related comment earlier in fs/unionfs/lookup.c:init_lower_nd() that says: #ifdef ALLOC_LOWER_ND_FILE /* * XXX: one day we may need to have the lower return an open file * for us. It is not needed in 2.6.23-rc1 for nfs2/nfs3, but may * very well be needed for nfs4. */ struct file *file; #endif /* ALLOC_LOWER_ND_FILE */ In the interest of keeping unionfs as simple as I can, when I implemented the whole "pass a lower nd" stuff, I left thos bits of semi-experimental #ifdef code for this lower file upon open-intent. It's not enabled and up until now, it didn't seem to be needed. Do you think unionfs has to start using this nd->intent.open.file stuff? Thanks, Erez. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
In message [EMAIL PROTECTED], Dave Hansen writes: On Mon, 2007-11-05 at 15:40 +, Hugh Dickins wrote: [...] I have a decent guess what the bug is, too. In the unionfs code: int init_lower_nd(struct nameidata *nd, unsigned int flags) { ... #ifdef ALLOC_LOWER_ND_FILE file = kzalloc(sizeof(struct file), GFP_KERNEL); if (unlikely(!file)) { err = -ENOMEM; break; /* exit switch statement and thus return */ } nd-intent.open.file = file; #endif /* ALLOC_LOWER_ND_FILE */ The r/o bind mount code will mnt_drop_write() on that file's f_vfsmnt at __fput() time. Since that code never got a write on the mount, we'll see an imbalance if the file was opened for a write. I don't see this file's mnt set anywhere, so I'm not completely sure that this is it. In any case, rolling your own 'struct file' without using alloc_file() and friends is a no-no. [...] This #ifdef'd code in unionfs is actually not enabled. I left it there as a reminder of possible future things to come (esp. if nameidata gets split). There's a related comment earlier in fs/unionfs/lookup.c:init_lower_nd() that says: #ifdef ALLOC_LOWER_ND_FILE /* * XXX: one day we may need to have the lower return an open file * for us. It is not needed in 2.6.23-rc1 for nfs2/nfs3, but may * very well be needed for nfs4. */ struct file *file; #endif /* ALLOC_LOWER_ND_FILE */ In the interest of keeping unionfs as simple as I can, when I implemented the whole pass a lower nd stuff, I left thos bits of semi-experimental #ifdef code for this lower file upon open-intent. It's not enabled and up until now, it didn't seem to be needed. Do you think unionfs has to start using this nd-intent.open.file stuff? Thanks, Erez. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
In message [EMAIL PROTECTED], Hugh Dickins writes: [Dave, I've Cc'ed you re handle_write_count_underflow, see below.] On Wed, 31 Oct 2007, Erez Zadok wrote: Hi Hugh, I've addressed all of your concerns and am happy to report that the newly revised unionfs_writepage works even better, including under my memory-pressure conditions. To summarize my changes since the last time: - I'm only masking __GFP_FS, not __GFP_IO - using find_or_create_page to avoid locking issues around mapping mask - handle for_reclaim case more efficiently - using copy_highpage so we handle KM_USER* - un/locking upper/lower page as/when needed - updated comments to clarify what/why - unionfs_sync_page: gone (yes, vfs.txt did confuse me, plus ecryptfs used to have it) Below is the newest version of unionfs_writepage. Let me know what you think. I have to say that with these changes, unionfs appears visibly faster under memory pressure. I suspect the for_reclaim handling is probably the largest contributor to this speedup. That's good news, and that unionfs_writepage looks good to me - with three reservations I've not observed before. One, I think you would be safer to do a set_page_dirty(lower_page) before your clear_page_dirty_for_io(lower_page). I know that sounds silly, but see Linus' Yes, Virginia comment in clear_page_dirty_for_io: there's a lot of subtlety hereabouts, and I think you'd be mimicing the usual path closer if you set_page_dirty first - there's nothing else doing it on that lower_page, is there? I'm not certain that you need to, but I think you'd do well to look into it and make up your own mind. Hugh, my code looks like: if (wbc-for_reclaim) { set_page_dirty(lower_page); unlock_page(lower_page); goto out_release; } BUG_ON(!lower_mapping-a_ops-writepage); clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */ err = lower_mapping-a_ops-writepage(lower_page, wbc); Do you mean I should set_page_dirty(lower_page) unconditionally before clear_page_dirty_for_io? (I already do that in the 'if' statement above it.) Two, I'm unsure of the way you're clearing or setting PageUptodate on the upper page there. The rules for PageUptodate are fairly obvious when reading, but when a write fails, it's not so obvious. Again, I'm not saying what you've got is wrong (it may be unavoidable, to keep synch between lower and upper), but it deserves a second thought. I looked at all mainline filesystems's -writepage to see what, if any, they do with their page's uptodate flag. Most f/s don't touch the flag one way or another. cifs_writepage sets the uptodate flag unconditionally: why? ecryptfs_writepage has a legit reason: if encrypting the page failed, it doesn't want anyone to use it, so it clears its page's uptodate flag (else it sets it as uptodate). hostfs_writepage clears pageuptodate if it failed to write_file(), which I'm not sure if it makes sense or not. ntfs_writepage goes as far as doing BUG_ON(!PageUptodate(page)) which indicates to me that the page passed to -writepage should always be uptodate. Is that a fair statement? smb_writepage pretty much unconditionally calls SetPageUptodate(page). Why? Is there a reason smbfs and cifs both do this unconditionally? If so, then why is ntfs calling BUG_ON if the page isn't uptodate? Either that BUG_ON in ntfs is redundant, or cifs/smbfs's SetPageUptodate is redundant, but they can't both be right. And finally, unionfs clears the uptodate flag on error from the lower -writepage, and otherwise sets the flag on success from the lower -writepage. My gut feeling is that unionfs shouldn't change the page uptodate flag at all: if the VFS passes unionfs_writepage a page which isn't uptodate, then the VFS has a serious problem b/c it'd be asking a f/s to write out a page which isn't up-to-date, right? Otherwise, whether unionfs_writepage manages to write the lower page or not, why should that invalidate the state of the unionfs page itself? Come to think of it, I think clearing pageuptodate on error from -writepage(lower_page) may be bad. Imagine if after such a failed unionfs_writepage, I get a unionfs_readpage: that -readpage will get data from the lower f/s page and copy it *over* the unionfs page, even if the upper page's data was more recent prior to the failed call to unionfs_writepage. IOW, we could be reverting a user-visible mmap'ed page to a previous on-disk version. What do you think: could this happen? Anyway, I'll run some exhaustive testing next and see what happens if I don't set/clear the uptodate flag in unionfs_writepage. Three, I believe you need to add a flush_dcache_page(lower_page) after the copy_highpage(lower_page): some architectures will need that to see the new data if they have lower_page mapped (though I expect it's anyway shaky ground to be accessing through the lower
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Mon, 5 Nov 2007, Dave Hansen wrote: > > Actually, I think your s/while/if/ change is probably a decent fix. Any resemblance to a decent fix is purely coincidental. > Barring any other races, that loop should always have made progress on > mnt->__mnt_writers the way it is written. If we get to: > > > lock_and_coalesce_cpu_mnt_writer_counts(); > ->HERE > > mnt_unlock_cpus(); > > and don't have a positive mnt->__mnt_writers, we know something is going > badly. We WARN_ON() there, which should at least give an earlier > warning that the system is not doing well. But it doesn't fix the > inevitable. Could you try the attached patch and see if it at least > warns you earlier? Thanks, Dave, yes, that gives me a nice warning: leak detected on mount(c25ebd80) writers count: -65537 WARNING: at fs/namespace.c:249 handle_write_count_underflow() [] show_trace_log_lvl+0x1b/0x2e [] show_trace+0x16/0x1b [] dump_stack+0x19/0x1e [] handle_write_count_underflow+0x4c/0x60 [] mnt_drop_write+0x69/0x8e [] __fput+0xff/0x162 [] fput+0x2e/0x33 [] unionfs_file_release+0xc2/0x1c5 [] __fput+0x8f/0x162 [] fput+0x2e/0x33 [] filp_close+0x50/0x5d [] sys_close+0x74/0xb4 [] sysenter_past_esp+0x5f/0x85 and the test then goes quietly on its way instead of hanging. Though I imagine, with your patch or mine, that it's then making an unfortunate frequency of calls to lock_and_coalesce_longer_name_than_I_care_to_type thereafter. But it's hardly your responsibility to optimize for bugs elsewhere. The 2.6.23-mm1 tree has MNT_USER at 0x200, so I adjusted your flag to #define MNT_IMBALANCED_WRITE_COUNT 0x400 /* just for debugging */ > > I have a decent guess what the bug is, too. In the unionfs code: I'll let Erez take it from there... Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Mon, 2007-11-05 at 15:40 +, Hugh Dickins wrote: > The second problem was a hang: all cpus in > handle_write_count_underflow > doing lock_and_coalesce_cpu_mnt_writer_counts: new -mm stuff from Dave > Hansen. At first I thought that was a locking problem in Dave's code, > but I now suspect it's that your unionfs reference counting is wrong > somewhere, and the error accumulates until __mnt_writers drops below > MNT_WRITER_UNDERFLOW_LIMIT, but the coalescence does nothing to help > and we're stuck in that loop. I've never actually seen this happen in practice, but I do know exactly what you're talking about. > but I hope Dave can > also make handle_write_count_underflow more robust, it's unfortunate > if refcount errors elsewhere first show up as a hang there. Actually, I think your s/while/if/ change is probably a decent fix. Barring any other races, that loop should always have made progress on mnt->__mnt_writers the way it is written. If we get to: > lock_and_coalesce_cpu_mnt_writer_counts(); ->HERE > mnt_unlock_cpus(); and don't have a positive mnt->__mnt_writers, we know something is going badly. We WARN_ON() there, which should at least give an earlier warning that the system is not doing well. But it doesn't fix the inevitable. Could you try the attached patch and see if it at least warns you earlier? I have a decent guess what the bug is, too. In the unionfs code: > int init_lower_nd(struct nameidata *nd, unsigned int flags) > { > ... > #ifdef ALLOC_LOWER_ND_FILE > file = kzalloc(sizeof(struct file), GFP_KERNEL); > if (unlikely(!file)) { > err = -ENOMEM; > break; /* exit switch statement and thus return */ > } > nd->intent.open.file = file; > #endif /* ALLOC_LOWER_ND_FILE */ The r/o bind mount code will mnt_drop_write() on that file's f_vfsmnt at __fput() time. Since that code never got a write on the mount, we'll see an imbalance if the file was opened for a write. I don't see this file's mnt set anywhere, so I'm not completely sure that this is it. In any case, rolling your own 'struct file' without using alloc_file() and friends is a no-no. BTW, I have some "debugging" code in my latest set of patches that I think should fix this kind of imbalance with the mnt->__mnt_writers(). It ensures that before we do that mnt_drop_write() at __fput() that we absolutely did a mnt_want_write() at some point in the 'struct file's life. -- Dave linux-2.6.git-dave/fs/namespace.c| 31 ++- linux-2.6.git-dave/include/linux/mount.h |1 + 2 files changed, 23 insertions(+), 9 deletions(-) diff -puN fs/namei.c~fix-naughty-loop fs/namei.c diff -puN fs/namespace.c~fix-naughty-loop fs/namespace.c --- linux-2.6.git/fs/namespace.c~fix-naughty-loop 2007-11-05 08:03:59.0 -0800 +++ linux-2.6.git-dave/fs/namespace.c 2007-11-05 08:35:06.0 -0800 @@ -225,16 +225,29 @@ static void lock_and_coalesce_cpu_mnt_wr */ static void handle_write_count_underflow(struct vfsmount *mnt) { - while (atomic_read(>__mnt_writers) < - MNT_WRITER_UNDERFLOW_LIMIT) { - /* -* It isn't necessary to hold all of the locks -* at the same time, but doing it this way makes -* us share a lot more code. -*/ - lock_and_coalesce_cpu_mnt_writer_counts(); - mnt_unlock_cpus(); + if (atomic_read(>__mnt_writers) >= + MNT_WRITER_UNDERFLOW_LIMIT) + return; + /* +* It isn't necessary to hold all of the locks +* at the same time, but doing it this way makes +* us share a lot more code. +*/ + lock_and_coalesce_cpu_mnt_writer_counts(); + /* +* If coalescing the per-cpu writer counts did not +* get us back to a positive writer count, we have +* a bug. +*/ + if ((atomic_read(>__mnt_writers) < 0) && + !(mnt->mnt_flags & MNT_IMBALANCED_WRITE_COUNT)) { + printk("leak detected on mount(%p) writers count: %d\n", + mnt, atomic_read(>__mnt_writers)); + WARN_ON(1); + /* use the flag to keep the dmesg spam down */ + mnt->mnt_flags |= MNT_IMBALANCED_WRITE_COUNT; } + mnt_unlock_cpus(); } /** diff -puN include/linux/mount.h~fix-naughty-loop include/linux/mount.h --- linux-2.6.git/include/linux/mount.h~fix-naughty-loop2007-11-05 08:22:21.0 -0800 +++ linux-2.6.git-dave/include/linux/mount.h2007-11-05 08:28:20.0 -0800 @@ -32,6 +32,7 @@ struct mnt_namespace; #define MNT_READONLY 0x40/* does the user want this to be r/o? */ #define MNT_SHRINKABLE 0x100 +#define MNT_IMBALANCED_WRITE_COUNT 0x200 /* just for debugging */ #define MNT_SHARED 0x1000
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
[Dave, I've Cc'ed you re handle_write_count_underflow, see below.] On Wed, 31 Oct 2007, Erez Zadok wrote: > > Hi Hugh, I've addressed all of your concerns and am happy to report that the > newly revised unionfs_writepage works even better, including under my > memory-pressure conditions. To summarize my changes since the last time: > > - I'm only masking __GFP_FS, not __GFP_IO > - using find_or_create_page to avoid locking issues around mapping mask > - handle for_reclaim case more efficiently > - using copy_highpage so we handle KM_USER* > - un/locking upper/lower page as/when needed > - updated comments to clarify what/why > - unionfs_sync_page: gone (yes, vfs.txt did confuse me, plus ecryptfs used > to have it) > > Below is the newest version of unionfs_writepage. Let me know what you > think. > > I have to say that with these changes, unionfs appears visibly faster under > memory pressure. I suspect the for_reclaim handling is probably the largest > contributor to this speedup. That's good news, and that unionfs_writepage looks good to me - with three reservations I've not observed before. One, I think you would be safer to do a set_page_dirty(lower_page) before your clear_page_dirty_for_io(lower_page). I know that sounds silly, but see Linus' "Yes, Virginia" comment in clear_page_dirty_for_io: there's a lot of subtlety hereabouts, and I think you'd be mimicing the usual path closer if you set_page_dirty first - there's nothing else doing it on that lower_page, is there? I'm not certain that you need to, but I think you'd do well to look into it and make up your own mind. Two, I'm unsure of the way you're clearing or setting PageUptodate on the upper page there. The rules for PageUptodate are fairly obvious when reading, but when a write fails, it's not so obvious. Again, I'm not saying what you've got is wrong (it may be unavoidable, to keep synch between lower and upper), but it deserves a second thought. Three, I believe you need to add a flush_dcache_page(lower_page) after the copy_highpage(lower_page): some architectures will need that to see the new data if they have lower_page mapped (though I expect it's anyway shaky ground to be accessing through the lower mount at the same time as modifying through the upper). I've been trying this out on 2.6.23-mm1 with your 21 Oct 1-9/9 and your 2 Nov 1-8/8 patches applied (rejects being patches which were already in 2.6.23-mm1). I was hoping to reproduce the BUG_ON(entry->val) that I fear from shmem_writepage(), before fixing it; but not seen that at all yet - that might be good news, but it's more likely I just haven't tried hard enough yet. For now I'm doing repeated make -j20 kernel builds, pushing into swap, in a unionfs mount of just a single dir on tmpfs. This has shown up several problems, two of which I've had to hack around to get further. The first: I very quickly hit "BUG: atomic counter underflow" from -mm's i386 atomic_dec_and_test: from filp_close calling unionfs_flush. I did a little test fork()ing while holding a file open on unionfs, and indeed it appears that your totalopens code is broken, being unaware of how fork() bumps up a file count without an open. That's rather basic, I'm puzzled that this has remained undiscovered until now - or perhaps it's just a recent addition. It looked to me as if the totalopens count was about avoiding some d_deleted processing in unionfs_flush, which actually should be left until unionfs_release (and that your unionfs_flush ought to be calling the lower level flush in all cases). To get going, I've been running with the quick hack patch below: but I've spent very little time thinking about it, plus it's a long time since I've done VFS stuff; so that patch may be nothing but an embarrassment that reflects neither your intentions nor the VFS rules! And it may itself be responsible for the further problems I've seen. The second problem was a hang: all cpus in handle_write_count_underflow doing lock_and_coalesce_cpu_mnt_writer_counts: new -mm stuff from Dave Hansen. At first I thought that was a locking problem in Dave's code, but I now suspect it's that your unionfs reference counting is wrong somewhere, and the error accumulates until __mnt_writers drops below MNT_WRITER_UNDERFLOW_LIMIT, but the coalescence does nothing to help and we're stuck in that loop. My even greater hack to solve that one was to change Dave's "while" to "if"! Then indeed tests can run for some while. As I say, my suspicion is that the actual error is within unionfs (perhaps introduced by my first hack); but I hope Dave can also make handle_write_count_underflow more robust, it's unfortunate if refcount errors elsewhere first show up as a hang there. I've had CONFIG_UNION_FS_DEBUG=y but will probably turn it off when I come back to this, since it's rather noisy at present. I've not checked whether its reports are peculiar to having tmpfs below or not. I get lots of "unionfs: new lower inode mtime"
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
[Dave, I've Cc'ed you re handle_write_count_underflow, see below.] On Wed, 31 Oct 2007, Erez Zadok wrote: Hi Hugh, I've addressed all of your concerns and am happy to report that the newly revised unionfs_writepage works even better, including under my memory-pressure conditions. To summarize my changes since the last time: - I'm only masking __GFP_FS, not __GFP_IO - using find_or_create_page to avoid locking issues around mapping mask - handle for_reclaim case more efficiently - using copy_highpage so we handle KM_USER* - un/locking upper/lower page as/when needed - updated comments to clarify what/why - unionfs_sync_page: gone (yes, vfs.txt did confuse me, plus ecryptfs used to have it) Below is the newest version of unionfs_writepage. Let me know what you think. I have to say that with these changes, unionfs appears visibly faster under memory pressure. I suspect the for_reclaim handling is probably the largest contributor to this speedup. That's good news, and that unionfs_writepage looks good to me - with three reservations I've not observed before. One, I think you would be safer to do a set_page_dirty(lower_page) before your clear_page_dirty_for_io(lower_page). I know that sounds silly, but see Linus' Yes, Virginia comment in clear_page_dirty_for_io: there's a lot of subtlety hereabouts, and I think you'd be mimicing the usual path closer if you set_page_dirty first - there's nothing else doing it on that lower_page, is there? I'm not certain that you need to, but I think you'd do well to look into it and make up your own mind. Two, I'm unsure of the way you're clearing or setting PageUptodate on the upper page there. The rules for PageUptodate are fairly obvious when reading, but when a write fails, it's not so obvious. Again, I'm not saying what you've got is wrong (it may be unavoidable, to keep synch between lower and upper), but it deserves a second thought. Three, I believe you need to add a flush_dcache_page(lower_page) after the copy_highpage(lower_page): some architectures will need that to see the new data if they have lower_page mapped (though I expect it's anyway shaky ground to be accessing through the lower mount at the same time as modifying through the upper). I've been trying this out on 2.6.23-mm1 with your 21 Oct 1-9/9 and your 2 Nov 1-8/8 patches applied (rejects being patches which were already in 2.6.23-mm1). I was hoping to reproduce the BUG_ON(entry-val) that I fear from shmem_writepage(), before fixing it; but not seen that at all yet - that might be good news, but it's more likely I just haven't tried hard enough yet. For now I'm doing repeated make -j20 kernel builds, pushing into swap, in a unionfs mount of just a single dir on tmpfs. This has shown up several problems, two of which I've had to hack around to get further. The first: I very quickly hit BUG: atomic counter underflow from -mm's i386 atomic_dec_and_test: from filp_close calling unionfs_flush. I did a little test fork()ing while holding a file open on unionfs, and indeed it appears that your totalopens code is broken, being unaware of how fork() bumps up a file count without an open. That's rather basic, I'm puzzled that this has remained undiscovered until now - or perhaps it's just a recent addition. It looked to me as if the totalopens count was about avoiding some d_deleted processing in unionfs_flush, which actually should be left until unionfs_release (and that your unionfs_flush ought to be calling the lower level flush in all cases). To get going, I've been running with the quick hack patch below: but I've spent very little time thinking about it, plus it's a long time since I've done VFS stuff; so that patch may be nothing but an embarrassment that reflects neither your intentions nor the VFS rules! And it may itself be responsible for the further problems I've seen. The second problem was a hang: all cpus in handle_write_count_underflow doing lock_and_coalesce_cpu_mnt_writer_counts: new -mm stuff from Dave Hansen. At first I thought that was a locking problem in Dave's code, but I now suspect it's that your unionfs reference counting is wrong somewhere, and the error accumulates until __mnt_writers drops below MNT_WRITER_UNDERFLOW_LIMIT, but the coalescence does nothing to help and we're stuck in that loop. My even greater hack to solve that one was to change Dave's while to if! Then indeed tests can run for some while. As I say, my suspicion is that the actual error is within unionfs (perhaps introduced by my first hack); but I hope Dave can also make handle_write_count_underflow more robust, it's unfortunate if refcount errors elsewhere first show up as a hang there. I've had CONFIG_UNION_FS_DEBUG=y but will probably turn it off when I come back to this, since it's rather noisy at present. I've not checked whether its reports are peculiar to having tmpfs below or not. I get lots of unionfs: new lower inode mtime reports on directories (if
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Mon, 2007-11-05 at 15:40 +, Hugh Dickins wrote: The second problem was a hang: all cpus in handle_write_count_underflow doing lock_and_coalesce_cpu_mnt_writer_counts: new -mm stuff from Dave Hansen. At first I thought that was a locking problem in Dave's code, but I now suspect it's that your unionfs reference counting is wrong somewhere, and the error accumulates until __mnt_writers drops below MNT_WRITER_UNDERFLOW_LIMIT, but the coalescence does nothing to help and we're stuck in that loop. I've never actually seen this happen in practice, but I do know exactly what you're talking about. but I hope Dave can also make handle_write_count_underflow more robust, it's unfortunate if refcount errors elsewhere first show up as a hang there. Actually, I think your s/while/if/ change is probably a decent fix. Barring any other races, that loop should always have made progress on mnt-__mnt_writers the way it is written. If we get to: lock_and_coalesce_cpu_mnt_writer_counts(); -HERE mnt_unlock_cpus(); and don't have a positive mnt-__mnt_writers, we know something is going badly. We WARN_ON() there, which should at least give an earlier warning that the system is not doing well. But it doesn't fix the inevitable. Could you try the attached patch and see if it at least warns you earlier? I have a decent guess what the bug is, too. In the unionfs code: int init_lower_nd(struct nameidata *nd, unsigned int flags) { ... #ifdef ALLOC_LOWER_ND_FILE file = kzalloc(sizeof(struct file), GFP_KERNEL); if (unlikely(!file)) { err = -ENOMEM; break; /* exit switch statement and thus return */ } nd-intent.open.file = file; #endif /* ALLOC_LOWER_ND_FILE */ The r/o bind mount code will mnt_drop_write() on that file's f_vfsmnt at __fput() time. Since that code never got a write on the mount, we'll see an imbalance if the file was opened for a write. I don't see this file's mnt set anywhere, so I'm not completely sure that this is it. In any case, rolling your own 'struct file' without using alloc_file() and friends is a no-no. BTW, I have some debugging code in my latest set of patches that I think should fix this kind of imbalance with the mnt-__mnt_writers(). It ensures that before we do that mnt_drop_write() at __fput() that we absolutely did a mnt_want_write() at some point in the 'struct file's life. -- Dave linux-2.6.git-dave/fs/namespace.c| 31 ++- linux-2.6.git-dave/include/linux/mount.h |1 + 2 files changed, 23 insertions(+), 9 deletions(-) diff -puN fs/namei.c~fix-naughty-loop fs/namei.c diff -puN fs/namespace.c~fix-naughty-loop fs/namespace.c --- linux-2.6.git/fs/namespace.c~fix-naughty-loop 2007-11-05 08:03:59.0 -0800 +++ linux-2.6.git-dave/fs/namespace.c 2007-11-05 08:35:06.0 -0800 @@ -225,16 +225,29 @@ static void lock_and_coalesce_cpu_mnt_wr */ static void handle_write_count_underflow(struct vfsmount *mnt) { - while (atomic_read(mnt-__mnt_writers) - MNT_WRITER_UNDERFLOW_LIMIT) { - /* -* It isn't necessary to hold all of the locks -* at the same time, but doing it this way makes -* us share a lot more code. -*/ - lock_and_coalesce_cpu_mnt_writer_counts(); - mnt_unlock_cpus(); + if (atomic_read(mnt-__mnt_writers) = + MNT_WRITER_UNDERFLOW_LIMIT) + return; + /* +* It isn't necessary to hold all of the locks +* at the same time, but doing it this way makes +* us share a lot more code. +*/ + lock_and_coalesce_cpu_mnt_writer_counts(); + /* +* If coalescing the per-cpu writer counts did not +* get us back to a positive writer count, we have +* a bug. +*/ + if ((atomic_read(mnt-__mnt_writers) 0) + !(mnt-mnt_flags MNT_IMBALANCED_WRITE_COUNT)) { + printk(leak detected on mount(%p) writers count: %d\n, + mnt, atomic_read(mnt-__mnt_writers)); + WARN_ON(1); + /* use the flag to keep the dmesg spam down */ + mnt-mnt_flags |= MNT_IMBALANCED_WRITE_COUNT; } + mnt_unlock_cpus(); } /** diff -puN include/linux/mount.h~fix-naughty-loop include/linux/mount.h --- linux-2.6.git/include/linux/mount.h~fix-naughty-loop2007-11-05 08:22:21.0 -0800 +++ linux-2.6.git-dave/include/linux/mount.h2007-11-05 08:28:20.0 -0800 @@ -32,6 +32,7 @@ struct mnt_namespace; #define MNT_READONLY 0x40/* does the user want this to be r/o? */ #define MNT_SHRINKABLE 0x100 +#define MNT_IMBALANCED_WRITE_COUNT 0x200 /* just for debugging */ #define MNT_SHARED 0x1000 /* if the vfsmount is a
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Mon, 5 Nov 2007, Dave Hansen wrote: Actually, I think your s/while/if/ change is probably a decent fix. Any resemblance to a decent fix is purely coincidental. Barring any other races, that loop should always have made progress on mnt-__mnt_writers the way it is written. If we get to: lock_and_coalesce_cpu_mnt_writer_counts(); -HERE mnt_unlock_cpus(); and don't have a positive mnt-__mnt_writers, we know something is going badly. We WARN_ON() there, which should at least give an earlier warning that the system is not doing well. But it doesn't fix the inevitable. Could you try the attached patch and see if it at least warns you earlier? Thanks, Dave, yes, that gives me a nice warning: leak detected on mount(c25ebd80) writers count: -65537 WARNING: at fs/namespace.c:249 handle_write_count_underflow() [c0103486] show_trace_log_lvl+0x1b/0x2e [c01034b6] show_trace+0x16/0x1b [c0103589] dump_stack+0x19/0x1e [c0171906] handle_write_count_underflow+0x4c/0x60 [c0171983] mnt_drop_write+0x69/0x8e [c0160211] __fput+0xff/0x162 [c016010d] fput+0x2e/0x33 [c01b8f63] unionfs_file_release+0xc2/0x1c5 [c01601a1] __fput+0x8f/0x162 [c016010d] fput+0x2e/0x33 [c015ec9d] filp_close+0x50/0x5d [c015ed1e] sys_close+0x74/0xb4 [c01026ce] sysenter_past_esp+0x5f/0x85 and the test then goes quietly on its way instead of hanging. Though I imagine, with your patch or mine, that it's then making an unfortunate frequency of calls to lock_and_coalesce_longer_name_than_I_care_to_type thereafter. But it's hardly your responsibility to optimize for bugs elsewhere. The 2.6.23-mm1 tree has MNT_USER at 0x200, so I adjusted your flag to #define MNT_IMBALANCED_WRITE_COUNT 0x400 /* just for debugging */ I have a decent guess what the bug is, too. In the unionfs code: I'll let Erez take it from there... Hugh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi Hugh, I've addressed all of your concerns and am happy to report that the newly revised unionfs_writepage works even better, including under my memory-pressure conditions. To summarize my changes since the last time: - I'm only masking __GFP_FS, not __GFP_IO - using find_or_create_page to avoid locking issues around mapping mask - handle for_reclaim case more efficiently - using copy_highpage so we handle KM_USER* - un/locking upper/lower page as/when needed - updated comments to clarify what/why - unionfs_sync_page: gone (yes, vfs.txt did confuse me, plus ecryptfs used to have it) Below is the newest version of unionfs_writepage. Let me know what you think. I have to say that with these changes, unionfs appears visibly faster under memory pressure. I suspect the for_reclaim handling is probably the largest contributor to this speedup. Many thanks, Erez. // static int unionfs_writepage(struct page *page, struct writeback_control *wbc) { int err = -EIO; struct inode *inode; struct inode *lower_inode; struct page *lower_page; struct address_space *lower_mapping; /* lower inode mapping */ gfp_t mask; inode = page->mapping->host; lower_inode = unionfs_lower_inode(inode); lower_mapping = lower_inode->i_mapping; /* * find lower page (returns a locked page) * * We turn off __GFP_FS while we look for or create a new lower * page. This prevents a recursion into the file system code, which * under memory pressure conditions could lead to a deadlock. This * is similar to how the loop driver behaves (see loop_set_fd in * drivers/block/loop.c). If we can't find the lower page, we * redirty our page and return "success" so that the VM will call us * again in the (hopefully near) future. */ mask = mapping_gfp_mask(lower_mapping) & ~(__GFP_FS); lower_page = find_or_create_page(lower_mapping, page->index, mask); if (!lower_page) { err = 0; set_page_dirty(page); goto out; } /* copy page data from our upper page to the lower page */ copy_highpage(lower_page, page); /* * Call lower writepage (expects locked page). However, if we are * called with wbc->for_reclaim, then the VFS/VM just wants to * reclaim our page. Therefore, we don't need to call the lower * ->writepage: just copy our data to the lower page (already done * above), then mark the lower page dirty and unlock it, and return * success. */ if (wbc->for_reclaim) { set_page_dirty(lower_page); unlock_page(lower_page); goto out_release; } BUG_ON(!lower_mapping->a_ops->writepage); clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */ err = lower_mapping->a_ops->writepage(lower_page, wbc); if (err < 0) { ClearPageUptodate(page); goto out_release; } /* * Lower file systems such as ramfs and tmpfs, may return * AOP_WRITEPAGE_ACTIVATE so that the VM won't try to (pointlessly) * write the page again for a while. But those lower file systems * also set the page dirty bit back again. Since we successfully * copied our page data to the lower page, then the VM will come * back to the lower page (directly) and try to flush it. So we can * save the VM the hassle of coming back to our page and trying to * flush too. Therefore, we don't re-dirty our own page, and we * never return AOP_WRITEPAGE_ACTIVATE back to the VM (we consider * this a success). * * We also unlock the lower page if the lower ->writepage returned * AOP_WRITEPAGE_ACTIVATE. (This "anomalous" behaviour may be * addressed in future shmem/VM code.) */ if (err == AOP_WRITEPAGE_ACTIVATE) { err = 0; unlock_page(lower_page); } /* all is well */ SetPageUptodate(page); /* lower mtimes have changed: update ours */ unionfs_copy_attr_times(inode); out_release: /* b/c find_or_create_page increased refcnt */ page_cache_release(lower_page); out: /* * We unlock our page unconditionally, because we never return * AOP_WRITEPAGE_ACTIVATE. */ unlock_page(page); return err; } // - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi Hugh, I've addressed all of your concerns and am happy to report that the newly revised unionfs_writepage works even better, including under my memory-pressure conditions. To summarize my changes since the last time: - I'm only masking __GFP_FS, not __GFP_IO - using find_or_create_page to avoid locking issues around mapping mask - handle for_reclaim case more efficiently - using copy_highpage so we handle KM_USER* - un/locking upper/lower page as/when needed - updated comments to clarify what/why - unionfs_sync_page: gone (yes, vfs.txt did confuse me, plus ecryptfs used to have it) Below is the newest version of unionfs_writepage. Let me know what you think. I have to say that with these changes, unionfs appears visibly faster under memory pressure. I suspect the for_reclaim handling is probably the largest contributor to this speedup. Many thanks, Erez. // static int unionfs_writepage(struct page *page, struct writeback_control *wbc) { int err = -EIO; struct inode *inode; struct inode *lower_inode; struct page *lower_page; struct address_space *lower_mapping; /* lower inode mapping */ gfp_t mask; inode = page-mapping-host; lower_inode = unionfs_lower_inode(inode); lower_mapping = lower_inode-i_mapping; /* * find lower page (returns a locked page) * * We turn off __GFP_FS while we look for or create a new lower * page. This prevents a recursion into the file system code, which * under memory pressure conditions could lead to a deadlock. This * is similar to how the loop driver behaves (see loop_set_fd in * drivers/block/loop.c). If we can't find the lower page, we * redirty our page and return success so that the VM will call us * again in the (hopefully near) future. */ mask = mapping_gfp_mask(lower_mapping) ~(__GFP_FS); lower_page = find_or_create_page(lower_mapping, page-index, mask); if (!lower_page) { err = 0; set_page_dirty(page); goto out; } /* copy page data from our upper page to the lower page */ copy_highpage(lower_page, page); /* * Call lower writepage (expects locked page). However, if we are * called with wbc-for_reclaim, then the VFS/VM just wants to * reclaim our page. Therefore, we don't need to call the lower * -writepage: just copy our data to the lower page (already done * above), then mark the lower page dirty and unlock it, and return * success. */ if (wbc-for_reclaim) { set_page_dirty(lower_page); unlock_page(lower_page); goto out_release; } BUG_ON(!lower_mapping-a_ops-writepage); clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */ err = lower_mapping-a_ops-writepage(lower_page, wbc); if (err 0) { ClearPageUptodate(page); goto out_release; } /* * Lower file systems such as ramfs and tmpfs, may return * AOP_WRITEPAGE_ACTIVATE so that the VM won't try to (pointlessly) * write the page again for a while. But those lower file systems * also set the page dirty bit back again. Since we successfully * copied our page data to the lower page, then the VM will come * back to the lower page (directly) and try to flush it. So we can * save the VM the hassle of coming back to our page and trying to * flush too. Therefore, we don't re-dirty our own page, and we * never return AOP_WRITEPAGE_ACTIVATE back to the VM (we consider * this a success). * * We also unlock the lower page if the lower -writepage returned * AOP_WRITEPAGE_ACTIVATE. (This anomalous behaviour may be * addressed in future shmem/VM code.) */ if (err == AOP_WRITEPAGE_ACTIVATE) { err = 0; unlock_page(lower_page); } /* all is well */ SetPageUptodate(page); /* lower mtimes have changed: update ours */ unionfs_copy_attr_times(inode); out_release: /* b/c find_or_create_page increased refcnt */ page_cache_release(lower_page); out: /* * We unlock our page unconditionally, because we never return * AOP_WRITEPAGE_ACTIVATE. */ unlock_page(page); return err; } // - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Sun, 28 Oct 2007, Erez Zadok wrote: > > I took your advise regarding ~(__GFP_FS|__GFP_IO), AOP_WRITEPAGE_ACTIVATE, > and such. I revised my unionfs_writepage and unionfs_sync_page, and tested > it under memory pressure: I have a couple of live CDs that use tmpfs and can > deterministically reproduce the conditions resulting in A_W_A. I also went > back to using grab_cache_page but with the gfp_mask suggestions you made. > > I'm happy to report that it all works great now! That's very encouraging... > Below is the entirety of > the new unionfs_mmap and unionfs_sync_page code. I'd appreciate if you and > others can look it over and see if you find any problems. ... but still a few problems, I'm afraid. The greatest problem is a tmpfs one, that would be for me to solve. But first... > static int unionfs_writepage(struct page *page, struct writeback_control *wbc) > { > int err = -EIO; > struct inode *inode; > struct inode *lower_inode; > struct page *lower_page; > char *kaddr, *lower_kaddr; > struct address_space *mapping; /* lower inode mapping */ > gfp_t old_gfp_mask; > > inode = page->mapping->host; > lower_inode = unionfs_lower_inode(inode); > mapping = lower_inode->i_mapping; > > /* >* find lower page (returns a locked page) >* >* We turn off __GFP_IO|__GFP_FS so as to prevent a deadlock under On reflection, I think I went too far in asking you to mask off __GFP_IO. Loop has to do so because it's a block device, down towards the IO layer; but unionfs is a filesystem, so masking off __GFP_FS is enough to prevent recursion into the FS layer with danger of deadlock, and leaving __GFP_IO on gives a better chance of success. >* memory pressure conditions. This is similar to how the loop >* driver behaves (see loop_set_fd in drivers/block/loop.c). >* If we can't find the lower page, we redirty our page and return >* "success" so that the VM will call us again in the (hopefully >* near) future. >*/ > old_gfp_mask = mapping_gfp_mask(mapping); > mapping_set_gfp_mask(mapping, old_gfp_mask & ~(__GFP_IO|__GFP_FS)); > > lower_page = grab_cache_page(mapping, page->index); > mapping_set_gfp_mask(mapping, old_gfp_mask); Hmm, several points on that. When suggesting something like this, I did remark "what locking needed?". You've got none: which is problematic if two stacked mounts are playing with the same underlying file concurrently (yes, userspace would have a data coherency problem in such a case, but the kernel still needs to worry about its own internal integrity) - you'd be in danger of masking (__GFP_IO|__GFP_FS) permanently off the underlying file; and furthermore, losing error flags (AS_EIO, AS_ENOSPC) which share the same unsigned long. Neither likely but both wrong. See the comment on mapping_set_gfp_mask() in include/pagemap.h: * This is non-atomic. Only to be used before the mapping is activated. Strictly speaking, I guess loop was a little bit guilty even when just loop_set_fd() did it: the underlying mapping might already be active. It appears to be just as guilty as you in its do_loop_switch() case (done at BIO completion time), but that's for a LOOP_CHANGE_FD ioctl which would only be expected to be called once, during installation; whereas you're using mapping_set_gfp_mask here with great frequency. Another point on this is: loop masks __GFP_IO|__GFP_FS off the file for the whole duration while it is looped, whereas you're flipping it just in this preliminary section of unionfs_writepage. I think you're probably okay to be doing it only here within ->writepage: I think loop covered every operation because it's at the block device level, perhaps both reads and writes needed to serve reclaim at the higher FS level; and also easier to do it once for all. Are you wrong to be doing it only around the grab_cache_page, leaving the lower level ->writepage further down unprotected? Certainly doing it around the grab_cache_page is likely to be way more important than around the ->writepage (but rather depends on filesystem). And on reflection, I think that the lower filesystem's writepage should already be using GFP_NOFS to avoid deadlocks in any of its allocations when wbc->for_reclaim, so you should be okay just masking off around the grab_cache_page. (Actually, in the wbc->for_reclaim case, I think you don't really need to call the lower level writepage at all. Just set_page_dirty on the lower page, unlock it and return. In due course that memory pressure which has called unionfs_writepage, will come around to the lower level page and do writepage upon it. Whether that's a better strategy or not, I'm do not know.) There's an attractively simple answer to the mapping_set_gfp_mask locking problem, if we're confident that it's only needed around the grab_cache_page. Look at the declaration of grab_cache_page in
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Sun, 28 Oct 2007, Erez Zadok wrote: I took your advise regarding ~(__GFP_FS|__GFP_IO), AOP_WRITEPAGE_ACTIVATE, and such. I revised my unionfs_writepage and unionfs_sync_page, and tested it under memory pressure: I have a couple of live CDs that use tmpfs and can deterministically reproduce the conditions resulting in A_W_A. I also went back to using grab_cache_page but with the gfp_mask suggestions you made. I'm happy to report that it all works great now! That's very encouraging... Below is the entirety of the new unionfs_mmap and unionfs_sync_page code. I'd appreciate if you and others can look it over and see if you find any problems. ... but still a few problems, I'm afraid. The greatest problem is a tmpfs one, that would be for me to solve. But first... static int unionfs_writepage(struct page *page, struct writeback_control *wbc) { int err = -EIO; struct inode *inode; struct inode *lower_inode; struct page *lower_page; char *kaddr, *lower_kaddr; struct address_space *mapping; /* lower inode mapping */ gfp_t old_gfp_mask; inode = page-mapping-host; lower_inode = unionfs_lower_inode(inode); mapping = lower_inode-i_mapping; /* * find lower page (returns a locked page) * * We turn off __GFP_IO|__GFP_FS so as to prevent a deadlock under On reflection, I think I went too far in asking you to mask off __GFP_IO. Loop has to do so because it's a block device, down towards the IO layer; but unionfs is a filesystem, so masking off __GFP_FS is enough to prevent recursion into the FS layer with danger of deadlock, and leaving __GFP_IO on gives a better chance of success. * memory pressure conditions. This is similar to how the loop * driver behaves (see loop_set_fd in drivers/block/loop.c). * If we can't find the lower page, we redirty our page and return * success so that the VM will call us again in the (hopefully * near) future. */ old_gfp_mask = mapping_gfp_mask(mapping); mapping_set_gfp_mask(mapping, old_gfp_mask ~(__GFP_IO|__GFP_FS)); lower_page = grab_cache_page(mapping, page-index); mapping_set_gfp_mask(mapping, old_gfp_mask); Hmm, several points on that. When suggesting something like this, I did remark what locking needed?. You've got none: which is problematic if two stacked mounts are playing with the same underlying file concurrently (yes, userspace would have a data coherency problem in such a case, but the kernel still needs to worry about its own internal integrity) - you'd be in danger of masking (__GFP_IO|__GFP_FS) permanently off the underlying file; and furthermore, losing error flags (AS_EIO, AS_ENOSPC) which share the same unsigned long. Neither likely but both wrong. See the comment on mapping_set_gfp_mask() in include/pagemap.h: * This is non-atomic. Only to be used before the mapping is activated. Strictly speaking, I guess loop was a little bit guilty even when just loop_set_fd() did it: the underlying mapping might already be active. It appears to be just as guilty as you in its do_loop_switch() case (done at BIO completion time), but that's for a LOOP_CHANGE_FD ioctl which would only be expected to be called once, during installation; whereas you're using mapping_set_gfp_mask here with great frequency. Another point on this is: loop masks __GFP_IO|__GFP_FS off the file for the whole duration while it is looped, whereas you're flipping it just in this preliminary section of unionfs_writepage. I think you're probably okay to be doing it only here within -writepage: I think loop covered every operation because it's at the block device level, perhaps both reads and writes needed to serve reclaim at the higher FS level; and also easier to do it once for all. Are you wrong to be doing it only around the grab_cache_page, leaving the lower level -writepage further down unprotected? Certainly doing it around the grab_cache_page is likely to be way more important than around the -writepage (but rather depends on filesystem). And on reflection, I think that the lower filesystem's writepage should already be using GFP_NOFS to avoid deadlocks in any of its allocations when wbc-for_reclaim, so you should be okay just masking off around the grab_cache_page. (Actually, in the wbc-for_reclaim case, I think you don't really need to call the lower level writepage at all. Just set_page_dirty on the lower page, unlock it and return. In due course that memory pressure which has called unionfs_writepage, will come around to the lower level page and do writepage upon it. Whether that's a better strategy or not, I'm do not know.) There's an attractively simple answer to the mapping_set_gfp_mask locking problem, if we're confident that it's only needed around the grab_cache_page. Look at the declaration of grab_cache_page in linux/pagemap.h: it immediately extracts the gfp_mask from the
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Huge, I took your advise regarding ~(__GFP_FS|__GFP_IO), AOP_WRITEPAGE_ACTIVATE, and such. I revised my unionfs_writepage and unionfs_sync_page, and tested it under memory pressure: I have a couple of live CDs that use tmpfs and can deterministically reproduce the conditions resulting in A_W_A. I also went back to using grab_cache_page but with the gfp_mask suggestions you made. I'm happy to report that it all works great now! Below is the entirety of the new unionfs_mmap and unionfs_sync_page code. I'd appreciate if you and others can look it over and see if you find any problems. Thanks, Erez. static int unionfs_writepage(struct page *page, struct writeback_control *wbc) { int err = -EIO; struct inode *inode; struct inode *lower_inode; struct page *lower_page; char *kaddr, *lower_kaddr; struct address_space *mapping; /* lower inode mapping */ gfp_t old_gfp_mask; inode = page->mapping->host; lower_inode = unionfs_lower_inode(inode); mapping = lower_inode->i_mapping; /* * find lower page (returns a locked page) * * We turn off __GFP_IO|__GFP_FS so as to prevent a deadlock under * memory pressure conditions. This is similar to how the loop * driver behaves (see loop_set_fd in drivers/block/loop.c). * If we can't find the lower page, we redirty our page and return * "success" so that the VM will call us again in the (hopefully * near) future. */ old_gfp_mask = mapping_gfp_mask(mapping); mapping_set_gfp_mask(mapping, old_gfp_mask & ~(__GFP_IO|__GFP_FS)); lower_page = grab_cache_page(mapping, page->index); mapping_set_gfp_mask(mapping, old_gfp_mask); if (!lower_page) { err = 0; set_page_dirty(page); goto out; } /* get page address, and encode it */ kaddr = kmap(page); lower_kaddr = kmap(lower_page); memcpy(lower_kaddr, kaddr, PAGE_CACHE_SIZE); kunmap(page); kunmap(lower_page); BUG_ON(!mapping->a_ops->writepage); /* call lower writepage (expects locked page) */ clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */ err = mapping->a_ops->writepage(lower_page, wbc); /* b/c grab_cache_page locked it and ->writepage unlocks on success */ if (err) unlock_page(lower_page); /* b/c grab_cache_page increased refcnt */ page_cache_release(lower_page); if (err < 0) { ClearPageUptodate(page); goto out; } /* * Lower file systems such as ramfs and tmpfs, may return * AOP_WRITEPAGE_ACTIVATE so that the VM won't try to (pointlessly) * write the page again for a while. But those lower file systems * also set the page dirty bit back again. Since we successfully * copied our page data to the lower page, then the VM will come * back to the lower page (directly) and try to flush it. So we can * save the VM the hassle of coming back to our page and trying to * flush too. Therefore, we don't re-dirty our own page, and we * don't return AOP_WRITEPAGE_ACTIVATE back to the VM (we consider * this a success). */ if (err == AOP_WRITEPAGE_ACTIVATE) err = 0; /* all is well */ SetPageUptodate(page); /* lower mtimes has changed: update ours */ unionfs_copy_attr_times(inode); unlock_page(page); out: return err; } static void unionfs_sync_page(struct page *page) { struct inode *inode; struct inode *lower_inode; struct page *lower_page; struct address_space *mapping; /* lower inode mapping */ gfp_t old_gfp_mask; inode = page->mapping->host; lower_inode = unionfs_lower_inode(inode); mapping = lower_inode->i_mapping; /* * Find lower page (returns a locked page). * * We turn off __GFP_IO|__GFP_FS so as to prevent a deadlock under * memory pressure conditions. This is similar to how the loop * driver behaves (see loop_set_fd in drivers/block/loop.c). * If we can't find the lower page, we redirty our page and return * "success" so that the VM will call us again in the (hopefully * near) future. */ old_gfp_mask = mapping_gfp_mask(mapping); mapping_set_gfp_mask(mapping, old_gfp_mask & ~(__GFP_IO|__GFP_FS)); lower_page = grab_cache_page(mapping, page->index); mapping_set_gfp_mask(mapping, old_gfp_mask); if (!lower_page) { printk(KERN_ERR "unionfs: grab_cache_page failed\n"); goto out; } /* do the actual sync */ /* * XXX: can we optimize ala RAIF
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Huge, I took your advise regarding ~(__GFP_FS|__GFP_IO), AOP_WRITEPAGE_ACTIVATE, and such. I revised my unionfs_writepage and unionfs_sync_page, and tested it under memory pressure: I have a couple of live CDs that use tmpfs and can deterministically reproduce the conditions resulting in A_W_A. I also went back to using grab_cache_page but with the gfp_mask suggestions you made. I'm happy to report that it all works great now! Below is the entirety of the new unionfs_mmap and unionfs_sync_page code. I'd appreciate if you and others can look it over and see if you find any problems. Thanks, Erez. static int unionfs_writepage(struct page *page, struct writeback_control *wbc) { int err = -EIO; struct inode *inode; struct inode *lower_inode; struct page *lower_page; char *kaddr, *lower_kaddr; struct address_space *mapping; /* lower inode mapping */ gfp_t old_gfp_mask; inode = page-mapping-host; lower_inode = unionfs_lower_inode(inode); mapping = lower_inode-i_mapping; /* * find lower page (returns a locked page) * * We turn off __GFP_IO|__GFP_FS so as to prevent a deadlock under * memory pressure conditions. This is similar to how the loop * driver behaves (see loop_set_fd in drivers/block/loop.c). * If we can't find the lower page, we redirty our page and return * success so that the VM will call us again in the (hopefully * near) future. */ old_gfp_mask = mapping_gfp_mask(mapping); mapping_set_gfp_mask(mapping, old_gfp_mask ~(__GFP_IO|__GFP_FS)); lower_page = grab_cache_page(mapping, page-index); mapping_set_gfp_mask(mapping, old_gfp_mask); if (!lower_page) { err = 0; set_page_dirty(page); goto out; } /* get page address, and encode it */ kaddr = kmap(page); lower_kaddr = kmap(lower_page); memcpy(lower_kaddr, kaddr, PAGE_CACHE_SIZE); kunmap(page); kunmap(lower_page); BUG_ON(!mapping-a_ops-writepage); /* call lower writepage (expects locked page) */ clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */ err = mapping-a_ops-writepage(lower_page, wbc); /* b/c grab_cache_page locked it and -writepage unlocks on success */ if (err) unlock_page(lower_page); /* b/c grab_cache_page increased refcnt */ page_cache_release(lower_page); if (err 0) { ClearPageUptodate(page); goto out; } /* * Lower file systems such as ramfs and tmpfs, may return * AOP_WRITEPAGE_ACTIVATE so that the VM won't try to (pointlessly) * write the page again for a while. But those lower file systems * also set the page dirty bit back again. Since we successfully * copied our page data to the lower page, then the VM will come * back to the lower page (directly) and try to flush it. So we can * save the VM the hassle of coming back to our page and trying to * flush too. Therefore, we don't re-dirty our own page, and we * don't return AOP_WRITEPAGE_ACTIVATE back to the VM (we consider * this a success). */ if (err == AOP_WRITEPAGE_ACTIVATE) err = 0; /* all is well */ SetPageUptodate(page); /* lower mtimes has changed: update ours */ unionfs_copy_attr_times(inode); unlock_page(page); out: return err; } static void unionfs_sync_page(struct page *page) { struct inode *inode; struct inode *lower_inode; struct page *lower_page; struct address_space *mapping; /* lower inode mapping */ gfp_t old_gfp_mask; inode = page-mapping-host; lower_inode = unionfs_lower_inode(inode); mapping = lower_inode-i_mapping; /* * Find lower page (returns a locked page). * * We turn off __GFP_IO|__GFP_FS so as to prevent a deadlock under * memory pressure conditions. This is similar to how the loop * driver behaves (see loop_set_fd in drivers/block/loop.c). * If we can't find the lower page, we redirty our page and return * success so that the VM will call us again in the (hopefully * near) future. */ old_gfp_mask = mapping_gfp_mask(mapping); mapping_set_gfp_mask(mapping, old_gfp_mask ~(__GFP_IO|__GFP_FS)); lower_page = grab_cache_page(mapping, page-index); mapping_set_gfp_mask(mapping, old_gfp_mask); if (!lower_page) { printk(KERN_ERR unionfs: grab_cache_page failed\n); goto out; } /* do the actual sync */ /* * XXX: can we optimize ala RAIF and set the lower page
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
In message <[EMAIL PROTECTED]>, Hugh Dickins writes: > On Mon, 22 Oct 2007, Erez Zadok wrote: [...] > > If you've got suggestions how I can handle unionfs_write more cleanly, or > > comments on the above possibilities, I'd love to hear them. > > For now I think you should pursue the ~(__GFP_FS|__GFP_IO) idea somehow. > > Hugh Hugh, thanks for the great explanations and suggestions (in multiple emails). I'm going to test all of those soon. Erez. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
In message [EMAIL PROTECTED], Hugh Dickins writes: On Mon, 22 Oct 2007, Erez Zadok wrote: [...] If you've got suggestions how I can handle unionfs_write more cleanly, or comments on the above possibilities, I'd love to hear them. For now I think you should pursue the ~(__GFP_FS|__GFP_IO) idea somehow. Hugh Hugh, thanks for the great explanations and suggestions (in multiple emails). I'm going to test all of those soon. Erez. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Fri, 26 Oct 2007, Neil Brown wrote: > On Thursday October 25, [EMAIL PROTECTED] wrote: > > The patch looks like it makes perfect sense to me. Great, thanks a lot for looking at it, Neil and Pekka. > Before the change, ->writepage could return AOP_WRITEPAGE_ACTIVATE > without unlocking the page, and this has precisely the effect of: >ClearPageReclaim(); (if the call path was through pageout) >SetPageActive(); (if the call was through shrink_page_list) >unlock_page(); > > With the patch, the ->writepage method does the SetPageActive and the > unlock_page, which on the whole seems cleaner. > > We seem to have lost a call to ClearPageReclaim - I don't know if that > is significant. It doesn't show up in the diff at all, but pageout() already has if (!PageWriteback(page)) { /* synchronous write or broken a_ops? */ ClearPageReclaim(page); } which will clear it since we've never set PageWriteback. I think no harm would come from leaving it set there, since it only takes effect in end_page_writeback (its effect being to let the just written page be moved to the end of the LRU, so that it will then be soon reclaimed), and clear_page_dirty_for_io clears it before coming down this way. But I'd never argue for that: I hate having leftover flags hanging around outside the scope of their relevance. > > Special, hidden, undocumented, secret hack! Then in 2.6.7 Andrew > > stole his own secret and used it when concocting ramdisk_writepage. > > Oh, and NFS made some kind of use of it in 2.6.6 only. Then Neil > > revealed the secret to the uninitiated in 2.6.17: now, what's the > > appropriate punishment for that? > > Surely the punishment should be for writing hidden undocumented hacks > in the first place! I vote we just make him maintainer for the whole > kernel - that will keep him so busy that he will never have a chance > to do it again :-) That is a splendid retort, which has won you absolution. But it makes me a little sad: that smiley should be a weepy. > > --- 2.6.24-rc1/Documentation/filesystems/vfs.txt2007-10-24 > > 07:15:11.0 +0100 > > +++ linux/Documentation/filesystems/vfs.txt 2007-10-24 08:42:07.0 > > +0100 > > @@ -567,9 +567,7 @@ struct address_space_operations { > >If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to > >try too hard if there are problems, and may choose to write out > >other pages from the mapping if that is easier (e.g. due to > > - internal dependencies). If it chooses not to start writeout, it > > - should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep > > - calling ->writepage on that page. > > + internal dependencies). > > > > It seems that the new requirement is that if the address_space > chooses not to write out the page, it should now call SetPageActive(). > If that is the case, I think it should be explicit in the > documentation - please? No, it's not the case; but you're right that I should add something there, to put an end to the idea. It'll be something along the lines of "You may notice shmem setting PageActive there, but please don't do that; or if you insist, be sure never to do so in the !wbc->for_reclaim case". The PageActive thing is for when a filesystem regrets that it even had a ->writepage (it replicates the behaviour of the writepage == NULL case or the VM_LOCKED SWAP_FAIL case or the !add_to_swap case, delaying the return of this page to writepage for as long as it can). It's done in shmem_writepage because shm_lock (equivalent to VM_LOCKED) is only discovered within that writepage, and no-swap is discovered there too. ramdisk does it too: I've not tried to understand ramdisk as Nick and Eric have, but it used to have no writepage, and would prefer to have no writepage, but appears to need one for some PageUptodate reasons. It's fairly normal for a filesystem to find that for some reason it cannot carry out a writepage on this page right now (in the reclaim case: the sync case demands action, IIRC); so it then simply does set_page_dirty and unlock_page and returns "success". I'll try to condense this down for the Doc when finalizing the patch; which I've still not yet tested properly - thanks for the eyes, but I can't submit it until I've checked in detail that it really gets to do what we think it does. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi, On 10/26/07, Neil Brown <[EMAIL PROTECTED]> wrote: > It seems that the new requirement is that if the address_space > chooses not to write out the page, it should now call SetPageActive(). > If that is the case, I think it should be explicit in the > documentation - please? Agreed. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi Hugh, On 10/25/07, Hugh Dickins <[EMAIL PROTECTED]> wrote: > @@ -349,10 +349,6 @@ static pageout_t pageout(struct page *pa > res = mapping->a_ops->writepage(page, ); > if (res < 0) > handle_write_error(mapping, page, res); > - if (res == AOP_WRITEPAGE_ACTIVATE) { > - ClearPageReclaim(page); > - return PAGE_ACTIVATE; > - } I don't see ClearPageReclaim added anywhere. Is that done on purpose? Other than that, the patch looks good to me and I think we should stick it into -mm to punish Andrew for his secret hack ;-). Pekka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi, On 10/26/07, Neil Brown [EMAIL PROTECTED] wrote: It seems that the new requirement is that if the address_space chooses not to write out the page, it should now call SetPageActive(). If that is the case, I think it should be explicit in the documentation - please? Agreed. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi Hugh, On 10/25/07, Hugh Dickins [EMAIL PROTECTED] wrote: @@ -349,10 +349,6 @@ static pageout_t pageout(struct page *pa res = mapping-a_ops-writepage(page, wbc); if (res 0) handle_write_error(mapping, page, res); - if (res == AOP_WRITEPAGE_ACTIVATE) { - ClearPageReclaim(page); - return PAGE_ACTIVATE; - } I don't see ClearPageReclaim added anywhere. Is that done on purpose? Other than that, the patch looks good to me and I think we should stick it into -mm to punish Andrew for his secret hack ;-). Pekka - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Fri, 26 Oct 2007, Neil Brown wrote: On Thursday October 25, [EMAIL PROTECTED] wrote: The patch looks like it makes perfect sense to me. Great, thanks a lot for looking at it, Neil and Pekka. Before the change, -writepage could return AOP_WRITEPAGE_ACTIVATE without unlocking the page, and this has precisely the effect of: ClearPageReclaim(); (if the call path was through pageout) SetPageActive(); (if the call was through shrink_page_list) unlock_page(); With the patch, the -writepage method does the SetPageActive and the unlock_page, which on the whole seems cleaner. We seem to have lost a call to ClearPageReclaim - I don't know if that is significant. It doesn't show up in the diff at all, but pageout() already has if (!PageWriteback(page)) { /* synchronous write or broken a_ops? */ ClearPageReclaim(page); } which will clear it since we've never set PageWriteback. I think no harm would come from leaving it set there, since it only takes effect in end_page_writeback (its effect being to let the just written page be moved to the end of the LRU, so that it will then be soon reclaimed), and clear_page_dirty_for_io clears it before coming down this way. But I'd never argue for that: I hate having leftover flags hanging around outside the scope of their relevance. Special, hidden, undocumented, secret hack! Then in 2.6.7 Andrew stole his own secret and used it when concocting ramdisk_writepage. Oh, and NFS made some kind of use of it in 2.6.6 only. Then Neil revealed the secret to the uninitiated in 2.6.17: now, what's the appropriate punishment for that? Surely the punishment should be for writing hidden undocumented hacks in the first place! I vote we just make him maintainer for the whole kernel - that will keep him so busy that he will never have a chance to do it again :-) That is a splendid retort, which has won you absolution. But it makes me a little sad: that smiley should be a weepy. --- 2.6.24-rc1/Documentation/filesystems/vfs.txt2007-10-24 07:15:11.0 +0100 +++ linux/Documentation/filesystems/vfs.txt 2007-10-24 08:42:07.0 +0100 @@ -567,9 +567,7 @@ struct address_space_operations { If wbc-sync_mode is WB_SYNC_NONE, -writepage doesn't have to try too hard if there are problems, and may choose to write out other pages from the mapping if that is easier (e.g. due to - internal dependencies). If it chooses not to start writeout, it - should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep - calling -writepage on that page. + internal dependencies). It seems that the new requirement is that if the address_space chooses not to write out the page, it should now call SetPageActive(). If that is the case, I think it should be explicit in the documentation - please? No, it's not the case; but you're right that I should add something there, to put an end to the idea. It'll be something along the lines of You may notice shmem setting PageActive there, but please don't do that; or if you insist, be sure never to do so in the !wbc-for_reclaim case. The PageActive thing is for when a filesystem regrets that it even had a -writepage (it replicates the behaviour of the writepage == NULL case or the VM_LOCKED SWAP_FAIL case or the !add_to_swap case, delaying the return of this page to writepage for as long as it can). It's done in shmem_writepage because shm_lock (equivalent to VM_LOCKED) is only discovered within that writepage, and no-swap is discovered there too. ramdisk does it too: I've not tried to understand ramdisk as Nick and Eric have, but it used to have no writepage, and would prefer to have no writepage, but appears to need one for some PageUptodate reasons. It's fairly normal for a filesystem to find that for some reason it cannot carry out a writepage on this page right now (in the reclaim case: the sync case demands action, IIRC); so it then simply does set_page_dirty and unlock_page and returns success. I'll try to condense this down for the Doc when finalizing the patch; which I've still not yet tested properly - thanks for the eyes, but I can't submit it until I've checked in detail that it really gets to do what we think it does. Hugh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Thursday October 25, [EMAIL PROTECTED] wrote: > On Mon, 22 Oct 2007, Pekka Enberg wrote: > > On 10/22/07, Hugh Dickins <[EMAIL PROTECTED]> wrote: > > > Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE. > > > Both of those set BDI_CAP_NO_WRITEBACK. ramdisk never returned it > > > if !wbc->for_reclaim. I contend that shmem shouldn't either: it's > > > a special code to get the LRU rotation right, not useful elsewhere. > > > Though Documentation/filesystems/vfs.txt does imply wider use. > > > > > > I think this is where people use the phrase "go figure" ;) > > > > Heh. As far as I can tell, the implication of "wider use" was added by > > Neil in commit "341546f5ad6fce584531f744853a5807a140f2a9 Update some > > VFS documentation", so perhaps he might know? Neil? I just read the code, tried to understand it, translated that understanding into English, and put that in vfs.txt. It is very possible that what I wrote didn't match the intention of the author, but it seemed to match the behaviour of the code. The patch looks like it makes perfect sense to me. Before the change, ->writepage could return AOP_WRITEPAGE_ACTIVATE without unlocking the page, and this has precisely the effect of: ClearPageReclaim(); (if the call path was through pageout) SetPageActive(); (if the call was through shrink_page_list) unlock_page(); With the patch, the ->writepage method does the SetPageActive and the unlock_page, which on the whole seems cleaner. We seem to have lost a call to ClearPageReclaim - I don't know if that is significant. > > Special, hidden, undocumented, secret hack! Then in 2.6.7 Andrew > stole his own secret and used it when concocting ramdisk_writepage. > Oh, and NFS made some kind of use of it in 2.6.6 only. Then Neil > revealed the secret to the uninitiated in 2.6.17: now, what's the > appropriate punishment for that? Surely the punishment should be for writing hidden undocumented hacks in the first place! I vote we just make him maintainer for the whole kernel - that will keep him so busy that he will never have a chance to do it again :-) > --- 2.6.24-rc1/Documentation/filesystems/vfs.txt 2007-10-24 > 07:15:11.0 +0100 > +++ linux/Documentation/filesystems/vfs.txt 2007-10-24 08:42:07.0 > +0100 > @@ -567,9 +567,7 @@ struct address_space_operations { >If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to >try too hard if there are problems, and may choose to write out >other pages from the mapping if that is easier (e.g. due to > - internal dependencies). If it chooses not to start writeout, it > - should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep > - calling ->writepage on that page. > + internal dependencies). > It seems that the new requirement is that if the address_space chooses not to write out the page, it should now call SetPageActive(). If that is the case, I think it should be explicit in the documentation - please? NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Thu, 25 Oct 2007, Erez Zadok wrote: > > On a related note, I would just love to get rid of calling the lower > ->writepage in unionfs b/c I can't even tell if I have a lower page to use > all the time. I'd prefer to call vfs_write() if I can, but I'll need a > struct file, or at least a dentry. Why do you want to do that? You gave a good reason why it's easier for ecryptfs, but I doubt it's robust. The higher the level you choose to use, the harder to guarantee it won't deadlock. Or that's my gut feeling anyway. It's several years since I've thought about such issues: just because I came into this knowing about shmem_writepage, is perhaps not a good reason to choose me as advisor! Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Mon, 22 Oct 2007, Erez Zadok wrote: > > What's the precise semantics of AOP_WRITEPAGE_ACTIVATE? Sigh - not at you, at it! It's a secret that couldn't be kept secret, a hack for tmpfs reclaim, let's just look forward to it going away. > Is it considered an error or not? No, it's definitely not an error. It'a a private note from tmpfs (or ramdisk) to vmscan, saying "don't waste your time coming back to me with this page until you have to, please move on to another more likely to be freeable". > If it's an error, then I usually feel that it's important for > a stacked f/s to return that error indication upwards. Indeed, but this is not an error. Remember, neither ramdisk nor tmpfs is stable storage: okay, tmpfs can go out to disk by using swap, but that's not stable storage - it's not reconstituted after reboot. (If there's an error in writing to swap, well, that's a different issue; and there's few filesystems where such an I/O error would be reported from ->writepage.) > > The unionfs page and the lower page are somewhat tied together, at least > logically. For unionfs's page to be considered to have been written > successfully, the lower page has to be written successfully. So again, if > the lower f/s returns AOP_WRITEPAGE_ACTIVATE, should I consider my unionfs > page to have been written successfully or not? Consider it written successfully. (What does written mean with tmpfs? it means a page can be freed, it doesn't mean the data is forever safe.) > If I don't return > AOP_WRITEPAGE_ACTIVATE up, can there be any chance that some vital data may > never get flushed out? Things should work better if you don't return AOP_WRITEPAGE_ACTIVATE. If you mark your page as clean and successfully written, vmscan will be able to free it. If needed, we can get the data back from the lower page on demand, but meanwhile a page has been freed, which is what vmscan reclaim is all about. (But of course, in the case where you couldn't get hold of a page for the lower, you must redirty yours before returning.) > > unionfs_writepage also sets AOP_WRITEPAGE_ACTIVATE when it cannot > > find_lock_page: that case may be appropriate. Though I don't really > > understand it: seems dangerous to be relying upon the lower level page > > just happening to be there already. Isn't memory pressure then likely > > to clog up with lots of upper level dirty pages which cannot get > > written out to the lower level? > > Based on vfs.txt (which perhaps should be revised :-), I was trying to do > the best I can to ensure that no data is lost if the current page cannot be > written out to the lower f/s. > > I used to do grab_cache_page() before, but that caused problems: writepage > is not the right place to _increase_ memory pressure by allocating a new > page... Yes, but just hoping the lower page will be there, and doing nothing to encourage it to become there, sounds an even poorer strategy to me. It's not easy, I know. Your position reminds me of the loop driver (drivers/block/loop.c), which has long handled this situation (with great success, though I doubt an absolute guarantee) by taking __GFP_IO|__GFP_FS off the mapping_gfp_mask of the underlying file: look for gfp_mask in loop_set_fd() (and I think ignore do_loop_switch(), that's new to me and seems to be for a very special case). I grepped for gfp in unionfs, and there seems to be nothing: I doubt you can be robust under memory pressure without doing something about that. If you can take __GFP_IO|__GFP_FS off the lower mapping (just while in unionfs_writepage, or longer term? what locking needed?), then you should be able to go back to using grab_cache_page(). > > One solution I thought of is do what ecryptfs does: keep an open struct file > in my inode and call vfs_write(), but I don't see that as a significantly > cleaner/better solution. I agree with you. > (BTW, ecrypfts kinda had to go for vfs_write b/c > it changes the data size and content of what it writes below; unionfs is > simpler in that manner b/c it needs to write the same data to the lower file > at the same offset.) Ah, yes. > > Another idea we've experimented with before is "page pointer flipping." In > writepage, we temporarily set the page->mapping->host to the lower_inode; > then we call the lower writepage with OUR page; then fix back the > page->mapping->host to the upper inode. This had two benefits: first we can > guarantee that we always have a page to write below, and second we don't > need to keep both upper and lower pages (reduces memory pressure). Before > we did this page pointer flipping, we verified that the page is locked so no > other user could be written the page->mapping->host in this transient state, > and we ensured that no lower f/s was somehow caching the temporarily changed > value of page->mapping->host for later use. But, mucking with the pointers > in this manner is kinda ugly, to say the least. Still, I'd love to find a > clean and simple way
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Mon, 22 Oct 2007, Erez Zadok wrote: > In message <[EMAIL PROTECTED]>, Hugh Dickins writes: > > > > Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE. > > Both of those set BDI_CAP_NO_WRITEBACK. ramdisk never returned it > > if !wbc->for_reclaim. I contend that shmem shouldn't either: it's > > a special code to get the LRU rotation right, not useful elsewhere. > > Though Documentation/filesystems/vfs.txt does imply wider use. > > Yes, based on vfs.txt I figured unionfs should return > AOP_WRITEPAGE_ACTIVATE. unionfs_writepage returns it in two different cases: when it can't find the underlying page; and when the underlying writepage returns it. I'd say it's wrong to return it in both cases. In the first case, you don't really want your page put back to the head of the active list, you want to come back to try it again quite soon (I think): so you should just redirty and unlock and pretend success. ramdisk uses A_W_A because none of its pages will ever become freeable (and comment points out it'd be better if they weren't even on the LRUs - I think several people have recently been putting forward patches to keep such timewasters off the LRUs). shmem uses A_W_A when there's no swap (left), or when the underlying shm is marked as locked in memory: in each case, best to move on to look for other pages to swap out. (But I'm not quite convincing myself that the temporarily out-of-swap case is different from yours above.) It's about fixing some horrid busy loops where vmscan kept going over the same hopeless pages repeatedly, instead of moving on to better candidates. Oh, there's a third case, when move_to_swap_cache fails: that's rare, and I think I was just too lazy to separate them. In your second case, I fail to see why the unionfs level should mimic the lower level: you've successfully copied data and marked the lower level pages as dirty, vmscan will come back to those in due course, but it's just a waste of time for it to come back to the unionfs pages again - isn't it? > But, now that unionfs has ->writepages which won't > even call the lower writepage if BDI_CAP_NO_WRITEBACK is on, then perhaps I > no longer need unionfs_writepage to bother checking for > AOP_WRITEPAGE_ACTIVATE, or even return it up? unionfs_writepages handles the sync/msync/fsync leaking of A_W_A to userspace issue, as does Pekka & Andrew's patch to write_cache_pages, as does my patch to shmem_writepage. And I'm contending that unionfs_writepage should in no case return A_W_A up. But so long as A_W_A is still defined, unionfs_writepage does still need to check for it after calling the lower level ->writepage (because it needs to do the missing unlock_page): unionfs_writepages prevents unionfs_writepage being called on the normal writeout path, but it's still getting called by vmscan under memory pressure. (I'm in the habit of saying "vmscan" rather than naming the functions in question, because every few months someone restructures that file and changes their names. I exaggerate, but it's happened often enough.) > But, a future file system _could_ return AOP_WRITEPAGE_ACTIVATE w/o setting > BDI_CAP_NO_WRITEBACK, right? In that case, unionfs will still need to > handle AOP_WRITEPAGE_ACTIVATE in ->writepage, right? For so long as AOP_WRITEPAGE_ACTIVATE exists, unionfs_writepage needs to check for it coming from the lower level ->writepage, as I said above. But your/Pekka's unionfs_writepages doesn't need to worry about it at all, because Andrew/Pekka's write_cache_pages fix prevents it leaking up in the !reclaim case (as does my shmem_writepage fix): please remove that AOP_WRITEPAGE_ACTIVATE comment from unionfs_writepages. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
That's a nice historical review, Huge, of how got into these mess we're in now -- it all starts with good intentions. :-) On a related note, I would just love to get rid of calling the lower ->writepage in unionfs b/c I can't even tell if I have a lower page to use all the time. I'd prefer to call vfs_write() if I can, but I'll need a struct file, or at least a dentry. What ecryptfs does is store a struct file inside it's inode, so it can use it later in ->writepage to call vfs_write on the lower f/s. And Unionfs may have to go in that direction too, but this trick is not terribly clean -- storing a file inside an inode. I realize that the calling path to ->writepage doesn't have a file/dentry any more, but if we're considering larger changes to the writepage related code, can we perhaps consider passing a file or dentry to >writepage (same as commit_write, perhaps). Thanks, Erez. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Mon, 22 Oct 2007, Pekka Enberg wrote: > On 10/22/07, Hugh Dickins <[EMAIL PROTECTED]> wrote: > > Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE. > > Both of those set BDI_CAP_NO_WRITEBACK. ramdisk never returned it > > if !wbc->for_reclaim. I contend that shmem shouldn't either: it's > > a special code to get the LRU rotation right, not useful elsewhere. > > Though Documentation/filesystems/vfs.txt does imply wider use. > > > > I think this is where people use the phrase "go figure" ;) > > Heh. As far as I can tell, the implication of "wider use" was added by > Neil in commit "341546f5ad6fce584531f744853a5807a140f2a9 Update some > VFS documentation", so perhaps he might know? Neil? I take as gospel this extract from Andrew's original 2.5.52 comment: So the locking rules for writepage() are unchanged. They are: - Called with the page locked - Returns with the page unlocked - Must redirty the page itself if it wasn't all written. But there is a new, special, hidden, undocumented, secret hack for tmpfs: writepage may return WRITEPAGE_ACTIVATE to tell the VM to move the page to the active list. The page must be kept locked in this one case. Special, hidden, undocumented, secret hack! Then in 2.6.7 Andrew stole his own secret and used it when concocting ramdisk_writepage. Oh, and NFS made some kind of use of it in 2.6.6 only. Then Neil revealed the secret to the uninitiated in 2.6.17: now, what's the appropriate punishment for that? In the full 2.5.52 comment, Andrew explains how prior to this secret code, we used fail_writepage, which in the memory pressure case did an activate_page, with the intention of moving the page to the active list - but that didn't actually work, because the page is off the LRUs at this point, being passed around between pagevecs. I've always preferred the way it was originally trying to do it, which seems clearer and less error-prone than having a special code which people then have to worry about. Here's the patch I'd like to present in due course (against 2.6.24-rc1, so unionfs absent): tmpfs and ramdisk simply SetPageActive for this case (and go back to obeying the usual unlocking rule for writepage), vmscan.c observe and act accordingly. But I've not tested it at all (well, I've run with it in, but not actually going down the paths in question): it may suffer from something silly like the original fail_writepage. Plus I might be persuaded into making inc_zone_page_state(page, NR_VMSCAN_WRITE) conditional on !PageActive(page), just to produce the same stats as before (though they don't make a lot of sense, counting other non-writes as writes). And would it need a deprecation phase? Hugh Documentation/filesystems/Locking |6 +- Documentation/filesystems/vfs.txt |4 +--- drivers/block/rd.c|5 ++--- include/linux/fs.h| 10 -- mm/migrate.c |5 ++--- mm/page-writeback.c |4 mm/shmem.c| 11 --- mm/vmscan.c | 17 ++--- 8 files changed, 20 insertions(+), 42 deletions(-) --- 2.6.24-rc1/Documentation/filesystems/Locking2007-10-24 07:15:11.0 +0100 +++ linux/Documentation/filesystems/Locking 2007-10-24 08:42:07.0 +0100 @@ -228,11 +228,7 @@ If the filesystem is called for sync the in-progress I/O and then start new I/O. The filesystem should unlock the page synchronously, before returning to the -caller, unless ->writepage() returns special WRITEPAGE_ACTIVATE -value. WRITEPAGE_ACTIVATE means that page cannot really be written out -currently, and VM should stop calling ->writepage() on this page for some -time. VM does this by moving page to the head of the active list, hence the -name. +caller. Unless the filesystem is going to redirty_page_for_writepage(), unlock the page and return zero, writepage *must* run set_page_writeback() against the page, --- 2.6.24-rc1/Documentation/filesystems/vfs.txt2007-10-24 07:15:11.0 +0100 +++ linux/Documentation/filesystems/vfs.txt 2007-10-24 08:42:07.0 +0100 @@ -567,9 +567,7 @@ struct address_space_operations { If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to try too hard if there are problems, and may choose to write out other pages from the mapping if that is easier (e.g. due to - internal dependencies). If it chooses not to start writeout, it - should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep - calling ->writepage on that page. + internal dependencies). See the file "Locking" for more details. --- 2.6.24-rc1/drivers/block/rd.c 2007-10-24 07:15:23.0 +0100 +++ linux/drivers/block/rd.c2007-10-24 08:42:07.0 +0100 @@ -152,8 +152,7 @@ static int ramdisk_commit_write(struct f /* * ->writepage to the blockdev's mapping has to redirty the page so
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Mon, 22 Oct 2007, Pekka Enberg wrote: On 10/22/07, Hugh Dickins [EMAIL PROTECTED] wrote: Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE. Both of those set BDI_CAP_NO_WRITEBACK. ramdisk never returned it if !wbc-for_reclaim. I contend that shmem shouldn't either: it's a special code to get the LRU rotation right, not useful elsewhere. Though Documentation/filesystems/vfs.txt does imply wider use. I think this is where people use the phrase go figure ;) Heh. As far as I can tell, the implication of wider use was added by Neil in commit 341546f5ad6fce584531f744853a5807a140f2a9 Update some VFS documentation, so perhaps he might know? Neil? I take as gospel this extract from Andrew's original 2.5.52 comment: So the locking rules for writepage() are unchanged. They are: - Called with the page locked - Returns with the page unlocked - Must redirty the page itself if it wasn't all written. But there is a new, special, hidden, undocumented, secret hack for tmpfs: writepage may return WRITEPAGE_ACTIVATE to tell the VM to move the page to the active list. The page must be kept locked in this one case. Special, hidden, undocumented, secret hack! Then in 2.6.7 Andrew stole his own secret and used it when concocting ramdisk_writepage. Oh, and NFS made some kind of use of it in 2.6.6 only. Then Neil revealed the secret to the uninitiated in 2.6.17: now, what's the appropriate punishment for that? In the full 2.5.52 comment, Andrew explains how prior to this secret code, we used fail_writepage, which in the memory pressure case did an activate_page, with the intention of moving the page to the active list - but that didn't actually work, because the page is off the LRUs at this point, being passed around between pagevecs. I've always preferred the way it was originally trying to do it, which seems clearer and less error-prone than having a special code which people then have to worry about. Here's the patch I'd like to present in due course (against 2.6.24-rc1, so unionfs absent): tmpfs and ramdisk simply SetPageActive for this case (and go back to obeying the usual unlocking rule for writepage), vmscan.c observe and act accordingly. But I've not tested it at all (well, I've run with it in, but not actually going down the paths in question): it may suffer from something silly like the original fail_writepage. Plus I might be persuaded into making inc_zone_page_state(page, NR_VMSCAN_WRITE) conditional on !PageActive(page), just to produce the same stats as before (though they don't make a lot of sense, counting other non-writes as writes). And would it need a deprecation phase? Hugh Documentation/filesystems/Locking |6 +- Documentation/filesystems/vfs.txt |4 +--- drivers/block/rd.c|5 ++--- include/linux/fs.h| 10 -- mm/migrate.c |5 ++--- mm/page-writeback.c |4 mm/shmem.c| 11 --- mm/vmscan.c | 17 ++--- 8 files changed, 20 insertions(+), 42 deletions(-) --- 2.6.24-rc1/Documentation/filesystems/Locking2007-10-24 07:15:11.0 +0100 +++ linux/Documentation/filesystems/Locking 2007-10-24 08:42:07.0 +0100 @@ -228,11 +228,7 @@ If the filesystem is called for sync the in-progress I/O and then start new I/O. The filesystem should unlock the page synchronously, before returning to the -caller, unless -writepage() returns special WRITEPAGE_ACTIVATE -value. WRITEPAGE_ACTIVATE means that page cannot really be written out -currently, and VM should stop calling -writepage() on this page for some -time. VM does this by moving page to the head of the active list, hence the -name. +caller. Unless the filesystem is going to redirty_page_for_writepage(), unlock the page and return zero, writepage *must* run set_page_writeback() against the page, --- 2.6.24-rc1/Documentation/filesystems/vfs.txt2007-10-24 07:15:11.0 +0100 +++ linux/Documentation/filesystems/vfs.txt 2007-10-24 08:42:07.0 +0100 @@ -567,9 +567,7 @@ struct address_space_operations { If wbc-sync_mode is WB_SYNC_NONE, -writepage doesn't have to try too hard if there are problems, and may choose to write out other pages from the mapping if that is easier (e.g. due to - internal dependencies). If it chooses not to start writeout, it - should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep - calling -writepage on that page. + internal dependencies). See the file Locking for more details. --- 2.6.24-rc1/drivers/block/rd.c 2007-10-24 07:15:23.0 +0100 +++ linux/drivers/block/rd.c2007-10-24 08:42:07.0 +0100 @@ -152,8 +152,7 @@ static int ramdisk_commit_write(struct f /* * -writepage to the blockdev's mapping has to redirty the page so that the - * VM doesn't go and steal
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Mon, 22 Oct 2007, Erez Zadok wrote: In message [EMAIL PROTECTED], Hugh Dickins writes: Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE. Both of those set BDI_CAP_NO_WRITEBACK. ramdisk never returned it if !wbc-for_reclaim. I contend that shmem shouldn't either: it's a special code to get the LRU rotation right, not useful elsewhere. Though Documentation/filesystems/vfs.txt does imply wider use. Yes, based on vfs.txt I figured unionfs should return AOP_WRITEPAGE_ACTIVATE. unionfs_writepage returns it in two different cases: when it can't find the underlying page; and when the underlying writepage returns it. I'd say it's wrong to return it in both cases. In the first case, you don't really want your page put back to the head of the active list, you want to come back to try it again quite soon (I think): so you should just redirty and unlock and pretend success. ramdisk uses A_W_A because none of its pages will ever become freeable (and comment points out it'd be better if they weren't even on the LRUs - I think several people have recently been putting forward patches to keep such timewasters off the LRUs). shmem uses A_W_A when there's no swap (left), or when the underlying shm is marked as locked in memory: in each case, best to move on to look for other pages to swap out. (But I'm not quite convincing myself that the temporarily out-of-swap case is different from yours above.) It's about fixing some horrid busy loops where vmscan kept going over the same hopeless pages repeatedly, instead of moving on to better candidates. Oh, there's a third case, when move_to_swap_cache fails: that's rare, and I think I was just too lazy to separate them. In your second case, I fail to see why the unionfs level should mimic the lower level: you've successfully copied data and marked the lower level pages as dirty, vmscan will come back to those in due course, but it's just a waste of time for it to come back to the unionfs pages again - isn't it? But, now that unionfs has -writepages which won't even call the lower writepage if BDI_CAP_NO_WRITEBACK is on, then perhaps I no longer need unionfs_writepage to bother checking for AOP_WRITEPAGE_ACTIVATE, or even return it up? unionfs_writepages handles the sync/msync/fsync leaking of A_W_A to userspace issue, as does Pekka Andrew's patch to write_cache_pages, as does my patch to shmem_writepage. And I'm contending that unionfs_writepage should in no case return A_W_A up. But so long as A_W_A is still defined, unionfs_writepage does still need to check for it after calling the lower level -writepage (because it needs to do the missing unlock_page): unionfs_writepages prevents unionfs_writepage being called on the normal writeout path, but it's still getting called by vmscan under memory pressure. (I'm in the habit of saying vmscan rather than naming the functions in question, because every few months someone restructures that file and changes their names. I exaggerate, but it's happened often enough.) But, a future file system _could_ return AOP_WRITEPAGE_ACTIVATE w/o setting BDI_CAP_NO_WRITEBACK, right? In that case, unionfs will still need to handle AOP_WRITEPAGE_ACTIVATE in -writepage, right? For so long as AOP_WRITEPAGE_ACTIVATE exists, unionfs_writepage needs to check for it coming from the lower level -writepage, as I said above. But your/Pekka's unionfs_writepages doesn't need to worry about it at all, because Andrew/Pekka's write_cache_pages fix prevents it leaking up in the !reclaim case (as does my shmem_writepage fix): please remove that AOP_WRITEPAGE_ACTIVATE comment from unionfs_writepages. Hugh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
That's a nice historical review, Huge, of how got into these mess we're in now -- it all starts with good intentions. :-) On a related note, I would just love to get rid of calling the lower -writepage in unionfs b/c I can't even tell if I have a lower page to use all the time. I'd prefer to call vfs_write() if I can, but I'll need a struct file, or at least a dentry. What ecryptfs does is store a struct file inside it's inode, so it can use it later in -writepage to call vfs_write on the lower f/s. And Unionfs may have to go in that direction too, but this trick is not terribly clean -- storing a file inside an inode. I realize that the calling path to -writepage doesn't have a file/dentry any more, but if we're considering larger changes to the writepage related code, can we perhaps consider passing a file or dentry to writepage (same as commit_write, perhaps). Thanks, Erez. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Mon, 22 Oct 2007, Erez Zadok wrote: What's the precise semantics of AOP_WRITEPAGE_ACTIVATE? Sigh - not at you, at it! It's a secret that couldn't be kept secret, a hack for tmpfs reclaim, let's just look forward to it going away. Is it considered an error or not? No, it's definitely not an error. It'a a private note from tmpfs (or ramdisk) to vmscan, saying don't waste your time coming back to me with this page until you have to, please move on to another more likely to be freeable. If it's an error, then I usually feel that it's important for a stacked f/s to return that error indication upwards. Indeed, but this is not an error. Remember, neither ramdisk nor tmpfs is stable storage: okay, tmpfs can go out to disk by using swap, but that's not stable storage - it's not reconstituted after reboot. (If there's an error in writing to swap, well, that's a different issue; and there's few filesystems where such an I/O error would be reported from -writepage.) The unionfs page and the lower page are somewhat tied together, at least logically. For unionfs's page to be considered to have been written successfully, the lower page has to be written successfully. So again, if the lower f/s returns AOP_WRITEPAGE_ACTIVATE, should I consider my unionfs page to have been written successfully or not? Consider it written successfully. (What does written mean with tmpfs? it means a page can be freed, it doesn't mean the data is forever safe.) If I don't return AOP_WRITEPAGE_ACTIVATE up, can there be any chance that some vital data may never get flushed out? Things should work better if you don't return AOP_WRITEPAGE_ACTIVATE. If you mark your page as clean and successfully written, vmscan will be able to free it. If needed, we can get the data back from the lower page on demand, but meanwhile a page has been freed, which is what vmscan reclaim is all about. (But of course, in the case where you couldn't get hold of a page for the lower, you must redirty yours before returning.) unionfs_writepage also sets AOP_WRITEPAGE_ACTIVATE when it cannot find_lock_page: that case may be appropriate. Though I don't really understand it: seems dangerous to be relying upon the lower level page just happening to be there already. Isn't memory pressure then likely to clog up with lots of upper level dirty pages which cannot get written out to the lower level? Based on vfs.txt (which perhaps should be revised :-), I was trying to do the best I can to ensure that no data is lost if the current page cannot be written out to the lower f/s. I used to do grab_cache_page() before, but that caused problems: writepage is not the right place to _increase_ memory pressure by allocating a new page... Yes, but just hoping the lower page will be there, and doing nothing to encourage it to become there, sounds an even poorer strategy to me. It's not easy, I know. Your position reminds me of the loop driver (drivers/block/loop.c), which has long handled this situation (with great success, though I doubt an absolute guarantee) by taking __GFP_IO|__GFP_FS off the mapping_gfp_mask of the underlying file: look for gfp_mask in loop_set_fd() (and I think ignore do_loop_switch(), that's new to me and seems to be for a very special case). I grepped for gfp in unionfs, and there seems to be nothing: I doubt you can be robust under memory pressure without doing something about that. If you can take __GFP_IO|__GFP_FS off the lower mapping (just while in unionfs_writepage, or longer term? what locking needed?), then you should be able to go back to using grab_cache_page(). One solution I thought of is do what ecryptfs does: keep an open struct file in my inode and call vfs_write(), but I don't see that as a significantly cleaner/better solution. I agree with you. (BTW, ecrypfts kinda had to go for vfs_write b/c it changes the data size and content of what it writes below; unionfs is simpler in that manner b/c it needs to write the same data to the lower file at the same offset.) Ah, yes. Another idea we've experimented with before is page pointer flipping. In writepage, we temporarily set the page-mapping-host to the lower_inode; then we call the lower writepage with OUR page; then fix back the page-mapping-host to the upper inode. This had two benefits: first we can guarantee that we always have a page to write below, and second we don't need to keep both upper and lower pages (reduces memory pressure). Before we did this page pointer flipping, we verified that the page is locked so no other user could be written the page-mapping-host in this transient state, and we ensured that no lower f/s was somehow caching the temporarily changed value of page-mapping-host for later use. But, mucking with the pointers in this manner is kinda ugly, to say the least. Still, I'd love to find a clean and simple way that two layers can share the same struct page and cleanly pass the
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Thu, 25 Oct 2007, Erez Zadok wrote: On a related note, I would just love to get rid of calling the lower -writepage in unionfs b/c I can't even tell if I have a lower page to use all the time. I'd prefer to call vfs_write() if I can, but I'll need a struct file, or at least a dentry. Why do you want to do that? You gave a good reason why it's easier for ecryptfs, but I doubt it's robust. The higher the level you choose to use, the harder to guarantee it won't deadlock. Or that's my gut feeling anyway. It's several years since I've thought about such issues: just because I came into this knowing about shmem_writepage, is perhaps not a good reason to choose me as advisor! Hugh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Thursday October 25, [EMAIL PROTECTED] wrote: On Mon, 22 Oct 2007, Pekka Enberg wrote: On 10/22/07, Hugh Dickins [EMAIL PROTECTED] wrote: Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE. Both of those set BDI_CAP_NO_WRITEBACK. ramdisk never returned it if !wbc-for_reclaim. I contend that shmem shouldn't either: it's a special code to get the LRU rotation right, not useful elsewhere. Though Documentation/filesystems/vfs.txt does imply wider use. I think this is where people use the phrase go figure ;) Heh. As far as I can tell, the implication of wider use was added by Neil in commit 341546f5ad6fce584531f744853a5807a140f2a9 Update some VFS documentation, so perhaps he might know? Neil? I just read the code, tried to understand it, translated that understanding into English, and put that in vfs.txt. It is very possible that what I wrote didn't match the intention of the author, but it seemed to match the behaviour of the code. The patch looks like it makes perfect sense to me. Before the change, -writepage could return AOP_WRITEPAGE_ACTIVATE without unlocking the page, and this has precisely the effect of: ClearPageReclaim(); (if the call path was through pageout) SetPageActive(); (if the call was through shrink_page_list) unlock_page(); With the patch, the -writepage method does the SetPageActive and the unlock_page, which on the whole seems cleaner. We seem to have lost a call to ClearPageReclaim - I don't know if that is significant. Special, hidden, undocumented, secret hack! Then in 2.6.7 Andrew stole his own secret and used it when concocting ramdisk_writepage. Oh, and NFS made some kind of use of it in 2.6.6 only. Then Neil revealed the secret to the uninitiated in 2.6.17: now, what's the appropriate punishment for that? Surely the punishment should be for writing hidden undocumented hacks in the first place! I vote we just make him maintainer for the whole kernel - that will keep him so busy that he will never have a chance to do it again :-) --- 2.6.24-rc1/Documentation/filesystems/vfs.txt 2007-10-24 07:15:11.0 +0100 +++ linux/Documentation/filesystems/vfs.txt 2007-10-24 08:42:07.0 +0100 @@ -567,9 +567,7 @@ struct address_space_operations { If wbc-sync_mode is WB_SYNC_NONE, -writepage doesn't have to try too hard if there are problems, and may choose to write out other pages from the mapping if that is easier (e.g. due to - internal dependencies). If it chooses not to start writeout, it - should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep - calling -writepage on that page. + internal dependencies). It seems that the new requirement is that if the address_space chooses not to write out the page, it should now call SetPageActive(). If that is the case, I think it should be explicit in the documentation - please? NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
In message <[EMAIL PROTECTED]>, Hugh Dickins writes: > Sorry for my delay, here are a few replies. > > > In unionfs_writepage() I tried to emulate as best possible what the lower > > f/s will have returned to the VFS. Since tmpfs's ->writepage can return > > AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in > > unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE. > > I think that's inappropriate. Why should unionfs_writepage re-mark its > page as dirty when the lower level does so? Unionfs has successfully > done its write to the lower level, what the lower level then gets up to > (writing then or not) is its own business: needn't be propagated upwards. What's the precise semantics of AOP_WRITEPAGE_ACTIVATE? Is it considered an error or not? If it's an error, then I usually feel that it's important for a stacked f/s to return that error indication upwards. The unionfs page and the lower page are somewhat tied together, at least logically. For unionfs's page to be considered to have been written successfully, the lower page has to be written successfully. So again, if the lower f/s returns AOP_WRITEPAGE_ACTIVATE, should I consider my unionfs page to have been written successfully or not? If I don't return AOP_WRITEPAGE_ACTIVATE up, can there be any chance that some vital data may never get flushed out? Anyway, now that unionfs has ->writepages that won't bother calling ->write for file systems with BDI_CAP_NO_WRITEBACK, the issue of AOP_WRITEPAGE_ACTIVATE in ->writepage may be less important. > unionfs_writepage also sets AOP_WRITEPAGE_ACTIVATE when it cannot > find_lock_page: that case may be appropriate. Though I don't really > understand it: seems dangerous to be relying upon the lower level page > just happening to be there already. Isn't memory pressure then likely > to clog up with lots of upper level dirty pages which cannot get > written out to the lower level? Based on vfs.txt (which perhaps should be revised :-), I was trying to do the best I can to ensure that no data is lost if the current page cannot be written out to the lower f/s. I used to do grab_cache_page() before, but that caused problems: writepage is not the right place to _increase_ memory pressure by allocating a new page... One solution I thought of is do what ecryptfs does: keep an open struct file in my inode and call vfs_write(), but I don't see that as a significantly cleaner/better solution. (BTW, ecrypfts kinda had to go for vfs_write b/c it changes the data size and content of what it writes below; unionfs is simpler in that manner b/c it needs to write the same data to the lower file at the same offset.) Another idea we've experimented with before is "page pointer flipping." In writepage, we temporarily set the page->mapping->host to the lower_inode; then we call the lower writepage with OUR page; then fix back the page->mapping->host to the upper inode. This had two benefits: first we can guarantee that we always have a page to write below, and second we don't need to keep both upper and lower pages (reduces memory pressure). Before we did this page pointer flipping, we verified that the page is locked so no other user could be written the page->mapping->host in this transient state, and we ensured that no lower f/s was somehow caching the temporarily changed value of page->mapping->host for later use. But, mucking with the pointers in this manner is kinda ugly, to say the least. Still, I'd love to find a clean and simple way that two layers can share the same struct page and cleanly pass the upper page to a lower f/s. If you've got suggestions how I can handle unionfs_write more cleanly, or comments on the above possibilities, I'd love to hear them. > > Should I be doing something different when unionfs stacks on top of tmpfs? > > I think not. > > > (BTW, this is probably also relevant to ecryptfs.) > > You're both agreed on that, but I don't see how: ecryptfs writes the > lower level via vfs_write, it's not using the lower level's writepage, > is it? Yup. ecryptfs no longer does that: it recently changed things and now it stores and open struct file in its inode, so it can always pass the file to vfs_write. This nicely avoids calling the lower writepage, but one has to keep an open file for every inode. Neither the solutions employed currently by unionfs and ecryptfs seem really satisfactory (clean and efficient). > Hugh Thanks, Erez. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
In message <[EMAIL PROTECTED]>, Hugh Dickins writes: > On Mon, 15 Oct 2007, Pekka Enberg wrote: > > > > I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that > > ->writepage() will never return AOP_WRITEPAGE_ACTIVATE for > > !wbc->for_reclaim case which would explain why we haven't hit this bug > > before. Hugh, Andrew? > > Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE. > Both of those set BDI_CAP_NO_WRITEBACK. ramdisk never returned it > if !wbc->for_reclaim. I contend that shmem shouldn't either: it's > a special code to get the LRU rotation right, not useful elsewhere. > Though Documentation/filesystems/vfs.txt does imply wider use. Yes, based on vfs.txt I figured unionfs should return AOP_WRITEPAGE_ACTIVATE. But, now that unionfs has ->writepages which won't even call the lower writepage if BDI_CAP_NO_WRITEBACK is on, then perhaps I no longer need unionfs_writepage to bother checking for AOP_WRITEPAGE_ACTIVATE, or even return it up? But, a future file system _could_ return AOP_WRITEPAGE_ACTIVATE w/o setting BDI_CAP_NO_WRITEBACK, right? In that case, unionfs will still need to handle AOP_WRITEPAGE_ACTIVATE in ->writepage, right? > I think this is where people use the phrase "go figure" ;) > > Hugh Erez. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi Hugh, On Mon, 15 Oct 2007, Pekka Enberg wrote: > > I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that > > ->writepage() will never return AOP_WRITEPAGE_ACTIVATE for > > !wbc->for_reclaim case which would explain why we haven't hit this bug > > before. Hugh, Andrew? On 10/22/07, Hugh Dickins <[EMAIL PROTECTED]> wrote: > Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE. > Both of those set BDI_CAP_NO_WRITEBACK. ramdisk never returned it > if !wbc->for_reclaim. I contend that shmem shouldn't either: it's > a special code to get the LRU rotation right, not useful elsewhere. > Though Documentation/filesystems/vfs.txt does imply wider use. > > I think this is where people use the phrase "go figure" ;) Heh. As far as I can tell, the implication of "wider use" was added by Neil in commit "341546f5ad6fce584531f744853a5807a140f2a9 Update some VFS documentation", so perhaps he might know? Neil? Pekka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi Hugh, On 10/22/07, Hugh Dickins <[EMAIL PROTECTED]> wrote: > I don't disagree with your unionfs_writepages patch, Pekka, but I think > it should be viewed as an optimization (don't waste time trying to write > a group of pages when we know that nothing will be done) rather than as > essential. Ok, so tmpfs needs your fix still. On 10/22/07, Hugh Dickins <[EMAIL PROTECTED]> wrote: > > So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE > > from being returned to userland. I guess we still need it, b/c even with > > your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to > > the VFS and we need to ensure that doesn't "leak" outside the kernel. > > Can it now? Current git has a patch from Andrew which bears a striking > resemblance to that from Pekka, stopping the leak from write_cache_pages. I don't think it can, it looks ok now. Pekka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Mon, 15 Oct 2007, Pekka Enberg wrote: > > I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that > ->writepage() will never return AOP_WRITEPAGE_ACTIVATE for > !wbc->for_reclaim case which would explain why we haven't hit this bug > before. Hugh, Andrew? Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE. Both of those set BDI_CAP_NO_WRITEBACK. ramdisk never returned it if !wbc->for_reclaim. I contend that shmem shouldn't either: it's a special code to get the LRU rotation right, not useful elsewhere. Though Documentation/filesystems/vfs.txt does imply wider use. I think this is where people use the phrase "go figure" ;) Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Sun, 14 Oct 2007, Erez Zadok wrote: > In message <[EMAIL PROTECTED]>, Pekka J Enberg writes: > > > > Look at mm/filemap.c:__filemap_fdatawrite_range(). You shouldn't be > > calling unionfs_writepage() _at all_ if the lower mapping has > > BDI_CAP_NO_WRITEBACK capability set. Perhaps something like the totally > > untested patch below? ... I don't disagree with your unionfs_writepages patch, Pekka, but I think it should be viewed as an optimization (don't waste time trying to write a group of pages when we know that nothing will be done) rather than as essential. Prior to unionfs's own use of AOP_WRITEPAGE_ACTIVATE, there have only been ramdisk and shmem generating it. ramdisk is careful only to return it in the wbc->for_reclaim case: I think (as in the patch I sent out before) shmem now ought to do so too for safety. Back in 2.4 days it was reasonable to assume that ->writepage would only get called from certain places, but things move faster nowadays, and the unionfs example shows others are liable to start ab/using it. I'll send Andrew that patch tomorrow (it's simple enough, but I'd like at least to try to reproduce the page_mapped bug first). > > Pekka, with a small change to your patch (to handle time-based cache > coherency), your patch worked well and passed all my tests. Thanks. > > So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE > from being returned to userland. I guess we still need it, b/c even with > your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to > the VFS and we need to ensure that doesn't "leak" outside the kernel. Can it now? Current git has a patch from Andrew which bears a striking resemblance to that from Pekka, stopping the leak from write_cache_pages. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Sorry for my delay, here are a few replies. On Sun, 14 Oct 2007, Erez Zadok wrote: > In message <[EMAIL PROTECTED]>, "Pekka Enberg" writes: > > > > However, I don't think the mapping_cap_writeback_dirty() check in > > __filemap_fdatawrite_range() works as expected when tmpfs is a lower > > mount for an unionfs mount. There's no BDI_CAP_NO_WRITEBACK capability > > for unionfs mappings so do_fsync() will call write_cache_pages() that > > unconditionally invokes shmem_writepage() via unionfs_writepage(). > > Unless, of course, there's some other unionfs magic I am missing. Thanks, Pekka, yes that made a lot of sense. > > In unionfs_writepage() I tried to emulate as best possible what the lower > f/s will have returned to the VFS. Since tmpfs's ->writepage can return > AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in > unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE. I think that's inappropriate. Why should unionfs_writepage re-mark its page as dirty when the lower level does so? Unionfs has successfully done its write to the lower level, what the lower level then gets up to (writing then or not) is its own business: needn't be propagated upwards. The fewer places that supply AOP_WRITEPAGE_ACTIVATE the better. What I'd like most of all is to eliminate it, in favour of vmscan.c working out the condition for itself: but I've given that no thought, it may not be reasonable. unionfs_writepage also sets AOP_WRITEPAGE_ACTIVATE when it cannot find_lock_page: that case may be appropriate. Though I don't really understand it: seems dangerous to be relying upon the lower level page just happening to be there already. Isn't memory pressure then likely to clog up with lots of upper level dirty pages which cannot get written out to the lower level? > > Should I be doing something different when unionfs stacks on top of tmpfs? I think not. > (BTW, this is probably also relevant to ecryptfs.) You're both agreed on that, but I don't see how: ecryptfs writes the lower level via vfs_write, it's not using the lower level's writepage, is it? Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Sorry for my delay, here are a few replies. On Sun, 14 Oct 2007, Erez Zadok wrote: In message [EMAIL PROTECTED], Pekka Enberg writes: However, I don't think the mapping_cap_writeback_dirty() check in __filemap_fdatawrite_range() works as expected when tmpfs is a lower mount for an unionfs mount. There's no BDI_CAP_NO_WRITEBACK capability for unionfs mappings so do_fsync() will call write_cache_pages() that unconditionally invokes shmem_writepage() via unionfs_writepage(). Unless, of course, there's some other unionfs magic I am missing. Thanks, Pekka, yes that made a lot of sense. In unionfs_writepage() I tried to emulate as best possible what the lower f/s will have returned to the VFS. Since tmpfs's -writepage can return AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE. I think that's inappropriate. Why should unionfs_writepage re-mark its page as dirty when the lower level does so? Unionfs has successfully done its write to the lower level, what the lower level then gets up to (writing then or not) is its own business: needn't be propagated upwards. The fewer places that supply AOP_WRITEPAGE_ACTIVATE the better. What I'd like most of all is to eliminate it, in favour of vmscan.c working out the condition for itself: but I've given that no thought, it may not be reasonable. unionfs_writepage also sets AOP_WRITEPAGE_ACTIVATE when it cannot find_lock_page: that case may be appropriate. Though I don't really understand it: seems dangerous to be relying upon the lower level page just happening to be there already. Isn't memory pressure then likely to clog up with lots of upper level dirty pages which cannot get written out to the lower level? Should I be doing something different when unionfs stacks on top of tmpfs? I think not. (BTW, this is probably also relevant to ecryptfs.) You're both agreed on that, but I don't see how: ecryptfs writes the lower level via vfs_write, it's not using the lower level's writepage, is it? Hugh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Sun, 14 Oct 2007, Erez Zadok wrote: In message [EMAIL PROTECTED], Pekka J Enberg writes: Look at mm/filemap.c:__filemap_fdatawrite_range(). You shouldn't be calling unionfs_writepage() _at all_ if the lower mapping has BDI_CAP_NO_WRITEBACK capability set. Perhaps something like the totally untested patch below? ... I don't disagree with your unionfs_writepages patch, Pekka, but I think it should be viewed as an optimization (don't waste time trying to write a group of pages when we know that nothing will be done) rather than as essential. Prior to unionfs's own use of AOP_WRITEPAGE_ACTIVATE, there have only been ramdisk and shmem generating it. ramdisk is careful only to return it in the wbc-for_reclaim case: I think (as in the patch I sent out before) shmem now ought to do so too for safety. Back in 2.4 days it was reasonable to assume that -writepage would only get called from certain places, but things move faster nowadays, and the unionfs example shows others are liable to start ab/using it. I'll send Andrew that patch tomorrow (it's simple enough, but I'd like at least to try to reproduce the page_mapped bug first). Pekka, with a small change to your patch (to handle time-based cache coherency), your patch worked well and passed all my tests. Thanks. So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE from being returned to userland. I guess we still need it, b/c even with your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to the VFS and we need to ensure that doesn't leak outside the kernel. Can it now? Current git has a patch from Andrew which bears a striking resemblance to that from Pekka, stopping the leak from write_cache_pages. Hugh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Mon, 15 Oct 2007, Pekka Enberg wrote: I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that -writepage() will never return AOP_WRITEPAGE_ACTIVATE for !wbc-for_reclaim case which would explain why we haven't hit this bug before. Hugh, Andrew? Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE. Both of those set BDI_CAP_NO_WRITEBACK. ramdisk never returned it if !wbc-for_reclaim. I contend that shmem shouldn't either: it's a special code to get the LRU rotation right, not useful elsewhere. Though Documentation/filesystems/vfs.txt does imply wider use. I think this is where people use the phrase go figure ;) Hugh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi Hugh, On 10/22/07, Hugh Dickins [EMAIL PROTECTED] wrote: I don't disagree with your unionfs_writepages patch, Pekka, but I think it should be viewed as an optimization (don't waste time trying to write a group of pages when we know that nothing will be done) rather than as essential. Ok, so tmpfs needs your fix still. On 10/22/07, Hugh Dickins [EMAIL PROTECTED] wrote: So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE from being returned to userland. I guess we still need it, b/c even with your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to the VFS and we need to ensure that doesn't leak outside the kernel. Can it now? Current git has a patch from Andrew which bears a striking resemblance to that from Pekka, stopping the leak from write_cache_pages. I don't think it can, it looks ok now. Pekka - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi Hugh, On Mon, 15 Oct 2007, Pekka Enberg wrote: I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that -writepage() will never return AOP_WRITEPAGE_ACTIVATE for !wbc-for_reclaim case which would explain why we haven't hit this bug before. Hugh, Andrew? On 10/22/07, Hugh Dickins [EMAIL PROTECTED] wrote: Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE. Both of those set BDI_CAP_NO_WRITEBACK. ramdisk never returned it if !wbc-for_reclaim. I contend that shmem shouldn't either: it's a special code to get the LRU rotation right, not useful elsewhere. Though Documentation/filesystems/vfs.txt does imply wider use. I think this is where people use the phrase go figure ;) Heh. As far as I can tell, the implication of wider use was added by Neil in commit 341546f5ad6fce584531f744853a5807a140f2a9 Update some VFS documentation, so perhaps he might know? Neil? Pekka - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
In message [EMAIL PROTECTED], Hugh Dickins writes: On Mon, 15 Oct 2007, Pekka Enberg wrote: I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that -writepage() will never return AOP_WRITEPAGE_ACTIVATE for !wbc-for_reclaim case which would explain why we haven't hit this bug before. Hugh, Andrew? Only ramdisk and shmem have been returning AOP_WRITEPAGE_ACTIVATE. Both of those set BDI_CAP_NO_WRITEBACK. ramdisk never returned it if !wbc-for_reclaim. I contend that shmem shouldn't either: it's a special code to get the LRU rotation right, not useful elsewhere. Though Documentation/filesystems/vfs.txt does imply wider use. Yes, based on vfs.txt I figured unionfs should return AOP_WRITEPAGE_ACTIVATE. But, now that unionfs has -writepages which won't even call the lower writepage if BDI_CAP_NO_WRITEBACK is on, then perhaps I no longer need unionfs_writepage to bother checking for AOP_WRITEPAGE_ACTIVATE, or even return it up? But, a future file system _could_ return AOP_WRITEPAGE_ACTIVATE w/o setting BDI_CAP_NO_WRITEBACK, right? In that case, unionfs will still need to handle AOP_WRITEPAGE_ACTIVATE in -writepage, right? I think this is where people use the phrase go figure ;) Hugh Erez. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
In message [EMAIL PROTECTED], Hugh Dickins writes: Sorry for my delay, here are a few replies. In unionfs_writepage() I tried to emulate as best possible what the lower f/s will have returned to the VFS. Since tmpfs's -writepage can return AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE. I think that's inappropriate. Why should unionfs_writepage re-mark its page as dirty when the lower level does so? Unionfs has successfully done its write to the lower level, what the lower level then gets up to (writing then or not) is its own business: needn't be propagated upwards. What's the precise semantics of AOP_WRITEPAGE_ACTIVATE? Is it considered an error or not? If it's an error, then I usually feel that it's important for a stacked f/s to return that error indication upwards. The unionfs page and the lower page are somewhat tied together, at least logically. For unionfs's page to be considered to have been written successfully, the lower page has to be written successfully. So again, if the lower f/s returns AOP_WRITEPAGE_ACTIVATE, should I consider my unionfs page to have been written successfully or not? If I don't return AOP_WRITEPAGE_ACTIVATE up, can there be any chance that some vital data may never get flushed out? Anyway, now that unionfs has -writepages that won't bother calling -write for file systems with BDI_CAP_NO_WRITEBACK, the issue of AOP_WRITEPAGE_ACTIVATE in -writepage may be less important. unionfs_writepage also sets AOP_WRITEPAGE_ACTIVATE when it cannot find_lock_page: that case may be appropriate. Though I don't really understand it: seems dangerous to be relying upon the lower level page just happening to be there already. Isn't memory pressure then likely to clog up with lots of upper level dirty pages which cannot get written out to the lower level? Based on vfs.txt (which perhaps should be revised :-), I was trying to do the best I can to ensure that no data is lost if the current page cannot be written out to the lower f/s. I used to do grab_cache_page() before, but that caused problems: writepage is not the right place to _increase_ memory pressure by allocating a new page... One solution I thought of is do what ecryptfs does: keep an open struct file in my inode and call vfs_write(), but I don't see that as a significantly cleaner/better solution. (BTW, ecrypfts kinda had to go for vfs_write b/c it changes the data size and content of what it writes below; unionfs is simpler in that manner b/c it needs to write the same data to the lower file at the same offset.) Another idea we've experimented with before is page pointer flipping. In writepage, we temporarily set the page-mapping-host to the lower_inode; then we call the lower writepage with OUR page; then fix back the page-mapping-host to the upper inode. This had two benefits: first we can guarantee that we always have a page to write below, and second we don't need to keep both upper and lower pages (reduces memory pressure). Before we did this page pointer flipping, we verified that the page is locked so no other user could be written the page-mapping-host in this transient state, and we ensured that no lower f/s was somehow caching the temporarily changed value of page-mapping-host for later use. But, mucking with the pointers in this manner is kinda ugly, to say the least. Still, I'd love to find a clean and simple way that two layers can share the same struct page and cleanly pass the upper page to a lower f/s. If you've got suggestions how I can handle unionfs_write more cleanly, or comments on the above possibilities, I'd love to hear them. Should I be doing something different when unionfs stacks on top of tmpfs? I think not. (BTW, this is probably also relevant to ecryptfs.) You're both agreed on that, but I don't see how: ecryptfs writes the lower level via vfs_write, it's not using the lower level's writepage, is it? Yup. ecryptfs no longer does that: it recently changed things and now it stores and open struct file in its inode, so it can always pass the file to vfs_write. This nicely avoids calling the lower writepage, but one has to keep an open file for every inode. Neither the solutions employed currently by unionfs and ecryptfs seem really satisfactory (clean and efficient). Hugh Thanks, Erez. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
In message <[EMAIL PROTECTED]>, "Pekka Enberg" writes: > Hi, > > On 10/15/07, Erez Zadok <[EMAIL PROTECTED]> wrote: > > Pekka, with a small change to your patch (to handle time-based cache > > coherency), your patch worked well and passed all my tests. Thanks. > > > > So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE > > from being returned to userland. I guess we still need it, b/c even with > > your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to > > the VFS and we need to ensure that doesn't "leak" outside the kernel. > > I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that > ->writepage() will never return AOP_WRITEPAGE_ACTIVATE for > !wbc->for_reclaim case which would explain why we haven't hit this bug > before. Hugh, Andrew? > > And btw, I think we need to fix ecryptfs too. Yes, ecryptfs needs this fix too (and probably a couple of other mmap fixes I've made to unionfs recently -- Mike Halcrow already knows :-) Of course, running ecryptfs on top of tmpfs is somewhat odd and uncommon; but with unionfs, users use tmpfs as the copyup branch very often. >Pekka Erez. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
In message [EMAIL PROTECTED], Pekka Enberg writes: Hi, On 10/15/07, Erez Zadok [EMAIL PROTECTED] wrote: Pekka, with a small change to your patch (to handle time-based cache coherency), your patch worked well and passed all my tests. Thanks. So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE from being returned to userland. I guess we still need it, b/c even with your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to the VFS and we need to ensure that doesn't leak outside the kernel. I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that -writepage() will never return AOP_WRITEPAGE_ACTIVATE for !wbc-for_reclaim case which would explain why we haven't hit this bug before. Hugh, Andrew? And btw, I think we need to fix ecryptfs too. Yes, ecryptfs needs this fix too (and probably a couple of other mmap fixes I've made to unionfs recently -- Mike Halcrow already knows :-) Of course, running ecryptfs on top of tmpfs is somewhat odd and uncommon; but with unionfs, users use tmpfs as the copyup branch very often. Pekka Erez. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi, On 10/15/07, Erez Zadok <[EMAIL PROTECTED]> wrote: > Pekka, with a small change to your patch (to handle time-based cache > coherency), your patch worked well and passed all my tests. Thanks. > > So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE > from being returned to userland. I guess we still need it, b/c even with > your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to > the VFS and we need to ensure that doesn't "leak" outside the kernel. I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that ->writepage() will never return AOP_WRITEPAGE_ACTIVATE for !wbc->for_reclaim case which would explain why we haven't hit this bug before. Hugh, Andrew? And btw, I think we need to fix ecryptfs too. Pekka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi, On 10/15/07, Erez Zadok [EMAIL PROTECTED] wrote: Pekka, with a small change to your patch (to handle time-based cache coherency), your patch worked well and passed all my tests. Thanks. So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE from being returned to userland. I guess we still need it, b/c even with your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to the VFS and we need to ensure that doesn't leak outside the kernel. I wonder whether _not setting_ BDI_CAP_NO_WRITEBACK implies that -writepage() will never return AOP_WRITEPAGE_ACTIVATE for !wbc-for_reclaim case which would explain why we haven't hit this bug before. Hugh, Andrew? And btw, I think we need to fix ecryptfs too. Pekka - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
In message <[EMAIL PROTECTED]>, Pekka J Enberg writes: > Hi Erez, > > On Sun, 14 Oct 2007, Erez Zadok wrote: > > In unionfs_writepage() I tried to emulate as best possible what the lower > > f/s will have returned to the VFS. Since tmpfs's ->writepage can return > > AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in > > unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE. > > > > Should I be doing something different when unionfs stacks on top of tmpfs? > > (BTW, this is probably also relevant to ecryptfs.) > > Look at mm/filemap.c:__filemap_fdatawrite_range(). You shouldn't be > calling unionfs_writepage() _at all_ if the lower mapping has > BDI_CAP_NO_WRITEBACK capability set. Perhaps something like the totally > untested patch below? > > Pekka [...] Pekka, with a small change to your patch (to handle time-based cache coherency), your patch worked well and passed all my tests. Thanks. So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE from being returned to userland. I guess we still need it, b/c even with your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to the VFS and we need to ensure that doesn't "leak" outside the kernel. Erez. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi Erez, On Sun, 14 Oct 2007, Erez Zadok wrote: > In unionfs_writepage() I tried to emulate as best possible what the lower > f/s will have returned to the VFS. Since tmpfs's ->writepage can return > AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in > unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE. > > Should I be doing something different when unionfs stacks on top of tmpfs? > (BTW, this is probably also relevant to ecryptfs.) Look at mm/filemap.c:__filemap_fdatawrite_range(). You shouldn't be calling unionfs_writepage() _at all_ if the lower mapping has BDI_CAP_NO_WRITEBACK capability set. Perhaps something like the totally untested patch below? Pekka --- fs/unionfs/mmap.c | 17 + 1 file changed, 17 insertions(+) Index: linux-2.6.23-rc8/fs/unionfs/mmap.c === --- linux-2.6.23-rc8.orig/fs/unionfs/mmap.c +++ linux-2.6.23-rc8/fs/unionfs/mmap.c @@ -17,6 +17,7 @@ * published by the Free Software Foundation. */ +#include #include "union.h" /* @@ -144,6 +145,21 @@ out: return err; } +static int unionfs_writepages(struct address_space *mapping, + struct writeback_control *wbc) +{ + struct inode *lower_inode; + struct inode *inode; + + inode = mapping->host; + lower_inode = unionfs_lower_inode(inode); + + if (!mapping_cap_writeback_dirty(lower_inode->i_mapping)) + return 0; + + return generic_writepages(mapping, wbc); +} + /* * readpage is called from generic_page_read and the fault handler. * If your file system uses generic_page_read for the read op, it @@ -371,6 +387,7 @@ out: struct address_space_operations unionfs_aops = { .writepage = unionfs_writepage, + .writepages = unionfs_writepages, .readpage = unionfs_readpage, .prepare_write = unionfs_prepare_write, .commit_write = unionfs_commit_write, - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
In message <[EMAIL PROTECTED]>, "Pekka Enberg" writes: > Hi Hugh, > > On Sat, 13 Oct 2007, Pekka Enberg wrote: > > Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages() > > without unionfs even? > > On 10/14/07, Hugh Dickins <[EMAIL PROTECTED]> wrote: > > I believe not. Please do double-check my assertions, I've always found > > the _writepages paths rather twisty; but my belief (supported by the > > fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page)) > > in five years) is that tmpfs/shmem opts out of all of that with its > > .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK, > > in shmem_backing_dev_info, which avoids all those _writepages avenues > > (e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is > > just a subfunction of the _writepages. > > Thanks for the explanation, you're obviously correct. > > However, I don't think the mapping_cap_writeback_dirty() check in > __filemap_fdatawrite_range() works as expected when tmpfs is a lower > mount for an unionfs mount. There's no BDI_CAP_NO_WRITEBACK capability > for unionfs mappings so do_fsync() will call write_cache_pages() that > unconditionally invokes shmem_writepage() via unionfs_writepage(). > Unless, of course, there's some other unionfs magic I am missing. > >Pekka In unionfs_writepage() I tried to emulate as best possible what the lower f/s will have returned to the VFS. Since tmpfs's ->writepage can return AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE. Should I be doing something different when unionfs stacks on top of tmpfs? (BTW, this is probably also relevant to ecryptfs.) Thanks, Erez. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi Hugh, On Sat, 13 Oct 2007, Pekka Enberg wrote: > Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages() > without unionfs even? On 10/14/07, Hugh Dickins <[EMAIL PROTECTED]> wrote: > I believe not. Please do double-check my assertions, I've always found > the _writepages paths rather twisty; but my belief (supported by the > fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page)) > in five years) is that tmpfs/shmem opts out of all of that with its > .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK, > in shmem_backing_dev_info, which avoids all those _writepages avenues > (e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is > just a subfunction of the _writepages. Thanks for the explanation, you're obviously correct. However, I don't think the mapping_cap_writeback_dirty() check in __filemap_fdatawrite_range() works as expected when tmpfs is a lower mount for an unionfs mount. There's no BDI_CAP_NO_WRITEBACK capability for unionfs mappings so do_fsync() will call write_cache_pages() that unconditionally invokes shmem_writepage() via unionfs_writepage(). Unless, of course, there's some other unionfs magic I am missing. Pekka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Sat, 13 Oct 2007, Pekka Enberg wrote: > On 10/12/07, Hugh Dickins <[EMAIL PROTECTED]> wrote: > > But I keep suspecting that the answer might be the patch below (which > > rather follows what drivers/block/rd.c is doing). I'm especially > > worried that, rather than just AOP_WRITEPAGE_ACTIVATE being returned > > to userspace, bad enough in itself, you might be liable to hit that > > BUG_ON(page_mapped(page)). shmem_writepage does not expect to be > > called by anyone outside mm/vmscan.c, but unionfs can now get to it? > > Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages() > without unionfs even? I believe not. Please do double-check my assertions, I've always found the _writepages paths rather twisty; but my belief (supported by the fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page)) in five years) is that tmpfs/shmem opts out of all of that with its .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK, in shmem_backing_dev_info, which avoids all those _writepages avenues (e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is just a subfunction of the _writepages. So, while I don't disagree with your patch to write_cache_pages (though it wasn't clear to me whether it should break from or continue the loop if it ever does meet an AOP_WRITEPAGE_ACTIVATE), I don't think that's really the root of the problem. Hugh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Sat, 13 Oct 2007, Pekka Enberg wrote: On 10/12/07, Hugh Dickins [EMAIL PROTECTED] wrote: But I keep suspecting that the answer might be the patch below (which rather follows what drivers/block/rd.c is doing). I'm especially worried that, rather than just AOP_WRITEPAGE_ACTIVATE being returned to userspace, bad enough in itself, you might be liable to hit that BUG_ON(page_mapped(page)). shmem_writepage does not expect to be called by anyone outside mm/vmscan.c, but unionfs can now get to it? Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages() without unionfs even? I believe not. Please do double-check my assertions, I've always found the _writepages paths rather twisty; but my belief (supported by the fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page)) in five years) is that tmpfs/shmem opts out of all of that with its .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK, in shmem_backing_dev_info, which avoids all those _writepages avenues (e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is just a subfunction of the _writepages. So, while I don't disagree with your patch to write_cache_pages (though it wasn't clear to me whether it should break from or continue the loop if it ever does meet an AOP_WRITEPAGE_ACTIVATE), I don't think that's really the root of the problem. Hugh - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi Hugh, On Sat, 13 Oct 2007, Pekka Enberg wrote: Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages() without unionfs even? On 10/14/07, Hugh Dickins [EMAIL PROTECTED] wrote: I believe not. Please do double-check my assertions, I've always found the _writepages paths rather twisty; but my belief (supported by the fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page)) in five years) is that tmpfs/shmem opts out of all of that with its .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK, in shmem_backing_dev_info, which avoids all those _writepages avenues (e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is just a subfunction of the _writepages. Thanks for the explanation, you're obviously correct. However, I don't think the mapping_cap_writeback_dirty() check in __filemap_fdatawrite_range() works as expected when tmpfs is a lower mount for an unionfs mount. There's no BDI_CAP_NO_WRITEBACK capability for unionfs mappings so do_fsync() will call write_cache_pages() that unconditionally invokes shmem_writepage() via unionfs_writepage(). Unless, of course, there's some other unionfs magic I am missing. Pekka - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
In message [EMAIL PROTECTED], Pekka Enberg writes: Hi Hugh, On Sat, 13 Oct 2007, Pekka Enberg wrote: Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages() without unionfs even? On 10/14/07, Hugh Dickins [EMAIL PROTECTED] wrote: I believe not. Please do double-check my assertions, I've always found the _writepages paths rather twisty; but my belief (supported by the fact that we've not hit shmem_writepage's BUG_ON(page_mapped(page)) in five years) is that tmpfs/shmem opts out of all of that with its .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK, in shmem_backing_dev_info, which avoids all those _writepages avenues (e.g. through bdi_cap_writeback_dirty tests), and write_cache_pages is just a subfunction of the _writepages. Thanks for the explanation, you're obviously correct. However, I don't think the mapping_cap_writeback_dirty() check in __filemap_fdatawrite_range() works as expected when tmpfs is a lower mount for an unionfs mount. There's no BDI_CAP_NO_WRITEBACK capability for unionfs mappings so do_fsync() will call write_cache_pages() that unconditionally invokes shmem_writepage() via unionfs_writepage(). Unless, of course, there's some other unionfs magic I am missing. Pekka In unionfs_writepage() I tried to emulate as best possible what the lower f/s will have returned to the VFS. Since tmpfs's -writepage can return AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE. Should I be doing something different when unionfs stacks on top of tmpfs? (BTW, this is probably also relevant to ecryptfs.) Thanks, Erez. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi Erez, On Sun, 14 Oct 2007, Erez Zadok wrote: In unionfs_writepage() I tried to emulate as best possible what the lower f/s will have returned to the VFS. Since tmpfs's -writepage can return AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE. Should I be doing something different when unionfs stacks on top of tmpfs? (BTW, this is probably also relevant to ecryptfs.) Look at mm/filemap.c:__filemap_fdatawrite_range(). You shouldn't be calling unionfs_writepage() _at all_ if the lower mapping has BDI_CAP_NO_WRITEBACK capability set. Perhaps something like the totally untested patch below? Pekka --- fs/unionfs/mmap.c | 17 + 1 file changed, 17 insertions(+) Index: linux-2.6.23-rc8/fs/unionfs/mmap.c === --- linux-2.6.23-rc8.orig/fs/unionfs/mmap.c +++ linux-2.6.23-rc8/fs/unionfs/mmap.c @@ -17,6 +17,7 @@ * published by the Free Software Foundation. */ +#include linux/backing-dev.h #include union.h /* @@ -144,6 +145,21 @@ out: return err; } +static int unionfs_writepages(struct address_space *mapping, + struct writeback_control *wbc) +{ + struct inode *lower_inode; + struct inode *inode; + + inode = mapping-host; + lower_inode = unionfs_lower_inode(inode); + + if (!mapping_cap_writeback_dirty(lower_inode-i_mapping)) + return 0; + + return generic_writepages(mapping, wbc); +} + /* * readpage is called from generic_page_read and the fault handler. * If your file system uses generic_page_read for the read op, it @@ -371,6 +387,7 @@ out: struct address_space_operations unionfs_aops = { .writepage = unionfs_writepage, + .writepages = unionfs_writepages, .readpage = unionfs_readpage, .prepare_write = unionfs_prepare_write, .commit_write = unionfs_commit_write, - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
In message [EMAIL PROTECTED], Pekka J Enberg writes: Hi Erez, On Sun, 14 Oct 2007, Erez Zadok wrote: In unionfs_writepage() I tried to emulate as best possible what the lower f/s will have returned to the VFS. Since tmpfs's -writepage can return AOP_WRITEPAGE_ACTIVATE and re-mark its page as dirty, I did the same in unionfs: mark again my page as dirty, and return AOP_WRITEPAGE_ACTIVATE. Should I be doing something different when unionfs stacks on top of tmpfs? (BTW, this is probably also relevant to ecryptfs.) Look at mm/filemap.c:__filemap_fdatawrite_range(). You shouldn't be calling unionfs_writepage() _at all_ if the lower mapping has BDI_CAP_NO_WRITEBACK capability set. Perhaps something like the totally untested patch below? Pekka [...] Pekka, with a small change to your patch (to handle time-based cache coherency), your patch worked well and passed all my tests. Thanks. So now I wonder if we still need the patch to prevent AOP_WRITEPAGE_ACTIVATE from being returned to userland. I guess we still need it, b/c even with your patch, generic_writepages() can return AOP_WRITEPAGE_ACTIVATE back to the VFS and we need to ensure that doesn't leak outside the kernel. Erez. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi Hugh, On 10/12/07, Hugh Dickins <[EMAIL PROTECTED]> wrote: > But I keep suspecting that the answer might be the patch below (which > rather follows what drivers/block/rd.c is doing). I'm especially > worried that, rather than just AOP_WRITEPAGE_ACTIVATE being returned > to userspace, bad enough in itself, you might be liable to hit that > BUG_ON(page_mapped(page)). shmem_writepage does not expect to be > called by anyone outside mm/vmscan.c, but unionfs can now get to it? Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages() without unionfs even? Pekka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi Hugh, On 10/12/07, Hugh Dickins [EMAIL PROTECTED] wrote: But I keep suspecting that the answer might be the patch below (which rather follows what drivers/block/rd.c is doing). I'm especially worried that, rather than just AOP_WRITEPAGE_ACTIVATE being returned to userspace, bad enough in itself, you might be liable to hit that BUG_ON(page_mapped(page)). shmem_writepage does not expect to be called by anyone outside mm/vmscan.c, but unionfs can now get to it? Doesn't msync(2) get to it via mm/page-writeback.c:write_cache_pages() without unionfs even? Pekka - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Thu, 11 Oct 2007, Ryan Finnie wrote: > On 10/11/07, Andrew Morton <[EMAIL PROTECTED]> wrote: > > shit. That's a nasty bug. Really userspace should be testing for -1, but > > the msync() library function should only ever return 0 or -1. > > > > Does this fix it? > > > > --- a/mm/page-writeback.c~a > > +++ a/mm/page-writeback.c > > @@ -850,8 +850,10 @@ retry: > > > > ret = (*writepage)(page, wbc, data); > > > > - if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) > > + if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) { > > unlock_page(page); > > + ret = 0; > > + } > > if (ret || (--(wbc->nr_to_write) <= 0)) > > done = 1; > > if (wbc->nonblocking && bdi_write_congested(bdi)) { > > _ > > > > Pekka Enberg replied with an identical patch a few days ago, but for > some reason the same condition flows up to msync as -1 EIO instead of > AOP_WRITEPAGE_ACTIVATE with that patch applied. The last part of the > thread is below. Thanks. Each time I sit down to follow what's going on with writepage and unionfs and msync, I get distracted: I really haven't researched this properly. But I keep suspecting that the answer might be the patch below (which rather follows what drivers/block/rd.c is doing). I'm especially worried that, rather than just AOP_WRITEPAGE_ACTIVATE being returned to userspace, bad enough in itself, you might be liable to hit that BUG_ON(page_mapped(page)). shmem_writepage does not expect to be called by anyone outside mm/vmscan.c, but unionfs can now get to it? Please let us know if this patch does fix it: then I'll try harder to work out what goes on. Thanks, Hugh --- 2.6.23/mm/shmem.c 2007-10-09 21:31:38.0 +0100 +++ linux/mm/shmem.c2007-10-12 01:25:46.0 +0100 @@ -916,6 +916,11 @@ static int shmem_writepage(struct page * struct inode *inode; BUG_ON(!PageLocked(page)); + if (!wbc->for_reclaim) { + set_page_dirty(page); + unlock_page(page); + return 0; + } BUG_ON(page_mapped(page)); mapping = page->mapping; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On 10/11/07, Andrew Morton <[EMAIL PROTECTED]> wrote: > shit. That's a nasty bug. Really userspace should be testing for -1, but > the msync() library function should only ever return 0 or -1. > > Does this fix it? > > --- a/mm/page-writeback.c~a > +++ a/mm/page-writeback.c > @@ -850,8 +850,10 @@ retry: > > ret = (*writepage)(page, wbc, data); > > - if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) > + if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) { > unlock_page(page); > + ret = 0; > + } > if (ret || (--(wbc->nr_to_write) <= 0)) > done = 1; > if (wbc->nonblocking && bdi_write_congested(bdi)) { > _ > Pekka Enberg replied with an identical patch a few days ago, but for some reason the same condition flows up to msync as -1 EIO instead of AOP_WRITEPAGE_ACTIVATE with that patch applied. The last part of the thread is below. Thanks. Ryan On 10/7/07, Ryan Finnie <[EMAIL PROTECTED]> wrote: > On 10/7/07, Pekka J Enberg <[EMAIL PROTECTED]> wrote: > > On 10/7/07, Erez Zadok <[EMAIL PROTECTED]> wrote: > > > Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes > > > returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland. > > > Therefore, some user programs fail, esp. if they're written such as > > > this: > > > ... > > It's a kernel bug. AOP_WRITEPAGE_ACTIVATE is a hint to the VM to avoid > > writeback of the page in the near future. I wonder if it's enough that we > > change the return value to zero from > > mm/page-writeback.c:write_cache_pages() in case we hit > > AOP_WRITEPAGE_ACTIVE... > > Doesn't appear to be enough. I can't figure out why (since it appears > write_cache_pages bubbles up directly to sys_msync), but with that > patch applied, in my test case[1], msync returns -1 EIO. However, > with the exact same kernel without that patch applied, msync returns > 524288 (AOP_WRITEPAGE_ACTIVATE). But as your patch specifically flips > 524288 to 0, I can't figure out how it eventually returns -1 EIO. > > Ryan > > [1] "apt-get check" on a unionfs2 mount backed by tmpfs over cdrom, > standard livecd setup > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Sun, 7 Oct 2007 15:20:19 -0400 Erez Zadok <[EMAIL PROTECTED]> wrote: > According to vfs.txt, ->writepage() may return AOP_WRITEPAGE_ACTIVATE back > to the VFS/VM. Indeed some filesystems such as tmpfs can return > AOP_WRITEPAGE_ACTIVATE; and stackable file systems (e.g., Unionfs) also > return AOP_WRITEPAGE_ACTIVATE if the lower f/s returned it. > > Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes > returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland. > Therefore, some user programs fail, esp. if they're written such as this: > > err = msync(...); > if (err != 0) > // fail > > They temporarily fixed the specific program in question (apt-get) to check > > if (err < 0) > // fail > > Is this a bug indeed, or are user programs supposed to handle > AOP_WRITEPAGE_ACTIVATE (I hope not the latter). If it's a kernel bug, what > should the kernel return: a zero, or an -errno (and which one)? > shit. That's a nasty bug. Really userspace should be testing for -1, but the msync() library function should only ever return 0 or -1. Does this fix it? --- a/mm/page-writeback.c~a +++ a/mm/page-writeback.c @@ -850,8 +850,10 @@ retry: ret = (*writepage)(page, wbc, data); - if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) + if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) { unlock_page(page); + ret = 0; + } if (ret || (--(wbc->nr_to_write) <= 0)) done = 1; if (wbc->nonblocking && bdi_write_congested(bdi)) { _ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Sun, 7 Oct 2007 15:20:19 -0400 Erez Zadok [EMAIL PROTECTED] wrote: According to vfs.txt, -writepage() may return AOP_WRITEPAGE_ACTIVATE back to the VFS/VM. Indeed some filesystems such as tmpfs can return AOP_WRITEPAGE_ACTIVATE; and stackable file systems (e.g., Unionfs) also return AOP_WRITEPAGE_ACTIVATE if the lower f/s returned it. Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland. Therefore, some user programs fail, esp. if they're written such as this: err = msync(...); if (err != 0) // fail They temporarily fixed the specific program in question (apt-get) to check if (err 0) // fail Is this a bug indeed, or are user programs supposed to handle AOP_WRITEPAGE_ACTIVATE (I hope not the latter). If it's a kernel bug, what should the kernel return: a zero, or an -errno (and which one)? shit. That's a nasty bug. Really userspace should be testing for -1, but the msync() library function should only ever return 0 or -1. Does this fix it? --- a/mm/page-writeback.c~a +++ a/mm/page-writeback.c @@ -850,8 +850,10 @@ retry: ret = (*writepage)(page, wbc, data); - if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) + if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) { unlock_page(page); + ret = 0; + } if (ret || (--(wbc-nr_to_write) = 0)) done = 1; if (wbc-nonblocking bdi_write_congested(bdi)) { _ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On 10/11/07, Andrew Morton [EMAIL PROTECTED] wrote: shit. That's a nasty bug. Really userspace should be testing for -1, but the msync() library function should only ever return 0 or -1. Does this fix it? --- a/mm/page-writeback.c~a +++ a/mm/page-writeback.c @@ -850,8 +850,10 @@ retry: ret = (*writepage)(page, wbc, data); - if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) + if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) { unlock_page(page); + ret = 0; + } if (ret || (--(wbc-nr_to_write) = 0)) done = 1; if (wbc-nonblocking bdi_write_congested(bdi)) { _ Pekka Enberg replied with an identical patch a few days ago, but for some reason the same condition flows up to msync as -1 EIO instead of AOP_WRITEPAGE_ACTIVATE with that patch applied. The last part of the thread is below. Thanks. Ryan On 10/7/07, Ryan Finnie [EMAIL PROTECTED] wrote: On 10/7/07, Pekka J Enberg [EMAIL PROTECTED] wrote: On 10/7/07, Erez Zadok [EMAIL PROTECTED] wrote: Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland. Therefore, some user programs fail, esp. if they're written such as this: ... It's a kernel bug. AOP_WRITEPAGE_ACTIVATE is a hint to the VM to avoid writeback of the page in the near future. I wonder if it's enough that we change the return value to zero from mm/page-writeback.c:write_cache_pages() in case we hit AOP_WRITEPAGE_ACTIVE... Doesn't appear to be enough. I can't figure out why (since it appears write_cache_pages bubbles up directly to sys_msync), but with that patch applied, in my test case[1], msync returns -1 EIO. However, with the exact same kernel without that patch applied, msync returns 524288 (AOP_WRITEPAGE_ACTIVATE). But as your patch specifically flips 524288 to 0, I can't figure out how it eventually returns -1 EIO. Ryan [1] apt-get check on a unionfs2 mount backed by tmpfs over cdrom, standard livecd setup - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On Thu, 11 Oct 2007, Ryan Finnie wrote: On 10/11/07, Andrew Morton [EMAIL PROTECTED] wrote: shit. That's a nasty bug. Really userspace should be testing for -1, but the msync() library function should only ever return 0 or -1. Does this fix it? --- a/mm/page-writeback.c~a +++ a/mm/page-writeback.c @@ -850,8 +850,10 @@ retry: ret = (*writepage)(page, wbc, data); - if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) + if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) { unlock_page(page); + ret = 0; + } if (ret || (--(wbc-nr_to_write) = 0)) done = 1; if (wbc-nonblocking bdi_write_congested(bdi)) { _ Pekka Enberg replied with an identical patch a few days ago, but for some reason the same condition flows up to msync as -1 EIO instead of AOP_WRITEPAGE_ACTIVATE with that patch applied. The last part of the thread is below. Thanks. Each time I sit down to follow what's going on with writepage and unionfs and msync, I get distracted: I really haven't researched this properly. But I keep suspecting that the answer might be the patch below (which rather follows what drivers/block/rd.c is doing). I'm especially worried that, rather than just AOP_WRITEPAGE_ACTIVATE being returned to userspace, bad enough in itself, you might be liable to hit that BUG_ON(page_mapped(page)). shmem_writepage does not expect to be called by anyone outside mm/vmscan.c, but unionfs can now get to it? Please let us know if this patch does fix it: then I'll try harder to work out what goes on. Thanks, Hugh --- 2.6.23/mm/shmem.c 2007-10-09 21:31:38.0 +0100 +++ linux/mm/shmem.c2007-10-12 01:25:46.0 +0100 @@ -916,6 +916,11 @@ static int shmem_writepage(struct page * struct inode *inode; BUG_ON(!PageLocked(page)); + if (!wbc-for_reclaim) { + set_page_dirty(page); + unlock_page(page); + return 0; + } BUG_ON(page_mapped(page)); mapping = page-mapping; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi Ryan, On 10/8/07, Ryan Finnie <[EMAIL PROTECTED]> wrote: > Doesn't appear to be enough. I can't figure out why (since it appears > write_cache_pages bubbles up directly to sys_msync), but with that > patch applied, in my test case[1], msync returns -1 EIO. However, > with the exact same kernel without that patch applied, msync returns > 524288 (AOP_WRITEPAGE_ACTIVATE). But as your patch specifically flips > 524288 to 0, I can't figure out how it eventually returns -1 EIO. > > [1] "apt-get check" on a unionfs2 mount backed by tmpfs over cdrom, > standard livecd setup You have swap device disabled, right? If so, I can't see any reason why msync(2) on tmpfs would return -EIO. Can you please send a strace log for your test case? Pekka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi Ryan, On 10/8/07, Ryan Finnie [EMAIL PROTECTED] wrote: Doesn't appear to be enough. I can't figure out why (since it appears write_cache_pages bubbles up directly to sys_msync), but with that patch applied, in my test case[1], msync returns -1 EIO. However, with the exact same kernel without that patch applied, msync returns 524288 (AOP_WRITEPAGE_ACTIVATE). But as your patch specifically flips 524288 to 0, I can't figure out how it eventually returns -1 EIO. [1] apt-get check on a unionfs2 mount backed by tmpfs over cdrom, standard livecd setup You have swap device disabled, right? If so, I can't see any reason why msync(2) on tmpfs would return -EIO. Can you please send a strace log for your test case? Pekka - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On 10/7/07, Pekka J Enberg <[EMAIL PROTECTED]> wrote: > On 10/7/07, Erez Zadok <[EMAIL PROTECTED]> wrote: > > Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes > > returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland. > > Therefore, some user programs fail, esp. if they're written such as > > this: > ... > It's a kernel bug. AOP_WRITEPAGE_ACTIVATE is a hint to the VM to avoid > writeback of the page in the near future. I wonder if it's enough that we > change the return value to zero from > mm/page-writeback.c:write_cache_pages() in case we hit AOP_WRITEPAGE_ACTIVE... Doesn't appear to be enough. I can't figure out why (since it appears write_cache_pages bubbles up directly to sys_msync), but with that patch applied, in my test case[1], msync returns -1 EIO. However, with the exact same kernel without that patch applied, msync returns 524288 (AOP_WRITEPAGE_ACTIVATE). But as your patch specifically flips 524288 to 0, I can't figure out how it eventually returns -1 EIO. Ryan [1] "apt-get check" on a unionfs2 mount backed by tmpfs over cdrom, standard livecd setup - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi Erez, On 10/7/07, Erez Zadok <[EMAIL PROTECTED]> wrote: > Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes > returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland. > Therefore, some user programs fail, esp. if they're written such as > this: [snip] On 10/7/07, Erez Zadok <[EMAIL PROTECTED]> wrote: > Is this a bug indeed, or are user programs supposed to handle > AOP_WRITEPAGE_ACTIVATE (I hope not the latter). If it's a kernel bug, > what should the kernel return: a zero, or an -errno (and which one)? It's a kernel bug. AOP_WRITEPAGE_ACTIVATE is a hint to the VM to avoid writeback of the page in the near future. I wonder if it's enough that we change the return value to zero from mm/page-writeback.c:write_cache_pages() in case we hit AOP_WRITEPAGE_ACTIVE... Pekka diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 63512a9..717f341 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -672,8 +672,10 @@ retry: ret = (*writepage)(page, wbc, data); - if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) + if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) { unlock_page(page); + ret = 0; + } if (ret || (--(wbc->nr_to_write) <= 0)) done = 1; if (wbc->nonblocking && bdi_write_congested(bdi)) {
msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
According to vfs.txt, ->writepage() may return AOP_WRITEPAGE_ACTIVATE back to the VFS/VM. Indeed some filesystems such as tmpfs can return AOP_WRITEPAGE_ACTIVATE; and stackable file systems (e.g., Unionfs) also return AOP_WRITEPAGE_ACTIVATE if the lower f/s returned it. Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland. Therefore, some user programs fail, esp. if they're written such as this: err = msync(...); if (err != 0) // fail They temporarily fixed the specific program in question (apt-get) to check if (err < 0) // fail Is this a bug indeed, or are user programs supposed to handle AOP_WRITEPAGE_ACTIVATE (I hope not the latter). If it's a kernel bug, what should the kernel return: a zero, or an -errno (and which one)? Thanks, Erez. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
According to vfs.txt, -writepage() may return AOP_WRITEPAGE_ACTIVATE back to the VFS/VM. Indeed some filesystems such as tmpfs can return AOP_WRITEPAGE_ACTIVATE; and stackable file systems (e.g., Unionfs) also return AOP_WRITEPAGE_ACTIVATE if the lower f/s returned it. Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland. Therefore, some user programs fail, esp. if they're written such as this: err = msync(...); if (err != 0) // fail They temporarily fixed the specific program in question (apt-get) to check if (err 0) // fail Is this a bug indeed, or are user programs supposed to handle AOP_WRITEPAGE_ACTIVATE (I hope not the latter). If it's a kernel bug, what should the kernel return: a zero, or an -errno (and which one)? Thanks, Erez. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
Hi Erez, On 10/7/07, Erez Zadok [EMAIL PROTECTED] wrote: Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland. Therefore, some user programs fail, esp. if they're written such as this: [snip] On 10/7/07, Erez Zadok [EMAIL PROTECTED] wrote: Is this a bug indeed, or are user programs supposed to handle AOP_WRITEPAGE_ACTIVATE (I hope not the latter). If it's a kernel bug, what should the kernel return: a zero, or an -errno (and which one)? It's a kernel bug. AOP_WRITEPAGE_ACTIVATE is a hint to the VM to avoid writeback of the page in the near future. I wonder if it's enough that we change the return value to zero from mm/page-writeback.c:write_cache_pages() in case we hit AOP_WRITEPAGE_ACTIVE... Pekka diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 63512a9..717f341 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -672,8 +672,10 @@ retry: ret = (*writepage)(page, wbc, data); - if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) + if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) { unlock_page(page); + ret = 0; + } if (ret || (--(wbc-nr_to_write) = 0)) done = 1; if (wbc-nonblocking bdi_write_congested(bdi)) {
Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland
On 10/7/07, Pekka J Enberg [EMAIL PROTECTED] wrote: On 10/7/07, Erez Zadok [EMAIL PROTECTED] wrote: Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland. Therefore, some user programs fail, esp. if they're written such as this: ... It's a kernel bug. AOP_WRITEPAGE_ACTIVATE is a hint to the VM to avoid writeback of the page in the near future. I wonder if it's enough that we change the return value to zero from mm/page-writeback.c:write_cache_pages() in case we hit AOP_WRITEPAGE_ACTIVE... Doesn't appear to be enough. I can't figure out why (since it appears write_cache_pages bubbles up directly to sys_msync), but with that patch applied, in my test case[1], msync returns -1 EIO. However, with the exact same kernel without that patch applied, msync returns 524288 (AOP_WRITEPAGE_ACTIVATE). But as your patch specifically flips 524288 to 0, I can't figure out how it eventually returns -1 EIO. Ryan [1] apt-get check on a unionfs2 mount backed by tmpfs over cdrom, standard livecd setup - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/