Re: [rfc][patch 3/5] afs: new aops
Nick Piggin [EMAIL PROTECTED] wrote: The problem is that the code called assumes that the struct page * argument points to a single page, not an array of pages as would presumably be the case if PAGE_CACHE_SIZE PAGE_SIZE. Incorrect. Christoph's patch for example does this by using compound pages. Now I personally don't like the patch or see the point in PAGE_CACHE_SIZE / PAGE_SIZE distinction, but I'm just telling you what the convention is. There is no point you arguing against it, that's simply how it is. No! You are wrong. I wrote the AFS code. I *know* it can only deal with single pages. It has no knowledge of compound pages and does not handle page arrays. This may be a flaw in my code, but it's there nonetheless. The assertion is a guard against that. *That* was the point of my statement. Perhaps I should've said 'my code' rather than 'the code'. If Christoph has a patch to deal with that, it's either not upstream yet or it hasn't altered AFS. So: you may not change the assertion unless you also fix the lower functions. I won't change the assertion, because you haven't been following said convention, so just changing it in one place is stupider than not changing it at all, but not for the reason cited. The convention is not precisely clear. Just grep for PAGE_CACHE_SIZE in Documentation/. It's only mentioned twice, and in neither case does it give any information about what PAGE_CACHE_SIZE is, what it represents, or where it applies. Therefore it's an ill-defined concept. If you look in Documentation/filesystems/vfs.txt, you'll see that it almost always talks about 'pages'. It only mentions 'pagecache pages' once - in the description of write_begin(), but it's not clear whether that means anything. However, I've now noted that I need to fix my code, so just keep the assertion for now and I'll fix my code to handle multipage blocks. David - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Beagle and logging inotify events
On Nov 13, 2007, at 7:04 PM, Jon Smirl wrote: Is it feasible to do something like this in the linux file system architecture? Beagle beats on my disk for an hour when I reboot. Of course I don't like that and I shut Beagle off. Leopard, by the way, does exactly this: it has a daemon that starts at boot time and taps FSEvents then journals file system changes to a well-known file on local disk. I don't see why this couldn't be done on Linux as well. -- Forwarded message -- From: Jon Smirl [EMAIL PROTECTED] Date: Nov 13, 2007 4:44 PM Subject: Re: Strange beagle interaction.. To: Linus Torvalds [EMAIL PROTECTED] Cc: J. Bruce Fields [EMAIL PROTECTED], Junio C Hamano [EMAIL PROTECTED], Git Mailing List [EMAIL PROTECTED], Johannes Schindelin [EMAIL PROTECTED] On 11/13/07, Linus Torvalds [EMAIL PROTECTED] wrote: On Tue, 13 Nov 2007, J. Bruce Fields wrote: Last I ran across this, I believe I found it was adding extended attributes to the file. Yeah, I just straced it and found the same thing. It's saving fingerprints and mtimes to files in the extended attributes. Things like Beagle need a guaranteed log of global inotify events. That would let them efficiently find changes made since the last time they updated their index. Right now every time Beagle starts it hasn't got a clue what has changed in the file system since it was last run. This forces Beagle to rescan the entire filesystem every time it is started. The xattrs are used as cache to reduce this load somewhat. A better solution would be for the kernel to log inotify events to disk in a manner that survives reboots. When Beagle starts it would locate its last checkpoint and then process the logged inotify events from that time forward. This inotify logging needs to be bullet proof or it will mess up your Beagle index. Logged files systems already contain the logged inotify data (in their own internal form). There's just no universal API for retrieving it in a file system independent manner. Yeah, I just turned off beagle. It looked to me like it was doing something wrongheaded. Gaah. The problem is, setting xattrs does actually change ctime. Which means that if we want to make git play nice with beagle, I guess we have to just remove the comparison of ctime. Oh, well. Git doesn't *require* it, but I like the notion of checking the inode really really carefully. But it looks like it may not be an option, because of file indexers hiding stuff behind our backs. Or we could just tell people not to run beagle on their git trees, but I suspect some people will actually *want* to. Even if it flushes their disk caches. Linus - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jon Smirl [EMAIL PROTECTED] -- Jon Smirl [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux- fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever chuck[dot]lever[at]oracle[dot]com - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Beagle and logging inotify events
On 11/14/07, Chuck Lever [EMAIL PROTECTED] wrote: On Nov 13, 2007, at 7:04 PM, Jon Smirl wrote: Is it feasible to do something like this in the linux file system architecture? Beagle beats on my disk for an hour when I reboot. Of course I don't like that and I shut Beagle off. Leopard, by the way, does exactly this: it has a daemon that starts at boot time and taps FSEvents then journals file system changes to a well-known file on local disk. Logging file systems have all of the needed info. Plus they know what is going on with rollback/replay after a crash. How about a fs API where Beagle has a token for a checkpoint, and then it can ask for a recreation of inotify events from that point forward. It's always possible for the file system to say I can't do that and trigger a full rebuild from Beagle. Daemons that aren't coordinated with the file system have a window during crash/reboot where they can get confused. Without low level support like this Beagle is forced to do a rescan on every boot. Since I crash my machine all of the time the disk load from rebooting is intolerable and I turn Beagle off. Even just turning the machine on in the morning generates an annoyingly large load on the disk. I don't see why this couldn't be done on Linux as well. -- Forwarded message -- From: Jon Smirl [EMAIL PROTECTED] Date: Nov 13, 2007 4:44 PM Subject: Re: Strange beagle interaction.. To: Linus Torvalds [EMAIL PROTECTED] Cc: J. Bruce Fields [EMAIL PROTECTED], Junio C Hamano [EMAIL PROTECTED], Git Mailing List [EMAIL PROTECTED], Johannes Schindelin [EMAIL PROTECTED] On 11/13/07, Linus Torvalds [EMAIL PROTECTED] wrote: On Tue, 13 Nov 2007, J. Bruce Fields wrote: Last I ran across this, I believe I found it was adding extended attributes to the file. Yeah, I just straced it and found the same thing. It's saving fingerprints and mtimes to files in the extended attributes. Things like Beagle need a guaranteed log of global inotify events. That would let them efficiently find changes made since the last time they updated their index. Right now every time Beagle starts it hasn't got a clue what has changed in the file system since it was last run. This forces Beagle to rescan the entire filesystem every time it is started. The xattrs are used as cache to reduce this load somewhat. A better solution would be for the kernel to log inotify events to disk in a manner that survives reboots. When Beagle starts it would locate its last checkpoint and then process the logged inotify events from that time forward. This inotify logging needs to be bullet proof or it will mess up your Beagle index. Logged files systems already contain the logged inotify data (in their own internal form). There's just no universal API for retrieving it in a file system independent manner. Yeah, I just turned off beagle. It looked to me like it was doing something wrongheaded. Gaah. The problem is, setting xattrs does actually change ctime. Which means that if we want to make git play nice with beagle, I guess we have to just remove the comparison of ctime. Oh, well. Git doesn't *require* it, but I like the notion of checking the inode really really carefully. But it looks like it may not be an option, because of file indexers hiding stuff behind our backs. Or we could just tell people not to run beagle on their git trees, but I suspect some people will actually *want* to. Even if it flushes their disk caches. Linus - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jon Smirl [EMAIL PROTECTED] -- Jon Smirl [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux- fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- Jon Smirl [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Should PAGE_CACHE_SIZE be discarded?
Are we ever going to have PAGE_CACHE_SIZE != PAGE_SIZE? If not, why not discard PAGE_CACHE_SIZE as it's then redundant. PAGE_CACHE_SIZE is also an ill-defined concept. Just grep in Documentation/ and it only comes up in a couple of places, neither particularly informative. If there's a possibility that it will be used, then someone who knows how it's supposed to work needs to sit down and document what it is, what it represents, where it applies, how it interacts with PG_compound and how the page flags distribute over a page cache slot. One further thing to consider: actually making PAGE_CACHE_SIZE PAGE_SIZE work will be an interesting problem, as I'm certain most filesystems will break horribly without a lot of work (ramfs might be only exception). David - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Beagle and logging inotify events
Jon Smirl wrote: On 11/14/07, Chuck Lever [EMAIL PROTECTED] wrote: On Nov 13, 2007, at 7:04 PM, Jon Smirl wrote: Is it feasible to do something like this in the linux file system architecture? Beagle beats on my disk for an hour when I reboot. Of course I don't like that and I shut Beagle off. Leopard, by the way, does exactly this: it has a daemon that starts at boot time and taps FSEvents then journals file system changes to a well-known file on local disk. Logging file systems have all of the needed info. Plus they know what is going on with rollback/replay after a crash. True, but not all file systems have a journal. Consider ext2 or FAT32, both of which are still common. How about a fs API where Beagle has a token for a checkpoint, and then it can ask for a recreation of inotify events from that point forward. It's always possible for the file system to say I can't do that and trigger a full rebuild from Beagle. Daemons that aren't coordinated with the file system have a window during crash/reboot where they can get confused. A reasonably effective solution can be implemented in user space without changes to the file system APIs or implementations. IOW we already have the tools to make something useful. For example, you don't need to record every file system event to make this useful. Listing only directory-level changes (ie some file in this directory has changed) is enough to prune most of Beagle's work when it starts up. Without low level support like this Beagle is forced to do a rescan on every boot. Since I crash my machine all of the time the disk load from rebooting is intolerable and I turn Beagle off. Even just turning the machine on in the morning generates an annoyingly large load on the disk. Understood. The need is clear. My Dad's WinXP system takes 10 minutes after every start-up before it's usable, simply because the virus scanner has to check every file in the system. Same problem! I don't see why this couldn't be done on Linux as well. -- Forwarded message -- From: Jon Smirl [EMAIL PROTECTED] Date: Nov 13, 2007 4:44 PM Subject: Re: Strange beagle interaction.. To: Linus Torvalds [EMAIL PROTECTED] Cc: J. Bruce Fields [EMAIL PROTECTED], Junio C Hamano [EMAIL PROTECTED], Git Mailing List [EMAIL PROTECTED], Johannes Schindelin [EMAIL PROTECTED] On 11/13/07, Linus Torvalds [EMAIL PROTECTED] wrote: On Tue, 13 Nov 2007, J. Bruce Fields wrote: Last I ran across this, I believe I found it was adding extended attributes to the file. Yeah, I just straced it and found the same thing. It's saving fingerprints and mtimes to files in the extended attributes. Things like Beagle need a guaranteed log of global inotify events. That would let them efficiently find changes made since the last time they updated their index. Right now every time Beagle starts it hasn't got a clue what has changed in the file system since it was last run. This forces Beagle to rescan the entire filesystem every time it is started. The xattrs are used as cache to reduce this load somewhat. A better solution would be for the kernel to log inotify events to disk in a manner that survives reboots. When Beagle starts it would locate its last checkpoint and then process the logged inotify events from that time forward. This inotify logging needs to be bullet proof or it will mess up your Beagle index. Logged files systems already contain the logged inotify data (in their own internal form). There's just no universal API for retrieving it in a file system independent manner. Yeah, I just turned off beagle. It looked to me like it was doing something wrongheaded. Gaah. The problem is, setting xattrs does actually change ctime. Which means that if we want to make git play nice with beagle, I guess we have to just remove the comparison of ctime. Oh, well. Git doesn't *require* it, but I like the notion of checking the inode really really carefully. But it looks like it may not be an option, because of file indexers hiding stuff behind our backs. Or we could just tell people not to run beagle on their git trees, but I suspect some people will actually *want* to. Even if it flushes their disk caches. Linus - To unsubscribe from this list: send the line unsubscribe git in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jon Smirl [EMAIL PROTECTED] -- Jon Smirl [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux- fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever chuck[dot]lever[at]oracle[dot]com begin:vcard fn:Chuck Lever n:Lever;Chuck org:Oracle Corporation;Corporate Architecture: Linux Projects Group adr:;;1015 Granger Avenue;Ann Arbor;MI;48104;USA title:Principal Member of Staff tel;work:+1 248 614 5091 x-mozilla-html:FALSE version:2.1 end:vcard
Re: Beagle and logging inotify events
On 11/14/07, Chuck Lever [EMAIL PROTECTED] wrote: Jon Smirl wrote: On 11/14/07, Chuck Lever [EMAIL PROTECTED] wrote: On Nov 13, 2007, at 7:04 PM, Jon Smirl wrote: Is it feasible to do something like this in the linux file system architecture? Beagle beats on my disk for an hour when I reboot. Of course I don't like that and I shut Beagle off. Leopard, by the way, does exactly this: it has a daemon that starts at boot time and taps FSEvents then journals file system changes to a well-known file on local disk. Logging file systems have all of the needed info. Plus they know what is going on with rollback/replay after a crash. True, but not all file systems have a journal. Consider ext2 or FAT32, both of which are still common. ext2/FAT32 can use the deamon approach you describe below which also works as a short term solution. The Beagle people do have a deamon but it can be turned off. Holes where you don't record the inotify events and update the index are really bad because they can make files that you know are on the disk disappear from the index. I don't believe Beagle distinguishes between someone turning it off for a day and then turning it back on, vs a reboot. In both cases it says there was a window where untracked changes could have happened and it triggers a full rescan. The root problem here is needing a bullet proof inotify log with no windows. The only place that is going to happen is inside the file system logs. We just need an API to say recreate the inotify stream from this checkpoint forward. Things like FAT/ext2 will always return a no data available error from this API. How about a fs API where Beagle has a token for a checkpoint, and then it can ask for a recreation of inotify events from that point forward. It's always possible for the file system to say I can't do that and trigger a full rebuild from Beagle. Daemons that aren't coordinated with the file system have a window during crash/reboot where they can get confused. A reasonably effective solution can be implemented in user space without changes to the file system APIs or implementations. IOW we already have the tools to make something useful. For example, you don't need to record every file system event to make this useful. Listing only directory-level changes (ie some file in this directory has changed) is enough to prune most of Beagle's work when it starts up. Without low level support like this Beagle is forced to do a rescan on every boot. Since I crash my machine all of the time the disk load from rebooting is intolerable and I turn Beagle off. Even just turning the machine on in the morning generates an annoyingly large load on the disk. Understood. The need is clear. My Dad's WinXP system takes 10 minutes after every start-up before it's usable, simply because the virus scanner has to check every file in the system. Same problem! I don't see why this couldn't be done on Linux as well. -- Forwarded message -- From: Jon Smirl [EMAIL PROTECTED] Date: Nov 13, 2007 4:44 PM Subject: Re: Strange beagle interaction.. To: Linus Torvalds [EMAIL PROTECTED] Cc: J. Bruce Fields [EMAIL PROTECTED], Junio C Hamano [EMAIL PROTECTED], Git Mailing List [EMAIL PROTECTED], Johannes Schindelin [EMAIL PROTECTED] On 11/13/07, Linus Torvalds [EMAIL PROTECTED] wrote: On Tue, 13 Nov 2007, J. Bruce Fields wrote: Last I ran across this, I believe I found it was adding extended attributes to the file. Yeah, I just straced it and found the same thing. It's saving fingerprints and mtimes to files in the extended attributes. Things like Beagle need a guaranteed log of global inotify events. That would let them efficiently find changes made since the last time they updated their index. Right now every time Beagle starts it hasn't got a clue what has changed in the file system since it was last run. This forces Beagle to rescan the entire filesystem every time it is started. The xattrs are used as cache to reduce this load somewhat. A better solution would be for the kernel to log inotify events to disk in a manner that survives reboots. When Beagle starts it would locate its last checkpoint and then process the logged inotify events from that time forward. This inotify logging needs to be bullet proof or it will mess up your Beagle index. Logged files systems already contain the logged inotify data (in their own internal form). There's just no universal API for retrieving it in a file system independent manner. Yeah, I just turned off beagle. It looked to me like it was doing something wrongheaded. Gaah. The problem is, setting xattrs does actually change ctime. Which means that if we want to make git play nice with beagle, I guess we have to just remove the comparison of ctime. Oh, well. Git doesn't *require* it, but I like the notion of checking the inode
Re: [rfc][patch 3/5] afs: new aops
On Wed, Nov 14, 2007 at 12:18:43PM +, David Howells wrote: Nick Piggin [EMAIL PROTECTED] wrote: The problem is that the code called assumes that the struct page * argument points to a single page, not an array of pages as would presumably be the case if PAGE_CACHE_SIZE PAGE_SIZE. Incorrect. Christoph's patch for example does this by using compound pages. Now I personally don't like the patch or see the point in PAGE_CACHE_SIZE / PAGE_SIZE distinction, but I'm just telling you what the convention is. There is no point you arguing against it, that's simply how it is. No! You are wrong. I wrote the AFS code. I *know* it can only deal with No I'm talking about core code. In core code, the PAGE_CACHE_SIZE is for page cache struct pages. Single struct pages (not page arrays). Take a look at generic mapping read or something. There is nothing to deal with page arrays there either, but that's simply the convention. So: you may not change the assertion unless you also fix the lower functions. I won't change the assertion, because you haven't been following said convention, so just changing it in one place is stupider than not changing it at all, but not for the reason cited. The convention is not precisely clear. Just grep for PAGE_CACHE_SIZE in Documentation/. It's only mentioned twice, and in neither case does it give any information about what PAGE_CACHE_SIZE is, what it represents, or where it applies. Therefore it's an ill-defined concept. If you look in Documentation/filesystems/vfs.txt, you'll see that it almost always talks about 'pages'. It only mentions 'pagecache pages' once - in the description of write_begin(), but it's not clear whether that means anything. Documentation is the opposite of convention ;) Look in mm/. However, I've now noted that I need to fix my code, so just keep the assertion for now and I'll fix my code to handle multipage blocks. I'm not saying you need to do that. Leave it at PAGE_SIZE, really it doesn't matter that much at present. This has just blown out of proportion. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Beagle and logging inotify events
Jon Smirl [EMAIL PROTECTED] writes: On 11/14/07, Chuck Lever [EMAIL PROTECTED] wrote: On Nov 13, 2007, at 7:04 PM, Jon Smirl wrote: Is it feasible to do something like this in the linux file system architecture? Beagle beats on my disk for an hour when I reboot. Of course I don't like that and I shut Beagle off. Leopard, by the way, does exactly this: it has a daemon that starts at boot time and taps FSEvents then journals file system changes to a well-known file on local disk. Logging file systems have all of the needed info. Actually most journaling file systems in Linux use block logging and it would be probably hard to get specific file names out of a random collection of logged blocks. And even if you could they would hit a lot of false positives since everything is rounded up to block level. With intent logging like in XFS/JFS it would be easier, but even then costly :- e.g. they might log changes to the inode but there is no back pointer to the file name short of searching the whole directory tree. -Andi - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Should PAGE_CACHE_SIZE be discarded?
On Wed, Nov 14, 2007 at 01:56:53PM +, David Howells wrote: Are we ever going to have PAGE_CACHE_SIZE != PAGE_SIZE? If not, why not discard PAGE_CACHE_SIZE as it's then redundant. Christoph Lameter has patches exactly to make PAGE_CACHE_SIZE larger than PAGE_SIZE, and they seem to work without much effort. I happen to hate the patches ;) but that doesn't change the fact that PAGE_CACHE_SIZE is relatively useful and it is not at all an ill-defined concept. PAGE_CACHE_SIZE is also an ill-defined concept. Just grep in Documentation/ and it only comes up in a couple of places, neither particularly informative. If there's a possibility that it will be used, then someone who knows how it's supposed to work needs to sit down and document what it is, what it represents, where it applies, how it interacts with PG_compound and how the page flags distribute over a page cache slot. No, this was an *example*. It has nothing to do with PG_compound upsream. That was just an example. Basically, anything that goes in the page cache is in units of PAGE_CACHE_SIZE, and nothing else. For filesystems it should be pretty easy... One further thing to consider: actually making PAGE_CACHE_SIZE PAGE_SIZE work will be an interesting problem, as I'm certain most filesystems will break horribly without a lot of work (ramfs might be only exception). I think most filesystems actually don't have much problem with it. mm code has bigger problems, eg. due to ptes - pagecache no longer being equal size but why would filesystems care? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Beagle and logging inotify events
Jon Smirl wrote: On 11/14/07, Chuck Lever [EMAIL PROTECTED] wrote: Jon Smirl wrote: On 11/14/07, Chuck Lever [EMAIL PROTECTED] wrote: On Nov 13, 2007, at 7:04 PM, Jon Smirl wrote: Is it feasible to do something like this in the linux file system architecture? Beagle beats on my disk for an hour when I reboot. Of course I don't like that and I shut Beagle off. Leopard, by the way, does exactly this: it has a daemon that starts at boot time and taps FSEvents then journals file system changes to a well-known file on local disk. Logging file systems have all of the needed info. Plus they know what is going on with rollback/replay after a crash. True, but not all file systems have a journal. Consider ext2 or FAT32, both of which are still common. ext2/FAT32 can use the deamon approach you describe below which also works as a short term solution. The Beagle people do have a deamon but it can be turned off. Holes where you don't record the inotify events and update the index are really bad because they can make files that you know are on the disk disappear from the index. I don't believe Beagle distinguishes between someone turning it off for a day and then turning it back on, vs a reboot. In both cases it says there was a window where untracked changes could have happened and it triggers a full rescan. The root problem here is needing a bullet proof inotify log with no windows. I disagree: we don't need a bullet-proof log. We can get a significant performance improvement even with a permanent dnotify log implemented in user-space. We already have well-defined fallback behavior if such a log is missing or incomplete. The problem with a permanent inotify log is that it can become unmanageably enormous, and a performance problem to boot. Recording at that level of detail makes it more likely that the logger won't be able to keep up with file system activity. A lightweight solution gets us most of the way there, is simple to implement, and doesn't introduce many new issues. As long as it can tell us precisely where the holes are, it shouldn't be a problem. The only place that is going to happen is inside the file system logs. As Andi points out, existing block-based journaling implementations won't easily provide this. And most fs journals are actually pretty limited in size. Alternately, you could insert a stackable file system layer between the VFS and the on-disk fs to provide more seamless information about updates. We just need an API to say recreate the inotify stream from this checkpoint forward. Things like FAT/ext2 will always return a no data available error from this API. How about a fs API where Beagle has a token for a checkpoint, and then it can ask for a recreation of inotify events from that point forward. It's always possible for the file system to say I can't do that and trigger a full rebuild from Beagle. Daemons that aren't coordinated with the file system have a window during crash/reboot where they can get confused. A reasonably effective solution can be implemented in user space without changes to the file system APIs or implementations. IOW we already have the tools to make something useful. For example, you don't need to record every file system event to make this useful. Listing only directory-level changes (ie some file in this directory has changed) is enough to prune most of Beagle's work when it starts up. Without low level support like this Beagle is forced to do a rescan on every boot. Since I crash my machine all of the time the disk load from rebooting is intolerable and I turn Beagle off. Even just turning the machine on in the morning generates an annoyingly large load on the disk. Understood. The need is clear. My Dad's WinXP system takes 10 minutes after every start-up before it's usable, simply because the virus scanner has to check every file in the system. Same problem! I don't see why this couldn't be done on Linux as well. -- Forwarded message -- From: Jon Smirl [EMAIL PROTECTED] Date: Nov 13, 2007 4:44 PM Subject: Re: Strange beagle interaction.. To: Linus Torvalds [EMAIL PROTECTED] Cc: J. Bruce Fields [EMAIL PROTECTED], Junio C Hamano [EMAIL PROTECTED], Git Mailing List [EMAIL PROTECTED], Johannes Schindelin [EMAIL PROTECTED] On 11/13/07, Linus Torvalds [EMAIL PROTECTED] wrote: On Tue, 13 Nov 2007, J. Bruce Fields wrote: Last I ran across this, I believe I found it was adding extended attributes to the file. Yeah, I just straced it and found the same thing. It's saving fingerprints and mtimes to files in the extended attributes. Things like Beagle need a guaranteed log of global inotify events. That would let them efficiently find changes made since the last time they updated their index. Right now every time Beagle starts it hasn't got a clue what has changed in the file system since it was last run. This forces Beagle to rescan the entire filesystem every time it is started. The
Re: Should PAGE_CACHE_SIZE be discarded?
Nick Piggin [EMAIL PROTECTED] wrote: Christoph Lameter has patches exactly to make PAGE_CACHE_SIZE larger than PAGE_SIZE, and they seem to work without much effort. I happen to hate the patches ;) but that doesn't change the fact that PAGE_CACHE_SIZE is relatively useful and it is not at all an ill-defined concept. Where, please? mm kernels? Basically, anything that goes in the page cache is in units of PAGE_CACHE_SIZE, and nothing else. For filesystems it should be pretty easy... That depends on what the coverage of struct page is. I don't actually know whether this is PAGE_SIZE or PAGE_CACHE_SIZE; I assumed it to be the former, but from what you've said, I'm not actually sure. David - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Beagle and logging inotify events
On 11/14/07, Chuck Lever [EMAIL PROTECTED] wrote: Jon Smirl wrote: On 11/14/07, Chuck Lever [EMAIL PROTECTED] wrote: Jon Smirl wrote: On 11/14/07, Chuck Lever [EMAIL PROTECTED] wrote: On Nov 13, 2007, at 7:04 PM, Jon Smirl wrote: Is it feasible to do something like this in the linux file system architecture? Beagle beats on my disk for an hour when I reboot. Of course I don't like that and I shut Beagle off. Leopard, by the way, does exactly this: it has a daemon that starts at boot time and taps FSEvents then journals file system changes to a well-known file on local disk. Logging file systems have all of the needed info. Plus they know what is going on with rollback/replay after a crash. True, but not all file systems have a journal. Consider ext2 or FAT32, both of which are still common. ext2/FAT32 can use the deamon approach you describe below which also works as a short term solution. The Beagle people do have a deamon but it can be turned off. Holes where you don't record the inotify events and update the index are really bad because they can make files that you know are on the disk disappear from the index. I don't believe Beagle distinguishes between someone turning it off for a day and then turning it back on, vs a reboot. In both cases it says there was a window where untracked changes could have happened and it triggers a full rescan. The root problem here is needing a bullet proof inotify log with no windows. I disagree: we don't need a bullet-proof log. We can get a significant performance improvement even with a permanent dnotify log implemented in user-space. We already have well-defined fallback behavior if such a log is missing or incomplete. The problem with a permanent inotify log is that it can become unmanageably enormous, and a performance problem to boot. Recording at that level of detail makes it more likely that the logger won't be able to keep up with file system activity. It doesn't have to become enormous, if the checkpoint request is too old just return no-data and trigger a full scan in Beagle. 50K of log data would probably be enough. The main thing you need to cover is the reboot process and files that get touched after the beagle shuts down or before it gets started. For example the log could checkpoint once a minute, in that case you wouldn't need more than two minutes worth of log. Beagle would just remember the last checkpoint it processed and apply reapply changes after it. If someone turns Beagle off for a couple of days it should be expected that they will need a full scan when they turn it back on. A lightweight solution gets us most of the way there, is simple to implement, and doesn't introduce many new issues. As long as it can tell us precisely where the holes are, it shouldn't be a problem. The only place that is going to happen is inside the file system logs. As Andi points out, existing block-based journaling implementations won't easily provide this. And most fs journals are actually pretty limited in size. Alternately, you could insert a stackable file system layer between the VFS and the on-disk fs to provide more seamless information about updates. We just need an API to say recreate the inotify stream from this checkpoint forward. Things like FAT/ext2 will always return a no data available error from this API. How about a fs API where Beagle has a token for a checkpoint, and then it can ask for a recreation of inotify events from that point forward. It's always possible for the file system to say I can't do that and trigger a full rebuild from Beagle. Daemons that aren't coordinated with the file system have a window during crash/reboot where they can get confused. A reasonably effective solution can be implemented in user space without changes to the file system APIs or implementations. IOW we already have the tools to make something useful. For example, you don't need to record every file system event to make this useful. Listing only directory-level changes (ie some file in this directory has changed) is enough to prune most of Beagle's work when it starts up. Without low level support like this Beagle is forced to do a rescan on every boot. Since I crash my machine all of the time the disk load from rebooting is intolerable and I turn Beagle off. Even just turning the machine on in the morning generates an annoyingly large load on the disk. Understood. The need is clear. My Dad's WinXP system takes 10 minutes after every start-up before it's usable, simply because the virus scanner has to check every file in the system. Same problem! I don't see why this couldn't be done on Linux as well. -- Forwarded message -- From: Jon Smirl [EMAIL PROTECTED] Date: Nov 13, 2007 4:44 PM Subject: Re: Strange beagle interaction.. To: Linus Torvalds [EMAIL PROTECTED] Cc: J.
Re: Beagle and logging inotify events
On Wed, Nov 14, 2007 at 02:22:51PM -0500, Jon Smirl wrote: On 11/14/07, J. Bruce Fields [EMAIL PROTECTED] wrote: On Wed, Nov 14, 2007 at 04:30:16PM +0100, Andi Kleen wrote: Jon Smirl [EMAIL PROTECTED] writes: On 11/14/07, Chuck Lever [EMAIL PROTECTED] wrote: On Nov 13, 2007, at 7:04 PM, Jon Smirl wrote: Is it feasible to do something like this in the linux file system architecture? Beagle beats on my disk for an hour when I reboot. Of course I don't like that and I shut Beagle off. Leopard, by the way, does exactly this: it has a daemon that starts at boot time and taps FSEvents then journals file system changes to a well-known file on local disk. Logging file systems have all of the needed info. Actually most journaling file systems in Linux use block logging and it would be probably hard to get specific file names out of a random collection of logged blocks. And even if you could they would hit a lot of false positives since everything is rounded up to block level. With intent logging like in XFS/JFS it would be easier, but even then costly :- e.g. they might log changes to the inode but there is no back pointer to the file name short of searching the whole directory tree. So it seems the best approach given the current api's would be just to cache all the stat data, and stat every file on reboot. I don't understand why beagle is reading the entire filesystem data. I understand why even just doing the stat's could be prohibitive, though. I believe Beagle is looking at the mtimes on the files. It uses xattrs to store the last mtime it checked and then compares it to the current mtime. It also stores a hash of the file in an xattr. So even if the You meant only if, not even if? mtimes don't match it recomputes the hash and only if the hashes differ do it update its free text search index. OK, that makes a little more sense. (Though it seems unfortunate to use xattrs instead of caching the data elsewhere. Git and nfs e.g. both use the ctime to decide when a file changes, so you're invalidating their caches unnecessarily.) --b. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Beagle and logging inotify events
On Nov 14, 2007 11:32 -0500, Chuck Lever wrote: I disagree: we don't need a bullet-proof log. We can get a significant performance improvement even with a permanent dnotify log implemented in user-space. We already have well-defined fallback behavior if such a log is missing or incomplete. The problem with a permanent inotify log is that it can become unmanageably enormous, and a performance problem to boot. Recording at that level of detail makes it more likely that the logger won't be able to keep up with file system activity. A lightweight solution gets us most of the way there, is simple to implement, and doesn't introduce many new issues. As long as it can tell us precisely where the holes are, it shouldn't be a problem. Jan Kara is working on a patch for ext4 which would store a recursive timestamp for each directory that gives the latest time that a file in that directory was modified. ZFS has a similar mechanism by virtue of doing full-tree updates during COW of all the metadata blocks and storing the most recent transaction number in each block. I suspect btrfs could do the same thing easily. That would allow recursive-descent filesystem traversal to be much more efficient because whole chunks of the filesystem tree can be ignored during scans. Cheers, Andreas -- Andreas Dilger Sr. Software Engineer, Lustre Group Sun Microsystems of Canada, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Beagle and logging inotify events
On Wed, Nov 14, 2007 at 04:30:16PM +0100, Andi Kleen wrote: Jon Smirl [EMAIL PROTECTED] writes: On 11/14/07, Chuck Lever [EMAIL PROTECTED] wrote: On Nov 13, 2007, at 7:04 PM, Jon Smirl wrote: Is it feasible to do something like this in the linux file system architecture? Beagle beats on my disk for an hour when I reboot. Of course I don't like that and I shut Beagle off. Leopard, by the way, does exactly this: it has a daemon that starts at boot time and taps FSEvents then journals file system changes to a well-known file on local disk. Logging file systems have all of the needed info. Actually most journaling file systems in Linux use block logging and it would be probably hard to get specific file names out of a random collection of logged blocks. And even if you could they would hit a lot of false positives since everything is rounded up to block level. With intent logging like in XFS/JFS it would be easier, but even then costly :- e.g. they might log changes to the inode but there is no back pointer to the file name short of searching the whole directory tree. So it seems the best approach given the current api's would be just to cache all the stat data, and stat every file on reboot. I don't understand why beagle is reading the entire filesystem data. I understand why even just doing the stat's could be prohibitive, though. --b. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] nfs: use -mmap_prepare() to avoid an AB-BA deadlock
Normal locking order is: i_mutex mmap_sem However NFS's -mmap hook, which is called under mmap_sem, can take i_mutex. Avoid this potential deadlock by doing the work that requires i_mutex from the new -mmap_prepare(). [ Is this sufficient, or does it introduce a race? ] Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/nfs/file.c | 25 + 1 file changed, 17 insertions(+), 8 deletions(-) Index: linux-2.6/fs/nfs/file.c === --- linux-2.6.orig/fs/nfs/file.c +++ linux-2.6/fs/nfs/file.c @@ -41,6 +41,9 @@ static int nfs_file_open(struct inode *, struct file *); static int nfs_file_release(struct inode *, struct file *); static loff_t nfs_file_llseek(struct file *file, loff_t offset, int origin); +static int +nfs_file_mmap_prepare(struct file * file, unsigned long len, + unsigned long prot, unsigned long flags, unsigned long pgoff); static int nfs_file_mmap(struct file *, struct vm_area_struct *); static ssize_t nfs_file_splice_read(struct file *filp, loff_t *ppos, struct pipe_inode_info *pipe, @@ -64,6 +67,7 @@ const struct file_operations nfs_file_op .write = do_sync_write, .aio_read = nfs_file_read, .aio_write = nfs_file_write, + .mmap_prepare = nfs_file_mmap_prepare, .mmap = nfs_file_mmap, .open = nfs_file_open, .flush = nfs_file_flush, @@ -270,7 +274,8 @@ nfs_file_splice_read(struct file *filp, } static int -nfs_file_mmap(struct file * file, struct vm_area_struct * vma) +nfs_file_mmap_prepare(struct file * file, unsigned long len, + unsigned long prot, unsigned long flags, unsigned long pgoff) { struct dentry *dentry = file-f_path.dentry; struct inode *inode = dentry-d_inode; @@ -279,13 +284,17 @@ nfs_file_mmap(struct file * file, struct dfprintk(VFS, nfs: mmap(%s/%s)\n, dentry-d_parent-d_name.name, dentry-d_name.name); - status = nfs_revalidate_mapping(inode, file-f_mapping); - if (!status) { - vma-vm_ops = nfs_file_vm_ops; - vma-vm_flags |= VM_CAN_NONLINEAR; - file_accessed(file); - } - return status; + return nfs_revalidate_mapping(inode, file-f_mapping); +} + +static int +nfs_file_mmap(struct file *file, struct vm_area_struct *vma) +{ + vma-vm_ops = nfs_file_vm_ops; + vma-vm_flags |= VM_CAN_NONLINEAR; + file_accessed(file); + + return 0; } /* -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] mmap vs NFS
Currently there is an AB-BA deadlock in NFS mmap. nfs_file_mmap() can take i_mutex, while holding mmap_sem, whereas the regular locking order is the other way around. This patch-set attempts to solve this issue. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] vfs: -mmap_prepare()
Some filesystems (NFS) need i_mutex in -mmap(), this violates the normal locking order. Provide a hook before we take mmap_sem. This leaves a window between -mmap_prepare() and -mmap(), if thats a problem (Trond?) we could also provide -mmap_finish() and guarantee it being called if -mmap_prepare() returned success. This would allow holding state and thereby close the window. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- Documentation/filesystems/Locking | 11 ++- Documentation/filesystems/vfs.txt |3 +++ include/linux/fs.h|1 + ipc/shm.c | 13 + mm/mmap.c | 12 mm/nommu.c| 12 6 files changed, 51 insertions(+), 1 deletion(-) Index: linux-2.6/include/linux/fs.h === --- linux-2.6.orig/include/linux/fs.h +++ linux-2.6/include/linux/fs.h @@ -1172,6 +1172,7 @@ struct file_operations { int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); + int (*mmap_prepare) (struct file *, unsigned long len, unsigned long prot, unsigned long flags, unsigned long pgoff); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id); Index: linux-2.6/mm/mmap.c === --- linux-2.6.orig/mm/mmap.c +++ linux-2.6/mm/mmap.c @@ -1035,6 +1035,12 @@ unsigned long do_mmap_pgoff(struct file struct mm_struct *mm = current-mm; unsigned long ret; + if (file file-f_op file-f_op-mmap_prepare) { + ret = file-f_op-mmap_prepare(file, len, prot, flags, pgoff); + if (ret) + return ret; + } + down_write(mm-mmap_sem); ret = ___do_mmap_pgoff(file, addr, len, prot, flags, pgoff); up_write(mm-mmap_sem); @@ -1054,6 +1060,12 @@ unsigned long do_mmap(struct file *file, if ((offset + PAGE_ALIGN(len)) offset || (offset ~PAGE_MASK)) return ret; + if (file file-f_op file-f_op-mmap_prepare) { + ret = file-f_op-mmap_prepare(file, len, prot, flags, pgoff); + if (ret) + return ret; + } + down_write(mm-mmap_sem); ret = ___do_mmap_pgoff(file, addr, len, prot, flags, pgoff); up_write(mm-mmap_sem); Index: linux-2.6/mm/nommu.c === --- linux-2.6.orig/mm/nommu.c +++ linux-2.6/mm/nommu.c @@ -1025,6 +1025,12 @@ unsigned long do_mmap_pgoff(struct file struct mm_struct *mm = current-mm; unsigned long ret; + if (file file-f_op file-f_op-mmap_prepare) { + ret = file-f_op-mmap_prepare(file, len, prot, flags, pgoff); + if (ret) + return ret; + } + down_write(mm-mmap_sem); ret = ___do_mmap_pgoff(file, addr, len, prot, flags, pgoff); up_write(mm-mmap_sem); @@ -1044,6 +1050,12 @@ unsigned long do_mmap(struct file *file, if ((offset + PAGE_ALIGN(len)) offset || (offset ~PAGE_MASK)) return ret; + if (file file-f_op file-f_op-mmap_prepare) { + ret = file-f_op-mmap_prepare(file, len, prot, flags, pgoff); + if (ret) + return ret; + } + down_write(mm-mmap_sem); ret = ___do_mmap_pgoff(file, addr, len, prot, flags, pgoff); up_write(mm-mmap_sem); Index: linux-2.6/ipc/shm.c === --- linux-2.6.orig/ipc/shm.c +++ linux-2.6/ipc/shm.c @@ -300,6 +300,12 @@ static int shm_mmap(struct file * file, struct shm_file_data *sfd = shm_file_data(file); int ret; + /* +* SHM backing filesystems may not have mmap_prepare! +* See so_shmat(). +*/ + WARN_ON(sfd-file-f_op-mmap_prepare); + ret = sfd-file-f_op-mmap(sfd-file, vma); if (ret != 0) return ret; @@ -1012,6 +1018,13 @@ long do_shmat(int shmid, char __user *sh goto invalid; } + /* +* The usage of ___do_mmap_locked() is needed because we must already +* hold the mmap_sem here due to find_vma_intersection vs mmap races. +* +* This prohibits in SHM backing filesystems from using +* f_op-mmap_prepare(). +*/ user_addr = ___do_mmap_pgoff (file, addr, size, prot, flags, 0); *raddr = user_addr; err = 0; Index: linux-2.6/Documentation/filesystems/Locking
[PATCH 1/3] mm: pull mmap_sem into do_mmap{,_pgoff}
It appears that some filesystems (NFS) require i_mutex in -mmap(). This violates the normal locking order of: i_mutex mmap_sem In order to provide a mmap hook that is outside of mmap_sem, pull this lock into do_mmap{,_pgoff}. ___do_mmap_pgoff() - base function, requires mmap_sem __do_mmap_anon() - full do_mmap() for anonymous, requires mmap_sem do_mmap_pgoff()- full do_mmap_pgoff() do_mmap() - full do_mmap() Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- arch/alpha/kernel/osf_sys.c |2 - arch/arm/kernel/sys_arm.c |2 - arch/avr32/kernel/sys_avr32.c |2 - arch/blackfin/kernel/sys_bfin.c |2 - arch/cris/kernel/sys_cris.c |2 - arch/frv/kernel/sys_frv.c |4 --- arch/h8300/kernel/sys_h8300.c |4 --- arch/ia64/ia32/sys_ia32.c | 30 - arch/ia64/kernel/sys_ia64.c |2 - arch/m32r/kernel/sys_m32r.c |2 - arch/m68k/kernel/sys_m68k.c |4 --- arch/m68knommu/kernel/sys_m68k.c|2 - arch/mips/kernel/irixelf.c | 12 -- arch/mips/kernel/linux32.c |2 - arch/mips/kernel/syscall.c |2 - arch/mips/kernel/sysirix.c |5 arch/parisc/kernel/sys_parisc.c |2 - arch/powerpc/kernel/syscalls.c |2 - arch/s390/kernel/compat_linux.c |7 -- arch/s390/kernel/sys_s390.c |2 - arch/sh/kernel/sys_sh.c |2 - arch/sh64/kernel/sys_sh64.c |2 - arch/sparc/kernel/sys_sparc.c |2 - arch/sparc/kernel/sys_sunos.c |2 - arch/sparc64/kernel/binfmt_aout32.c |6 - arch/sparc64/kernel/sys_sparc.c |2 - arch/sparc64/kernel/sys_sunos32.c |2 - arch/sparc64/solaris/misc.c |2 - arch/um/kernel/syscall.c|2 - arch/v850/kernel/syscalls.c |2 - arch/x86/ia32/ia32_aout.c |6 - arch/x86/ia32/sys_ia32.c| 10 arch/x86/kernel/sys_i386_32.c |4 --- arch/x86/kernel/sys_x86_64.c|2 - arch/xtensa/kernel/syscall.c|2 - drivers/char/drm/drm_bufs.c |4 --- drivers/char/drm/i810_dma.c |2 - fs/aio.c|6 ++--- fs/binfmt_aout.c|6 - fs/binfmt_elf.c |6 - fs/binfmt_elf_fdpic.c | 11 + fs/binfmt_flat.c|6 + fs/binfmt_som.c |6 - include/linux/mm.h | 22 +- ipc/shm.c |2 - mm/mmap.c | 42 mm/nommu.c | 39 - 47 files changed, 112 insertions(+), 180 deletions(-) Index: linux-2.6/arch/alpha/kernel/osf_sys.c === --- linux-2.6.orig/arch/alpha/kernel/osf_sys.c +++ linux-2.6/arch/alpha/kernel/osf_sys.c @@ -193,9 +193,7 @@ osf_mmap(unsigned long addr, unsigned lo goto out; } flags = ~(MAP_EXECUTABLE | MAP_DENYWRITE); - down_write(current-mm-mmap_sem); ret = do_mmap(file, addr, len, prot, flags, off); - up_write(current-mm-mmap_sem); if (file) fput(file); out: Index: linux-2.6/arch/arm/kernel/sys_arm.c === --- linux-2.6.orig/arch/arm/kernel/sys_arm.c +++ linux-2.6/arch/arm/kernel/sys_arm.c @@ -72,9 +72,7 @@ inline long do_mmap2( goto out; } - down_write(current-mm-mmap_sem); error = do_mmap_pgoff(file, addr, len, prot, flags, pgoff); - up_write(current-mm-mmap_sem); if (file) fput(file); Index: linux-2.6/arch/avr32/kernel/sys_avr32.c === --- linux-2.6.orig/arch/avr32/kernel/sys_avr32.c +++ linux-2.6/arch/avr32/kernel/sys_avr32.c @@ -41,9 +41,7 @@ asmlinkage long sys_mmap2(unsigned long return error; } - down_write(current-mm-mmap_sem); error = do_mmap_pgoff(file, addr, len, prot, flags, offset); - up_write(current-mm-mmap_sem); if (file) fput(file); Index: linux-2.6/arch/blackfin/kernel/sys_bfin.c === --- linux-2.6.orig/arch/blackfin/kernel/sys_bfin.c +++ linux-2.6/arch/blackfin/kernel/sys_bfin.c @@ -78,9 +78,7 @@ do_mmap2(unsigned long addr, unsigned lo goto out; } - down_write(current-mm-mmap_sem); error = do_mmap_pgoff(file, addr, len, prot, flags, pgoff); - up_write(current-mm-mmap_sem); if (file) fput(file); Index: linux-2.6/arch/cris/kernel/sys_cris.c
Re: [PATCH 0/3] mmap vs NFS
Seems I forgot to mention this is an RFC :-) On Wed, 2007-11-14 at 21:01 +0100, Peter Zijlstra wrote: Currently there is an AB-BA deadlock in NFS mmap. nfs_file_mmap() can take i_mutex, while holding mmap_sem, whereas the regular locking order is the other way around. This patch-set attempts to solve this issue. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] mmap vs NFS
On Wed, 2007-11-14 at 21:01 +0100, Peter Zijlstra wrote: Currently there is an AB-BA deadlock in NFS mmap. nfs_file_mmap() can take i_mutex, while holding mmap_sem, whereas the regular locking order is the other way around. This patch-set attempts to solve this issue. Looks good from the NFS point of view. Acked-by: Trond Myklebust [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [rfc][patch 3/5] afs: new aops
On Wed, Nov 14, 2007 at 03:57:46PM +, David Howells wrote: Nick Piggin [EMAIL PROTECTED] wrote: In core code, the PAGE_CACHE_SIZE is for page cache struct pages. Single struct pages (not page arrays). Take a look at generic mapping read or something. So you're saying a struct page controls an area of PAGE_CACHE_SIZE, not an area of PAGE_SIZE? No, a pagecache page is PAGE_CACHE_SIZE. And not all struct pages control the same amount of data anyway, with compound pages. Is a struct page then a purely pagecache concept? Documentation is the opposite of convention ;) If it's not Documented, then it's irrelevant. But you can't just decide yourself that it is irrelevant because you don't grep hard enough ;) include/linux/mm.h: * A page may belong to an inode's memory mapping. In this case, page-mapping * is the pointer to the inode, and page-index is the file offset of the page, * in units of PAGE_CACHE_SIZE. include/linux/mm_types.h unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE units, *not* PAGE_CACHE_SIZE */ Looks like documentation to me. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] nfs: use -mmap_prepare() to avoid an AB-BA deadlock
On Wed, 2007-11-14 at 22:22 +0100, Nick Piggin wrote: On Wed, Nov 14, 2007 at 09:01:39PM +0100, Peter Zijlstra wrote: Normal locking order is: i_mutex mmap_sem However NFS's -mmap hook, which is called under mmap_sem, can take i_mutex. Avoid this potential deadlock by doing the work that requires i_mutex from the new -mmap_prepare(). [ Is this sufficient, or does it introduce a race? ] Seems like an OK patchset in my opinion. I don't know much about NFS unfortunately, but I wonder what prevents the condition fixed by nfs_revalidate_mapping from happening again while the mmap is active...? As the changelog might have suggested, I'm not overly sure of the nfs requirements myself. I think it just does a best effort at getting the pages coherent with other clients, and then hopes for the best. I'll let Trond enlighten us further before I make an utter fool of myself :-) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Should PAGE_CACHE_SIZE be discarded?
On Wed, Nov 14, 2007 at 03:59:39PM +, David Howells wrote: Nick Piggin [EMAIL PROTECTED] wrote: Christoph Lameter has patches exactly to make PAGE_CACHE_SIZE larger than PAGE_SIZE, and they seem to work without much effort. I happen to hate the patches ;) but that doesn't change the fact that PAGE_CACHE_SIZE is relatively useful and it is not at all an ill-defined concept. Where, please? mm kernels? Floating around. I'm not saying it will even get upstream. It's just an example. Basically, anything that goes in the page cache is in units of PAGE_CACHE_SIZE, and nothing else. For filesystems it should be pretty easy... That depends on what the coverage of struct page is. I don't actually know whether this is PAGE_SIZE or PAGE_CACHE_SIZE; I assumed it to be the former, but from what you've said, I'm not actually sure. It can be pretty well any power of 2 from PAGE_SIZE upwards, with compound pages. None of the filesystems should really care at all. It's not even a new concept, hugetlbfs uses HPAGE_SIZE... - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] nfs: use -mmap_prepare() to avoid an AB-BA deadlock
On Wed, 2007-11-14 at 16:41 -0500, Trond Myklebust wrote: On Wed, 2007-11-14 at 22:31 +0100, Peter Zijlstra wrote: On Wed, 2007-11-14 at 22:22 +0100, Nick Piggin wrote: On Wed, Nov 14, 2007 at 09:01:39PM +0100, Peter Zijlstra wrote: Normal locking order is: i_mutex mmap_sem However NFS's -mmap hook, which is called under mmap_sem, can take i_mutex. Avoid this potential deadlock by doing the work that requires i_mutex from the new -mmap_prepare(). [ Is this sufficient, or does it introduce a race? ] Seems like an OK patchset in my opinion. I don't know much about NFS unfortunately, but I wonder what prevents the condition fixed by nfs_revalidate_mapping from happening again while the mmap is active...? As the changelog might have suggested, I'm not overly sure of the nfs requirements myself. I think it just does a best effort at getting the pages coherent with other clients, and then hopes for the best. I'll let Trond enlighten us further before I make an utter fool of myself :-) The NFS client needs to check the validity of already cached data before it allows those pages to be mmapped. If it finds out that the cache is stale, then we need to call invalidate_inode_pages2() to clear out the cache and refresh it from the server. The inode-i_mutex is necessary in order to prevent races between the new writes and the cache invalidation code. Right, but I guess what Nick asked is, if pages could be stale to start with, how is that avoided in the future. The way I understand it, this re-validate is just a best effort at getting a coherent image. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] nfs: use -mmap_prepare() to avoid an AB-BA deadlock
On Wed, 2007-11-14 at 22:31 +0100, Peter Zijlstra wrote: On Wed, 2007-11-14 at 22:22 +0100, Nick Piggin wrote: On Wed, Nov 14, 2007 at 09:01:39PM +0100, Peter Zijlstra wrote: Normal locking order is: i_mutex mmap_sem However NFS's -mmap hook, which is called under mmap_sem, can take i_mutex. Avoid this potential deadlock by doing the work that requires i_mutex from the new -mmap_prepare(). [ Is this sufficient, or does it introduce a race? ] Seems like an OK patchset in my opinion. I don't know much about NFS unfortunately, but I wonder what prevents the condition fixed by nfs_revalidate_mapping from happening again while the mmap is active...? As the changelog might have suggested, I'm not overly sure of the nfs requirements myself. I think it just does a best effort at getting the pages coherent with other clients, and then hopes for the best. I'll let Trond enlighten us further before I make an utter fool of myself :-) The NFS client needs to check the validity of already cached data before it allows those pages to be mmapped. If it finds out that the cache is stale, then we need to call invalidate_inode_pages2() to clear out the cache and refresh it from the server. The inode-i_mutex is necessary in order to prevent races between the new writes and the cache invalidation code. Cheers Trond - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] nfs: use -mmap_prepare() to avoid an AB-BA deadlock
On Wed, Nov 14, 2007 at 09:01:39PM +0100, Peter Zijlstra wrote: Normal locking order is: i_mutex mmap_sem However NFS's -mmap hook, which is called under mmap_sem, can take i_mutex. Avoid this potential deadlock by doing the work that requires i_mutex from the new -mmap_prepare(). [ Is this sufficient, or does it introduce a race? ] Seems like an OK patchset in my opinion. I don't know much about NFS unfortunately, but I wonder what prevents the condition fixed by nfs_revalidate_mapping from happening again while the mmap is active...? Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] --- fs/nfs/file.c | 25 + 1 file changed, 17 insertions(+), 8 deletions(-) Index: linux-2.6/fs/nfs/file.c === --- linux-2.6.orig/fs/nfs/file.c +++ linux-2.6/fs/nfs/file.c @@ -41,6 +41,9 @@ static int nfs_file_open(struct inode *, struct file *); static int nfs_file_release(struct inode *, struct file *); static loff_t nfs_file_llseek(struct file *file, loff_t offset, int origin); +static int +nfs_file_mmap_prepare(struct file * file, unsigned long len, + unsigned long prot, unsigned long flags, unsigned long pgoff); static int nfs_file_mmap(struct file *, struct vm_area_struct *); static ssize_t nfs_file_splice_read(struct file *filp, loff_t *ppos, struct pipe_inode_info *pipe, @@ -64,6 +67,7 @@ const struct file_operations nfs_file_op .write = do_sync_write, .aio_read = nfs_file_read, .aio_write = nfs_file_write, + .mmap_prepare = nfs_file_mmap_prepare, .mmap = nfs_file_mmap, .open = nfs_file_open, .flush = nfs_file_flush, @@ -270,7 +274,8 @@ nfs_file_splice_read(struct file *filp, } static int -nfs_file_mmap(struct file * file, struct vm_area_struct * vma) +nfs_file_mmap_prepare(struct file * file, unsigned long len, + unsigned long prot, unsigned long flags, unsigned long pgoff) { struct dentry *dentry = file-f_path.dentry; struct inode *inode = dentry-d_inode; @@ -279,13 +284,17 @@ nfs_file_mmap(struct file * file, struct dfprintk(VFS, nfs: mmap(%s/%s)\n, dentry-d_parent-d_name.name, dentry-d_name.name); - status = nfs_revalidate_mapping(inode, file-f_mapping); - if (!status) { - vma-vm_ops = nfs_file_vm_ops; - vma-vm_flags |= VM_CAN_NONLINEAR; - file_accessed(file); - } - return status; + return nfs_revalidate_mapping(inode, file-f_mapping); +} + +static int +nfs_file_mmap(struct file *file, struct vm_area_struct *vma) +{ + vma-vm_ops = nfs_file_vm_ops; + vma-vm_flags |= VM_CAN_NONLINEAR; + file_accessed(file); + + return 0; } /* -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] nfs: use -mmap_prepare() to avoid an AB-BA deadlock
On Wed, 2007-11-14 at 22:50 +0100, Peter Zijlstra wrote: Right, but I guess what Nick asked is, if pages could be stale to start with, how is that avoided in the future. The way I understand it, this re-validate is just a best effort at getting a coherent image. The normal convention for NFS is to use a close-to-open cache consistency model. In that model, applications must agree never to open the file for reading or writing if an application on a different NFS client already holds it open for writing. However there is no standard locking model for _enforcing_ such an agreement, so some setups do violate it. One obvious model that we try to support is that where the applications are using POSIX locking in order to ensure exclusive access to the data when requires. Another model is to rely rather on synchronous writes and heavy attribute revalidation to detect when a competing application has written to the file (the 'noac' mount option). While such a model is obviously deficient in that it can never guarantee cache coherency, we do attempt to ensure that it works on a per-operation basis (IOW: we check cache coherency before each call to read(), to mmap(), etc) since it is by far the easiest model to apply if you have applications that cannot be rewritten and that satisfy the requirement that they rarely conflict. Cheers Trond - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] nfs: use -mmap_prepare() to avoid an AB-BA deadlock
On Wed, Nov 14, 2007 at 05:18:50PM -0500, Trond Myklebust wrote: On Wed, 2007-11-14 at 22:50 +0100, Peter Zijlstra wrote: Right, but I guess what Nick asked is, if pages could be stale to start with, how is that avoided in the future. The way I understand it, this re-validate is just a best effort at getting a coherent image. The normal convention for NFS is to use a close-to-open cache consistency model. In that model, applications must agree never to open the file for reading or writing if an application on a different NFS client already holds it open for writing. However there is no standard locking model for _enforcing_ such an agreement, so some setups do violate it. One obvious model that we try to support is that where the applications are using POSIX locking in order to ensure exclusive access to the data when requires. Another model is to rely rather on synchronous writes and heavy attribute revalidation to detect when a competing application has written to the file (the 'noac' mount option). While such a model is obviously deficient in that it can never guarantee cache coherency, we do attempt to ensure that it works on a per-operation basis (IOW: we check cache coherency before each call to read(), to mmap(), etc) since it is by far the easiest model to apply if you have applications that cannot be rewritten and that satisfy the requirement that they rarely conflict. mmap()s can be different from read in that the syscall may have little relation to when the data gets used. But I guess it's still a best effort thing. Fair enough. Thanks, Nick - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] nfs: use -mmap_prepare() to avoid an AB-BA deadlock
On Wed, 2007-11-14 at 23:24 +0100, Nick Piggin wrote: mmap()s can be different from read in that the syscall may have little relation to when the data gets used. But I guess it's still a best effort thing. Fair enough. Agreed that mmap() is special and very problematic on NFS. However I can't see how we can improve on the existing models short of some significant protocol modifications, and so far, nobody has presented the IETF with a good case for why they need this level of cache consistency. Cheers Trond - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 02/13] locks: fix possible infinite loop in posix deadlock detection
-stable review patch. If anyone has any objections, please let us know. -- From: J. Bruce Fields [EMAIL PROTECTED] patch 97855b49b6bac0bd25f16b017883634d13591d00 in mainline. It's currently possible to send posix_locks_deadlock() into an infinite loop (under the BKL). For now, fix this just by bailing out after a few iterations. We may want to fix this in a way that better clarifies the semantics of deadlock detection. But that will take more time, and this minimal fix is probably adequate for any realistic scenario, and is simple enough to be appropriate for applying to stable kernels now. Thanks to George Davis for reporting the problem. Cc: George G. Davis [EMAIL PROTECTED] Signed-off-by: J. Bruce Fields [EMAIL PROTECTED] Acked-by: Alan Cox [EMAIL PROTECTED] Signed-off-by: Linus Torvalds [EMAIL PROTECTED] Signed-off-by: Greg Kroah-Hartman [EMAIL PROTECTED] --- fs/locks.c | 11 +++ 1 file changed, 11 insertions(+) --- a/fs/locks.c +++ b/fs/locks.c @@ -694,11 +694,20 @@ EXPORT_SYMBOL(posix_test_lock); * Note: the above assumption may not be true when handling lock requests * from a broken NFS client. But broken NFS clients have a lot more to * worry about than proper deadlock detection anyway... --okir + * + * However, the failure of this assumption (also possible in the case of + * multiple tasks sharing the same open file table) also means there's no + * guarantee that the loop below will terminate. As a hack, we give up + * after a few iterations. */ + +#define MAX_DEADLK_ITERATIONS 10 + static int posix_locks_deadlock(struct file_lock *caller_fl, struct file_lock *block_fl) { struct list_head *tmp; + int i = 0; next_task: if (posix_same_owner(caller_fl, block_fl)) @@ -706,6 +715,8 @@ next_task: list_for_each(tmp, blocked_list) { struct file_lock *fl = list_entry(tmp, struct file_lock, fl_link); if (posix_same_owner(fl, block_fl)) { + if (i++ MAX_DEADLK_ITERATIONS) + return 0; fl = fl-fl_next; block_fl = fl; goto next_task; -- - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html