Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Michael Eisler <[EMAIL PROTECTED]> writes: > What if someone has written to multiple, non-contiguous regions > of a page? This has been foreseen and hence we only allow 1 contiguous region per page. If somebody tries to schedule a write to a second region that is not contiguous with the first, then we flush out the already scheduled write first, and wait on it to complete. We also only allow 1 process at a time to schedule writes to a given page. This is done in case the file access permissions turn out to be untrustworthy: (due to e.g. root squashing and the likes) it means that we don't end up clobbering the data from another process' pending write. > So, assume a page is 4096 bytes long, and there are 2048 > processes that each have write locked the even numbered bytes > of the first page of a file. Once all the locks have been > acquired, the page has been flushed to disk. > Now we have a clean page. Each process then writes only one > byte region that it has locked. > Now they unlock the region. > What happens? > (btw, on another nfs client, 2048 processes have locked the odd > number bytes of the first page of the same file). > If you claim that first client doesn't overwrite the updates > from the 2nd client, then I'll shut up. Blech... It'll be slow, but as I said, only 1 contiguous region is allowed to have pending writes on any given page at any given point in time. Basically, you'll end up with 2048 byte-sized regions getting written out from each machine. Each time a new write gets scheduled, it will flush out the previous write and wait on its completion. In your example, we will therefore in practice be doing completely synchronous writes... Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> > " " == Michael Eisler <[EMAIL PROTECTED]> writes: > > >> I'm not clear on why you want to enforce page alignedness > >> though? As long as writes respect the lock boundaries (and not > >> page boundaries) why would use of a page cache change matters? > > > For the reason that was pointed earlier by someone else as to > > why your fix in adequate. Since the I/O is page-based, if the > > locks are not, then two threads on two different clients will > > step over each other's locked regions. > > No they don't. > > As I've repeatedly stated, our cache does not require us to respect > page boundaries when writing. We do make sure that all writes pending > on the entire file are flushed to disk before we lock/unlock a > region. If somebody has held a lock on 2 bytes lying across a page > boundary, and has only written within that 2 byte region, a write is > sent for 2 bytes. What if someone has written to multiple, non-contiguous regions of a page? > There is no difference here between cached and uncached operation. The > only difference is that we trust the lock to prevent other machines > writing within the locked region. So, assume a page is 4096 bytes long, and there are 2048 processes that each have write locked the even numbered bytes of the first page of a file. Once all the locks have been acquired, the page has been flushed to disk. Now we have a clean page. Each process then writes only one byte region that it has locked. Now they unlock the region. What happens? (btw, on another nfs client, 2048 processes have locked the odd number bytes of the first page of the same file). If you claim that first client doesn't overwrite the updates from the 2nd client, then I'll shut up. -mre - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Michael Eisler <[EMAIL PROTECTED]> writes: >> I'm not clear on why you want to enforce page alignedness >> though? As long as writes respect the lock boundaries (and not >> page boundaries) why would use of a page cache change matters? > For the reason that was pointed earlier by someone else as to > why your fix in adequate. Since the I/O is page-based, if the > locks are not, then two threads on two different clients will > step over each other's locked regions. No they don't. As I've repeatedly stated, our cache does not require us to respect page boundaries when writing. We do make sure that all writes pending on the entire file are flushed to disk before we lock/unlock a region. If somebody has held a lock on 2 bytes lying across a page boundary, and has only written within that 2 byte region, a write is sent for 2 bytes. There is no difference here between cached and uncached operation. The only difference is that we trust the lock to prevent other machines writing within the locked region. When reading, we have now (assuming the patch I just sent in to Linus was accepted) ensured that all pages that were cached prior to grabbing the lock are thrown out. Any new access to that file will cause the relevant pages to be reread from the server. Even if we're extending a lock, within a page boundary or locking a new portion of the file then all existing writes are flushed to disk first, and the entire page cache is invalidated. Again any access to the newly locked region will result in a reread from the server. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> Yes. fs/read_write calls the NFS subsystem. The problem then is that > NFS uses the generic_file_{read,write,mmap}() interfaces. These are > what enforce use of the page cache. So, don't use generic*() when locking is active. It's what most other UNIX-based NFS clients do. Even if it is "stupid beyond belief", it works. > You could drop these functions, but that would mean designing an > entire VFS for NFS's use alone. Such a decision would have to be very > well motivated in order to convince Linus. Avoiding corruption. > >> As far as I can see, the current use of the page cache should > >> be safe as long as applications respect the locking boundaries, > >> and don't expect consistency outside locked areas. > > > Then the code ought to enforce page aligned locks. Of course, > > while that will produce correctness, it will violate the > > principle of least surprise. It might be better to simply > > AFAICS that would be a bad idea, since it will lead to programs having > to know about the hardware granularity. You could easily imagine > deadlock situations that could arise if one program is unwittingly > locking an area that is meant to be claimed by another. I can't imagine any deadlock scenarios. If the app locks on a page boundary, then accept it, otherwise return an error. But it does violate least surprise, so I think bypassing the page cache when locking is active is better. > I'm not clear on why you want to enforce page alignedness though? As > long as writes respect the lock boundaries (and not page boundaries) > why would use of a page cache change matters? For the reason that was pointed earlier by someone else as to why your fix in adequate. Since the I/O is page-based, if the locks are not, then two threads on two different clients will step over each other's locked regions. Folks might think that NLM locking is brain dead, and they wouldn't get an argument from me. But if you are going to document that you support it, then please get it right. -mre - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
" " == Michael Eisler [EMAIL PROTECTED] writes: I'm not clear on why you want to enforce page alignedness though? As long as writes respect the lock boundaries (and not page boundaries) why would use of a page cache change matters? For the reason that was pointed earlier by someone else as to why your fix in adequate. Since the I/O is page-based, if the locks are not, then two threads on two different clients will step over each other's locked regions. No they don't. As I've repeatedly stated, our cache does not require us to respect page boundaries when writing. We do make sure that all writes pending on the entire file are flushed to disk before we lock/unlock a region. If somebody has held a lock on 2 bytes lying across a page boundary, and has only written within that 2 byte region, a write is sent for 2 bytes. There is no difference here between cached and uncached operation. The only difference is that we trust the lock to prevent other machines writing within the locked region. When reading, we have now (assuming the patch I just sent in to Linus was accepted) ensured that all pages that were cached prior to grabbing the lock are thrown out. Any new access to that file will cause the relevant pages to be reread from the server. Even if we're extending a lock, within a page boundary or locking a new portion of the file then all existing writes are flushed to disk first, and the entire page cache is invalidated. Again any access to the newly locked region will result in a reread from the server. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock()does not provide coherency guarantee
On 16 Sep 2000, Trond Myklebust wrote: > > Yes. fs/read_write calls the NFS subsystem. The problem then is that > NFS uses the generic_file_{read,write,mmap}() interfaces. These are > what enforce use of the page cache. > > You could drop these functions, but that would mean designing an > entire VFS for NFS's use alone. Such a decision would have to be very > well motivated in order to convince Linus. Ehh.. I would say that such a decision would be stupid beyond belief, and impossible to motivate. NFS certainly doesn't _have_ to use the page cache. However, not using the page cache would basically in the end be equivalent to (a) not having coherent mmap's over NFS (b) probably having much weaker caching neither of which is really an option at all, I suspect. You could do caching on your own, but I dare anybody to come up with a better cache that is able to handle mmap and read/write coherency. The page granularity issues come from mmap, not from the page cache per se. ANYTHING that solves the coherency issue is pretty much bound by the limitations of a page-cache-like thing - they are not limitations of the implementation, but limitations pretty much inherent to the problem itself. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> I'm not a Linux kernel literate. However, I found your > assertion surprising. Does procfs do page i/o as well? No. An fs isnt required to use the page caches at all. It makes life a lot saner to do so - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Michael Eisler <[EMAIL PROTECTED]> writes: > I'm not a Linux kernel literate. However, I found your > assertion surprising. Does procfs do page i/o as well? No. It has its own setup. > file.c in fs/nfs suggests that the Linux VFS has non-page > interfaces in addition to page interfaces. fs/read_write.c > suggests that the read and write system calls uses the non-page > interface. Yes. fs/read_write calls the NFS subsystem. The problem then is that NFS uses the generic_file_{read,write,mmap}() interfaces. These are what enforce use of the page cache. You could drop these functions, but that would mean designing an entire VFS for NFS's use alone. Such a decision would have to be very well motivated in order to convince Linus. >> As far as I can see, the current use of the page cache should >> be safe as long as applications respect the locking boundaries, >> and don't expect consistency outside locked areas. > Then the code ought to enforce page aligned locks. Of course, > while that will produce correctness, it will violate the > principle of least surprise. It might be better to simply AFAICS that would be a bad idea, since it will lead to programs having to know about the hardware granularity. You could easily imagine deadlock situations that could arise if one program is unwittingly locking an area that is meant to be claimed by another. I'm not clear on why you want to enforce page alignedness though? As long as writes respect the lock boundaries (and not page boundaries) why would use of a page cache change matters? Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> > " " == Michael Eisler <[EMAIL PROTECTED]> writes: > > > Focus on correctness and do the expedient thing first, which > > is: > > - The first time a file is locked, flush dirty pages > >to the server, and then invalidate the page cache > > This would be implemented with the last patch I proposed. > > > - While the file is locked, do vypass the page cache for all > > I/O. > > This is not possible given the current design of the Linux VFS. The > design is such that all reads/writes go through the page cache. I'm I'm not a Linux kernel literate. However, I found your assertion surprising. Does procfs do page i/o as well? file.c in fs/nfs suggests that the Linux VFS has non-page interfaces in addition to page interfaces. fs/read_write.c suggests that the read and write system calls uses the non-page interface. I cannot speak for Linux, but System V Release 4 dervied systems uses the page cache primarily as a tool for each file system, yet still hide the page interface from the code path leading from the read/write system calls to the VFS. > not sure that it is possible to get round this without some major > changes in VFS philosophy. Hacks such as invalidating the cache after > each read/write would definitely give rise to races. > > As far as I can see, the current use of the page cache should be safe > as long as applications respect the locking boundaries, and don't > expect consistency outside locked areas. Then the code ought to enforce page aligned locks. Of course, while that will produce correctness, it will violate the principle of least surprise. It might be better to simply return an error when a lock operation is attempted on an NFS file. Assuming of course, the Linux kernel isn't capable of honoring a read() or write() system whenever the file system doesn't support page-based i/o, which, again, I'd be surprised by. -mre - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Michael Eisler <[EMAIL PROTECTED]> writes: > Focus on correctness and do the expedient thing first, which > is: > - The first time a file is locked, flush dirty pages > to the server, and then invalidate the page cache This would be implemented with the last patch I proposed. > - While the file is locked, do vypass the page cache for all >I/O. This is not possible given the current design of the Linux VFS. The design is such that all reads/writes go through the page cache. I'm not sure that it is possible to get round this without some major changes in VFS philosophy. Hacks such as invalidating the cache after each read/write would definitely give rise to races. As far as I can see, the current use of the page cache should be safe as long as applications respect the locking boundaries, and don't expect consistency outside locked areas. The exception here is mmap() for which no sane semantics can be found unless the application explicitly unmaps and remaps after having acquired the file lock. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
" " == Michael Eisler [EMAIL PROTECTED] writes: Focus on correctness and do the expedient thing first, which is: - The first time a file is locked, flush dirty pages to the server, and then invalidate the page cache This would be implemented with the last patch I proposed. - While the file is locked, do vypass the page cache for all I/O. This is not possible given the current design of the Linux VFS. The design is such that all reads/writes go through the page cache. I'm I'm not a Linux kernel literate. However, I found your assertion surprising. Does procfs do page i/o as well? file.c in fs/nfs suggests that the Linux VFS has non-page interfaces in addition to page interfaces. fs/read_write.c suggests that the read and write system calls uses the non-page interface. I cannot speak for Linux, but System V Release 4 dervied systems uses the page cache primarily as a tool for each file system, yet still hide the page interface from the code path leading from the read/write system calls to the VFS. not sure that it is possible to get round this without some major changes in VFS philosophy. Hacks such as invalidating the cache after each read/write would definitely give rise to races. As far as I can see, the current use of the page cache should be safe as long as applications respect the locking boundaries, and don't expect consistency outside locked areas. Then the code ought to enforce page aligned locks. Of course, while that will produce correctness, it will violate the principle of least surprise. It might be better to simply return an error when a lock operation is attempted on an NFS file. Assuming of course, the Linux kernel isn't capable of honoring a read() or write() system whenever the file system doesn't support page-based i/o, which, again, I'd be surprised by. -mre - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
" " == Michael Eisler [EMAIL PROTECTED] writes: I'm not a Linux kernel literate. However, I found your assertion surprising. Does procfs do page i/o as well? No. It has its own setup. file.c in fs/nfs suggests that the Linux VFS has non-page interfaces in addition to page interfaces. fs/read_write.c suggests that the read and write system calls uses the non-page interface. Yes. fs/read_write calls the NFS subsystem. The problem then is that NFS uses the generic_file_{read,write,mmap}() interfaces. These are what enforce use of the page cache. You could drop these functions, but that would mean designing an entire VFS for NFS's use alone. Such a decision would have to be very well motivated in order to convince Linus. As far as I can see, the current use of the page cache should be safe as long as applications respect the locking boundaries, and don't expect consistency outside locked areas. Then the code ought to enforce page aligned locks. Of course, while that will produce correctness, it will violate the principle of least surprise. It might be better to simply AFAICS that would be a bad idea, since it will lead to programs having to know about the hardware granularity. You could easily imagine deadlock situations that could arise if one program is unwittingly locking an area that is meant to be claimed by another. I'm not clear on why you want to enforce page alignedness though? As long as writes respect the lock boundaries (and not page boundaries) why would use of a page cache change matters? Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
I'm not a Linux kernel literate. However, I found your assertion surprising. Does procfs do page i/o as well? No. An fs isnt required to use the page caches at all. It makes life a lot saner to do so - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock()does not provide coherency guarantee
On 16 Sep 2000, Trond Myklebust wrote: Yes. fs/read_write calls the NFS subsystem. The problem then is that NFS uses the generic_file_{read,write,mmap}() interfaces. These are what enforce use of the page cache. You could drop these functions, but that would mean designing an entire VFS for NFS's use alone. Such a decision would have to be very well motivated in order to convince Linus. Ehh.. I would say that such a decision would be stupid beyond belief, and impossible to motivate. NFS certainly doesn't _have_ to use the page cache. However, not using the page cache would basically in the end be equivalent to (a) not having coherent mmap's over NFS (b) probably having much weaker caching neither of which is really an option at all, I suspect. You could do caching on your own, but I dare anybody to come up with a better cache that is able to handle mmap and read/write coherency. The page granularity issues come from mmap, not from the page cache per se. ANYTHING that solves the coherency issue is pretty much bound by the limitations of a page-cache-like thing - they are not limitations of the implementation, but limitations pretty much inherent to the problem itself. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> > " " == James Yarbrough <[EMAIL PROTECTED]> writes: > > > What is done for bypassing the cache when the size of a file > > lock held by the reading/writing process is not a multiple of > > the caching granularity? Consider two different clients with > > processes sharing a file and locking 2k byte regions of the > > file and possibly updating these regions. Suppose that each > > system caches 4k byte blocks. If system A locks the first 2k > > of a block and system B locks the second 2k, the updates from > > one of the systems may be lost if these systems cache the > > writes. This is because each system will write back the 4k > > block it cached, not the 2k block that was locked. > > Under Linux writebacks have byte-sized granularity. If a page has been > partially dirtied, we save that information, and only write back the > dirty areas. As long as each system has restricted its updates to > within the 2k block that it has locked, there should be no conflict. > > If however one system has been writing over the full 4k block, then > there will indeed be a race. Using a page cache when locking is turned on is tricky at best. The only working optimizations I know of in this area are allowing the use of page cache when the entire file is locked. My two cents ... Focus on correctness and do the expedient thing first, which is: - The first time a file is locked, flush dirty pages to the server, and then invalidate the page cache - While the file is locked, do vypass the page cache for all I/O. Once that works, the gaping wound in the Linux NFS/NLM client will be closed. This will give you the breathing room to experiment with something that works more optimally (yet still correctly) in some conditions. E.g., one possible optimization is to allow page I/O as long as the locks are page aligned or whole file aligned. -mre - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == James Yarbrough <[EMAIL PROTECTED]> writes: > What is done for bypassing the cache when the size of a file > lock held by the reading/writing process is not a multiple of > the caching granularity? Consider two different clients with > processes sharing a file and locking 2k byte regions of the > file and possibly updating these regions. Suppose that each > system caches 4k byte blocks. If system A locks the first 2k > of a block and system B locks the second 2k, the updates from > one of the systems may be lost if these systems cache the > writes. This is because each system will write back the 4k > block it cached, not the 2k block that was locked. Under Linux writebacks have byte-sized granularity. If a page has been partially dirtied, we save that information, and only write back the dirty areas. As long as each system has restricted its updates to within the 2k block that it has locked, there should be no conflict. If however one system has been writing over the full 4k block, then there will indeed be a race. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> This will always invalidate the page cache whenever we try to obtain > the lock, hence you are guaranteed that the cache will be reread after > the lock was grabbed. > After unlocking however one needs no guarantees other than ensuring > that any modifications were committed while we held the lock, so we > flush the writebacks, drop the lock, and leave it at that. What is done for bypassing the cache when the size of a file lock held by the reading/writing process is not a multiple of the caching granularity? Consider two different clients with processes sharing a file and locking 2k byte regions of the file and possibly updating these regions. Suppose that each system caches 4k byte blocks. If system A locks the first 2k of a block and system B locks the second 2k, the updates from one of the systems may be lost if these systems cache the writes. This is because each system will write back the 4k block it cached, not the 2k block that was locked. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Michael Eisler <[EMAIL PROTECTED]> writes: > The fix still does not provide coherency guarantees in all > situations, and at minimum, there ought to be a way to force > the client provide a coherency guarantee. Yes. I came to the same conclusion after having sent it off. I hope the following patch is more acceptable. This will always invalidate the page cache whenever we try to obtain the lock, hence you are guaranteed that the cache will be reread after the lock was grabbed. After unlocking however one needs no guarantees other than ensuring that any modifications were committed while we held the lock, so we flush the writebacks, drop the lock, and leave it at that. Cheers, Trond --- linux-2.4.0-test8/fs/nfs/file.c Fri Jun 30 01:02:40 2000 +++ linux-2.4.0-test8-fix_lock/fs/nfs/file.cFri Sep 15 13:11:39 2000 @@ -291,10 +291,13 @@ status = 0; /* -* Make sure we re-validate anything we've got cached. +* Make sure we clear the cache whenever we try to get the lock. * This makes locking act as a cache coherency point. */ out_ok: - NFS_CACHEINV(inode); + if ((cmd == F_SETLK || cmd == F_SETLKW) && fl->fl_type != F_UNLCK) { + nfs_wb_all(inode); /* we may have slept */ + nfs_zap_caches(inode); + } return status; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
" " == Michael Eisler [EMAIL PROTECTED] writes: The fix still does not provide coherency guarantees in all situations, and at minimum, there ought to be a way to force the client provide a coherency guarantee. Yes. I came to the same conclusion after having sent it off. I hope the following patch is more acceptable. This will always invalidate the page cache whenever we try to obtain the lock, hence you are guaranteed that the cache will be reread after the lock was grabbed. After unlocking however one needs no guarantees other than ensuring that any modifications were committed while we held the lock, so we flush the writebacks, drop the lock, and leave it at that. Cheers, Trond --- linux-2.4.0-test8/fs/nfs/file.c Fri Jun 30 01:02:40 2000 +++ linux-2.4.0-test8-fix_lock/fs/nfs/file.cFri Sep 15 13:11:39 2000 @@ -291,10 +291,13 @@ status = 0; /* -* Make sure we re-validate anything we've got cached. +* Make sure we clear the cache whenever we try to get the lock. * This makes locking act as a cache coherency point. */ out_ok: - NFS_CACHEINV(inode); + if ((cmd == F_SETLK || cmd == F_SETLKW) fl-fl_type != F_UNLCK) { + nfs_wb_all(inode); /* we may have slept */ + nfs_zap_caches(inode); + } return status; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
" " == James Yarbrough [EMAIL PROTECTED] writes: What is done for bypassing the cache when the size of a file lock held by the reading/writing process is not a multiple of the caching granularity? Consider two different clients with processes sharing a file and locking 2k byte regions of the file and possibly updating these regions. Suppose that each system caches 4k byte blocks. If system A locks the first 2k of a block and system B locks the second 2k, the updates from one of the systems may be lost if these systems cache the writes. This is because each system will write back the 4k block it cached, not the 2k block that was locked. Under Linux writebacks have byte-sized granularity. If a page has been partially dirtied, we save that information, and only write back the dirty areas. As long as each system has restricted its updates to within the 2k block that it has locked, there should be no conflict. If however one system has been writing over the full 4k block, then there will indeed be a race. Using a page cache when locking is turned on is tricky at best. The only working optimizations I know of in this area are allowing the use of page cache when the entire file is locked. My two cents ... Focus on correctness and do the expedient thing first, which is: - The first time a file is locked, flush dirty pages to the server, and then invalidate the page cache - While the file is locked, do vypass the page cache for all I/O. Once that works, the gaping wound in the Linux NFS/NLM client will be closed. This will give you the breathing room to experiment with something that works more optimally (yet still correctly) in some conditions. E.g., one possible optimization is to allow page I/O as long as the locks are page aligned or whole file aligned. -mre - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
Jeff Epler <[EMAIL PROTECTED]> writes: > > > Is there a solution that would allow the kind of guarantee our > > > software wants with non-linux nfsds without the cache-blowing > > > that the change I'm suggesting causes? Trond: > > As you can see, the idea is to look at whether or not the file has > > changed recently (I arbitrarily chose a full minute as a concession > > towards clusters with lousy clock synchronization). If it has, then > > the page cache is zapped. If not, we force ordinary attribute cache > > consistency checking. On Thu, Sep 14, 2000 at 09:48:50AM -0700, Michael Eisler wrote: > The fix still does not provide coherency guarantees in all situations, and > at minimum, there ought to be a way to force the client provide a coherency > guarantee. I agree. I wish that the way to force the guarantee was through fcntl(F_???LK), but we'll take whatever way we can get. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> > " " == Jeff Epler <[EMAIL PROTECTED]> writes: > > > Is there a solution that would allow the kind of guarantee our > > software wants with non-linux nfsds without the cache-blowing > > that the change I'm suggesting causes? > > How about something like the following compromise? > > I haven't tried it out yet (and I've no idea whether or not Linus > would accept this) but it compiles, and it should definitely be better > behaved with respect to slowly-changing files. > > As you can see, the idea is to look at whether or not the file has > changed recently (I arbitrarily chose a full minute as a concession > towards clusters with lousy clock synchronization). If it has, then > the page cache is zapped. If not, we force ordinary attribute cache > consistency checking. The fix still does not provide coherency guarantees in all situations, and at minimum, there ought to be a way to force the client provide a coherency guarantee. > Cheers, > Trond > > --- linux-2.4.0-test8/fs/nfs/file.c Fri Jun 30 01:02:40 2000 > +++ linux-2.4.0-test8-fix_lock/fs/nfs/file.c Thu Sep 14 09:18:50 2000 > @@ -240,6 +240,20 @@ > } > > /* > + * Ensure more conservative data cache consistency than NFS_CACHEINV() > + * for files that change frequently. Avoids problems with sub-second > + * changes that don't register on i_mtime. > + */ > +static inline void > +nfs_lock_cacheinv(struct inode *inode) > +{ > + if ((long)CURRENT_TIME - (long)(inode->i_mtime + 60) < 0) > + nfs_zap_caches(inode); > + else > + NFS_CACHEINV(inode); > +} > + > +/* > * Lock a (portion of) a file > */ > int > @@ -295,6 +309,6 @@ >* This makes locking act as a cache coherency point. >*/ > out_ok: > - NFS_CACHEINV(inode); > + nfs_lock_cacheinv(inode); > return status; > } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
Theodore Y. Ts'o writes: > From: "Albert D. Cahalan" <[EMAIL PROTECTED]> >> The ext2 inode has 6 obviously free bytes, 6 that are only used >> on filesystems marked as Hurd-type, and 8 that seem to be claimed >> by competing security and EA projects. So, being wasteful, it would >> be possible to have picoseconds on all 4 time stamps. Doing mere >> milliseconds on 3 timestamps would leave us 16 bytes for security. > > Obviously free to you, perhaps, but you're not the one who allocates > bits out of the ext2 inode. There are various projects that are all > trying claim bits of ext2, including folks who want to implement > fragment support, which you've conveniently edited out of your > un-official ext2 inode structure which you sent out. First of all, I give you fragments: frag_data = 0; file_size = i_size; if(i_size_high>>31){ /* high bit marks fragment data */ frag_data = i_size_high; }else{ file_size |= (__u64)i_size_high << 32; /* frag is insignificant */ } Second of all, you had been planning on extents. With extents, there should be no need to play around with fragments. You just use a small block size. Third, if somebody wants UFS, they can use the UFS driver! More likely, these people should be using Reiserfs. > There has been some talk of doubling the size of the ext2 inode, which > will of course cause some backwards compatibility problems and would > mean that you would only be able to use certain advanced features on new > or converted ext2 filesystems. However, there are enough downsides with > this that it's something of a last resort. Oh well. People used to gripe about ext2 being bloated compared to the early Linux filesystems. Now we are getting JFS, with its huge quad-size inodes. > This is why I'd much prefer a solution which uses a few bits out of > i_generation. True, it's somewhat filesystem specific; but I expect > that any solution that's going to work over multiple filesystems is > going to have at least some filesystem specific bits. And given that we > need i_generation anyway for other purposes, we might as well allow one > of the solutions to simply borrow some number of bits from i_generation. Would you give up 6, 17, 21, or 30 bits? This would chop away at the protection offered by i_generation. Since accurate timestamps are generally useful for NFS and "make", they ought to get at least 30 bits, if not 60. (for the three) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
Date: Thu, 14 Sep 2000 17:03:11 +0200 (CEST) From: Trond Myklebust <[EMAIL PROTECTED]> For the timestamps, yes, but inode caching will take most of that hit. After all, the only time stat() reads from disk is when the inode has completely fallen out of the cache. For commonly used inodes, sure, but there are lots of applications where the inode cache doesn't buy you that much. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Theodore Y Ts'o <[EMAIL PROTECTED]> writes: >Would it perhaps make sense to use one of these last 'free' >fields as a pointer to an 'inode entension'? If you still >want ext2fs to be able to accommodate new projects and >ideas, then it seems that being able to extend the inode is >a desirable feature, but perhaps this overlaps with the >apparent plans for adding resource forks? > For stuff that's not commonly used, perhaps. The problem is > that you take a speed hit when you have to seek somewhere else > to get at the inode extension. So for something which is going > to have to be referenced for every stat() or getattr() > operation, there are a real performance issues with doing > something like that. For the timestamps, yes, but inode caching will take most of that hit. After all, the only time stat() reads from disk is when the inode has completely fallen out of the cache. I agree that doing something like this is less than ideal, but given that the ext2fs inode space hunt seems to be a frequently recurring issue it should be considered as an alternative that gives us whatever space we need for 3 full-sized timestamps and possibly for any future ext2fs projects while preserving backward compatibility. Of course, as I said: when suggesting this, I'm moving completely out of my depth. I'm not ready to argue seriously for or against anything whatsoever, but I'd appreciate an informed opinion. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
Date: Thu, 14 Sep 2000 15:09:35 +0200 (CEST) From: Trond Myklebust <[EMAIL PROTECTED]> Would it perhaps make sense to use one of these last 'free' fields as a pointer to an 'inode entension'? If you still want ext2fs to be able to accommodate new projects and ideas, then it seems that being able to extend the inode is a desirable feature, but perhaps this overlaps with the apparent plans for adding resource forks? For stuff that's not commonly used, perhaps. The problem is that you take a speed hit when you have to seek somewhere else to get at the inode extension. So for something which is going to have to be referenced for every stat() or getattr() operation, there are a real performance issues with doing something like that. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Theodore Y Ts'o <[EMAIL PROTECTED]> writes: > There has been some talk of doubling the size of the ext2 > inode, which will of course cause some backwards compatibility > problems and would mean that you would only be able to use > certain advanced features on new or converted ext2 filesystems. > However, there are enough downsides with this that it's > something of a last resort. It would make life a lot easier > for those various people doing new ext2 features from muscling > each other over space all the time. I'm sure the idea has been raised before, but given the above paragraph I can't resist poking my nose into where it doesn't really belong: Would it perhaps make sense to use one of these last 'free' fields as a pointer to an 'inode entension'? If you still want ext2fs to be able to accommodate new projects and ideas, then it seems that being able to extend the inode is a desirable feature, but perhaps this overlaps with the apparent plans for adding resource forks? Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
From: "Albert D. Cahalan" <[EMAIL PROTECTED]> Date:Wed, 13 Sep 2000 19:20:42 -0400 (EDT) The ext2 inode has 6 obviously free bytes, 6 that are only used on filesystems marked as Hurd-type, and 8 that seem to be claimed by competing security and EA projects. So, being wasteful, it would be possible to have picoseconds on all 4 time stamps. Doing mere milliseconds on 3 timestamps would leave us 16 bytes for security. Obviously free to you, perhaps, but you're not the one who allocates bits out of the ext2 inode. There are various projects that are all trying claim bits of ext2, including folks who want to implement fragment support, which you've conveniently edited out of your un-official ext2 inode structure which you sent out. There has been some talk of doubling the size of the ext2 inode, which will of course cause some backwards compatibility problems and would mean that you would only be able to use certain advanced features on new or converted ext2 filesystems. However, there are enough downsides with this that it's something of a last resort. It would make life a lot easier for those various people doing new ext2 features from muscling each other over space all the time. This is why I'd much prefer a solution which uses a few bits out of i_generation. True, it's somewhat filesystem specific; but I expect that any solution that's going to work over multiple filesystems is going to have at least some filesystem specific bits. And given that we need i_generation anyway for other purposes, we might as well allow one of the solutions to simply borrow some number of bits from i_generation. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
On Thu, Sep 14, 2000 at 11:51:20AM +0200, Trond Myklebust wrote: > The reason I'm sceptical to this and other in-core type solutions is > that knfsd doesn't keep files open for long: basically it opens a file > and close it upon processing of each and every request from a > client. You can therefore easily risk losing the inode between two > requests while the file is still open on the client. > If this means that you also lose part of the info in the mtime field, > then you will trigger a data cache invalidation on the client. knfsd keeps it only open for a small time, but the page cache references the inode as long as it has any data pages for the file. As long as you're not trashing extensively it should stay in memory for a longer time. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Andi Kleen <[EMAIL PROTECTED]> writes: > On Thu, Sep 14, 2000 at 10:49:59AM +0200, Trond Myklebust > wrote: >> > " " == Albert D Cahalan <[EMAIL PROTECTED]> writes: >> >> > The ext2 inode has 6 obviously free bytes, 6 that are only >> > used on filesystems marked as Hurd-type, and 8 that seem to >> > be claimed by competing security and EA projects. So, being >> > wasteful, it would be possible to have picoseconds on all 4 >> > time stamps. Doing mere milliseconds on 3 timestamps would >> > leave us 16 bytes for security. >> >> It's probably more realistic given the time resolution on most >> PCs. One could perhaps allow for greater resolution on those >> filesystems that have the storage for it at the VFS level. > You could also just keep the more finegrained mtime in core and > when you flush the inode round it up to the next second. That would work for ordinary (stateful) filesystems where you open it, then read/write whatever you need then close the file. The reason I'm sceptical to this and other in-core type solutions is that knfsd doesn't keep files open for long: basically it opens a file and close it upon processing of each and every request from a client. You can therefore easily risk losing the inode between two requests while the file is still open on the client. If this means that you also lose part of the info in the mtime field, then you will trigger a data cache invalidation on the client. It's very important to avoid excessive cache invalidation if you want stuff like mmap() to work on the client. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
On Thu, Sep 14, 2000 at 10:49:59AM +0200, Trond Myklebust wrote: > > " " == Albert D Cahalan <[EMAIL PROTECTED]> writes: > > > The ext2 inode has 6 obviously free bytes, 6 that are only used > > on filesystems marked as Hurd-type, and 8 that seem to be > > claimed by competing security and EA projects. So, being > > wasteful, it would be possible to have picoseconds on all 4 > > time stamps. Doing mere milliseconds on 3 timestamps would > > leave us 16 bytes for security. > > It's probably more realistic given the time resolution on most PCs. > One could perhaps allow for greater resolution on those filesystems > that have the storage for it at the VFS level. You could also just keep the more finegrained mtime in core and when you flush the inode round it up to the next second. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Albert D Cahalan <[EMAIL PROTECTED]> writes: > The ext2 inode has 6 obviously free bytes, 6 that are only used > on filesystems marked as Hurd-type, and 8 that seem to be > claimed by competing security and EA projects. So, being > wasteful, it would be possible to have picoseconds on all 4 > time stamps. Doing mere milliseconds on 3 timestamps would > leave us 16 bytes for security. It's probably more realistic given the time resolution on most PCs. One could perhaps allow for greater resolution on those filesystems that have the storage for it at the VFS level. What are the 'i_reserved'[12] fields all about? Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Jeff Epler <[EMAIL PROTECTED]> writes: > Is there a solution that would allow the kind of guarantee our > software wants with non-linux nfsds without the cache-blowing > that the change I'm suggesting causes? How about something like the following compromise? I haven't tried it out yet (and I've no idea whether or not Linus would accept this) but it compiles, and it should definitely be better behaved with respect to slowly-changing files. As you can see, the idea is to look at whether or not the file has changed recently (I arbitrarily chose a full minute as a concession towards clusters with lousy clock synchronization). If it has, then the page cache is zapped. If not, we force ordinary attribute cache consistency checking. Cheers, Trond --- linux-2.4.0-test8/fs/nfs/file.c Fri Jun 30 01:02:40 2000 +++ linux-2.4.0-test8-fix_lock/fs/nfs/file.cThu Sep 14 09:18:50 2000 @@ -240,6 +240,20 @@ } /* + * Ensure more conservative data cache consistency than NFS_CACHEINV() + * for files that change frequently. Avoids problems with sub-second + * changes that don't register on i_mtime. + */ +static inline void +nfs_lock_cacheinv(struct inode *inode) +{ + if ((long)CURRENT_TIME - (long)(inode->i_mtime + 60) < 0) + nfs_zap_caches(inode); + else + NFS_CACHEINV(inode); +} + +/* * Lock a (portion of) a file */ int @@ -295,6 +309,6 @@ * This makes locking act as a cache coherency point. */ out_ok: - NFS_CACHEINV(inode); + nfs_lock_cacheinv(inode); return status; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
" " == Jeff Epler [EMAIL PROTECTED] writes: Is there a solution that would allow the kind of guarantee our software wants with non-linux nfsds without the cache-blowing that the change I'm suggesting causes? How about something like the following compromise? I haven't tried it out yet (and I've no idea whether or not Linus would accept this) but it compiles, and it should definitely be better behaved with respect to slowly-changing files. As you can see, the idea is to look at whether or not the file has changed recently (I arbitrarily chose a full minute as a concession towards clusters with lousy clock synchronization). If it has, then the page cache is zapped. If not, we force ordinary attribute cache consistency checking. Cheers, Trond --- linux-2.4.0-test8/fs/nfs/file.c Fri Jun 30 01:02:40 2000 +++ linux-2.4.0-test8-fix_lock/fs/nfs/file.cThu Sep 14 09:18:50 2000 @@ -240,6 +240,20 @@ } /* + * Ensure more conservative data cache consistency than NFS_CACHEINV() + * for files that change frequently. Avoids problems with sub-second + * changes that don't register on i_mtime. + */ +static inline void +nfs_lock_cacheinv(struct inode *inode) +{ + if ((long)CURRENT_TIME - (long)(inode-i_mtime + 60) 0) + nfs_zap_caches(inode); + else + NFS_CACHEINV(inode); +} + +/* * Lock a (portion of) a file */ int @@ -295,6 +309,6 @@ * This makes locking act as a cache coherency point. */ out_ok: - NFS_CACHEINV(inode); + nfs_lock_cacheinv(inode); return status; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
On Thu, Sep 14, 2000 at 10:49:59AM +0200, Trond Myklebust wrote: " " == Albert D Cahalan [EMAIL PROTECTED] writes: The ext2 inode has 6 obviously free bytes, 6 that are only used on filesystems marked as Hurd-type, and 8 that seem to be claimed by competing security and EA projects. So, being wasteful, it would be possible to have picoseconds on all 4 time stamps. Doing mere milliseconds on 3 timestamps would leave us 16 bytes for security. It's probably more realistic given the time resolution on most PCs. One could perhaps allow for greater resolution on those filesystems that have the storage for it at the VFS level. You could also just keep the more finegrained mtime in core and when you flush the inode round it up to the next second. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
On Thu, Sep 14, 2000 at 11:51:20AM +0200, Trond Myklebust wrote: The reason I'm sceptical to this and other in-core type solutions is that knfsd doesn't keep files open for long: basically it opens a file and close it upon processing of each and every request from a client. You can therefore easily risk losing the inode between two requests while the file is still open on the client. If this means that you also lose part of the info in the mtime field, then you will trigger a data cache invalidation on the client. knfsd keeps it only open for a small time, but the page cache references the inode as long as it has any data pages for the file. As long as you're not trashing extensively it should stay in memory for a longer time. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
From: "Albert D. Cahalan" [EMAIL PROTECTED] Date:Wed, 13 Sep 2000 19:20:42 -0400 (EDT) The ext2 inode has 6 obviously free bytes, 6 that are only used on filesystems marked as Hurd-type, and 8 that seem to be claimed by competing security and EA projects. So, being wasteful, it would be possible to have picoseconds on all 4 time stamps. Doing mere milliseconds on 3 timestamps would leave us 16 bytes for security. Obviously free to you, perhaps, but you're not the one who allocates bits out of the ext2 inode. There are various projects that are all trying claim bits of ext2, including folks who want to implement fragment support, which you've conveniently edited out of your un-official ext2 inode structure which you sent out. There has been some talk of doubling the size of the ext2 inode, which will of course cause some backwards compatibility problems and would mean that you would only be able to use certain advanced features on new or converted ext2 filesystems. However, there are enough downsides with this that it's something of a last resort. It would make life a lot easier for those various people doing new ext2 features from muscling each other over space all the time. This is why I'd much prefer a solution which uses a few bits out of i_generation. True, it's somewhat filesystem specific; but I expect that any solution that's going to work over multiple filesystems is going to have at least some filesystem specific bits. And given that we need i_generation anyway for other purposes, we might as well allow one of the solutions to simply borrow some number of bits from i_generation. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
Date: Thu, 14 Sep 2000 15:09:35 +0200 (CEST) From: Trond Myklebust [EMAIL PROTECTED] Would it perhaps make sense to use one of these last 'free' fields as a pointer to an 'inode entension'? If you still want ext2fs to be able to accommodate new projects and ideas, then it seems that being able to extend the inode is a desirable feature, but perhaps this overlaps with the apparent plans for adding resource forks? For stuff that's not commonly used, perhaps. The problem is that you take a speed hit when you have to seek somewhere else to get at the inode extension. So for something which is going to have to be referenced for every stat() or getattr() operation, there are a real performance issues with doing something like that. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
" " == Theodore Y Ts'o [EMAIL PROTECTED] writes: Would it perhaps make sense to use one of these last 'free' fields as a pointer to an 'inode entension'? If you still want ext2fs to be able to accommodate new projects and ideas, then it seems that being able to extend the inode is a desirable feature, but perhaps this overlaps with the apparent plans for adding resource forks? For stuff that's not commonly used, perhaps. The problem is that you take a speed hit when you have to seek somewhere else to get at the inode extension. So for something which is going to have to be referenced for every stat() or getattr() operation, there are a real performance issues with doing something like that. For the timestamps, yes, but inode caching will take most of that hit. After all, the only time stat() reads from disk is when the inode has completely fallen out of the cache. I agree that doing something like this is less than ideal, but given that the ext2fs inode space hunt seems to be a frequently recurring issue it should be considered as an alternative that gives us whatever space we need for 3 full-sized timestamps and possibly for any future ext2fs projects while preserving backward compatibility. Of course, as I said: when suggesting this, I'm moving completely out of my depth. I'm not ready to argue seriously for or against anything whatsoever, but I'd appreciate an informed opinion. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
" " == Jeff Epler [EMAIL PROTECTED] writes: Is there a solution that would allow the kind of guarantee our software wants with non-linux nfsds without the cache-blowing that the change I'm suggesting causes? How about something like the following compromise? I haven't tried it out yet (and I've no idea whether or not Linus would accept this) but it compiles, and it should definitely be better behaved with respect to slowly-changing files. As you can see, the idea is to look at whether or not the file has changed recently (I arbitrarily chose a full minute as a concession towards clusters with lousy clock synchronization). If it has, then the page cache is zapped. If not, we force ordinary attribute cache consistency checking. The fix still does not provide coherency guarantees in all situations, and at minimum, there ought to be a way to force the client provide a coherency guarantee. Cheers, Trond --- linux-2.4.0-test8/fs/nfs/file.c Fri Jun 30 01:02:40 2000 +++ linux-2.4.0-test8-fix_lock/fs/nfs/file.c Thu Sep 14 09:18:50 2000 @@ -240,6 +240,20 @@ } /* + * Ensure more conservative data cache consistency than NFS_CACHEINV() + * for files that change frequently. Avoids problems with sub-second + * changes that don't register on i_mtime. + */ +static inline void +nfs_lock_cacheinv(struct inode *inode) +{ + if ((long)CURRENT_TIME - (long)(inode-i_mtime + 60) 0) + nfs_zap_caches(inode); + else + NFS_CACHEINV(inode); +} + +/* * Lock a (portion of) a file */ int @@ -295,6 +309,6 @@ * This makes locking act as a cache coherency point. */ out_ok: - NFS_CACHEINV(inode); + nfs_lock_cacheinv(inode); return status; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
Jeff Epler [EMAIL PROTECTED] writes: Is there a solution that would allow the kind of guarantee our software wants with non-linux nfsds without the cache-blowing that the change I'm suggesting causes? Trond: As you can see, the idea is to look at whether or not the file has changed recently (I arbitrarily chose a full minute as a concession towards clusters with lousy clock synchronization). If it has, then the page cache is zapped. If not, we force ordinary attribute cache consistency checking. On Thu, Sep 14, 2000 at 09:48:50AM -0700, Michael Eisler wrote: The fix still does not provide coherency guarantees in all situations, and at minimum, there ought to be a way to force the client provide a coherency guarantee. I agree. I wish that the way to force the guarantee was through fcntl(F_???LK), but we'll take whatever way we can get. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
On Wed, Sep 13, 2000 at 11:56:56PM +0200, Trond Myklebust wrote: > After consultation with the appropriate authorities, it turns out that > the Sun-code based clients do in fact turn off data caching entirely > when using NLM file locking. Entirely? That's interesting, because for our consistency requirements we only need the data cache to be flushed for a particular inode at the time of a locking call. At least if we port to Solaris (hah!) we won't have this particular problem in our clients. > Attribute cache consistency checking on > NFSv4 differs from v2/v3 in that you have a new attribute that is > guaranteed to change every time the file changes. That would be very welcome, but it's doubtful we could get an nfsv4 server to run on our old bsdi 1.x-based servers :) Thank you very much for continuing to pursue this. I wish I were an NFS expert, but as I may have said elsewhere, this task falls to me as the Linux evangelist at the office. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
I think that both the NFS server changes that Trond is suggesting and the NFS client changes that I am suggesting have their place. The fact that the tuple (mtime, size) is used to test to test for unchangedness of a file indicates that the people who designed NFS understood that just using mtime wasn't enough. The problem is that using (mtime, size) is not enough either, and I don't understand why anyone ever thought it would be. Trivial example: write(fd, "a", 1); llseek(fd, 0, 0); write(fd, "b", 1); My solution will enable the Linux client (and other clients implementing the same fix) to achieve the coherency promised by the comment in nfs_lock(), at the expense of slightly lower performance, no matter what NFS server they use (even a hypothetical one with mtime resolution of a year). Trond's nfs server changes would enable any client communicating with it to achieve the coherency promised in nfs_lock(), without the performance loss, but unfortunately the changes can't be made today and don't solve anything for non-Linux NFS servers. Thus, the only thing that is missing from my proposal is a way to fall back on the old invalidation code, for that time in the future where mtime changes are always sufficient to distinguish changes n a file. I'm not going to sweat this, since it sounds like this isn't implemented anywhere else and won't be in Linux for some time. When it comes along, we can upgrade mount and the nfs client with a new flag. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
Trond Myklebust writes: > > " " == Albert D Cahalan <[EMAIL PROTECTED]> writes: >> 20 bits * 3 timestamps == 60 bits 60 bits <= 8 bytes >> >> So you do need 8 bytes. > > Yes. Assuming that you want to implement the microseconds field on > all 3 timestamps. For NFS I would only need that precision on mtime. > Does anybody else see a need for it on ctime and/or atime? If you don't implement all 3, you get an ugly asymetry that messes up timestamp comparisons. You'd get ctime> I thought of this, but I don't think it is safe. >> >> write to file X us=1 write to file Y us=1 write to file Y us=2 >> write to file Y us=3 write to file X us=2 >> >> Oops, "make" on some clients will think Y is newer than X. >> >> Using the wrong unit would be OK, as long as it is a real >> sub-second time unit that doesn't overflow... but then you >> might as well convert it to microseconds. > > Sure, but the alternative might be to reserve some bits as an > 'operations counter' that is incremented every time an update is > performed. That way you might have milisecond resolution > (which is the sort of resolution that 'jiffies' & friends offers) > + a few extra bits that would be there for make/NFS/... update > consistency. No way. You still get the same errors seen on clients that don't know you are abusing the time stamps. The ext2 inode has 6 obviously free bytes, 6 that are only used on filesystems marked as Hurd-type, and 8 that seem to be claimed by competing security and EA projects. So, being wasteful, it would be possible to have picoseconds on all 4 time stamps. Doing mere milliseconds on 3 timestamps would leave us 16 bytes for security. This is what we have today: struct ext2_inode { __u16 i_mode; /* mode (type & permission bitss) */ __u16 i_uid; /* owner's UID */ __u32 i_size; /* size (low 32 bits) */ __u32 i_atime;/* access time */ __u32 i_ctime;/* inode change time */ __u32 i_mtime;/* modification time */ __u32 i_dtime;/* deletion time */ __u16 i_gid; /* group ID */ __u16 i_links_count; /* link count */ __u32 i_blocks; /* block count */ __u32 i_flags;/* ext2 flags */ union { __u32 i_translator; /* translator (Hurd only) */ __u32 i_res_fork; /* resource fork (Linux privs) */ } u1; __u32 i_block[15];/* block pointers */ __u32 i_version; /* inode generation (for NFS) */ __u32 i_file_acl; /* file ACL */ union { __u32 i_dir_acl; /* directory ACL */ __u32 i_size_high;/* high 32 bits of 64-bit size */ } u2; __u32 i_reserved1; __u16 i_reserved2; __u16 i_mode_high;/* extra mode bits (Hurd only) */ __u16 i_uid_high; /* high 16 bits of 32-bit UID */ __u16 i_gid_high; /* high 16 bits of 32-bit GID */ __u32 i_author; /* author (Hurd only) */ }; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Andi Kleen <[EMAIL PROTECTED]> writes: > Steal a couple of bytes from what ? >From the 32-bit storage area currently allocated to i_generation on the on-disk ext2fs inode (as per Ted's suggestion). With current disk/computing speeds, you probably don't need the full 32 bits to offer a reasonable guarantee of inode number uniqueness. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Jeff Epler <[EMAIL PROTECTED]> writes: > But "ctime and file size are the same" does not prove that the > file is unchanged. That's the root of this problem, and why > NFS_CACHEINV(inode) is not enough to ensure coherency. > Furthermore, according to NFSv4, what I am suggesting is > something that a client "may choose" to do anyhow---i.e., it's > at least permitted by the spec, even if it's not required. It appears I owe you an apology, and I'd better make it in public. After consultation with the appropriate authorities, it turns out that the Sun-code based clients do in fact turn off data caching entirely when using NLM file locking. Attribute cache consistency checking on NFSv4 differs from v2/v3 in that you have a new attribute that is guaranteed to change every time the file changes. I'm still not happy with that solution, as it means that performance is sub-optimal for a number of interesting cases, and it screws up mmap() royally. However it does shoot to pieces my argument that we currently do conform to standard behaviour. Cheers and sorry, Trond BTW: the microsecond mtime resolution on the NFS server remains an issue, but now only for the ordinary case of attribute cache consistency checking. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Theodore Y Ts'o <[EMAIL PROTECTED]> writes: > Why? i_generation is a field which is only used by the NFS > server. As far as ext2 is concerned, it's a black box value. > Currently, as I understand things (and you're much more the > export on the NFS code than I am) all 32 bits of i_generation > is used to enforce uniqueness of the NFS file handler across > the recycling of the inode (i.e., the inode is deleted, and > then reused for something else). Is that right? Yes. In effect you have a 32-bit counter that serves as a guarantees that an inode is not reused 'within a reasonable period of time'. > If we were to steal, say, 8 bits of i_generation to provide a > subsecond indicator of freshness of the attributes by shoving > those 8 bits into the NFS microseconds field, this shouldn't > cause any incompatibilities, should it? (Of couse, you would > mask out those 8 bits for the NFS file handle generation > algorithm, which you would only have 24 bits, which still > should be enough for its purposes.) > My understanding of the games played by the i_generation field > and the NFS file handle is that it's not visiable to NFS > clients. True, it would mean a certain amount of confusion for > those clients that kept the filesystem mounted across a > transition between an old 2.2.16 kernel and a reboot to a new > kernel that interpreted the i_generation field differently, but > that should be the only compatibility problem I can think of. > Am I missing something? We'd want to make an implementation that is generic and fits all 'commonly used' UNIX filesystems. That means that we don't want to apply an ext2-specific hack at the VFS or nfsd levels. The inode-generation field is a feature on several filesystems which already support sub-second timestamps. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
On Wed, Sep 13, 2000 at 10:35:10PM +0200, Trond Myklebust wrote: > > No, it does not help. i_generation is only in the NFS file > > handle, but not in the fattr/file id this is used for cache > > checks. The NFS file handle has to stay identical anyways, as > > long as the inode is not deleted/reused. > > > You might be able to steal a couple of bytes and then rewrite ext2fs > to mask those out from the 'i_generation' field, but it would mean that > you could no longer boot your old 2.2.16 kernel without trouble on > existing clients. Steal a couple of bytes from what ? -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
On Wed, Sep 13, 2000 at 12:42:15PM +0200, Trond Myklebust wrote: > > Is there any 'standard' function for reading the microseconds fields > in userland? I couldn't find anything with a brief search in the > X/Open docs. > Both Digital OSF/1 and Solaris use 3 undocumented spare fields in the struct stat and filled it with a microsecond/nanosecond fraction for atime, mtime and ctime. -- Frank - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
Date: Wed, 13 Sep 2000 22:35:10 +0200 (CEST) From: Trond Myklebust <[EMAIL PROTECTED]> You might be able to steal a couple of bytes and then rewrite ext2fs to mask those out from the 'i_generation' field, but it would mean that you could no longer boot your old 2.2.16 kernel without trouble on existing clients. Why? i_generation is a field which is only used by the NFS server. As far as ext2 is concerned, it's a black box value. Currently, as I understand things (and you're much more the export on the NFS code than I am) all 32 bits of i_generation is used to enforce uniqueness of the NFS file handler across the recycling of the inode (i.e., the inode is deleted, and then reused for something else). Is that right? If we were to steal, say, 8 bits of i_generation to provide a subsecond indicator of freshness of the attributes by shoving those 8 bits into the NFS microseconds field, this shouldn't cause any incompatibilities, should it? (Of couse, you would mask out those 8 bits for the NFS file handle generation algorithm, which you would only have 24 bits, which still should be enough for its purposes.) My understanding of the games played by the i_generation field and the NFS file handle is that it's not visiable to NFS clients. True, it would mean a certain amount of confusion for those clients that kept the filesystem mounted across a transition between an old 2.2.16 kernel and a reboot to a new kernel that interpreted the i_generation field differently, but that should be the only compatibility problem I can think of. Am I missing something? - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Andi Kleen <[EMAIL PROTECTED]> writes: > On Wed, Sep 13, 2000 at 11:51:45AM -0400, Theodore Y. Ts'o > wrote: >> >> So this is a really stupid question, but I'll ask it anyway. >> If you just need a cookie, is there any way that you might be >> able to steal a few bits from i_generation field for that >> purpose? (This assumes that we only worry about solving the >> problem for NFS.) > No, it does not help. i_generation is only in the NFS file > handle, but not in the fattr/file id this is used for cache > checks. The NFS file handle has to stay identical anyways, as > long as the inode is not deleted/reused. You might be able to steal a couple of bytes and then rewrite ext2fs to mask those out from the 'i_generation' field, but it would mean that you could no longer boot your old 2.2.16 kernel without trouble on existing clients. I'm no expert on the consequences of such a change on ext2fs and what it might do to other OSes, but my gut feeling is that this is perhaps one solution that we should try to avoid. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Albert D Cahalan <[EMAIL PROTECTED]> writes: > 20 bits * 3 timestamps == 60 bits 60 bits <= 8 bytes > So you do need 8 bytes. Yes. Assuming that you want to implement the microseconds field on all 3 timestamps. For NFS I would only need that precision on mtime. Does anybody else see a need for it on ctime and/or atime? > I thought of this, but I don't think it is safe. > write to file X us=1 write to file Y us=1 write to file Y us=2 > write to file Y us=3 write to file X us=2 > Oops, "make" on some clients will think Y is newer than X. > Using the wrong unit would be OK, as long as it is a real > sub-second time unit that doesn't overflow... but then you > might as well convert it to microseconds. Sure, but the alternative might be to reserve some bits as an 'operations counter' that is incremented every time an update is performed. That way you might have milisecond resolution (which is the sort of resolution that 'jiffies' & friends offers) + a few extra bits that would be there for make/NFS/... update consistency. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
On Wed, Sep 13, 2000 at 11:51:45AM -0400, Theodore Y. Ts'o wrote: > > So this is a really stupid question, but I'll ask it anyway. If you > just need a cookie, is there any way that you might be able to steal a > few bits from i_generation field for that purpose? (This assumes that > we only worry about solving the problem for NFS.) No, it does not help. i_generation is only in the NFS file handle, but not in the fattr/file id this is used for cache checks. The NFS file handle has to stay identical anyways, as long as the inode is not deleted/reused. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
Date:Wed, 13 Sep 2000 12:54:49 +0200 (CEST) From: Trond Myklebust <[EMAIL PROTECTED]> Don't forget that 2^20 > 10^6, hence if you really want units of microseconds, you actually only need to save 3 bytes worth of data per timestamp. For the purposes of NFS, however the 'microseconds' field doesn't actually have to mean anything. It's just used as a cookie. Furthermore, only the mtime field, and to a limited extent the ctime field, are used for attribute consistency checking. So this is a really stupid question, but I'll ask it anyway. If you just need a cookie, is there any way that you might be able to steal a few bits from i_generation field for that purpose? (This assumes that we only worry about solving the problem for NFS.) - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
Trond Myklebust writes: > > " " == Albert D Cahalan <[EMAIL PROTECTED]> writes: >> It would not be reasonable to try extending ext2 to 64-bit >> times, but milliseconds might be doable. You'd need 4 bytes to >> support the 3 standard time stamps. >> >> Going to microseconds would require 8 free bytes, which I don't >> think we have. (but we do have more that one might think, due >> to the unimplemented junk) > > Don't forget that 2^20 > 10^6, hence if you really want units of > microseconds, you actually only need to save 3 bytes worth of data > per timestamp. 20 bits * 3 timestamps == 60 bits 60 bits <= 8 bytes So you do need 8 bytes. > For the purposes of NFS, however the 'microseconds' field doesn't > actually have to mean anything. It's just used as a cookie. I thought of this, but I don't think it is safe. write to file X us=1 write to file Y us=1 write to file Y us=2 write to file Y us=3 write to file X us=2 Oops, "make" on some clients will think Y is newer than X. Using the wrong unit would be OK, as long as it is a real sub-second time unit that doesn't overflow... but then you might as well convert it to microseconds. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Jeff Epler <[EMAIL PROTECTED]> writes: > On Wed, Sep 13, 2000 at 02:53:02PM +0200, Trond Myklebust > wrote: >> No. Things fall in and out of the inode cache all the >> time. That's a vicious circle that's going to lead to a lot of >> unnecessary traffic. > The traffic is not "unnecessary" if it's needed to work with > today's NFS servers and get correct results. We need an > absolute guarantee in nfs_lock() that we won't read stale data, > we need it when working with non-linux NFS servers, and we need > it today. That's the joy of free software: you're free to patch in any hack you and your clients need. The goal of the official 2.4.x kernel however should IMO be to follow the specs which is what we do now. If current Linux servers are broken w.r.t. NFS specs then a hack to the Linux client won't fix them, and they will continue to break on non-Linux clients. NFS has a mechanism for solving this problem, and it is a mechanism that most (if not all) non-linux servers implement. That is what we need to implement too if we're serious about conforming to the standard. I'm quite willing to help fix the server (something I indeed think should be done), but to further obscure the code in the client with yet another bandaid for broken servers is not something I personally will help do. Perhaps Linus will indeed accept such a patch, in which case you (or anybody else) are quite free to construct it and pass it on to him. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
On Wed, Sep 13, 2000 at 02:53:02PM +0200, Trond Myklebust wrote: > No. Things fall in and out of the inode cache all the time. That's a > vicious circle that's going to lead to a lot of unnecessary traffic. The traffic is not "unnecessary" if it's needed to work with today's NFS servers and get correct results. We need an absolute guarantee in nfs_lock() that we won't read stale data, we need it when working with non-linux NFS servers, and we need it today. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Jamie Lokier <[EMAIL PROTECTED]> writes: > If we're getting serious about adding finer-grained mtimes to > ext2, could we please consider using those bits as a way to > know, for sure, whether a file has changed? Agreed. > Btw, the whole cache coherency thing would probably work very > well just updating times in in-core inodes. No. Things fall in and out of the inode cache all the time. That's a vicious circle that's going to lead to a lot of unnecessary traffic. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
Trond Myklebust wrote: > > It would not be reasonable to try extending ext2 to 64-bit > > times, but milliseconds might be doable. You'd need 4 bytes to > > support the 3 standard time stamps. > > > Going to microseconds would require 8 free bytes, which I don't > > think we have. (but we do have more that one might think, due > > to the unimplemented junk) > > Don't forget that 2^20 > 10^6, hence if you really want units of > microseconds, you actually only need to save 3 bytes worth of data per > timestamp. > > For the purposes of NFS, however the 'microseconds' field doesn't > actually have to mean anything. It's just used as a cookie. > Furthermore, only the mtime field, and to a limited extent the ctime > field, are used for attribute consistency checking. If we're getting serious about adding finer-grained mtimes to ext2, could we please consider using those bits as a way to know, for sure, whether a file has changed? Btw, the whole cache coherency thing would probably work very well just updating times in in-core inodes. thanks, -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
On Wed, Sep 13, 2000 at 12:42:15PM +0200, Trond Myklebust wrote: > > " " == Andi Kleen <[EMAIL PROTECTED]> writes: > > > On Tue, Sep 12, 2000 at 09:10:56PM +0200, Trond Myklebust > > wrote: > >> Making mtime a true 64-bit cookie on Linux servers would be a > >> solution that works on all clients. > > > Making mtime 64bit would also be useful for local parallel make > > runs, the current second resolution leads to race conditions on > > bigger SMP boxes. > > > Unfortunately the only FS that supports > currently is XFS afaik, and there is no VFS interface for it. > > > You'd have to fix make as well: 'stat' & friends alway return the > timestamp in seconds. struct stat64 leaves enough padding for micro or nanosecond fields. > > Is there any 'standard' function for reading the microseconds fields > in userland? I couldn't find anything with a brief search in the > X/Open docs. Solaris seems to support it in newer releases, and GNU make 3.79 already does on Solaris. They seem to use a st_mtime_nsec. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Albert D Cahalan <[EMAIL PROTECTED]> writes: > It would not be reasonable to try extending ext2 to 64-bit > times, but milliseconds might be doable. You'd need 4 bytes to > support the 3 standard time stamps. > Going to microseconds would require 8 free bytes, which I don't > think we have. (but we do have more that one might think, due > to the unimplemented junk) Don't forget that 2^20 > 10^6, hence if you really want units of microseconds, you actually only need to save 3 bytes worth of data per timestamp. For the purposes of NFS, however the 'microseconds' field doesn't actually have to mean anything. It's just used as a cookie. Furthermore, only the mtime field, and to a limited extent the ctime field, are used for attribute consistency checking. If NFS was the only concern therefore, you would get far on just saving one extra 24-bit field. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Andi Kleen <[EMAIL PROTECTED]> writes: > On Tue, Sep 12, 2000 at 09:10:56PM +0200, Trond Myklebust > wrote: >> Making mtime a true 64-bit cookie on Linux servers would be a >> solution that works on all clients. > Making mtime 64bit would also be useful for local parallel make > runs, the current second resolution leads to race conditions on > bigger SMP boxes. > Unfortunately the only FS that supports currently is XFS afaik, and there is no VFS interface for it. You'd have to fix make as well: 'stat' & friends alway return the timestamp in seconds. Is there any 'standard' function for reading the microseconds fields in userland? I couldn't find anything with a brief search in the X/Open docs. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
" " == Andi Kleen [EMAIL PROTECTED] writes: On Tue, Sep 12, 2000 at 09:10:56PM +0200, Trond Myklebust wrote: Making mtime a true 64-bit cookie on Linux servers would be a solution that works on all clients. Making mtime 64bit would also be useful for local parallel make runs, the current second resolution leads to race conditions on bigger SMP boxes. Unfortunately the only FS that supports s on disk timestamps currently is XFS afaik, and there is no VFS interface for it. You'd have to fix make as well: 'stat' friends alway return the timestamp in seconds. Is there any 'standard' function for reading the microseconds fields in userland? I couldn't find anything with a brief search in the X/Open docs. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
On Wed, Sep 13, 2000 at 12:42:15PM +0200, Trond Myklebust wrote: " " == Andi Kleen [EMAIL PROTECTED] writes: On Tue, Sep 12, 2000 at 09:10:56PM +0200, Trond Myklebust wrote: Making mtime a true 64-bit cookie on Linux servers would be a solution that works on all clients. Making mtime 64bit would also be useful for local parallel make runs, the current second resolution leads to race conditions on bigger SMP boxes. Unfortunately the only FS that supports s on disk timestamps currently is XFS afaik, and there is no VFS interface for it. You'd have to fix make as well: 'stat' friends alway return the timestamp in seconds. struct stat64 leaves enough padding for micro or nanosecond fields. Is there any 'standard' function for reading the microseconds fields in userland? I couldn't find anything with a brief search in the X/Open docs. Solaris seems to support it in newer releases, and GNU make 3.79 already does on Solaris. They seem to use a st_mtime_nsec. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
" " == Jamie Lokier [EMAIL PROTECTED] writes: If we're getting serious about adding finer-grained mtimes to ext2, could we please consider using those bits as a way to know, for sure, whether a file has changed? Agreed. Btw, the whole cache coherency thing would probably work very well just updating times in in-core inodes. No. Things fall in and out of the inode cache all the time. That's a vicious circle that's going to lead to a lot of unnecessary traffic. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
" " == Jeff Epler [EMAIL PROTECTED] writes: On Wed, Sep 13, 2000 at 02:53:02PM +0200, Trond Myklebust wrote: No. Things fall in and out of the inode cache all the time. That's a vicious circle that's going to lead to a lot of unnecessary traffic. The traffic is not "unnecessary" if it's needed to work with today's NFS servers and get correct results. We need an absolute guarantee in nfs_lock() that we won't read stale data, we need it when working with non-linux NFS servers, and we need it today. That's the joy of free software: you're free to patch in any hack you and your clients need. The goal of the official 2.4.x kernel however should IMO be to follow the specs which is what we do now. If current Linux servers are broken w.r.t. NFS specs then a hack to the Linux client won't fix them, and they will continue to break on non-Linux clients. NFS has a mechanism for solving this problem, and it is a mechanism that most (if not all) non-linux servers implement. That is what we need to implement too if we're serious about conforming to the standard. I'm quite willing to help fix the server (something I indeed think should be done), but to further obscure the code in the client with yet another bandaid for broken servers is not something I personally will help do. Perhaps Linus will indeed accept such a patch, in which case you (or anybody else) are quite free to construct it and pass it on to him. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
Date:Wed, 13 Sep 2000 12:54:49 +0200 (CEST) From: Trond Myklebust [EMAIL PROTECTED] Don't forget that 2^20 10^6, hence if you really want units of microseconds, you actually only need to save 3 bytes worth of data per timestamp. For the purposes of NFS, however the 'microseconds' field doesn't actually have to mean anything. It's just used as a cookie. Furthermore, only the mtime field, and to a limited extent the ctime field, are used for attribute consistency checking. So this is a really stupid question, but I'll ask it anyway. If you just need a cookie, is there any way that you might be able to steal a few bits from i_generation field for that purpose? (This assumes that we only worry about solving the problem for NFS.) - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
" " == Albert D Cahalan [EMAIL PROTECTED] writes: 20 bits * 3 timestamps == 60 bits 60 bits = 8 bytes So you do need 8 bytes. Yes. Assuming that you want to implement the microseconds field on all 3 timestamps. For NFS I would only need that precision on mtime. Does anybody else see a need for it on ctime and/or atime? I thought of this, but I don't think it is safe. write to file X us=1 write to file Y us=1 write to file Y us=2 write to file Y us=3 write to file X us=2 Oops, "make" on some clients will think Y is newer than X. Using the wrong unit would be OK, as long as it is a real sub-second time unit that doesn't overflow... but then you might as well convert it to microseconds. Sure, but the alternative might be to reserve some bits as an 'operations counter' that is incremented every time an update is performed. That way you might have milisecond resolution (which is the sort of resolution that 'jiffies' friends offers) + a few extra bits that would be there for make/NFS/... update consistency. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
" " == Andi Kleen [EMAIL PROTECTED] writes: On Wed, Sep 13, 2000 at 11:51:45AM -0400, Theodore Y. Ts'o wrote: So this is a really stupid question, but I'll ask it anyway. If you just need a cookie, is there any way that you might be able to steal a few bits from i_generation field for that purpose? (This assumes that we only worry about solving the problem for NFS.) No, it does not help. i_generation is only in the NFS file handle, but not in the fattr/file id this is used for cache checks. The NFS file handle has to stay identical anyways, as long as the inode is not deleted/reused. You might be able to steal a couple of bytes and then rewrite ext2fs to mask those out from the 'i_generation' field, but it would mean that you could no longer boot your old 2.2.16 kernel without trouble on existing clients. I'm no expert on the consequences of such a change on ext2fs and what it might do to other OSes, but my gut feeling is that this is perhaps one solution that we should try to avoid. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
Date: Wed, 13 Sep 2000 22:35:10 +0200 (CEST) From: Trond Myklebust [EMAIL PROTECTED] You might be able to steal a couple of bytes and then rewrite ext2fs to mask those out from the 'i_generation' field, but it would mean that you could no longer boot your old 2.2.16 kernel without trouble on existing clients. Why? i_generation is a field which is only used by the NFS server. As far as ext2 is concerned, it's a black box value. Currently, as I understand things (and you're much more the export on the NFS code than I am) all 32 bits of i_generation is used to enforce uniqueness of the NFS file handler across the recycling of the inode (i.e., the inode is deleted, and then reused for something else). Is that right? If we were to steal, say, 8 bits of i_generation to provide a subsecond indicator of freshness of the attributes by shoving those 8 bits into the NFS microseconds field, this shouldn't cause any incompatibilities, should it? (Of couse, you would mask out those 8 bits for the NFS file handle generation algorithm, which you would only have 24 bits, which still should be enough for its purposes.) My understanding of the games played by the i_generation field and the NFS file handle is that it's not visiable to NFS clients. True, it would mean a certain amount of confusion for those clients that kept the filesystem mounted across a transition between an old 2.2.16 kernel and a reboot to a new kernel that interpreted the i_generation field differently, but that should be the only compatibility problem I can think of. Am I missing something? - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
On Wed, Sep 13, 2000 at 12:42:15PM +0200, Trond Myklebust wrote: Is there any 'standard' function for reading the microseconds fields in userland? I couldn't find anything with a brief search in the X/Open docs. Both Digital OSF/1 and Solaris use 3 undocumented spare fields in the struct stat and filled it with a microsecond/nanosecond fraction for atime, mtime and ctime. -- Frank - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
On Wed, Sep 13, 2000 at 10:35:10PM +0200, Trond Myklebust wrote: No, it does not help. i_generation is only in the NFS file handle, but not in the fattr/file id this is used for cache checks. The NFS file handle has to stay identical anyways, as long as the inode is not deleted/reused. You might be able to steal a couple of bytes and then rewrite ext2fs to mask those out from the 'i_generation' field, but it would mean that you could no longer boot your old 2.2.16 kernel without trouble on existing clients. Steal a couple of bytes from what ? -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
" " == Theodore Y Ts'o [EMAIL PROTECTED] writes: Why? i_generation is a field which is only used by the NFS server. As far as ext2 is concerned, it's a black box value. Currently, as I understand things (and you're much more the export on the NFS code than I am) all 32 bits of i_generation is used to enforce uniqueness of the NFS file handler across the recycling of the inode (i.e., the inode is deleted, and then reused for something else). Is that right? Yes. In effect you have a 32-bit counter that serves as a guarantees that an inode is not reused 'within a reasonable period of time'. If we were to steal, say, 8 bits of i_generation to provide a subsecond indicator of freshness of the attributes by shoving those 8 bits into the NFS microseconds field, this shouldn't cause any incompatibilities, should it? (Of couse, you would mask out those 8 bits for the NFS file handle generation algorithm, which you would only have 24 bits, which still should be enough for its purposes.) My understanding of the games played by the i_generation field and the NFS file handle is that it's not visiable to NFS clients. True, it would mean a certain amount of confusion for those clients that kept the filesystem mounted across a transition between an old 2.2.16 kernel and a reboot to a new kernel that interpreted the i_generation field differently, but that should be the only compatibility problem I can think of. Am I missing something? We'd want to make an implementation that is generic and fits all 'commonly used' UNIX filesystems. That means that we don't want to apply an ext2-specific hack at the VFS or nfsd levels. The inode-generation field is a feature on several filesystems which already support sub-second timestamps. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
" " == Andi Kleen [EMAIL PROTECTED] writes: Steal a couple of bytes from what ? From the 32-bit storage area currently allocated to i_generation on the on-disk ext2fs inode (as per Ted's suggestion). With current disk/computing speeds, you probably don't need the full 32 bits to offer a reasonable guarantee of inode number uniqueness. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
I think that both the NFS server changes that Trond is suggesting and the NFS client changes that I am suggesting have their place. The fact that the tuple (mtime, size) is used to test to test for unchangedness of a file indicates that the people who designed NFS understood that just using mtime wasn't enough. The problem is that using (mtime, size) is not enough either, and I don't understand why anyone ever thought it would be. Trivial example: write(fd, "a", 1); llseek(fd, 0, 0); write(fd, "b", 1); My solution will enable the Linux client (and other clients implementing the same fix) to achieve the coherency promised by the comment in nfs_lock(), at the expense of slightly lower performance, no matter what NFS server they use (even a hypothetical one with mtime resolution of a year). Trond's nfs server changes would enable any client communicating with it to achieve the coherency promised in nfs_lock(), without the performance loss, but unfortunately the changes can't be made today and don't solve anything for non-Linux NFS servers. Thus, the only thing that is missing from my proposal is a way to fall back on the old invalidation code, for that time in the future where mtime changes are always sufficient to distinguish changes n a file. I'm not going to sweat this, since it sounds like this isn't implemented anywhere else and won't be in Linux for some time. When it comes along, we can upgrade mount and the nfs client with a new flag. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
Trond Myklebust writes: > Relying on the sub-second timing fields is a much more > implementation-specific. It depends on the capabilities of both the > filesystem and the server OS. > Linux and the knfsd server code could easily be modified to provide > such a service, but the problem (as I've stated before) is that you > need to save the time somewhere on disk. I believe currently ext2 > provides only 32 bits of storage for mtime (though perhaps somebody > else could comment on that). Yes, ext2 has this limit. It would not be reasonable to try extending ext2 to 64-bit times, but milliseconds might be doable. You'd need 4 bytes to support the 3 standard time stamps. Going to microseconds would require 8 free bytes, which I don't think we have. (but we do have more that one might think, due to the unimplemented junk) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
On Tue, Sep 12, 2000 at 09:10:56PM +0200, Trond Myklebust wrote: > Making mtime a true 64-bit cookie on Linux servers would be a solution > that works on all clients. Making mtime 64bit would also be useful for local parallel make runs, the current second resolution leads to race conditions on bigger SMP boxes. Unfortunately the only FS that supports http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
On Tue, Sep 12, 2000 at 09:10:56PM +0200, Trond Myklebust wrote: > > " " == Alan Cox <[EMAIL PROTECTED]> writes: > > > Providing everyone is careful to hold a lock I think it is > > > lockf() is a read barrier providing the local cache is flushed, > > the unlock is a write barrier providing the local cache is > > flushed first. Providing all users are using lockf for their > > I/O then it seems to be coherent > > > Sure, but what happens if you're in a mixed client environment? > If we have a purely Linux-specific hack to ensure cache coherency, > that will still corrupt the cache on those *NIX clients that use > ordinary cache coherency checking (i.e. checking mtime + file size) > rather than cache invalidation. We don't alter anything about what the other clients cache. Sure, you can connect to the same server with a client that doesn't have this fix, and sure you might get different cached data on different clients at the same time, but that's the current situation. You don't make it *worse* for anybody else, you make it *better* for Linux. > There's nothing in the NLM+NFSv2+NFSv3 notes about this situation, but > on page 77 of the latest NFSv4 draft it states: > >oFirst, when a client obtains a file lock for a particular > region, the data cache corresponding to that region (if any > cache data exists) must be revalidated. If the change attribute > indicates that the file may have been updated since the cached > data was obtained, the client must flush or invalidate the > cached data for the newly locked region. A client might choose > to invalidate all of non-modified cached data that it has for > the file but the only requirement for correct operation is to > invalidate all of the data in the newly locked region. But "ctime and file size are the same" does not prove that the file is unchanged. That's the root of this problem, and why NFS_CACHEINV(inode) is not enough to ensure coherency. Furthermore, according to NFSv4, what I am suggesting is something that a client "may choose" to do anyhow---i.e., it's at least permitted by the spec, even if it's not required. > Making mtime a true 64-bit cookie on Linux servers would be a solution > that works on all clients. Yes, but my solution works for all servers with a Linux client. (We've been making this modification to NFS clients in several other unices for years, we know it works and gives the guarantees we need) > It also avoids a lot of unnecessary cache flushing. Imagine having to > reread your entire mailbox every time you open the file whether or not > a new message has arrived. Ugh... It's better to get wrong results if you run a mail client at the same time on two different machines, right? Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
On Tue, Sep 12, 2000 at 07:08:09PM +0200, Trond Myklebust wrote: > This is a known issue, and is not easy to fix. Neither of the > solutions you propose are correct since they will both cause a cache > invalidation. This is not the same as cache coherency checking. > > > The correct solution in this case would be to add a 'sub-second time > field' to ext2fs & friends. Both NFSv2 and NFSv3 pass a 64 bit field [...] Trond, Thanks for your swift answer. I sent a message a few months ago with a much poorer solution which was perhaps rightly ignored Is there any solution that's available today, and doesn't depend on using Linux in the server? I suspect that we will have to distribute a modified nfs client with our app, and we're prepared to accept the cache invalidation (and reduced performance it causes) for a technique that will work with any NFS server and provide the level of guarantees we need. Is there a solution that would allow the kind of guarantee our software wants with non-linux nfsds without the cache-blowing that the change I'm suggesting causes? Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Alan Cox <[EMAIL PROTECTED]> writes: > Providing everyone is careful to hold a lock I think it is > lockf() is a read barrier providing the local cache is flushed, > the unlock is a write barrier providing the local cache is > flushed first. Providing all users are using lockf for their > I/O then it seems to be coherent Sure, but what happens if you're in a mixed client environment? If we have a purely Linux-specific hack to ensure cache coherency, that will still corrupt the cache on those *NIX clients that use ordinary cache coherency checking (i.e. checking mtime + file size) rather than cache invalidation. There's nothing in the NLM+NFSv2+NFSv3 notes about this situation, but on page 77 of the latest NFSv4 draft it states: oFirst, when a client obtains a file lock for a particular region, the data cache corresponding to that region (if any cache data exists) must be revalidated. If the change attribute indicates that the file may have been updated since the cached data was obtained, the client must flush or invalidate the cached data for the newly locked region. A client might choose to invalidate all of non-modified cached data that it has for the file but the only requirement for correct operation is to invalidate all of the data in the newly locked region. --- Making mtime a true 64-bit cookie on Linux servers would be a solution that works on all clients. It also avoids a lot of unnecessary cache flushing. Imagine having to reread your entire mailbox every time you open the file whether or not a new message has arrived. Ugh... Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Jeff Epler <[EMAIL PROTECTED]> writes: > Is there any solution that's available today, and doesn't > depend on using Linux in the server? I suspect that we will > have to distribute a modified nfs client with our app, and > we're prepared to accept the cache invalidation (and reduced > performance it causes) for a technique that will work with any > NFS server and provide the level of guarantees we need. > Is there a solution that would allow the kind of guarantee our > software wants with non-linux nfsds without the cache-blowing > that the change I'm suggesting causes? If you ensure that the file length is changed each time it is written to, then the cache coherency checking will detect that, and a cache invalidation is guaranteed to occur. That is something that will normally work on all NFS implementations. Relying on the sub-second timing fields is a much more implementation-specific. It depends on the capabilities of both the filesystem and the server OS. Linux and the knfsd server code could easily be modified to provide such a service, but the problem (as I've stated before) is that you need to save the time somewhere on disk. I believe currently ext2 provides only 32 bits of storage for mtime (though perhaps somebody else could comment on that). UFS does provide a 64-bit storage space for mtime. It would surprise me if the other *NIX implementations aren't using this to improve NFS cache consistency. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Jeff Epler <[EMAIL PROTECTED]> writes: > But "ctime and file size are the same" does not prove that the ^ mtime > file is unchanged. That's the root of this problem, and why > NFS_CACHEINV(inode) is not enough to ensure coherency. Yes, but I just demonstrated that messing around with the client is a hack, not a solution to this situation. It's a hack because it is not compatible with the expected behaviour of NFS clients. The NFSv4 draft demonstrates what the expected behaviour is, and illustrates what we have to implement for a generic fix (one that works on all clients -- not hacked Linux ones). This is why I'm saying the correct solution is to get the server to change mtime. To use the 'microseconds' field that is supported by the NFS protocol, and that we currently set to 0. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> If we have a purely Linux-specific hack to ensure cache coherency, > that will still corrupt the cache on those *NIX clients that use > ordinary cache coherency checking (i.e. checking mtime + file size) > rather than cache invalidation. Its what Solaris implements and what SunOS back down to 3.x implements. I wouldnt be suprised if others do likewise ? > It also avoids a lot of unnecessary cache flushing. Imagine having to > reread your entire mailbox every time you open the file whether or not > a new message has arrived. Ugh... Interesting point. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> " " == Jeff Epler <[EMAIL PROTECTED]> writes: > I believe I have discovered a problem in the Linux 2.2 nfs > client. The short story is that NFS_CACHEINV(inode) does not > "act as a cache coherency point" (as nfs/file.c:nfs_lock()'s > comment implies) when multiple modifications to a file take > place during the same second, because in that case the file's > mtime is not changed. > The fix we have found is to *either* have NFS_CACHEINV(inode) > change inode-> i_mtime to an artificial value (0) or to call > nfs_zap_caches instead. Since I am not sure which fix is > appropriate, I'm not enclosing an actual patch. This is a known issue, and is not easy to fix. Neither of the solutions you propose are correct since they will both cause a cache invalidation. This is not the same as cache coherency checking. The correct solution in this case would be to add a 'sub-second time field' to ext2fs & friends. Both NFSv2 and NFSv3 pass a 64 bit field for the mtime. Under NFSv2, the low 32 bits are meant to be interpreted in units of micro-seconds. Under NFSv3, the units are nanoseconds. For the purposes of ensuring cache coherency, the 'sub-second' field doesn't actually have to be a time-related thing at all. It could simply be an 'operation count' or something like that. The nfs_refresh_inode() uses the mtime field as a 64-bit cookie, so it doesn't care as long as the field is conserved across server reboots. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
" " == Jeff Epler [EMAIL PROTECTED] writes: I believe I have discovered a problem in the Linux 2.2 nfs client. The short story is that NFS_CACHEINV(inode) does not "act as a cache coherency point" (as nfs/file.c:nfs_lock()'s comment implies) when multiple modifications to a file take place during the same second, because in that case the file's mtime is not changed. The fix we have found is to *either* have NFS_CACHEINV(inode) change inode- i_mtime to an artificial value (0) or to call nfs_zap_caches instead. Since I am not sure which fix is appropriate, I'm not enclosing an actual patch. This is a known issue, and is not easy to fix. Neither of the solutions you propose are correct since they will both cause a cache invalidation. This is not the same as cache coherency checking. The correct solution in this case would be to add a 'sub-second time field' to ext2fs friends. Both NFSv2 and NFSv3 pass a 64 bit field for the mtime. Under NFSv2, the low 32 bits are meant to be interpreted in units of micro-seconds. Under NFSv3, the units are nanoseconds. For the purposes of ensuring cache coherency, the 'sub-second' field doesn't actually have to be a time-related thing at all. It could simply be an 'operation count' or something like that. The nfs_refresh_inode() uses the mtime field as a 64-bit cookie, so it doesn't care as long as the field is conserved across server reboots. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
If we have a purely Linux-specific hack to ensure cache coherency, that will still corrupt the cache on those *NIX clients that use ordinary cache coherency checking (i.e. checking mtime + file size) rather than cache invalidation. Its what Solaris implements and what SunOS back down to 3.x implements. I wouldnt be suprised if others do likewise ? It also avoids a lot of unnecessary cache flushing. Imagine having to reread your entire mailbox every time you open the file whether or not a new message has arrived. Ugh... Interesting point. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
" " == Jeff Epler [EMAIL PROTECTED] writes: But "ctime and file size are the same" does not prove that the ^ mtime file is unchanged. That's the root of this problem, and why NFS_CACHEINV(inode) is not enough to ensure coherency. Yes, but I just demonstrated that messing around with the client is a hack, not a solution to this situation. It's a hack because it is not compatible with the expected behaviour of NFS clients. The NFSv4 draft demonstrates what the expected behaviour is, and illustrates what we have to implement for a generic fix (one that works on all clients -- not hacked Linux ones). This is why I'm saying the correct solution is to get the server to change mtime. To use the 'microseconds' field that is supported by the NFS protocol, and that we currently set to 0. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
" " == Jeff Epler [EMAIL PROTECTED] writes: Is there any solution that's available today, and doesn't depend on using Linux in the server? I suspect that we will have to distribute a modified nfs client with our app, and we're prepared to accept the cache invalidation (and reduced performance it causes) for a technique that will work with any NFS server and provide the level of guarantees we need. Is there a solution that would allow the kind of guarantee our software wants with non-linux nfsds without the cache-blowing that the change I'm suggesting causes? If you ensure that the file length is changed each time it is written to, then the cache coherency checking will detect that, and a cache invalidation is guaranteed to occur. That is something that will normally work on all NFS implementations. Relying on the sub-second timing fields is a much more implementation-specific. It depends on the capabilities of both the filesystem and the server OS. Linux and the knfsd server code could easily be modified to provide such a service, but the problem (as I've stated before) is that you need to save the time somewhere on disk. I believe currently ext2 provides only 32 bits of storage for mtime (though perhaps somebody else could comment on that). UFS does provide a 64-bit storage space for mtime. It would surprise me if the other *NIX implementations aren't using this to improve NFS cache consistency. Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
" " == Alan Cox [EMAIL PROTECTED] writes: Providing everyone is careful to hold a lock I think it is lockf() is a read barrier providing the local cache is flushed, the unlock is a write barrier providing the local cache is flushed first. Providing all users are using lockf for their I/O then it seems to be coherent Sure, but what happens if you're in a mixed client environment? If we have a purely Linux-specific hack to ensure cache coherency, that will still corrupt the cache on those *NIX clients that use ordinary cache coherency checking (i.e. checking mtime + file size) rather than cache invalidation. There's nothing in the NLM+NFSv2+NFSv3 notes about this situation, but on page 77 of the latest NFSv4 draft it states: oFirst, when a client obtains a file lock for a particular region, the data cache corresponding to that region (if any cache data exists) must be revalidated. If the change attribute indicates that the file may have been updated since the cached data was obtained, the client must flush or invalidate the cached data for the newly locked region. A client might choose to invalidate all of non-modified cached data that it has for the file but the only requirement for correct operation is to invalidate all of the data in the newly locked region. --- Making mtime a true 64-bit cookie on Linux servers would be a solution that works on all clients. It also avoids a lot of unnecessary cache flushing. Imagine having to reread your entire mailbox every time you open the file whether or not a new message has arrived. Ugh... Cheers, Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
On Tue, Sep 12, 2000 at 07:08:09PM +0200, Trond Myklebust wrote: This is a known issue, and is not easy to fix. Neither of the solutions you propose are correct since they will both cause a cache invalidation. This is not the same as cache coherency checking. The correct solution in this case would be to add a 'sub-second time field' to ext2fs friends. Both NFSv2 and NFSv3 pass a 64 bit field [...] Trond, Thanks for your swift answer. I sent a message a few months ago with a much poorer solution which was perhaps rightly ignored Is there any solution that's available today, and doesn't depend on using Linux in the server? I suspect that we will have to distribute a modified nfs client with our app, and we're prepared to accept the cache invalidation (and reduced performance it causes) for a technique that will work with any NFS server and provide the level of guarantees we need. Is there a solution that would allow the kind of guarantee our software wants with non-linux nfsds without the cache-blowing that the change I'm suggesting causes? Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
On Tue, Sep 12, 2000 at 09:10:56PM +0200, Trond Myklebust wrote: " " == Alan Cox [EMAIL PROTECTED] writes: Providing everyone is careful to hold a lock I think it is lockf() is a read barrier providing the local cache is flushed, the unlock is a write barrier providing the local cache is flushed first. Providing all users are using lockf for their I/O then it seems to be coherent Sure, but what happens if you're in a mixed client environment? If we have a purely Linux-specific hack to ensure cache coherency, that will still corrupt the cache on those *NIX clients that use ordinary cache coherency checking (i.e. checking mtime + file size) rather than cache invalidation. We don't alter anything about what the other clients cache. Sure, you can connect to the same server with a client that doesn't have this fix, and sure you might get different cached data on different clients at the same time, but that's the current situation. You don't make it *worse* for anybody else, you make it *better* for Linux. There's nothing in the NLM+NFSv2+NFSv3 notes about this situation, but on page 77 of the latest NFSv4 draft it states: oFirst, when a client obtains a file lock for a particular region, the data cache corresponding to that region (if any cache data exists) must be revalidated. If the change attribute indicates that the file may have been updated since the cached data was obtained, the client must flush or invalidate the cached data for the newly locked region. A client might choose to invalidate all of non-modified cached data that it has for the file but the only requirement for correct operation is to invalidate all of the data in the newly locked region. But "ctime and file size are the same" does not prove that the file is unchanged. That's the root of this problem, and why NFS_CACHEINV(inode) is not enough to ensure coherency. Furthermore, according to NFSv4, what I am suggesting is something that a client "may choose" to do anyhow---i.e., it's at least permitted by the spec, even if it's not required. Making mtime a true 64-bit cookie on Linux servers would be a solution that works on all clients. Yes, but my solution works for all servers with a Linux client. (We've been making this modification to NFS clients in several other unices for years, we know it works and gives the guarantees we need) It also avoids a lot of unnecessary cache flushing. Imagine having to reread your entire mailbox every time you open the file whether or not a new message has arrived. Ugh... It's better to get wrong results if you run a mail client at the same time on two different machines, right? Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
On Tue, Sep 12, 2000 at 09:10:56PM +0200, Trond Myklebust wrote: Making mtime a true 64-bit cookie on Linux servers would be a solution that works on all clients. Making mtime 64bit would also be useful for local parallel make runs, the current second resolution leads to race conditions on bigger SMP boxes. Unfortunately the only FS that supports s on disk timestamps currently is XFS afaik, and there is no VFS interface for it. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
Trond Myklebust writes: Relying on the sub-second timing fields is a much more implementation-specific. It depends on the capabilities of both the filesystem and the server OS. Linux and the knfsd server code could easily be modified to provide such a service, but the problem (as I've stated before) is that you need to save the time somewhere on disk. I believe currently ext2 provides only 32 bits of storage for mtime (though perhaps somebody else could comment on that). Yes, ext2 has this limit. It would not be reasonable to try extending ext2 to 64-bit times, but milliseconds might be doable. You'd need 4 bytes to support the 3 standard time stamps. Going to microseconds would require 8 free bytes, which I don't think we have. (but we do have more that one might think, due to the unimplemented junk) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/