Re: OOPS in nfsd, affects all 2.2 and 2.4 kernels
> This problem that you are addressing is caused when solaris sends a > zero length write (I assume to implement the "access" system call, but > I haven't checked). more likely a long standing bug in Solaris that hasn't been stomped. Tony, you might let Sun know that you have a way to reproduce it at will, though there are Sun people on this alias who I'm sure will make it a high priority to stomp this one. :-) -mre - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: OOPS in nfsd, affects all 2.2 and 2.4 kernels
This problem that you are addressing is caused when solaris sends a zero length write (I assume to implement the "access" system call, but I haven't checked). more likely a long standing bug in Solaris that hasn't been stomped. Tony, you might let Sun know that you have a way to reproduce it at will, though there are Sun people on this alias who I'm sure will make it a high priority to stomp this one. :-) -mre - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> >>>>> " " == Michael Eisler <[EMAIL PROTECTED]> writes: > > >> I'm not clear on why you want to enforce page alignedness > >> though? As long as writes respect the lock boundaries (and not > >> page boundaries) why would use of a page cache change matters? > > > For the reason that was pointed earlier by someone else as to > > why your fix in adequate. Since the I/O is page-based, if the > > locks are not, then two threads on two different clients will > > step over each other's locked regions. > > No they don't. > > As I've repeatedly stated, our cache does not require us to respect > page boundaries when writing. We do make sure that all writes pending > on the entire file are flushed to disk before we lock/unlock a > region. If somebody has held a lock on 2 bytes lying across a page > boundary, and has only written within that 2 byte region, a write is > sent for 2 bytes. What if someone has written to multiple, non-contiguous regions of a page? > There is no difference here between cached and uncached operation. The > only difference is that we trust the lock to prevent other machines > writing within the locked region. So, assume a page is 4096 bytes long, and there are 2048 processes that each have write locked the even numbered bytes of the first page of a file. Once all the locks have been acquired, the page has been flushed to disk. Now we have a clean page. Each process then writes only one byte region that it has locked. Now they unlock the region. What happens? (btw, on another nfs client, 2048 processes have locked the odd number bytes of the first page of the same file). If you claim that first client doesn't overwrite the updates from the 2nd client, then I'll shut up. -mre - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> Yes. fs/read_write calls the NFS subsystem. The problem then is that > NFS uses the generic_file_{read,write,mmap}() interfaces. These are > what enforce use of the page cache. So, don't use generic*() when locking is active. It's what most other UNIX-based NFS clients do. Even if it is "stupid beyond belief", it works. > You could drop these functions, but that would mean designing an > entire VFS for NFS's use alone. Such a decision would have to be very > well motivated in order to convince Linus. Avoiding corruption. > >> As far as I can see, the current use of the page cache should > >> be safe as long as applications respect the locking boundaries, > >> and don't expect consistency outside locked areas. > > > Then the code ought to enforce page aligned locks. Of course, > > while that will produce correctness, it will violate the > > principle of least surprise. It might be better to simply > > AFAICS that would be a bad idea, since it will lead to programs having > to know about the hardware granularity. You could easily imagine > deadlock situations that could arise if one program is unwittingly > locking an area that is meant to be claimed by another. I can't imagine any deadlock scenarios. If the app locks on a page boundary, then accept it, otherwise return an error. But it does violate least surprise, so I think bypassing the page cache when locking is active is better. > I'm not clear on why you want to enforce page alignedness though? As > long as writes respect the lock boundaries (and not page boundaries) > why would use of a page cache change matters? For the reason that was pointed earlier by someone else as to why your fix in adequate. Since the I/O is page-based, if the locks are not, then two threads on two different clients will step over each other's locked regions. Folks might think that NLM locking is brain dead, and they wouldn't get an argument from me. But if you are going to document that you support it, then please get it right. -mre - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> >>>>> " " == Michael Eisler <[EMAIL PROTECTED]> writes: > > > Focus on correctness and do the expedient thing first, which > > is: > > - The first time a file is locked, flush dirty pages > >to the server, and then invalidate the page cache > > This would be implemented with the last patch I proposed. > > > - While the file is locked, do vypass the page cache for all > > I/O. > > This is not possible given the current design of the Linux VFS. The > design is such that all reads/writes go through the page cache. I'm I'm not a Linux kernel literate. However, I found your assertion surprising. Does procfs do page i/o as well? file.c in fs/nfs suggests that the Linux VFS has non-page interfaces in addition to page interfaces. fs/read_write.c suggests that the read and write system calls uses the non-page interface. I cannot speak for Linux, but System V Release 4 dervied systems uses the page cache primarily as a tool for each file system, yet still hide the page interface from the code path leading from the read/write system calls to the VFS. > not sure that it is possible to get round this without some major > changes in VFS philosophy. Hacks such as invalidating the cache after > each read/write would definitely give rise to races. > > As far as I can see, the current use of the page cache should be safe > as long as applications respect the locking boundaries, and don't > expect consistency outside locked areas. Then the code ought to enforce page aligned locks. Of course, while that will produce correctness, it will violate the principle of least surprise. It might be better to simply return an error when a lock operation is attempted on an NFS file. Assuming of course, the Linux kernel isn't capable of honoring a read() or write() system whenever the file system doesn't support page-based i/o, which, again, I'd be surprised by. -mre - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
" " == Michael Eisler [EMAIL PROTECTED] writes: Focus on correctness and do the expedient thing first, which is: - The first time a file is locked, flush dirty pages to the server, and then invalidate the page cache This would be implemented with the last patch I proposed. - While the file is locked, do vypass the page cache for all I/O. This is not possible given the current design of the Linux VFS. The design is such that all reads/writes go through the page cache. I'm I'm not a Linux kernel literate. However, I found your assertion surprising. Does procfs do page i/o as well? file.c in fs/nfs suggests that the Linux VFS has non-page interfaces in addition to page interfaces. fs/read_write.c suggests that the read and write system calls uses the non-page interface. I cannot speak for Linux, but System V Release 4 dervied systems uses the page cache primarily as a tool for each file system, yet still hide the page interface from the code path leading from the read/write system calls to the VFS. not sure that it is possible to get round this without some major changes in VFS philosophy. Hacks such as invalidating the cache after each read/write would definitely give rise to races. As far as I can see, the current use of the page cache should be safe as long as applications respect the locking boundaries, and don't expect consistency outside locked areas. Then the code ought to enforce page aligned locks. Of course, while that will produce correctness, it will violate the principle of least surprise. It might be better to simply return an error when a lock operation is attempted on an NFS file. Assuming of course, the Linux kernel isn't capable of honoring a read() or write() system whenever the file system doesn't support page-based i/o, which, again, I'd be surprised by. -mre - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> > " " == James Yarbrough <[EMAIL PROTECTED]> writes: > > > What is done for bypassing the cache when the size of a file > > lock held by the reading/writing process is not a multiple of > > the caching granularity? Consider two different clients with > > processes sharing a file and locking 2k byte regions of the > > file and possibly updating these regions. Suppose that each > > system caches 4k byte blocks. If system A locks the first 2k > > of a block and system B locks the second 2k, the updates from > > one of the systems may be lost if these systems cache the > > writes. This is because each system will write back the 4k > > block it cached, not the 2k block that was locked. > > Under Linux writebacks have byte-sized granularity. If a page has been > partially dirtied, we save that information, and only write back the > dirty areas. As long as each system has restricted its updates to > within the 2k block that it has locked, there should be no conflict. > > If however one system has been writing over the full 4k block, then > there will indeed be a race. Using a page cache when locking is turned on is tricky at best. The only working optimizations I know of in this area are allowing the use of page cache when the entire file is locked. My two cents ... Focus on correctness and do the expedient thing first, which is: - The first time a file is locked, flush dirty pages to the server, and then invalidate the page cache - While the file is locked, do vypass the page cache for all I/O. Once that works, the gaping wound in the Linux NFS/NLM client will be closed. This will give you the breathing room to experiment with something that works more optimally (yet still correctly) in some conditions. E.g., one possible optimization is to allow page I/O as long as the locks are page aligned or whole file aligned. -mre - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
" " == James Yarbrough [EMAIL PROTECTED] writes: What is done for bypassing the cache when the size of a file lock held by the reading/writing process is not a multiple of the caching granularity? Consider two different clients with processes sharing a file and locking 2k byte regions of the file and possibly updating these regions. Suppose that each system caches 4k byte blocks. If system A locks the first 2k of a block and system B locks the second 2k, the updates from one of the systems may be lost if these systems cache the writes. This is because each system will write back the 4k block it cached, not the 2k block that was locked. Under Linux writebacks have byte-sized granularity. If a page has been partially dirtied, we save that information, and only write back the dirty areas. As long as each system has restricted its updates to within the 2k block that it has locked, there should be no conflict. If however one system has been writing over the full 4k block, then there will indeed be a race. Using a page cache when locking is turned on is tricky at best. The only working optimizations I know of in this area are allowing the use of page cache when the entire file is locked. My two cents ... Focus on correctness and do the expedient thing first, which is: - The first time a file is locked, flush dirty pages to the server, and then invalidate the page cache - While the file is locked, do vypass the page cache for all I/O. Once that works, the gaping wound in the Linux NFS/NLM client will be closed. This will give you the breathing room to experiment with something that works more optimally (yet still correctly) in some conditions. E.g., one possible optimization is to allow page I/O as long as the locks are page aligned or whole file aligned. -mre - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
> > " " == Jeff Epler <[EMAIL PROTECTED]> writes: > > > Is there a solution that would allow the kind of guarantee our > > software wants with non-linux nfsds without the cache-blowing > > that the change I'm suggesting causes? > > How about something like the following compromise? > > I haven't tried it out yet (and I've no idea whether or not Linus > would accept this) but it compiles, and it should definitely be better > behaved with respect to slowly-changing files. > > As you can see, the idea is to look at whether or not the file has > changed recently (I arbitrarily chose a full minute as a concession > towards clusters with lousy clock synchronization). If it has, then > the page cache is zapped. If not, we force ordinary attribute cache > consistency checking. The fix still does not provide coherency guarantees in all situations, and at minimum, there ought to be a way to force the client provide a coherency guarantee. > Cheers, > Trond > > --- linux-2.4.0-test8/fs/nfs/file.c Fri Jun 30 01:02:40 2000 > +++ linux-2.4.0-test8-fix_lock/fs/nfs/file.c Thu Sep 14 09:18:50 2000 > @@ -240,6 +240,20 @@ > } > > /* > + * Ensure more conservative data cache consistency than NFS_CACHEINV() > + * for files that change frequently. Avoids problems with sub-second > + * changes that don't register on i_mtime. > + */ > +static inline void > +nfs_lock_cacheinv(struct inode *inode) > +{ > + if ((long)CURRENT_TIME - (long)(inode->i_mtime + 60) < 0) > + nfs_zap_caches(inode); > + else > + NFS_CACHEINV(inode); > +} > + > +/* > * Lock a (portion of) a file > */ > int > @@ -295,6 +309,6 @@ >* This makes locking act as a cache coherency point. >*/ > out_ok: > - NFS_CACHEINV(inode); > + nfs_lock_cacheinv(inode); > return status; > } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee
" " == Jeff Epler [EMAIL PROTECTED] writes: Is there a solution that would allow the kind of guarantee our software wants with non-linux nfsds without the cache-blowing that the change I'm suggesting causes? How about something like the following compromise? I haven't tried it out yet (and I've no idea whether or not Linus would accept this) but it compiles, and it should definitely be better behaved with respect to slowly-changing files. As you can see, the idea is to look at whether or not the file has changed recently (I arbitrarily chose a full minute as a concession towards clusters with lousy clock synchronization). If it has, then the page cache is zapped. If not, we force ordinary attribute cache consistency checking. The fix still does not provide coherency guarantees in all situations, and at minimum, there ought to be a way to force the client provide a coherency guarantee. Cheers, Trond --- linux-2.4.0-test8/fs/nfs/file.c Fri Jun 30 01:02:40 2000 +++ linux-2.4.0-test8-fix_lock/fs/nfs/file.c Thu Sep 14 09:18:50 2000 @@ -240,6 +240,20 @@ } /* + * Ensure more conservative data cache consistency than NFS_CACHEINV() + * for files that change frequently. Avoids problems with sub-second + * changes that don't register on i_mtime. + */ +static inline void +nfs_lock_cacheinv(struct inode *inode) +{ + if ((long)CURRENT_TIME - (long)(inode-i_mtime + 60) 0) + nfs_zap_caches(inode); + else + NFS_CACHEINV(inode); +} + +/* * Lock a (portion of) a file */ int @@ -295,6 +309,6 @@ * This makes locking act as a cache coherency point. */ out_ok: - NFS_CACHEINV(inode); + nfs_lock_cacheinv(inode); return status; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS client option to force 16-bit ugid.
> My home directory lives on a SunOS 4.1.4 server, which helpfully expands > 16-bit UIDs to 32 bits as signed quantities, not unsigned. So any uid above > 32768 gets 0x added to it. Doesn't http://sunsolve.Sun.COM/pub-cgi/retrieve.pl?type=0=fpatches/102394 fix this on the 4.1.4 server? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: NFS client option to force 16-bit ugid.
My home directory lives on a SunOS 4.1.4 server, which helpfully expands 16-bit UIDs to 32 bits as signed quantities, not unsigned. So any uid above 32768 gets 0x added to it. Doesn't http://sunsolve.Sun.COM/pub-cgi/retrieve.pl?type=0doc=fpatches/102394 fix this on the 4.1.4 server? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/