Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-18 Thread Trond Myklebust

> " " == Michael Eisler <[EMAIL PROTECTED]> writes:

 > What if someone has written to multiple, non-contiguous regions
 > of a page?

This has been foreseen and hence we only allow 1 contiguous region per
page. If somebody tries to schedule a write to a second region that is
not contiguous with the first, then we flush out the already scheduled
write first, and wait on it to complete.

We also only allow 1 process at a time to schedule writes to a given
page. This is done in case the file access permissions turn out to be
untrustworthy: (due to e.g. root squashing and the likes) it means
that we don't end up clobbering the data from another process' pending
write.

 > So, assume a page is 4096 bytes long, and there are 2048
 > processes that each have write locked the even numbered bytes
 > of the first page of a file. Once all the locks have been
 > acquired, the page has been flushed to disk.

 > Now we have a clean page. Each process then writes only one
 > byte region that it has locked.

 > Now they unlock the region.

 > What happens?

 > (btw, on another nfs client, 2048 processes have locked the odd
 > number bytes of the first page of the same file).

 > If you claim that first client doesn't overwrite the updates
 > from the 2nd client, then I'll shut up.

Blech... It'll be slow, but as I said, only 1 contiguous region is
allowed to have pending writes on any given page at any given point in
time. Basically, you'll end up with 2048 byte-sized regions getting
written out from each machine. Each time a new write gets scheduled,
it will flush out the previous write and wait on its completion.

In your example, we will therefore in practice be doing completely
synchronous writes...

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-18 Thread Michael Eisler

> > " " == Michael Eisler <[EMAIL PROTECTED]> writes:
> 
> >> I'm not clear on why you want to enforce page alignedness
> >> though? As long as writes respect the lock boundaries (and not
> >> page boundaries) why would use of a page cache change matters?
> 
>  > For the reason that was pointed earlier by someone else as to
>  > why your fix in adequate. Since the I/O is page-based, if the
>  > locks are not, then two threads on two different clients will
>  > step over each other's locked regions.
> 
> No they don't.
> 
> As I've repeatedly stated, our cache does not require us to respect
> page boundaries when writing. We do make sure that all writes pending
> on the entire file are flushed to disk before we lock/unlock a
> region. If somebody has held a lock on 2 bytes lying across a page
> boundary, and has only written within that 2 byte region, a write is
> sent for 2 bytes.

What if someone has written to multiple, non-contiguous regions of a page?

> There is no difference here between cached and uncached operation. The
> only difference is that we trust the lock to prevent other machines
> writing within the locked region.

So, assume a page is 4096 bytes long, and there are 2048 processes that each
have write locked the even numbered bytes of the first page of a file. Once
all the locks have been acquired, the page has been flushed to disk.

Now we have a clean page. Each process then writes only one byte region that
it has locked.

Now they unlock the region.

What happens?

(btw, on another nfs client, 2048 processes have locked the odd number bytes
of the first page of the same file).

If you claim that first client doesn't overwrite the updates from the
2nd client, then I'll shut up.

-mre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-18 Thread Trond Myklebust

> " " == Michael Eisler <[EMAIL PROTECTED]> writes:

>> I'm not clear on why you want to enforce page alignedness
>> though? As long as writes respect the lock boundaries (and not
>> page boundaries) why would use of a page cache change matters?

 > For the reason that was pointed earlier by someone else as to
 > why your fix in adequate. Since the I/O is page-based, if the
 > locks are not, then two threads on two different clients will
 > step over each other's locked regions.

No they don't.

As I've repeatedly stated, our cache does not require us to respect
page boundaries when writing. We do make sure that all writes pending
on the entire file are flushed to disk before we lock/unlock a
region. If somebody has held a lock on 2 bytes lying across a page
boundary, and has only written within that 2 byte region, a write is
sent for 2 bytes.
There is no difference here between cached and uncached operation. The
only difference is that we trust the lock to prevent other machines
writing within the locked region.


When reading, we have now (assuming the patch I just sent in to Linus
was accepted) ensured that all pages that were cached prior to
grabbing the lock are thrown out. Any new access to that file will
cause the relevant pages to be reread from the server.
Even if we're extending a lock, within a page boundary or locking a
new portion of the file then all existing writes are flushed to disk
first, and the entire page cache is invalidated. Again any access to
the newly locked region will result in a reread from the server.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-18 Thread Michael Eisler


> Yes. fs/read_write calls the NFS subsystem. The problem then is that
> NFS uses the generic_file_{read,write,mmap}() interfaces. These are
> what enforce use of the page cache.

So, don't use generic*() when locking is active. It's what most other
UNIX-based NFS clients do. Even if it is "stupid beyond
belief", it works. 

> You could drop these functions, but that would mean designing an
> entire VFS for NFS's use alone. Such a decision would have to be very
> well motivated in order to convince Linus.

Avoiding corruption.

> >> As far as I can see, the current use of the page cache should
> >> be safe as long as applications respect the locking boundaries,
> >> and don't expect consistency outside locked areas.
> 
>  > Then the code ought to enforce page aligned locks. Of course,
>  > while that will produce correctness, it will violate the
>  > principle of least surprise. It might be better to simply
> 
> AFAICS that would be a bad idea, since it will lead to programs having
> to know about the hardware granularity. You could easily imagine
> deadlock situations that could arise if one program is unwittingly
> locking an area that is meant to be claimed by another.

I can't imagine any deadlock scenarios. If the app locks on a page
boundary, then accept it, otherwise return an error. But
it does violate least surprise, so I think bypassing the page
cache when locking is active is better.

> I'm not clear on why you want to enforce page alignedness though? As
> long as writes respect the lock boundaries (and not page boundaries)
> why would use of a page cache change matters?

For the reason that was pointed earlier by someone else as to why your fix in
adequate. Since the I/O is page-based, if the locks are not, then two threads
on two different clients will step over each other's locked regions.

Folks might think that NLM locking is brain dead, and they wouldn't get an
argument from me. But if you are going to document that you support it, then
please get it right.

-mre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-18 Thread Trond Myklebust

 " " == Michael Eisler [EMAIL PROTECTED] writes:

 I'm not clear on why you want to enforce page alignedness
 though? As long as writes respect the lock boundaries (and not
 page boundaries) why would use of a page cache change matters?

  For the reason that was pointed earlier by someone else as to
  why your fix in adequate. Since the I/O is page-based, if the
  locks are not, then two threads on two different clients will
  step over each other's locked regions.

No they don't.

As I've repeatedly stated, our cache does not require us to respect
page boundaries when writing. We do make sure that all writes pending
on the entire file are flushed to disk before we lock/unlock a
region. If somebody has held a lock on 2 bytes lying across a page
boundary, and has only written within that 2 byte region, a write is
sent for 2 bytes.
There is no difference here between cached and uncached operation. The
only difference is that we trust the lock to prevent other machines
writing within the locked region.


When reading, we have now (assuming the patch I just sent in to Linus
was accepted) ensured that all pages that were cached prior to
grabbing the lock are thrown out. Any new access to that file will
cause the relevant pages to be reread from the server.
Even if we're extending a lock, within a page boundary or locking a
new portion of the file then all existing writes are flushed to disk
first, and the entire page cache is invalidated. Again any access to
the newly locked region will result in a reread from the server.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock()does not provide coherency guarantee

2000-09-16 Thread Linus Torvalds



On 16 Sep 2000, Trond Myklebust wrote:
> 
> Yes. fs/read_write calls the NFS subsystem. The problem then is that
> NFS uses the generic_file_{read,write,mmap}() interfaces. These are
> what enforce use of the page cache.
> 
> You could drop these functions, but that would mean designing an
> entire VFS for NFS's use alone. Such a decision would have to be very
> well motivated in order to convince Linus.

Ehh..

I would say that such a decision would be stupid beyond belief, and
impossible to motivate.

NFS certainly doesn't _have_ to use the page cache. However, not using the
page cache would basically in the end be equivalent to 
 (a) not having coherent mmap's over NFS
 (b) probably having much weaker caching
neither of which is really an option at all, I suspect.

You could do caching on your own, but I dare anybody to come up with a
better cache that is able to handle mmap and read/write coherency.

The page granularity issues come from mmap, not from the page cache per
se. ANYTHING that solves the coherency issue is pretty much bound by the
limitations of a page-cache-like thing - they are not limitations of the
implementation, but limitations pretty much inherent to the problem
itself.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-16 Thread Alan Cox

> I'm not a Linux kernel literate. However, I found your
> assertion surprising. Does procfs do page i/o as well?

No. An fs isnt required to use the page caches at all. It makes life a lot saner
to do so

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-16 Thread Trond Myklebust

> " " == Michael Eisler <[EMAIL PROTECTED]> writes:

 > I'm not a Linux kernel literate. However, I found your
 > assertion surprising. Does procfs do page i/o as well?

No. It has its own setup.

 > file.c in fs/nfs suggests that the Linux VFS has non-page
 > interfaces in addition to page interfaces. fs/read_write.c
 > suggests that the read and write system calls uses the non-page
 > interface.

Yes. fs/read_write calls the NFS subsystem. The problem then is that
NFS uses the generic_file_{read,write,mmap}() interfaces. These are
what enforce use of the page cache.

You could drop these functions, but that would mean designing an
entire VFS for NFS's use alone. Such a decision would have to be very
well motivated in order to convince Linus.

>> As far as I can see, the current use of the page cache should
>> be safe as long as applications respect the locking boundaries,
>> and don't expect consistency outside locked areas.

 > Then the code ought to enforce page aligned locks. Of course,
 > while that will produce correctness, it will violate the
 > principle of least surprise. It might be better to simply

AFAICS that would be a bad idea, since it will lead to programs having
to know about the hardware granularity. You could easily imagine
deadlock situations that could arise if one program is unwittingly
locking an area that is meant to be claimed by another.

I'm not clear on why you want to enforce page alignedness though? As
long as writes respect the lock boundaries (and not page boundaries)
why would use of a page cache change matters?

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-16 Thread Michael Eisler

> > " " == Michael Eisler <[EMAIL PROTECTED]> writes:
> 
>  > Focus on correctness and do the expedient thing first, which
>  > is:
>  > - The first time a file is locked, flush dirty pages
>  >to the server, and then invalidate the page cache
> 
> This would be implemented with the last patch I proposed.
> 
>  > - While the file is locked, do vypass the page cache for all
>  >  I/O.
> 
> This is not possible given the current design of the Linux VFS. The
> design is such that all reads/writes go through the page cache. I'm

I'm not a Linux kernel literate. However, I found your
assertion surprising. Does procfs do page i/o as well?

file.c in fs/nfs suggests that the Linux VFS has non-page interfaces
in addition to page interfaces. fs/read_write.c suggests that the
read and write system calls uses the non-page interface.

I cannot speak for Linux, but System V Release 4 dervied systems
uses the page cache primarily as a tool for each file system, yet
still hide the page interface from the code path leading from the
read/write system calls to the VFS.

> not sure that it is possible to get round this without some major
> changes in VFS philosophy. Hacks such as invalidating the cache after
> each read/write would definitely give rise to races.
> 
> As far as I can see, the current use of the page cache should be safe
> as long as applications respect the locking boundaries, and don't
> expect consistency outside locked areas.

Then the code ought to enforce page aligned locks. Of course, while
that will produce correctness, it will violate the principle of
least surprise. It might be better to simply return an error when
a lock operation is attempted on an NFS file. Assuming of course, the 
Linux kernel isn't capable of honoring a read() or write() system
whenever the file system doesn't support page-based i/o, which, again,
I'd be surprised by.

-mre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-16 Thread Trond Myklebust

> " " == Michael Eisler <[EMAIL PROTECTED]> writes:

 > Focus on correctness and do the expedient thing first, which
 > is:
 > - The first time a file is locked, flush dirty pages
 >  to the server, and then invalidate the page cache

This would be implemented with the last patch I proposed.

 > - While the file is locked, do vypass the page cache for all
 >I/O.

This is not possible given the current design of the Linux VFS. The
design is such that all reads/writes go through the page cache. I'm
not sure that it is possible to get round this without some major
changes in VFS philosophy. Hacks such as invalidating the cache after
each read/write would definitely give rise to races.

As far as I can see, the current use of the page cache should be safe
as long as applications respect the locking boundaries, and don't
expect consistency outside locked areas.

The exception here is mmap() for which no sane semantics can be found
unless the application explicitly unmaps and remaps after having
acquired the file lock.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-16 Thread Michael Eisler

  " " == Michael Eisler [EMAIL PROTECTED] writes:
 
   Focus on correctness and do the expedient thing first, which
   is:
   - The first time a file is locked, flush dirty pages
  to the server, and then invalidate the page cache
 
 This would be implemented with the last patch I proposed.
 
   - While the file is locked, do vypass the page cache for all
I/O.
 
 This is not possible given the current design of the Linux VFS. The
 design is such that all reads/writes go through the page cache. I'm

I'm not a Linux kernel literate. However, I found your
assertion surprising. Does procfs do page i/o as well?

file.c in fs/nfs suggests that the Linux VFS has non-page interfaces
in addition to page interfaces. fs/read_write.c suggests that the
read and write system calls uses the non-page interface.

I cannot speak for Linux, but System V Release 4 dervied systems
uses the page cache primarily as a tool for each file system, yet
still hide the page interface from the code path leading from the
read/write system calls to the VFS.

 not sure that it is possible to get round this without some major
 changes in VFS philosophy. Hacks such as invalidating the cache after
 each read/write would definitely give rise to races.
 
 As far as I can see, the current use of the page cache should be safe
 as long as applications respect the locking boundaries, and don't
 expect consistency outside locked areas.

Then the code ought to enforce page aligned locks. Of course, while
that will produce correctness, it will violate the principle of
least surprise. It might be better to simply return an error when
a lock operation is attempted on an NFS file. Assuming of course, the 
Linux kernel isn't capable of honoring a read() or write() system
whenever the file system doesn't support page-based i/o, which, again,
I'd be surprised by.

-mre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-16 Thread Trond Myklebust

 " " == Michael Eisler [EMAIL PROTECTED] writes:

  I'm not a Linux kernel literate. However, I found your
  assertion surprising. Does procfs do page i/o as well?

No. It has its own setup.

  file.c in fs/nfs suggests that the Linux VFS has non-page
  interfaces in addition to page interfaces. fs/read_write.c
  suggests that the read and write system calls uses the non-page
  interface.

Yes. fs/read_write calls the NFS subsystem. The problem then is that
NFS uses the generic_file_{read,write,mmap}() interfaces. These are
what enforce use of the page cache.

You could drop these functions, but that would mean designing an
entire VFS for NFS's use alone. Such a decision would have to be very
well motivated in order to convince Linus.

 As far as I can see, the current use of the page cache should
 be safe as long as applications respect the locking boundaries,
 and don't expect consistency outside locked areas.

  Then the code ought to enforce page aligned locks. Of course,
  while that will produce correctness, it will violate the
  principle of least surprise. It might be better to simply

AFAICS that would be a bad idea, since it will lead to programs having
to know about the hardware granularity. You could easily imagine
deadlock situations that could arise if one program is unwittingly
locking an area that is meant to be claimed by another.

I'm not clear on why you want to enforce page alignedness though? As
long as writes respect the lock boundaries (and not page boundaries)
why would use of a page cache change matters?

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-16 Thread Alan Cox

 I'm not a Linux kernel literate. However, I found your
 assertion surprising. Does procfs do page i/o as well?

No. An fs isnt required to use the page caches at all. It makes life a lot saner
to do so

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock()does not provide coherency guarantee

2000-09-16 Thread Linus Torvalds



On 16 Sep 2000, Trond Myklebust wrote:
 
 Yes. fs/read_write calls the NFS subsystem. The problem then is that
 NFS uses the generic_file_{read,write,mmap}() interfaces. These are
 what enforce use of the page cache.
 
 You could drop these functions, but that would mean designing an
 entire VFS for NFS's use alone. Such a decision would have to be very
 well motivated in order to convince Linus.

Ehh..

I would say that such a decision would be stupid beyond belief, and
impossible to motivate.

NFS certainly doesn't _have_ to use the page cache. However, not using the
page cache would basically in the end be equivalent to 
 (a) not having coherent mmap's over NFS
 (b) probably having much weaker caching
neither of which is really an option at all, I suspect.

You could do caching on your own, but I dare anybody to come up with a
better cache that is able to handle mmap and read/write coherency.

The page granularity issues come from mmap, not from the page cache per
se. ANYTHING that solves the coherency issue is pretty much bound by the
limitations of a page-cache-like thing - they are not limitations of the
implementation, but limitations pretty much inherent to the problem
itself.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-15 Thread Michael Eisler

> > " " == James Yarbrough <[EMAIL PROTECTED]> writes:
> 
>  > What is done for bypassing the cache when the size of a file
>  > lock held by the reading/writing process is not a multiple of
>  > the caching granularity?  Consider two different clients with
>  > processes sharing a file and locking 2k byte regions of the
>  > file and possibly updating these regions.  Suppose that each
>  > system caches 4k byte blocks.  If system A locks the first 2k
>  > of a block and system B locks the second 2k, the updates from
>  > one of the systems may be lost if these systems cache the
>  > writes.  This is because each system will write back the 4k
>  > block it cached, not the 2k block that was locked.
> 
> Under Linux writebacks have byte-sized granularity. If a page has been
> partially dirtied, we save that information, and only write back the
> dirty areas. As long as each system has restricted its updates to
> within the 2k block that it has locked, there should be no conflict.
> 
> If however one system has been writing over the full 4k block, then
> there will indeed be a race.

Using a page cache when locking is turned on is tricky at best. The
only working optimizations I know of in this area are allowing
the use of page cache when the entire file is locked.

My two cents ...

Focus on correctness and do the expedient thing
first, which is:
-  The first time a file is locked, flush dirty pages
to the server, and then invalidate the page cache
- While the file is locked, do vypass the page cache for all I/O.

Once that works, the gaping wound in the Linux NFS/NLM client will be
closed. This will give you the breathing room to experiment with
something that works more optimally (yet still correctly) in some conditions.
E.g., one possible optimization is to allow page I/O as long as the
locks are page aligned or whole file aligned.

-mre 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-15 Thread Trond Myklebust

> " " == James Yarbrough <[EMAIL PROTECTED]> writes:

 > What is done for bypassing the cache when the size of a file
 > lock held by the reading/writing process is not a multiple of
 > the caching granularity?  Consider two different clients with
 > processes sharing a file and locking 2k byte regions of the
 > file and possibly updating these regions.  Suppose that each
 > system caches 4k byte blocks.  If system A locks the first 2k
 > of a block and system B locks the second 2k, the updates from
 > one of the systems may be lost if these systems cache the
 > writes.  This is because each system will write back the 4k
 > block it cached, not the 2k block that was locked.

Under Linux writebacks have byte-sized granularity. If a page has been
partially dirtied, we save that information, and only write back the
dirty areas. As long as each system has restricted its updates to
within the 2k block that it has locked, there should be no conflict.

If however one system has been writing over the full 4k block, then
there will indeed be a race.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-15 Thread James Yarbrough

> This will always invalidate the page cache whenever we try to obtain
> the lock, hence you are guaranteed that the cache will be reread after

> the lock was grabbed.
> After unlocking however one needs no guarantees other than ensuring
> that any modifications were committed while we held the lock, so we
> flush the writebacks, drop the lock, and leave it at that.

What is done for bypassing the cache when the size of a file lock held
by the
reading/writing process is not a multiple of the caching granularity?
Consider
two different clients with processes sharing a file and locking 2k byte
regions
of the file and possibly updating these regions.  Suppose that each
system caches
4k byte blocks.  If system A locks the first 2k of a block and system B
locks
the second 2k, the updates from one of the systems may be lost if these
systems
cache the writes.  This is because each system will write back the 4k
block it
cached, not the 2k block that was locked.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-15 Thread Trond Myklebust

> " " == Michael Eisler <[EMAIL PROTECTED]> writes:

 > The fix still does not provide coherency guarantees in all
 > situations, and at minimum, there ought to be a way to force
 > the client provide a coherency guarantee.

Yes. I came to the same conclusion after having sent it off. I hope
the following patch is more acceptable.

This will always invalidate the page cache whenever we try to obtain
the lock, hence you are guaranteed that the cache will be reread after
the lock was grabbed.
After unlocking however one needs no guarantees other than ensuring
that any modifications were committed while we held the lock, so we
flush the writebacks, drop the lock, and leave it at that.

Cheers,
  Trond

--- linux-2.4.0-test8/fs/nfs/file.c Fri Jun 30 01:02:40 2000
+++ linux-2.4.0-test8-fix_lock/fs/nfs/file.cFri Sep 15 13:11:39 2000
@@ -291,10 +291,13 @@
status = 0;
 
/*
-* Make sure we re-validate anything we've got cached.
+* Make sure we clear the cache whenever we try to get the lock.
 * This makes locking act as a cache coherency point.
 */
  out_ok:
-   NFS_CACHEINV(inode);
+   if ((cmd == F_SETLK || cmd == F_SETLKW) && fl->fl_type != F_UNLCK) {
+   nfs_wb_all(inode);  /* we may have slept */
+   nfs_zap_caches(inode);
+   }
return status;
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-15 Thread Trond Myklebust

 " " == Michael Eisler [EMAIL PROTECTED] writes:

  The fix still does not provide coherency guarantees in all
  situations, and at minimum, there ought to be a way to force
  the client provide a coherency guarantee.

Yes. I came to the same conclusion after having sent it off. I hope
the following patch is more acceptable.

This will always invalidate the page cache whenever we try to obtain
the lock, hence you are guaranteed that the cache will be reread after
the lock was grabbed.
After unlocking however one needs no guarantees other than ensuring
that any modifications were committed while we held the lock, so we
flush the writebacks, drop the lock, and leave it at that.

Cheers,
  Trond

--- linux-2.4.0-test8/fs/nfs/file.c Fri Jun 30 01:02:40 2000
+++ linux-2.4.0-test8-fix_lock/fs/nfs/file.cFri Sep 15 13:11:39 2000
@@ -291,10 +291,13 @@
status = 0;
 
/*
-* Make sure we re-validate anything we've got cached.
+* Make sure we clear the cache whenever we try to get the lock.
 * This makes locking act as a cache coherency point.
 */
  out_ok:
-   NFS_CACHEINV(inode);
+   if ((cmd == F_SETLK || cmd == F_SETLKW)  fl-fl_type != F_UNLCK) {
+   nfs_wb_all(inode);  /* we may have slept */
+   nfs_zap_caches(inode);
+   }
return status;
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-15 Thread Michael Eisler

  " " == James Yarbrough [EMAIL PROTECTED] writes:
 
   What is done for bypassing the cache when the size of a file
   lock held by the reading/writing process is not a multiple of
   the caching granularity?  Consider two different clients with
   processes sharing a file and locking 2k byte regions of the
   file and possibly updating these regions.  Suppose that each
   system caches 4k byte blocks.  If system A locks the first 2k
   of a block and system B locks the second 2k, the updates from
   one of the systems may be lost if these systems cache the
   writes.  This is because each system will write back the 4k
   block it cached, not the 2k block that was locked.
 
 Under Linux writebacks have byte-sized granularity. If a page has been
 partially dirtied, we save that information, and only write back the
 dirty areas. As long as each system has restricted its updates to
 within the 2k block that it has locked, there should be no conflict.
 
 If however one system has been writing over the full 4k block, then
 there will indeed be a race.

Using a page cache when locking is turned on is tricky at best. The
only working optimizations I know of in this area are allowing
the use of page cache when the entire file is locked.

My two cents ...

Focus on correctness and do the expedient thing
first, which is:
-  The first time a file is locked, flush dirty pages
to the server, and then invalidate the page cache
- While the file is locked, do vypass the page cache for all I/O.

Once that works, the gaping wound in the Linux NFS/NLM client will be
closed. This will give you the breathing room to experiment with
something that works more optimally (yet still correctly) in some conditions.
E.g., one possible optimization is to allow page I/O as long as the
locks are page aligned or whole file aligned.

-mre 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-14 Thread Jeff Epler


Jeff Epler <[EMAIL PROTECTED]> writes:
> >  > Is there a solution that would allow the kind of guarantee our
> >  > software wants with non-linux nfsds without the cache-blowing
> >  > that the change I'm suggesting causes?

Trond:
> > As you can see, the idea is to look at whether or not the file has
> > changed recently (I arbitrarily chose a full minute as a concession
> > towards clusters with lousy clock synchronization). If it has, then
> > the page cache is zapped. If not, we force ordinary attribute cache
> > consistency checking.

On Thu, Sep 14, 2000 at 09:48:50AM -0700, Michael Eisler wrote:
> The fix still does not provide coherency guarantees in all situations, and
> at minimum, there ought to be a way to force the client provide a coherency
> guarantee.

I agree.

I wish that the way to force the guarantee was through fcntl(F_???LK), but
we'll take whatever way we can get.

Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-14 Thread Michael Eisler

> > " " == Jeff Epler <[EMAIL PROTECTED]> writes:
> 
>  > Is there a solution that would allow the kind of guarantee our
>  > software wants with non-linux nfsds without the cache-blowing
>  > that the change I'm suggesting causes?
> 
> How about something like the following compromise?
> 
> I haven't tried it out yet (and I've no idea whether or not Linus
> would accept this) but it compiles, and it should definitely be better
> behaved with respect to slowly-changing files.
> 
> As you can see, the idea is to look at whether or not the file has
> changed recently (I arbitrarily chose a full minute as a concession
> towards clusters with lousy clock synchronization). If it has, then
> the page cache is zapped. If not, we force ordinary attribute cache
> consistency checking.

The fix still does not provide coherency guarantees in all situations, and
at minimum, there ought to be a way to force the client provide a coherency
guarantee.

> Cheers,
>   Trond
> 
> --- linux-2.4.0-test8/fs/nfs/file.c   Fri Jun 30 01:02:40 2000
> +++ linux-2.4.0-test8-fix_lock/fs/nfs/file.c  Thu Sep 14 09:18:50 2000
> @@ -240,6 +240,20 @@
>  }
>  
>  /*
> + * Ensure more conservative data cache consistency than NFS_CACHEINV()
> + * for files that change frequently. Avoids problems with sub-second
> + * changes that don't register on i_mtime.
> + */
> +static inline void
> +nfs_lock_cacheinv(struct inode *inode)
> +{
> + if ((long)CURRENT_TIME - (long)(inode->i_mtime + 60) < 0)
> + nfs_zap_caches(inode);
> + else
> + NFS_CACHEINV(inode);
> +}
> +
> +/*
>   * Lock a (portion of) a file
>   */
>  int
> @@ -295,6 +309,6 @@
>* This makes locking act as a cache coherency point.
>*/
>   out_ok:
> - NFS_CACHEINV(inode);
> + nfs_lock_cacheinv(inode);
>   return status;
>  }

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-14 Thread Albert D. Cahalan

Theodore Y. Ts'o writes:
> From: "Albert D. Cahalan" <[EMAIL PROTECTED]>

>> The ext2 inode has 6 obviously free bytes, 6 that are only used
>> on filesystems marked as Hurd-type, and 8 that seem to be claimed
>> by competing security and EA projects. So, being wasteful, it would
>> be possible to have picoseconds on all 4 time stamps. Doing mere
>> milliseconds on 3 timestamps would leave us 16 bytes for security.
>
> Obviously free to you, perhaps, but you're not the one who allocates
> bits out of the ext2 inode.  There are various projects that are all
> trying claim bits of ext2, including folks who want to implement
> fragment support, which you've conveniently edited out of your
> un-official ext2 inode structure which you sent out.

First of all, I give you fragments:

frag_data = 0;
file_size = i_size;
if(i_size_high>>31){  /* high bit marks fragment data */
  frag_data = i_size_high;
}else{
  file_size |= (__u64)i_size_high << 32; /* frag is insignificant */
}

Second of all, you had been planning on extents. With extents,
there should be no need to play around with fragments. You just
use a small block size.

Third, if somebody wants UFS, they can use the UFS driver!
More likely, these people should be using Reiserfs.

> There has been some talk of doubling the size of the ext2 inode, which
> will of course cause some backwards compatibility problems and would
> mean that you would only be able to use certain advanced features on new
> or converted ext2 filesystems.  However, there are enough downsides with
> this that it's something of a last resort.

Oh well. People used to gripe about ext2 being bloated compared
to the early Linux filesystems. Now we are getting JFS, with its
huge quad-size inodes.

> This is why I'd much prefer a solution which uses a few bits out of
> i_generation.  True, it's somewhat filesystem specific; but I expect
> that any solution that's going to work over multiple filesystems is
> going to have at least some filesystem specific bits.  And given that we
> need i_generation anyway for other purposes, we might as well allow one
> of the solutions to simply borrow some number of bits from i_generation.

Would you give up 6, 17, 21, or 30 bits? This would chop away at
the protection offered by i_generation.

Since accurate timestamps are generally useful for NFS and "make",
they ought to get at least 30 bits, if not 60. (for the three)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-14 Thread Theodore Y. Ts'o

   Date: Thu, 14 Sep 2000 17:03:11 +0200 (CEST)
   From: Trond Myklebust <[EMAIL PROTECTED]>

   For the timestamps, yes, but inode caching will take most of that
   hit. After all, the only time stat() reads from disk is when the inode
   has completely fallen out of the cache.

For commonly used inodes, sure, but there are lots of applications where
the inode cache doesn't buy you that much.  

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-14 Thread Trond Myklebust

> " " == Theodore Y Ts'o <[EMAIL PROTECTED]> writes:

 >Would it perhaps make sense to use one of these last 'free'
 >fields as a pointer to an 'inode entension'?  If you still
 >want ext2fs to be able to accommodate new projects and
 >ideas, then it seems that being able to extend the inode is
 >a desirable feature, but perhaps this overlaps with the
 >apparent plans for adding resource forks?

 > For stuff that's not commonly used, perhaps.  The problem is
 > that you take a speed hit when you have to seek somewhere else
 > to get at the inode extension.  So for something which is going
 > to have to be referenced for every stat() or getattr()
 > operation, there are a real performance issues with doing
 > something like that.

For the timestamps, yes, but inode caching will take most of that
hit. After all, the only time stat() reads from disk is when the inode
has completely fallen out of the cache.

I agree that doing something like this is less than ideal, but given
that the ext2fs inode space hunt seems to be a frequently recurring
issue it should be considered as an alternative that gives us whatever
space we need for 3 full-sized timestamps and possibly for any future
ext2fs projects while preserving backward compatibility.

Of course, as I said: when suggesting this, I'm moving completely out
of my depth. I'm not ready to argue seriously for or against anything
whatsoever, but I'd appreciate an informed opinion.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-14 Thread Theodore Y. Ts'o

   Date: Thu, 14 Sep 2000 15:09:35 +0200 (CEST)
   From: Trond Myklebust <[EMAIL PROTECTED]>

   Would it perhaps make sense to use one of these last 'free' fields
   as a pointer to an 'inode entension'?
   If you still want ext2fs to be able to accommodate new projects and
   ideas, then it seems that being able to extend the inode is a
   desirable feature, but perhaps this overlaps with the apparent plans
   for adding resource forks?

For stuff that's not commonly used, perhaps.  The problem is that you
take a speed hit when you have to seek somewhere else to get at the
inode extension.   So for something which is going to have to be
referenced for every stat() or getattr() operation, there are a real
performance issues with doing something like that.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-14 Thread Trond Myklebust

> " " == Theodore Y Ts'o <[EMAIL PROTECTED]> writes:

 > There has been some talk of doubling the size of the ext2
 > inode, which will of course cause some backwards compatibility
 > problems and would mean that you would only be able to use
 > certain advanced features on new or converted ext2 filesystems.
 > However, there are enough downsides with this that it's
 > something of a last resort.  It would make life a lot easier
 > for those various people doing new ext2 features from muscling
 > each other over space all the time.

I'm sure the idea has been raised before, but given the above
paragraph I can't resist poking my nose into where it doesn't really
belong:

Would it perhaps make sense to use one of these last 'free' fields
as a pointer to an 'inode entension'?
If you still want ext2fs to be able to accommodate new projects and
ideas, then it seems that being able to extend the inode is a
desirable feature, but perhaps this overlaps with the apparent plans
for adding resource forks?

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-14 Thread Theodore Y. Ts'o

   From: "Albert D. Cahalan" <[EMAIL PROTECTED]>
   Date:Wed, 13 Sep 2000 19:20:42 -0400 (EDT)

   The ext2 inode has 6 obviously free bytes, 6 that are only used
   on filesystems marked as Hurd-type, and 8 that seem to be claimed
   by competing security and EA projects. So, being wasteful, it would
   be possible to have picoseconds on all 4 time stamps. Doing mere
   milliseconds on 3 timestamps would leave us 16 bytes for security.

Obviously free to you, perhaps, but you're not the one who allocates
bits out of the ext2 inode.  There are various projects that are all
trying claim bits of ext2, including folks who want to implement
fragment support, which you've conveniently edited out of your
un-official ext2 inode structure which you sent out.

There has been some talk of doubling the size of the ext2 inode, which
will of course cause some backwards compatibility problems and would
mean that you would only be able to use certain advanced features on new
or converted ext2 filesystems.  However, there are enough downsides with
this that it's something of a last resort.  It would make life a lot
easier for those various people doing new ext2 features from muscling
each other over space all the time.

This is why I'd much prefer a solution which uses a few bits out of
i_generation.  True, it's somewhat filesystem specific; but I expect
that any solution that's going to work over multiple filesystems is
going to have at least some filesystem specific bits.  And given that we
need i_generation anyway for other purposes, we might as well allow one
of the solutions to simply borrow some number of bits from i_generation.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-14 Thread Andi Kleen

On Thu, Sep 14, 2000 at 11:51:20AM +0200, Trond Myklebust wrote:
> The reason I'm sceptical to this and other in-core type solutions is
> that knfsd doesn't keep files open for long: basically it opens a file
> and close it upon processing of each and every request from a
> client. You can therefore easily risk losing the inode between two
> requests while the file is still open on the client.
> If this means that you also lose part of the info in the mtime field,
> then you will trigger a data cache invalidation on the client.

knfsd keeps it only open for a small time, but the page cache references
the inode as long as it has any data pages for the file. As long as you're 
not trashing extensively it should stay in memory for a longer time.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-14 Thread Trond Myklebust

> " " == Andi Kleen <[EMAIL PROTECTED]> writes:

 > On Thu, Sep 14, 2000 at 10:49:59AM +0200, Trond Myklebust
 > wrote:
>> > " " == Albert D Cahalan <[EMAIL PROTECTED]> writes:
>>
>> > The ext2 inode has 6 obviously free bytes, 6 that are only
>> > used on filesystems marked as Hurd-type, and 8 that seem to
>> > be claimed by competing security and EA projects. So, being
>> > wasteful, it would be possible to have picoseconds on all 4
>> > time stamps. Doing mere milliseconds on 3 timestamps would
>> > leave us 16 bytes for security.
>>
>> It's probably more realistic given the time resolution on most
>> PCs.  One could perhaps allow for greater resolution on those
>> filesystems that have the storage for it at the VFS level.

 > You could also just keep the more finegrained mtime in core and
 > when you flush the inode round it up to the next second.

That would work for ordinary (stateful) filesystems where you open it,
then read/write whatever you need then close the file.

The reason I'm sceptical to this and other in-core type solutions is
that knfsd doesn't keep files open for long: basically it opens a file
and close it upon processing of each and every request from a
client. You can therefore easily risk losing the inode between two
requests while the file is still open on the client.
If this means that you also lose part of the info in the mtime field,
then you will trigger a data cache invalidation on the client.

It's very important to avoid excessive cache invalidation if you want
stuff like mmap() to work on the client.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-14 Thread Andi Kleen

On Thu, Sep 14, 2000 at 10:49:59AM +0200, Trond Myklebust wrote:
> > " " == Albert D Cahalan <[EMAIL PROTECTED]> writes:
> 
>  > The ext2 inode has 6 obviously free bytes, 6 that are only used
>  > on filesystems marked as Hurd-type, and 8 that seem to be
>  > claimed by competing security and EA projects. So, being
>  > wasteful, it would be possible to have picoseconds on all 4
>  > time stamps. Doing mere milliseconds on 3 timestamps would
>  > leave us 16 bytes for security.
> 
> It's probably more realistic given the time resolution on most PCs.
> One could perhaps allow for greater resolution on those filesystems
> that have the storage for it at the VFS level.

You could also just keep the more finegrained mtime in core and when
you flush the inode round it up to the next second. 


-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-14 Thread Trond Myklebust

> " " == Albert D Cahalan <[EMAIL PROTECTED]> writes:

 > The ext2 inode has 6 obviously free bytes, 6 that are only used
 > on filesystems marked as Hurd-type, and 8 that seem to be
 > claimed by competing security and EA projects. So, being
 > wasteful, it would be possible to have picoseconds on all 4
 > time stamps. Doing mere milliseconds on 3 timestamps would
 > leave us 16 bytes for security.

It's probably more realistic given the time resolution on most PCs.
One could perhaps allow for greater resolution on those filesystems
that have the storage for it at the VFS level.

What are the 'i_reserved'[12] fields all about?

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-14 Thread Trond Myklebust

> " " == Jeff Epler <[EMAIL PROTECTED]> writes:

 > Is there a solution that would allow the kind of guarantee our
 > software wants with non-linux nfsds without the cache-blowing
 > that the change I'm suggesting causes?

How about something like the following compromise?

I haven't tried it out yet (and I've no idea whether or not Linus
would accept this) but it compiles, and it should definitely be better
behaved with respect to slowly-changing files.

As you can see, the idea is to look at whether or not the file has
changed recently (I arbitrarily chose a full minute as a concession
towards clusters with lousy clock synchronization). If it has, then
the page cache is zapped. If not, we force ordinary attribute cache
consistency checking.

Cheers,
  Trond

--- linux-2.4.0-test8/fs/nfs/file.c Fri Jun 30 01:02:40 2000
+++ linux-2.4.0-test8-fix_lock/fs/nfs/file.cThu Sep 14 09:18:50 2000
@@ -240,6 +240,20 @@
 }
 
 /*
+ * Ensure more conservative data cache consistency than NFS_CACHEINV()
+ * for files that change frequently. Avoids problems with sub-second
+ * changes that don't register on i_mtime.
+ */
+static inline void
+nfs_lock_cacheinv(struct inode *inode)
+{
+   if ((long)CURRENT_TIME - (long)(inode->i_mtime + 60) < 0)
+   nfs_zap_caches(inode);
+   else
+   NFS_CACHEINV(inode);
+}
+
+/*
  * Lock a (portion of) a file
  */
 int
@@ -295,6 +309,6 @@
 * This makes locking act as a cache coherency point.
 */
  out_ok:
-   NFS_CACHEINV(inode);
+   nfs_lock_cacheinv(inode);
return status;
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-14 Thread Trond Myklebust

 " " == Jeff Epler [EMAIL PROTECTED] writes:

  Is there a solution that would allow the kind of guarantee our
  software wants with non-linux nfsds without the cache-blowing
  that the change I'm suggesting causes?

How about something like the following compromise?

I haven't tried it out yet (and I've no idea whether or not Linus
would accept this) but it compiles, and it should definitely be better
behaved with respect to slowly-changing files.

As you can see, the idea is to look at whether or not the file has
changed recently (I arbitrarily chose a full minute as a concession
towards clusters with lousy clock synchronization). If it has, then
the page cache is zapped. If not, we force ordinary attribute cache
consistency checking.

Cheers,
  Trond

--- linux-2.4.0-test8/fs/nfs/file.c Fri Jun 30 01:02:40 2000
+++ linux-2.4.0-test8-fix_lock/fs/nfs/file.cThu Sep 14 09:18:50 2000
@@ -240,6 +240,20 @@
 }
 
 /*
+ * Ensure more conservative data cache consistency than NFS_CACHEINV()
+ * for files that change frequently. Avoids problems with sub-second
+ * changes that don't register on i_mtime.
+ */
+static inline void
+nfs_lock_cacheinv(struct inode *inode)
+{
+   if ((long)CURRENT_TIME - (long)(inode-i_mtime + 60)  0)
+   nfs_zap_caches(inode);
+   else
+   NFS_CACHEINV(inode);
+}
+
+/*
  * Lock a (portion of) a file
  */
 int
@@ -295,6 +309,6 @@
 * This makes locking act as a cache coherency point.
 */
  out_ok:
-   NFS_CACHEINV(inode);
+   nfs_lock_cacheinv(inode);
return status;
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-14 Thread Andi Kleen

On Thu, Sep 14, 2000 at 10:49:59AM +0200, Trond Myklebust wrote:
  " " == Albert D Cahalan [EMAIL PROTECTED] writes:
 
   The ext2 inode has 6 obviously free bytes, 6 that are only used
   on filesystems marked as Hurd-type, and 8 that seem to be
   claimed by competing security and EA projects. So, being
   wasteful, it would be possible to have picoseconds on all 4
   time stamps. Doing mere milliseconds on 3 timestamps would
   leave us 16 bytes for security.
 
 It's probably more realistic given the time resolution on most PCs.
 One could perhaps allow for greater resolution on those filesystems
 that have the storage for it at the VFS level.

You could also just keep the more finegrained mtime in core and when
you flush the inode round it up to the next second. 


-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-14 Thread Andi Kleen

On Thu, Sep 14, 2000 at 11:51:20AM +0200, Trond Myklebust wrote:
 The reason I'm sceptical to this and other in-core type solutions is
 that knfsd doesn't keep files open for long: basically it opens a file
 and close it upon processing of each and every request from a
 client. You can therefore easily risk losing the inode between two
 requests while the file is still open on the client.
 If this means that you also lose part of the info in the mtime field,
 then you will trigger a data cache invalidation on the client.

knfsd keeps it only open for a small time, but the page cache references
the inode as long as it has any data pages for the file. As long as you're 
not trashing extensively it should stay in memory for a longer time.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-14 Thread Theodore Y. Ts'o

   From: "Albert D. Cahalan" [EMAIL PROTECTED]
   Date:Wed, 13 Sep 2000 19:20:42 -0400 (EDT)

   The ext2 inode has 6 obviously free bytes, 6 that are only used
   on filesystems marked as Hurd-type, and 8 that seem to be claimed
   by competing security and EA projects. So, being wasteful, it would
   be possible to have picoseconds on all 4 time stamps. Doing mere
   milliseconds on 3 timestamps would leave us 16 bytes for security.

Obviously free to you, perhaps, but you're not the one who allocates
bits out of the ext2 inode.  There are various projects that are all
trying claim bits of ext2, including folks who want to implement
fragment support, which you've conveniently edited out of your
un-official ext2 inode structure which you sent out.

There has been some talk of doubling the size of the ext2 inode, which
will of course cause some backwards compatibility problems and would
mean that you would only be able to use certain advanced features on new
or converted ext2 filesystems.  However, there are enough downsides with
this that it's something of a last resort.  It would make life a lot
easier for those various people doing new ext2 features from muscling
each other over space all the time.

This is why I'd much prefer a solution which uses a few bits out of
i_generation.  True, it's somewhat filesystem specific; but I expect
that any solution that's going to work over multiple filesystems is
going to have at least some filesystem specific bits.  And given that we
need i_generation anyway for other purposes, we might as well allow one
of the solutions to simply borrow some number of bits from i_generation.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-14 Thread Theodore Y. Ts'o

   Date: Thu, 14 Sep 2000 15:09:35 +0200 (CEST)
   From: Trond Myklebust [EMAIL PROTECTED]

   Would it perhaps make sense to use one of these last 'free' fields
   as a pointer to an 'inode entension'?
   If you still want ext2fs to be able to accommodate new projects and
   ideas, then it seems that being able to extend the inode is a
   desirable feature, but perhaps this overlaps with the apparent plans
   for adding resource forks?

For stuff that's not commonly used, perhaps.  The problem is that you
take a speed hit when you have to seek somewhere else to get at the
inode extension.   So for something which is going to have to be
referenced for every stat() or getattr() operation, there are a real
performance issues with doing something like that.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-14 Thread Trond Myklebust

 " " == Theodore Y Ts'o [EMAIL PROTECTED] writes:

 Would it perhaps make sense to use one of these last 'free'
 fields as a pointer to an 'inode entension'?  If you still
 want ext2fs to be able to accommodate new projects and
 ideas, then it seems that being able to extend the inode is
 a desirable feature, but perhaps this overlaps with the
 apparent plans for adding resource forks?

  For stuff that's not commonly used, perhaps.  The problem is
  that you take a speed hit when you have to seek somewhere else
  to get at the inode extension.  So for something which is going
  to have to be referenced for every stat() or getattr()
  operation, there are a real performance issues with doing
  something like that.

For the timestamps, yes, but inode caching will take most of that
hit. After all, the only time stat() reads from disk is when the inode
has completely fallen out of the cache.

I agree that doing something like this is less than ideal, but given
that the ext2fs inode space hunt seems to be a frequently recurring
issue it should be considered as an alternative that gives us whatever
space we need for 3 full-sized timestamps and possibly for any future
ext2fs projects while preserving backward compatibility.

Of course, as I said: when suggesting this, I'm moving completely out
of my depth. I'm not ready to argue seriously for or against anything
whatsoever, but I'd appreciate an informed opinion.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-14 Thread Michael Eisler

  " " == Jeff Epler [EMAIL PROTECTED] writes:
 
   Is there a solution that would allow the kind of guarantee our
   software wants with non-linux nfsds without the cache-blowing
   that the change I'm suggesting causes?
 
 How about something like the following compromise?
 
 I haven't tried it out yet (and I've no idea whether or not Linus
 would accept this) but it compiles, and it should definitely be better
 behaved with respect to slowly-changing files.
 
 As you can see, the idea is to look at whether or not the file has
 changed recently (I arbitrarily chose a full minute as a concession
 towards clusters with lousy clock synchronization). If it has, then
 the page cache is zapped. If not, we force ordinary attribute cache
 consistency checking.

The fix still does not provide coherency guarantees in all situations, and
at minimum, there ought to be a way to force the client provide a coherency
guarantee.

 Cheers,
   Trond
 
 --- linux-2.4.0-test8/fs/nfs/file.c   Fri Jun 30 01:02:40 2000
 +++ linux-2.4.0-test8-fix_lock/fs/nfs/file.c  Thu Sep 14 09:18:50 2000
 @@ -240,6 +240,20 @@
  }
  
  /*
 + * Ensure more conservative data cache consistency than NFS_CACHEINV()
 + * for files that change frequently. Avoids problems with sub-second
 + * changes that don't register on i_mtime.
 + */
 +static inline void
 +nfs_lock_cacheinv(struct inode *inode)
 +{
 + if ((long)CURRENT_TIME - (long)(inode-i_mtime + 60)  0)
 + nfs_zap_caches(inode);
 + else
 + NFS_CACHEINV(inode);
 +}
 +
 +/*
   * Lock a (portion of) a file
   */
  int
 @@ -295,6 +309,6 @@
* This makes locking act as a cache coherency point.
*/
   out_ok:
 - NFS_CACHEINV(inode);
 + nfs_lock_cacheinv(inode);
   return status;
  }

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-14 Thread Jeff Epler


Jeff Epler [EMAIL PROTECTED] writes:
Is there a solution that would allow the kind of guarantee our
software wants with non-linux nfsds without the cache-blowing
that the change I'm suggesting causes?

Trond:
  As you can see, the idea is to look at whether or not the file has
  changed recently (I arbitrarily chose a full minute as a concession
  towards clusters with lousy clock synchronization). If it has, then
  the page cache is zapped. If not, we force ordinary attribute cache
  consistency checking.

On Thu, Sep 14, 2000 at 09:48:50AM -0700, Michael Eisler wrote:
 The fix still does not provide coherency guarantees in all situations, and
 at minimum, there ought to be a way to force the client provide a coherency
 guarantee.

I agree.

I wish that the way to force the guarantee was through fcntl(F_???LK), but
we'll take whatever way we can get.

Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Jeff Epler

On Wed, Sep 13, 2000 at 11:56:56PM +0200, Trond Myklebust wrote:
> After consultation with the appropriate authorities, it turns out that
> the Sun-code based clients do in fact turn off data caching entirely
> when using NLM file locking.

Entirely?  That's interesting, because for our consistency requirements we
only need the data cache to be flushed for a particular inode at the time
of a locking call.  At least if we port to Solaris (hah!) we won't have
this particular problem in our clients.

> Attribute cache consistency checking on
> NFSv4 differs from v2/v3 in that you have a new attribute that is
> guaranteed to change every time the file changes.

That would be very welcome, but it's doubtful we could get an nfsv4 server
to run on our old bsdi 1.x-based servers :)  

Thank you very much for continuing to pursue this.  I wish I were an NFS
expert, but as I may have said elsewhere, this task falls to me as the
Linux evangelist at the office.  

Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Jeff Epler

I think that both the NFS server changes that Trond is suggesting and the
NFS client changes that I am suggesting have their place.

The fact that the tuple (mtime, size) is used to test to test for
unchangedness of a file indicates that the people who designed NFS
understood that just using mtime wasn't enough.  The problem is that using
(mtime, size) is not enough either, and I don't understand why anyone ever
thought it would be.  Trivial example:
write(fd, "a", 1);
llseek(fd, 0, 0);
write(fd, "b", 1);

My solution will enable the Linux client (and other clients implementing
the same fix) to achieve the coherency promised by the comment in
nfs_lock(), at the expense of slightly lower performance, no matter what
NFS server they use (even a hypothetical one with mtime resolution of a
year).

Trond's nfs server changes would enable any client communicating with it to
achieve the coherency promised in nfs_lock(), without the performance
loss, but unfortunately the changes can't be made today and don't solve
anything for non-Linux NFS servers.

Thus, the only thing that is missing from my proposal is a way to fall back
on the old invalidation code, for that time in the future where mtime
changes are always sufficient to distinguish changes n a file.  I'm not
going to sweat this, since it sounds like this isn't implemented anywhere
else and won't be in Linux for some time.  When it comes along, we can
upgrade mount and the nfs client with a new flag.

Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Albert D. Cahalan

Trond Myklebust writes:
> > " " == Albert D Cahalan <[EMAIL PROTECTED]> writes:

>> 20 bits * 3 timestamps == 60 bits 60 bits <= 8 bytes
>>
>> So you do need 8 bytes.
>
> Yes. Assuming that you want to implement the microseconds field on
> all 3 timestamps. For NFS I would only need that precision on mtime.
> Does anybody else see a need for it on ctime and/or atime?

If you don't implement all 3, you get an ugly asymetry that
messes up timestamp comparisons.

You'd get ctime> I thought of this, but I don't think it is safe.
>>
>> write to file X us=1 write to file Y us=1 write to file Y us=2
>> write to file Y us=3 write to file X us=2
>>
>> Oops, "make" on some clients will think Y is newer than X.
>>
>> Using the wrong unit would be OK, as long as it is a real
>> sub-second time unit that doesn't overflow... but then you
>> might as well convert it to microseconds.
>
> Sure, but the alternative might be to reserve some bits as an
> 'operations counter' that is incremented every time an update is
> performed. That way you might have milisecond resolution
> (which is the sort of resolution that 'jiffies' & friends offers)
> + a few extra bits that would be there for make/NFS/... update
> consistency.

No way. You still get the same errors seen on clients that don't
know you are abusing the time stamps.

The ext2 inode has 6 obviously free bytes, 6 that are only used
on filesystems marked as Hurd-type, and 8 that seem to be claimed
by competing security and EA projects. So, being wasteful, it would
be possible to have picoseconds on all 4 time stamps. Doing mere
milliseconds on 3 timestamps would leave us 16 bytes for security.

This is what we have today:

struct ext2_inode {
__u16   i_mode; /* mode (type & permission bitss) */
__u16   i_uid;  /* owner's UID */
__u32   i_size; /* size (low 32 bits) */
__u32   i_atime;/* access time */
__u32   i_ctime;/* inode change time */
__u32   i_mtime;/* modification time */
__u32   i_dtime;/* deletion time */
__u16   i_gid;  /* group ID */
__u16   i_links_count;  /* link count */
__u32   i_blocks;   /* block count */
__u32   i_flags;/* ext2 flags */
union {
__u32   i_translator;   /* translator (Hurd only) */
__u32   i_res_fork; /* resource fork (Linux privs) */
} u1;
__u32   i_block[15];/* block pointers */
__u32   i_version;  /* inode generation (for NFS) */
__u32   i_file_acl; /* file ACL */
union {
__u32   i_dir_acl;  /* directory ACL */
__u32   i_size_high;/* high 32 bits of 64-bit size */
} u2;
__u32   i_reserved1;
__u16   i_reserved2;
__u16   i_mode_high;/* extra mode bits (Hurd only) */
__u16   i_uid_high; /* high 16 bits of 32-bit UID */
__u16   i_gid_high; /* high 16 bits of 32-bit GID */
__u32   i_author;   /* author (Hurd only) */
};
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Trond Myklebust

> " " == Andi Kleen <[EMAIL PROTECTED]> writes:

 > Steal a couple of bytes from what ?

>From the 32-bit storage area currently allocated to i_generation on
the on-disk ext2fs inode (as per Ted's suggestion). With current
disk/computing speeds, you probably don't need the full 32 bits to
offer a reasonable guarantee of inode number uniqueness.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Trond Myklebust

> " " == Jeff Epler <[EMAIL PROTECTED]> writes:

 > But "ctime and file size are the same" does not prove that the
 > file is unchanged.  That's the root of this problem, and why
 > NFS_CACHEINV(inode) is not enough to ensure coherency.

 > Furthermore, according to NFSv4, what I am suggesting is
 > something that a client "may choose" to do anyhow---i.e., it's
 > at least permitted by the spec, even if it's not required.

It appears I owe you an apology, and I'd better make it in public.

After consultation with the appropriate authorities, it turns out that
the Sun-code based clients do in fact turn off data caching entirely
when using NLM file locking. Attribute cache consistency checking on
NFSv4 differs from v2/v3 in that you have a new attribute that is
guaranteed to change every time the file changes.

I'm still not happy with that solution, as it means that performance
is sub-optimal for a number of interesting cases, and it screws up
mmap() royally. However it does shoot to pieces my argument that we
currently do conform to standard behaviour.

Cheers and sorry,
  Trond

BTW: the microsecond mtime resolution on the NFS server remains an
issue, but now only for the ordinary case of attribute cache
consistency checking.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Trond Myklebust

> " " == Theodore Y Ts'o <[EMAIL PROTECTED]> writes:

 > Why?  i_generation is a field which is only used by the NFS
 > server.  As far as ext2 is concerned, it's a black box value.
 > Currently, as I understand things (and you're much more the
 > export on the NFS code than I am) all 32 bits of i_generation
 > is used to enforce uniqueness of the NFS file handler across
 > the recycling of the inode (i.e., the inode is deleted, and
 > then reused for something else).  Is that right?

Yes. In effect you have a 32-bit counter that serves as a guarantees
that an inode is not reused 'within a reasonable period of time'.

 > If we were to steal, say, 8 bits of i_generation to provide a
 > subsecond indicator of freshness of the attributes by shoving
 > those 8 bits into the NFS microseconds field, this shouldn't
 > cause any incompatibilities, should it?  (Of couse, you would
 > mask out those 8 bits for the NFS file handle generation
 > algorithm, which you would only have 24 bits, which still
 > should be enough for its purposes.)

 > My understanding of the games played by the i_generation field
 > and the NFS file handle is that it's not visiable to NFS
 > clients.  True, it would mean a certain amount of confusion for
 > those clients that kept the filesystem mounted across a
 > transition between an old 2.2.16 kernel and a reboot to a new
 > kernel that interpreted the i_generation field differently, but
 > that should be the only compatibility problem I can think of.

 > Am I missing something?

We'd want to make an implementation that is generic and fits all
'commonly used' UNIX filesystems. That means that we don't want to
apply an ext2-specific hack at the VFS or nfsd levels.  The
inode-generation field is a feature on several filesystems which
already support sub-second timestamps.

Cheers,
 Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Andi Kleen

On Wed, Sep 13, 2000 at 10:35:10PM +0200, Trond Myklebust wrote:
>  > No, it does not help. i_generation is only in the NFS file
>  > handle, but not in the fattr/file id this is used for cache
>  > checks. The NFS file handle has to stay identical anyways, as
>  > long as the inode is not deleted/reused.
> 
> 
> You might be able to steal a couple of bytes and then rewrite ext2fs
> to mask those out from the 'i_generation' field, but it would mean that
> you could no longer boot your old 2.2.16 kernel without trouble on
> existing clients.

Steal a couple of bytes from what ? 

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Frank van Maarseveen

On Wed, Sep 13, 2000 at 12:42:15PM +0200, Trond Myklebust wrote:
> 
> Is there any 'standard' function for reading the microseconds fields
> in userland? I couldn't find anything with a brief search in the
> X/Open docs.
> 
Both Digital OSF/1 and Solaris use 3 undocumented spare fields in the
struct stat and filled it with a microsecond/nanosecond fraction for
atime, mtime and ctime.

-- 
Frank
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Theodore Y. Ts'o

   Date: Wed, 13 Sep 2000 22:35:10 +0200 (CEST)
   From: Trond Myklebust <[EMAIL PROTECTED]>

   You might be able to steal a couple of bytes and then rewrite ext2fs
   to mask those out from the 'i_generation' field, but it would mean that
   you could no longer boot your old 2.2.16 kernel without trouble on
   existing clients.

Why?  i_generation is a field which is only used by the NFS server.  As
far as ext2 is concerned, it's a black box value.  Currently, as I
understand things (and you're much more the export on the NFS code than
I am) all 32 bits of i_generation is used to enforce uniqueness of the
NFS file handler across the recycling of the inode (i.e., the inode is
deleted, and then reused for something else).  Is that right?

If we were to steal, say, 8 bits of i_generation to provide a subsecond
indicator of freshness of the attributes by shoving those 8 bits into
the NFS microseconds field, this shouldn't cause any incompatibilities,
should it?  (Of couse, you would mask out those 8 bits for the NFS file
handle generation algorithm, which you would only have 24 bits, which
still should be enough for its purposes.)

My understanding of the games played by the i_generation field and the
NFS file handle is that it's not visiable to NFS clients.  True, it
would mean a certain amount of confusion for those clients that kept the
filesystem mounted across a transition between an old 2.2.16 kernel and
a reboot to a new kernel that interpreted the i_generation field
differently, but that should be the only compatibility problem I can
think of.

Am I missing something?

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Trond Myklebust

> " " == Andi Kleen <[EMAIL PROTECTED]> writes:

 > On Wed, Sep 13, 2000 at 11:51:45AM -0400, Theodore Y. Ts'o
 > wrote:
>>
>> So this is a really stupid question, but I'll ask it anyway.
>> If you just need a cookie, is there any way that you might be
>> able to steal a few bits from i_generation field for that
>> purpose?  (This assumes that we only worry about solving the
>> problem for NFS.)

 > No, it does not help. i_generation is only in the NFS file
 > handle, but not in the fattr/file id this is used for cache
 > checks. The NFS file handle has to stay identical anyways, as
 > long as the inode is not deleted/reused.


You might be able to steal a couple of bytes and then rewrite ext2fs
to mask those out from the 'i_generation' field, but it would mean that
you could no longer boot your old 2.2.16 kernel without trouble on
existing clients.

I'm no expert on the consequences of such a change on ext2fs and what
it might do to other OSes, but my gut feeling is that this is perhaps
one solution that we should try to avoid.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Trond Myklebust

> " " == Albert D Cahalan <[EMAIL PROTECTED]> writes:

 > 20 bits * 3 timestamps == 60 bits 60 bits <= 8 bytes

 > So you do need 8 bytes.

Yes. Assuming that you want to implement the microseconds field on all 
3 timestamps. For NFS I would only need that precision on mtime.
Does anybody else see a need for it on ctime and/or atime?

 > I thought of this, but I don't think it is safe.

 > write to file X us=1 write to file Y us=1 write to file Y us=2
 > write to file Y us=3 write to file X us=2

 > Oops, "make" on some clients will think Y is newer than X.

 > Using the wrong unit would be OK, as long as it is a real
 > sub-second time unit that doesn't overflow... but then you
 > might as well convert it to microseconds.

Sure, but the alternative might be to reserve some bits as an
'operations counter' that is incremented every time an update is
performed. That way you might have milisecond resolution (which is the
sort of resolution that 'jiffies' & friends offers) + a few extra bits
that would be there for make/NFS/... update consistency.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Andi Kleen

On Wed, Sep 13, 2000 at 11:51:45AM -0400, Theodore Y. Ts'o wrote:
> 
> So this is a really stupid question, but I'll ask it anyway.  If you
> just need a cookie, is there any way that you might be able to steal a
> few bits from i_generation field for that purpose?  (This assumes that
> we only worry about solving the problem for NFS.)

No, it does not help. i_generation is only in the NFS file handle, but not
in the fattr/file id this is used for cache checks. The NFS file handle
has to stay identical anyways, as long as the inode is not deleted/reused.

-Andi


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Theodore Y. Ts'o

   Date:Wed, 13 Sep 2000 12:54:49 +0200 (CEST)
   From: Trond Myklebust <[EMAIL PROTECTED]>

   Don't forget that 2^20 > 10^6, hence if you really want units of
   microseconds, you actually only need to save 3 bytes worth of data per
   timestamp.

   For the purposes of NFS, however the 'microseconds' field doesn't
   actually have to mean anything. It's just used as a cookie.
   Furthermore, only the mtime field, and to a limited extent the ctime
   field, are used for attribute consistency checking.

So this is a really stupid question, but I'll ask it anyway.  If you
just need a cookie, is there any way that you might be able to steal a
few bits from i_generation field for that purpose?  (This assumes that
we only worry about solving the problem for NFS.)

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Albert D. Cahalan

Trond Myklebust writes:
> > " " == Albert D Cahalan <[EMAIL PROTECTED]> writes:

>> It would not be reasonable to try extending ext2 to 64-bit
>> times, but milliseconds might be doable. You'd need 4 bytes to
>> support the 3 standard time stamps.
>>
>> Going to microseconds would require 8 free bytes, which I don't
>> think we have. (but we do have more that one might think, due
>> to the unimplemented junk)
>
> Don't forget that 2^20 > 10^6, hence if you really want units of
> microseconds, you actually only need to save 3 bytes worth of data
> per timestamp.

20 bits * 3 timestamps == 60 bits
60 bits <= 8 bytes

So you do need 8 bytes.

> For the purposes of NFS, however the 'microseconds' field doesn't
> actually have to mean anything. It's just used as a cookie.

I thought of this, but I don't think it is safe.

write to file X   us=1
write to file Y   us=1
write to file Y   us=2
write to file Y   us=3
write to file X   us=2

Oops, "make" on some clients will think Y is newer than X.

Using the wrong unit would be OK, as long as it is a real
sub-second time unit that doesn't overflow... but then you
might as well convert it to microseconds.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Trond Myklebust

> " " == Jeff Epler <[EMAIL PROTECTED]> writes:

 > On Wed, Sep 13, 2000 at 02:53:02PM +0200, Trond Myklebust
 > wrote:
>> No. Things fall in and out of the inode cache all the
>> time. That's a vicious circle that's going to lead to a lot of
>> unnecessary traffic.

 > The traffic is not "unnecessary" if it's needed to work with
 > today's NFS servers and get correct results.  We need an
 > absolute guarantee in nfs_lock() that we won't read stale data,
 > we need it when working with non-linux NFS servers, and we need
 > it today.

That's the joy of free software: you're free to patch in any hack you
and your clients need.

The goal of the official 2.4.x kernel however should IMO be to follow
the specs which is what we do now.
If current Linux servers are broken w.r.t. NFS specs then a hack to
the Linux client won't fix them, and they will continue to break on
non-Linux clients.
NFS has a mechanism for solving this problem, and it is a mechanism
that most (if not all) non-linux servers implement. That is what we
need to implement too if we're serious about conforming to the
standard.

I'm quite willing to help fix the server (something I indeed think
should be done), but to further obscure the code in the client with
yet another bandaid for broken servers is not something I personally
will help do.

Perhaps Linus will indeed accept such a patch, in which case you (or
anybody else) are quite free to construct it and pass it on to him.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Jeff Epler

On Wed, Sep 13, 2000 at 02:53:02PM +0200, Trond Myklebust wrote:
> No. Things fall in and out of the inode cache all the time. That's a
> vicious circle that's going to lead to a lot of unnecessary traffic.

The traffic is not "unnecessary" if it's needed to work with today's NFS
servers and get correct results.  We need an absolute guarantee in nfs_lock()
that we won't read stale data, we need it when working with non-linux NFS
servers, and we need it today.

Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Trond Myklebust

> " " == Jamie Lokier <[EMAIL PROTECTED]> writes:

 > If we're getting serious about adding finer-grained mtimes to
 > ext2, could we please consider using those bits as a way to
 > know, for sure, whether a file has changed?

Agreed.

 > Btw, the whole cache coherency thing would probably work very
 > well just updating times in in-core inodes.

No. Things fall in and out of the inode cache all the time. That's a
vicious circle that's going to lead to a lot of unnecessary traffic.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Jamie Lokier

Trond Myklebust wrote:
>  > It would not be reasonable to try extending ext2 to 64-bit
>  > times, but milliseconds might be doable. You'd need 4 bytes to
>  > support the 3 standard time stamps.
> 
>  > Going to microseconds would require 8 free bytes, which I don't
>  > think we have. (but we do have more that one might think, due
>  > to the unimplemented junk)
> 
> Don't forget that 2^20 > 10^6, hence if you really want units of
> microseconds, you actually only need to save 3 bytes worth of data per
> timestamp.
> 
> For the purposes of NFS, however the 'microseconds' field doesn't
> actually have to mean anything. It's just used as a cookie.
> Furthermore, only the mtime field, and to a limited extent the ctime
> field, are used for attribute consistency checking.

If we're getting serious about adding finer-grained mtimes to ext2,
could we please consider using those bits as a way to know, for sure,
whether a file has changed?

Btw, the whole cache coherency thing would probably work very well just
updating times in in-core inodes.

thanks,
-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Andi Kleen

On Wed, Sep 13, 2000 at 12:42:15PM +0200, Trond Myklebust wrote:
> > " " == Andi Kleen <[EMAIL PROTECTED]> writes:
> 
>  > On Tue, Sep 12, 2000 at 09:10:56PM +0200, Trond Myklebust
>  > wrote:
> >> Making mtime a true 64-bit cookie on Linux servers would be a
> >> solution that works on all clients.
> 
>  > Making mtime 64bit would also be useful for local parallel make
>  > runs, the current second resolution leads to race conditions on
>  > bigger SMP boxes.
> 
>  > Unfortunately the only FS that supports   > currently is XFS afaik, and there is no VFS interface for it.
> 
> 
> You'd have to fix make as well: 'stat' & friends alway return the
> timestamp in seconds.

struct stat64 leaves enough padding for micro or nanosecond fields.

> 
> Is there any 'standard' function for reading the microseconds fields
> in userland? I couldn't find anything with a brief search in the
> X/Open docs.

Solaris seems to support it in newer releases, and GNU make 3.79 already does 
on Solaris. They seem to use a st_mtime_nsec.


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Trond Myklebust

> " " == Albert D Cahalan <[EMAIL PROTECTED]> writes:

 > It would not be reasonable to try extending ext2 to 64-bit
 > times, but milliseconds might be doable. You'd need 4 bytes to
 > support the 3 standard time stamps.

 > Going to microseconds would require 8 free bytes, which I don't
 > think we have. (but we do have more that one might think, due
 > to the unimplemented junk)

Don't forget that 2^20 > 10^6, hence if you really want units of
microseconds, you actually only need to save 3 bytes worth of data per
timestamp.

For the purposes of NFS, however the 'microseconds' field doesn't
actually have to mean anything. It's just used as a cookie.
Furthermore, only the mtime field, and to a limited extent the ctime
field, are used for attribute consistency checking.

If NFS was the only concern therefore, you would get far on just
saving one extra 24-bit field.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Trond Myklebust

> " " == Andi Kleen <[EMAIL PROTECTED]> writes:

 > On Tue, Sep 12, 2000 at 09:10:56PM +0200, Trond Myklebust
 > wrote:
>> Making mtime a true 64-bit cookie on Linux servers would be a
>> solution that works on all clients.

 > Making mtime 64bit would also be useful for local parallel make
 > runs, the current second resolution leads to race conditions on
 > bigger SMP boxes.

 > Unfortunately the only FS that supports  currently is XFS afaik, and there is no VFS interface for it.


You'd have to fix make as well: 'stat' & friends alway return the
timestamp in seconds.

Is there any 'standard' function for reading the microseconds fields
in userland? I couldn't find anything with a brief search in the
X/Open docs.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Trond Myklebust

 " " == Andi Kleen [EMAIL PROTECTED] writes:

  On Tue, Sep 12, 2000 at 09:10:56PM +0200, Trond Myklebust
  wrote:
 Making mtime a true 64-bit cookie on Linux servers would be a
 solution that works on all clients.

  Making mtime 64bit would also be useful for local parallel make
  runs, the current second resolution leads to race conditions on
  bigger SMP boxes.

  Unfortunately the only FS that supports s on disk timestamps
  currently is XFS afaik, and there is no VFS interface for it.


You'd have to fix make as well: 'stat'  friends alway return the
timestamp in seconds.

Is there any 'standard' function for reading the microseconds fields
in userland? I couldn't find anything with a brief search in the
X/Open docs.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Andi Kleen

On Wed, Sep 13, 2000 at 12:42:15PM +0200, Trond Myklebust wrote:
  " " == Andi Kleen [EMAIL PROTECTED] writes:
 
   On Tue, Sep 12, 2000 at 09:10:56PM +0200, Trond Myklebust
   wrote:
  Making mtime a true 64-bit cookie on Linux servers would be a
  solution that works on all clients.
 
   Making mtime 64bit would also be useful for local parallel make
   runs, the current second resolution leads to race conditions on
   bigger SMP boxes.
 
   Unfortunately the only FS that supports s on disk timestamps
   currently is XFS afaik, and there is no VFS interface for it.
 
 
 You'd have to fix make as well: 'stat'  friends alway return the
 timestamp in seconds.

struct stat64 leaves enough padding for micro or nanosecond fields.

 
 Is there any 'standard' function for reading the microseconds fields
 in userland? I couldn't find anything with a brief search in the
 X/Open docs.

Solaris seems to support it in newer releases, and GNU make 3.79 already does 
on Solaris. They seem to use a st_mtime_nsec.


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Trond Myklebust

 " " == Jamie Lokier [EMAIL PROTECTED] writes:

  If we're getting serious about adding finer-grained mtimes to
  ext2, could we please consider using those bits as a way to
  know, for sure, whether a file has changed?

Agreed.

  Btw, the whole cache coherency thing would probably work very
  well just updating times in in-core inodes.

No. Things fall in and out of the inode cache all the time. That's a
vicious circle that's going to lead to a lot of unnecessary traffic.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Trond Myklebust

 " " == Jeff Epler [EMAIL PROTECTED] writes:

  On Wed, Sep 13, 2000 at 02:53:02PM +0200, Trond Myklebust
  wrote:
 No. Things fall in and out of the inode cache all the
 time. That's a vicious circle that's going to lead to a lot of
 unnecessary traffic.

  The traffic is not "unnecessary" if it's needed to work with
  today's NFS servers and get correct results.  We need an
  absolute guarantee in nfs_lock() that we won't read stale data,
  we need it when working with non-linux NFS servers, and we need
  it today.

That's the joy of free software: you're free to patch in any hack you
and your clients need.

The goal of the official 2.4.x kernel however should IMO be to follow
the specs which is what we do now.
If current Linux servers are broken w.r.t. NFS specs then a hack to
the Linux client won't fix them, and they will continue to break on
non-Linux clients.
NFS has a mechanism for solving this problem, and it is a mechanism
that most (if not all) non-linux servers implement. That is what we
need to implement too if we're serious about conforming to the
standard.

I'm quite willing to help fix the server (something I indeed think
should be done), but to further obscure the code in the client with
yet another bandaid for broken servers is not something I personally
will help do.

Perhaps Linus will indeed accept such a patch, in which case you (or
anybody else) are quite free to construct it and pass it on to him.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Theodore Y. Ts'o

   Date:Wed, 13 Sep 2000 12:54:49 +0200 (CEST)
   From: Trond Myklebust [EMAIL PROTECTED]

   Don't forget that 2^20  10^6, hence if you really want units of
   microseconds, you actually only need to save 3 bytes worth of data per
   timestamp.

   For the purposes of NFS, however the 'microseconds' field doesn't
   actually have to mean anything. It's just used as a cookie.
   Furthermore, only the mtime field, and to a limited extent the ctime
   field, are used for attribute consistency checking.

So this is a really stupid question, but I'll ask it anyway.  If you
just need a cookie, is there any way that you might be able to steal a
few bits from i_generation field for that purpose?  (This assumes that
we only worry about solving the problem for NFS.)

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Trond Myklebust

 " " == Albert D Cahalan [EMAIL PROTECTED] writes:

  20 bits * 3 timestamps == 60 bits 60 bits = 8 bytes

  So you do need 8 bytes.

Yes. Assuming that you want to implement the microseconds field on all 
3 timestamps. For NFS I would only need that precision on mtime.
Does anybody else see a need for it on ctime and/or atime?

  I thought of this, but I don't think it is safe.

  write to file X us=1 write to file Y us=1 write to file Y us=2
  write to file Y us=3 write to file X us=2

  Oops, "make" on some clients will think Y is newer than X.

  Using the wrong unit would be OK, as long as it is a real
  sub-second time unit that doesn't overflow... but then you
  might as well convert it to microseconds.

Sure, but the alternative might be to reserve some bits as an
'operations counter' that is incremented every time an update is
performed. That way you might have milisecond resolution (which is the
sort of resolution that 'jiffies'  friends offers) + a few extra bits
that would be there for make/NFS/... update consistency.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Trond Myklebust

 " " == Andi Kleen [EMAIL PROTECTED] writes:

  On Wed, Sep 13, 2000 at 11:51:45AM -0400, Theodore Y. Ts'o
  wrote:

 So this is a really stupid question, but I'll ask it anyway.
 If you just need a cookie, is there any way that you might be
 able to steal a few bits from i_generation field for that
 purpose?  (This assumes that we only worry about solving the
 problem for NFS.)

  No, it does not help. i_generation is only in the NFS file
  handle, but not in the fattr/file id this is used for cache
  checks. The NFS file handle has to stay identical anyways, as
  long as the inode is not deleted/reused.


You might be able to steal a couple of bytes and then rewrite ext2fs
to mask those out from the 'i_generation' field, but it would mean that
you could no longer boot your old 2.2.16 kernel without trouble on
existing clients.

I'm no expert on the consequences of such a change on ext2fs and what
it might do to other OSes, but my gut feeling is that this is perhaps
one solution that we should try to avoid.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Theodore Y. Ts'o

   Date: Wed, 13 Sep 2000 22:35:10 +0200 (CEST)
   From: Trond Myklebust [EMAIL PROTECTED]

   You might be able to steal a couple of bytes and then rewrite ext2fs
   to mask those out from the 'i_generation' field, but it would mean that
   you could no longer boot your old 2.2.16 kernel without trouble on
   existing clients.

Why?  i_generation is a field which is only used by the NFS server.  As
far as ext2 is concerned, it's a black box value.  Currently, as I
understand things (and you're much more the export on the NFS code than
I am) all 32 bits of i_generation is used to enforce uniqueness of the
NFS file handler across the recycling of the inode (i.e., the inode is
deleted, and then reused for something else).  Is that right?

If we were to steal, say, 8 bits of i_generation to provide a subsecond
indicator of freshness of the attributes by shoving those 8 bits into
the NFS microseconds field, this shouldn't cause any incompatibilities,
should it?  (Of couse, you would mask out those 8 bits for the NFS file
handle generation algorithm, which you would only have 24 bits, which
still should be enough for its purposes.)

My understanding of the games played by the i_generation field and the
NFS file handle is that it's not visiable to NFS clients.  True, it
would mean a certain amount of confusion for those clients that kept the
filesystem mounted across a transition between an old 2.2.16 kernel and
a reboot to a new kernel that interpreted the i_generation field
differently, but that should be the only compatibility problem I can
think of.

Am I missing something?

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Frank van Maarseveen

On Wed, Sep 13, 2000 at 12:42:15PM +0200, Trond Myklebust wrote:
 
 Is there any 'standard' function for reading the microseconds fields
 in userland? I couldn't find anything with a brief search in the
 X/Open docs.
 
Both Digital OSF/1 and Solaris use 3 undocumented spare fields in the
struct stat and filled it with a microsecond/nanosecond fraction for
atime, mtime and ctime.

-- 
Frank
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Andi Kleen

On Wed, Sep 13, 2000 at 10:35:10PM +0200, Trond Myklebust wrote:
   No, it does not help. i_generation is only in the NFS file
   handle, but not in the fattr/file id this is used for cache
   checks. The NFS file handle has to stay identical anyways, as
   long as the inode is not deleted/reused.
 
 
 You might be able to steal a couple of bytes and then rewrite ext2fs
 to mask those out from the 'i_generation' field, but it would mean that
 you could no longer boot your old 2.2.16 kernel without trouble on
 existing clients.

Steal a couple of bytes from what ? 

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Trond Myklebust

 " " == Theodore Y Ts'o [EMAIL PROTECTED] writes:

  Why?  i_generation is a field which is only used by the NFS
  server.  As far as ext2 is concerned, it's a black box value.
  Currently, as I understand things (and you're much more the
  export on the NFS code than I am) all 32 bits of i_generation
  is used to enforce uniqueness of the NFS file handler across
  the recycling of the inode (i.e., the inode is deleted, and
  then reused for something else).  Is that right?

Yes. In effect you have a 32-bit counter that serves as a guarantees
that an inode is not reused 'within a reasonable period of time'.

  If we were to steal, say, 8 bits of i_generation to provide a
  subsecond indicator of freshness of the attributes by shoving
  those 8 bits into the NFS microseconds field, this shouldn't
  cause any incompatibilities, should it?  (Of couse, you would
  mask out those 8 bits for the NFS file handle generation
  algorithm, which you would only have 24 bits, which still
  should be enough for its purposes.)

  My understanding of the games played by the i_generation field
  and the NFS file handle is that it's not visiable to NFS
  clients.  True, it would mean a certain amount of confusion for
  those clients that kept the filesystem mounted across a
  transition between an old 2.2.16 kernel and a reboot to a new
  kernel that interpreted the i_generation field differently, but
  that should be the only compatibility problem I can think of.

  Am I missing something?

We'd want to make an implementation that is generic and fits all
'commonly used' UNIX filesystems. That means that we don't want to
apply an ext2-specific hack at the VFS or nfsd levels.  The
inode-generation field is a feature on several filesystems which
already support sub-second timestamps.

Cheers,
 Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Trond Myklebust

 " " == Andi Kleen [EMAIL PROTECTED] writes:

  Steal a couple of bytes from what ?

From the 32-bit storage area currently allocated to i_generation on
the on-disk ext2fs inode (as per Ted's suggestion). With current
disk/computing speeds, you probably don't need the full 32 bits to
offer a reasonable guarantee of inode number uniqueness.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-13 Thread Jeff Epler

I think that both the NFS server changes that Trond is suggesting and the
NFS client changes that I am suggesting have their place.

The fact that the tuple (mtime, size) is used to test to test for
unchangedness of a file indicates that the people who designed NFS
understood that just using mtime wasn't enough.  The problem is that using
(mtime, size) is not enough either, and I don't understand why anyone ever
thought it would be.  Trivial example:
write(fd, "a", 1);
llseek(fd, 0, 0);
write(fd, "b", 1);

My solution will enable the Linux client (and other clients implementing
the same fix) to achieve the coherency promised by the comment in
nfs_lock(), at the expense of slightly lower performance, no matter what
NFS server they use (even a hypothetical one with mtime resolution of a
year).

Trond's nfs server changes would enable any client communicating with it to
achieve the coherency promised in nfs_lock(), without the performance
loss, but unfortunately the changes can't be made today and don't solve
anything for non-Linux NFS servers.

Thus, the only thing that is missing from my proposal is a way to fall back
on the old invalidation code, for that time in the future where mtime
changes are always sufficient to distinguish changes n a file.  I'm not
going to sweat this, since it sounds like this isn't implemented anywhere
else and won't be in Linux for some time.  When it comes along, we can
upgrade mount and the nfs client with a new flag.

Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-12 Thread Albert D. Cahalan

Trond Myklebust writes:

> Relying on the sub-second timing fields is a much more
> implementation-specific. It depends on the capabilities of both the
> filesystem and the server OS.
> Linux and the knfsd server code could easily be modified to provide
> such a service, but the problem (as I've stated before) is that you
> need to save the time somewhere on disk. I believe currently ext2
> provides only 32 bits of storage for mtime (though perhaps somebody
> else could comment on that).

Yes, ext2 has this limit.

It would not be reasonable to try extending ext2 to 64-bit times,
but milliseconds might be doable. You'd need 4 bytes to support
the 3 standard time stamps.

Going to microseconds would require 8 free bytes, which I don't
think we have. (but we do have more that one might think, due
to the unimplemented junk)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-12 Thread Andi Kleen

On Tue, Sep 12, 2000 at 09:10:56PM +0200, Trond Myklebust wrote:
> Making mtime a true 64-bit cookie on Linux servers would be a solution
> that works on all clients.

Making mtime 64bit would also be useful for local parallel make runs, 
the current second resolution leads to race conditions on bigger SMP boxes.

Unfortunately the only FS that supports http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-12 Thread Jeff Epler

On Tue, Sep 12, 2000 at 09:10:56PM +0200, Trond Myklebust wrote:
> > " " == Alan Cox <[EMAIL PROTECTED]> writes:
> 
>  > Providing everyone is careful to hold a lock I think it is
> 
>  > lockf() is a read barrier providing the local cache is flushed,
>  > the unlock is a write barrier providing the local cache is
>  > flushed first. Providing all users are using lockf for their
>  > I/O then it seems to be coherent
> 
> 
> Sure, but what happens if you're in a mixed client environment?
> If we have a purely Linux-specific hack to ensure cache coherency,
> that will still corrupt the cache on those *NIX clients that use
> ordinary cache coherency checking (i.e. checking mtime + file size)
> rather than cache invalidation.

We don't alter anything about what the other clients cache.  Sure, you can
connect to the same server with a client that doesn't have this fix, and
sure you might get different cached data on different clients at the same
time, but that's the current situation.  You don't make it *worse* for
anybody else, you make it *better* for Linux.

> There's nothing in the NLM+NFSv2+NFSv3 notes about this situation, but 
> on page 77 of the latest NFSv4 draft it states:
> 
>oFirst, when a client obtains a file lock for a particular
> region, the data cache corresponding to that region (if any
> cache data exists) must be revalidated.  If the change attribute
> indicates that the file may have been updated since the cached
> data was obtained, the client must flush or invalidate the
> cached data for the newly locked region.  A client might choose
> to invalidate all of non-modified cached data that it has for
> the file but the only requirement for correct operation is to
> invalidate all of the data in the newly locked region.

But "ctime and file size are the same" does not prove that the file is
unchanged.  That's the root of this problem, and why NFS_CACHEINV(inode) is
not enough to ensure coherency.

Furthermore, according to NFSv4, what I am suggesting is something that a
client "may choose" to do anyhow---i.e., it's at least permitted by the
spec, even if it's not required.

> Making mtime a true 64-bit cookie on Linux servers would be a solution
> that works on all clients.

Yes, but my solution works for all servers with a Linux client. (We've been
making this modification to NFS clients in several other unices for years,
we know it works and gives the guarantees we need)

> It also avoids a lot of unnecessary cache flushing. Imagine having to
> reread your entire mailbox every time you open the file whether or not
> a new message has arrived. Ugh...

It's better to get wrong results if you run a mail client at the same time
on two different machines, right?

Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-12 Thread Jeff Epler

On Tue, Sep 12, 2000 at 07:08:09PM +0200, Trond Myklebust wrote:
> This is a known issue, and is not easy to fix. Neither of the
> solutions you propose are correct since they will both cause a cache
> invalidation. This is not the same as cache coherency checking.
> 
> 
> The correct solution in this case would be to add a 'sub-second time
> field' to ext2fs & friends. Both NFSv2 and NFSv3 pass a 64 bit field
[...]

Trond,
Thanks for your swift answer.  I sent a message a few months ago with
a much poorer solution which was perhaps rightly ignored

Is there any solution that's available today, and doesn't depend on
using Linux in the server?  I suspect that we will have to distribute
a modified nfs client with our app, and we're prepared to accept the
cache invalidation (and reduced performance it causes) for a technique
that will work with any NFS server and provide the level of guarantees
we need.

Is there a solution that would allow the kind of guarantee our software
wants with non-linux nfsds without the cache-blowing that the change I'm
suggesting causes?

Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-12 Thread Trond Myklebust

> " " == Alan Cox <[EMAIL PROTECTED]> writes:

 > Providing everyone is careful to hold a lock I think it is

 > lockf() is a read barrier providing the local cache is flushed,
 > the unlock is a write barrier providing the local cache is
 > flushed first. Providing all users are using lockf for their
 > I/O then it seems to be coherent


Sure, but what happens if you're in a mixed client environment?
If we have a purely Linux-specific hack to ensure cache coherency,
that will still corrupt the cache on those *NIX clients that use
ordinary cache coherency checking (i.e. checking mtime + file size)
rather than cache invalidation.

There's nothing in the NLM+NFSv2+NFSv3 notes about this situation, but 
on page 77 of the latest NFSv4 draft it states:

   oFirst, when a client obtains a file lock for a particular
region, the data cache corresponding to that region (if any
cache data exists) must be revalidated.  If the change attribute
indicates that the file may have been updated since the cached
data was obtained, the client must flush or invalidate the
cached data for the newly locked region.  A client might choose
to invalidate all of non-modified cached data that it has for
the file but the only requirement for correct operation is to
invalidate all of the data in the newly locked region.


---

Making mtime a true 64-bit cookie on Linux servers would be a solution
that works on all clients.
It also avoids a lot of unnecessary cache flushing. Imagine having to
reread your entire mailbox every time you open the file whether or not
a new message has arrived. Ugh...

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-12 Thread Trond Myklebust

> " " == Jeff Epler <[EMAIL PROTECTED]> writes:

 > Is there any solution that's available today, and doesn't
 > depend on using Linux in the server?  I suspect that we will
 > have to distribute a modified nfs client with our app, and
 > we're prepared to accept the cache invalidation (and reduced
 > performance it causes) for a technique that will work with any
 > NFS server and provide the level of guarantees we need.

 > Is there a solution that would allow the kind of guarantee our
 > software wants with non-linux nfsds without the cache-blowing
 > that the change I'm suggesting causes?

If you ensure that the file length is changed each time it is written
to, then the cache coherency checking will detect that, and a cache
invalidation is guaranteed to occur. That is something that will
normally work on all NFS implementations.

Relying on the sub-second timing fields is a much more
implementation-specific. It depends on the capabilities of both the
filesystem and the server OS.
Linux and the knfsd server code could easily be modified to provide
such a service, but the problem (as I've stated before) is that you
need to save the time somewhere on disk. I believe currently ext2
provides only 32 bits of storage for mtime (though perhaps somebody
else could comment on that).
UFS does provide a 64-bit storage space for mtime. It would surprise
me if the other *NIX implementations aren't using this to improve NFS
cache consistency.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-12 Thread Trond Myklebust

> " " == Jeff Epler <[EMAIL PROTECTED]> writes:

 > But "ctime and file size are the same" does not prove that the
^ mtime
 > file is unchanged.  That's the root of this problem, and why
 > NFS_CACHEINV(inode) is not enough to ensure coherency.

Yes, but I just demonstrated that messing around with the client is a
hack, not a solution to this situation.
It's a hack because it is not compatible with the expected behaviour
of NFS clients. The NFSv4 draft demonstrates what the expected
behaviour is, and illustrates what we have to implement for a generic
fix (one that works on all clients -- not hacked Linux ones).

This is why I'm saying the correct solution is to get the server to
change mtime. To use the 'microseconds' field that is supported by the
NFS protocol, and that we currently set to 0.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-12 Thread Alan Cox

> If we have a purely Linux-specific hack to ensure cache coherency,
> that will still corrupt the cache on those *NIX clients that use
> ordinary cache coherency checking (i.e. checking mtime + file size)
> rather than cache invalidation.

Its what Solaris implements and what SunOS back down to 3.x implements. I
wouldnt be suprised if others do likewise ?

> It also avoids a lot of unnecessary cache flushing. Imagine having to
> reread your entire mailbox every time you open the file whether or not
> a new message has arrived. Ugh...

Interesting point.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-12 Thread Trond Myklebust

> " " == Jeff Epler <[EMAIL PROTECTED]> writes:

 > I believe I have discovered a problem in the Linux 2.2 nfs
 > client.  The short story is that NFS_CACHEINV(inode) does not
 > "act as a cache coherency point" (as nfs/file.c:nfs_lock()'s
 > comment implies) when multiple modifications to a file take
 > place during the same second, because in that case the file's
 > mtime is not changed.

 > The fix we have found is to *either* have NFS_CACHEINV(inode)
 > change inode-> i_mtime to an artificial value (0) or to call
 > nfs_zap_caches instead.  Since I am not sure which fix is
 > appropriate, I'm not enclosing an actual patch.

This is a known issue, and is not easy to fix. Neither of the
solutions you propose are correct since they will both cause a cache
invalidation. This is not the same as cache coherency checking.


The correct solution in this case would be to add a 'sub-second time
field' to ext2fs & friends. Both NFSv2 and NFSv3 pass a 64 bit field
for the mtime. Under NFSv2, the low 32 bits are meant to be
interpreted in units of micro-seconds. Under NFSv3, the units are
nanoseconds.
For the purposes of ensuring cache coherency, the 'sub-second' field
doesn't actually have to be a time-related thing at all. It could
simply be an 'operation count' or something like that. The
nfs_refresh_inode() uses the mtime field as a 64-bit cookie, so it
doesn't care as long as the field is conserved across server reboots.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-12 Thread Trond Myklebust

 " " == Jeff Epler [EMAIL PROTECTED] writes:

  I believe I have discovered a problem in the Linux 2.2 nfs
  client.  The short story is that NFS_CACHEINV(inode) does not
  "act as a cache coherency point" (as nfs/file.c:nfs_lock()'s
  comment implies) when multiple modifications to a file take
  place during the same second, because in that case the file's
  mtime is not changed.

  The fix we have found is to *either* have NFS_CACHEINV(inode)
  change inode- i_mtime to an artificial value (0) or to call
  nfs_zap_caches instead.  Since I am not sure which fix is
  appropriate, I'm not enclosing an actual patch.

This is a known issue, and is not easy to fix. Neither of the
solutions you propose are correct since they will both cause a cache
invalidation. This is not the same as cache coherency checking.


The correct solution in this case would be to add a 'sub-second time
field' to ext2fs  friends. Both NFSv2 and NFSv3 pass a 64 bit field
for the mtime. Under NFSv2, the low 32 bits are meant to be
interpreted in units of micro-seconds. Under NFSv3, the units are
nanoseconds.
For the purposes of ensuring cache coherency, the 'sub-second' field
doesn't actually have to be a time-related thing at all. It could
simply be an 'operation count' or something like that. The
nfs_refresh_inode() uses the mtime field as a 64-bit cookie, so it
doesn't care as long as the field is conserved across server reboots.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-12 Thread Alan Cox

 If we have a purely Linux-specific hack to ensure cache coherency,
 that will still corrupt the cache on those *NIX clients that use
 ordinary cache coherency checking (i.e. checking mtime + file size)
 rather than cache invalidation.

Its what Solaris implements and what SunOS back down to 3.x implements. I
wouldnt be suprised if others do likewise ?

 It also avoids a lot of unnecessary cache flushing. Imagine having to
 reread your entire mailbox every time you open the file whether or not
 a new message has arrived. Ugh...

Interesting point.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-12 Thread Trond Myklebust

 " " == Jeff Epler [EMAIL PROTECTED] writes:

  But "ctime and file size are the same" does not prove that the
^ mtime
  file is unchanged.  That's the root of this problem, and why
  NFS_CACHEINV(inode) is not enough to ensure coherency.

Yes, but I just demonstrated that messing around with the client is a
hack, not a solution to this situation.
It's a hack because it is not compatible with the expected behaviour
of NFS clients. The NFSv4 draft demonstrates what the expected
behaviour is, and illustrates what we have to implement for a generic
fix (one that works on all clients -- not hacked Linux ones).

This is why I'm saying the correct solution is to get the server to
change mtime. To use the 'microseconds' field that is supported by the
NFS protocol, and that we currently set to 0.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-12 Thread Trond Myklebust

 " " == Jeff Epler [EMAIL PROTECTED] writes:

  Is there any solution that's available today, and doesn't
  depend on using Linux in the server?  I suspect that we will
  have to distribute a modified nfs client with our app, and
  we're prepared to accept the cache invalidation (and reduced
  performance it causes) for a technique that will work with any
  NFS server and provide the level of guarantees we need.

  Is there a solution that would allow the kind of guarantee our
  software wants with non-linux nfsds without the cache-blowing
  that the change I'm suggesting causes?

If you ensure that the file length is changed each time it is written
to, then the cache coherency checking will detect that, and a cache
invalidation is guaranteed to occur. That is something that will
normally work on all NFS implementations.

Relying on the sub-second timing fields is a much more
implementation-specific. It depends on the capabilities of both the
filesystem and the server OS.
Linux and the knfsd server code could easily be modified to provide
such a service, but the problem (as I've stated before) is that you
need to save the time somewhere on disk. I believe currently ext2
provides only 32 bits of storage for mtime (though perhaps somebody
else could comment on that).
UFS does provide a 64-bit storage space for mtime. It would surprise
me if the other *NIX implementations aren't using this to improve NFS
cache consistency.

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-12 Thread Trond Myklebust

 " " == Alan Cox [EMAIL PROTECTED] writes:

  Providing everyone is careful to hold a lock I think it is

  lockf() is a read barrier providing the local cache is flushed,
  the unlock is a write barrier providing the local cache is
  flushed first. Providing all users are using lockf for their
  I/O then it seems to be coherent


Sure, but what happens if you're in a mixed client environment?
If we have a purely Linux-specific hack to ensure cache coherency,
that will still corrupt the cache on those *NIX clients that use
ordinary cache coherency checking (i.e. checking mtime + file size)
rather than cache invalidation.

There's nothing in the NLM+NFSv2+NFSv3 notes about this situation, but 
on page 77 of the latest NFSv4 draft it states:

   oFirst, when a client obtains a file lock for a particular
region, the data cache corresponding to that region (if any
cache data exists) must be revalidated.  If the change attribute
indicates that the file may have been updated since the cached
data was obtained, the client must flush or invalidate the
cached data for the newly locked region.  A client might choose
to invalidate all of non-modified cached data that it has for
the file but the only requirement for correct operation is to
invalidate all of the data in the newly locked region.


---

Making mtime a true 64-bit cookie on Linux servers would be a solution
that works on all clients.
It also avoids a lot of unnecessary cache flushing. Imagine having to
reread your entire mailbox every time you open the file whether or not
a new message has arrived. Ugh...

Cheers,
  Trond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-12 Thread Jeff Epler

On Tue, Sep 12, 2000 at 07:08:09PM +0200, Trond Myklebust wrote:
 This is a known issue, and is not easy to fix. Neither of the
 solutions you propose are correct since they will both cause a cache
 invalidation. This is not the same as cache coherency checking.
 
 
 The correct solution in this case would be to add a 'sub-second time
 field' to ext2fs  friends. Both NFSv2 and NFSv3 pass a 64 bit field
[...]

Trond,
Thanks for your swift answer.  I sent a message a few months ago with
a much poorer solution which was perhaps rightly ignored

Is there any solution that's available today, and doesn't depend on
using Linux in the server?  I suspect that we will have to distribute
a modified nfs client with our app, and we're prepared to accept the
cache invalidation (and reduced performance it causes) for a technique
that will work with any NFS server and provide the level of guarantees
we need.

Is there a solution that would allow the kind of guarantee our software
wants with non-linux nfsds without the cache-blowing that the change I'm
suggesting causes?

Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-12 Thread Jeff Epler

On Tue, Sep 12, 2000 at 09:10:56PM +0200, Trond Myklebust wrote:
  " " == Alan Cox [EMAIL PROTECTED] writes:
 
   Providing everyone is careful to hold a lock I think it is
 
   lockf() is a read barrier providing the local cache is flushed,
   the unlock is a write barrier providing the local cache is
   flushed first. Providing all users are using lockf for their
   I/O then it seems to be coherent
 
 
 Sure, but what happens if you're in a mixed client environment?
 If we have a purely Linux-specific hack to ensure cache coherency,
 that will still corrupt the cache on those *NIX clients that use
 ordinary cache coherency checking (i.e. checking mtime + file size)
 rather than cache invalidation.

We don't alter anything about what the other clients cache.  Sure, you can
connect to the same server with a client that doesn't have this fix, and
sure you might get different cached data on different clients at the same
time, but that's the current situation.  You don't make it *worse* for
anybody else, you make it *better* for Linux.

 There's nothing in the NLM+NFSv2+NFSv3 notes about this situation, but 
 on page 77 of the latest NFSv4 draft it states:
 
oFirst, when a client obtains a file lock for a particular
 region, the data cache corresponding to that region (if any
 cache data exists) must be revalidated.  If the change attribute
 indicates that the file may have been updated since the cached
 data was obtained, the client must flush or invalidate the
 cached data for the newly locked region.  A client might choose
 to invalidate all of non-modified cached data that it has for
 the file but the only requirement for correct operation is to
 invalidate all of the data in the newly locked region.

But "ctime and file size are the same" does not prove that the file is
unchanged.  That's the root of this problem, and why NFS_CACHEINV(inode) is
not enough to ensure coherency.

Furthermore, according to NFSv4, what I am suggesting is something that a
client "may choose" to do anyhow---i.e., it's at least permitted by the
spec, even if it's not required.

 Making mtime a true 64-bit cookie on Linux servers would be a solution
 that works on all clients.

Yes, but my solution works for all servers with a Linux client. (We've been
making this modification to NFS clients in several other unices for years,
we know it works and gives the guarantees we need)

 It also avoids a lot of unnecessary cache flushing. Imagine having to
 reread your entire mailbox every time you open the file whether or not
 a new message has arrived. Ugh...

It's better to get wrong results if you run a mail client at the same time
on two different machines, right?

Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-12 Thread Andi Kleen

On Tue, Sep 12, 2000 at 09:10:56PM +0200, Trond Myklebust wrote:
 Making mtime a true 64-bit cookie on Linux servers would be a solution
 that works on all clients.

Making mtime 64bit would also be useful for local parallel make runs, 
the current second resolution leads to race conditions on bigger SMP boxes.

Unfortunately the only FS that supports s on disk timestamps currently
is XFS afaik, and there is no VFS interface for it.


-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: NFS locking bug -- limited mtime resolution means nfs_lock() does not provide coherency guarantee

2000-09-12 Thread Albert D. Cahalan

Trond Myklebust writes:

 Relying on the sub-second timing fields is a much more
 implementation-specific. It depends on the capabilities of both the
 filesystem and the server OS.
 Linux and the knfsd server code could easily be modified to provide
 such a service, but the problem (as I've stated before) is that you
 need to save the time somewhere on disk. I believe currently ext2
 provides only 32 bits of storage for mtime (though perhaps somebody
 else could comment on that).

Yes, ext2 has this limit.

It would not be reasonable to try extending ext2 to 64-bit times,
but milliseconds might be doable. You'd need 4 bytes to support
the 3 standard time stamps.

Going to microseconds would require 8 free bytes, which I don't
think we have. (but we do have more that one might think, due
to the unimplemented junk)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/