subject:"If not readdir\(\) then what\?"

Re: If not readdir() then what?

2007-04-16 Thread Neil Brown

On Monday April 16, [EMAIL PROTECTED] wrote:
> 
> The challenge is making it be stable across inserts/deletes, never
> mind reboots.  And it's not a "little bit of cacheing"; in order to be
> correct we would have to cache *forever*, since at least in theory an
> NFS client could hold on to a cookie for an arbitrarily long period of
> time (weeks, months, years, decades), yes?

Yes.  But I think we've already establish that the on-disk structure
chosen by ext3/htree is not able to perfectly support NFS (which is a
pity given that it was written for Linux and Linux is thought to
support NFS).  Our goal is to find the best mapping possible and,
where caching can improve stability for real-world uses, use caching
to help stabilise that mapping.

> 
> You're welcome to try, but I suspect it won't take long before you'll
> see why I'm asserting that a directory fd cache in nfsd is *way* less
> work.  :-)

You have provided some very helpful insights into how ext3/htree
currently works - thanks for that.
I will definitely make a closer inspection of the code and so how
possible it is to realise by ideas.  I'll let you know how I go.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-16 Thread Theodore Tso

On Mon, Apr 16, 2007 at 04:18:53PM +1000, Neil Brown wrote:
> On Thursday April 12, [EMAIL PROTECTED] wrote:
> > 
> > Unfortunately, in the NFS case if there are hash collisions, under the
> > wrong circumstances the NFS client could loop forever trying to
> > readdir() a directory stream.
> 
> I think that can be easily avoided by the filesystem by simply
> ensuring that the cookies returned in a request are all *after* the
> 'llseek' cookie.  There still maybe some risk of missing or extra
> entries in the directory if the filesystem cannot perform a precise
> cookie->position mapping, but an infinite loop should not be a problem.

It is true that you can replace the infinite loop problem by
guaranteeing that certain filenames are simply always missing.
Neither seemed desirable to me, but it is true that if we have to err
in one direction or another, simply skipping files in the case of a
hash collision is probably a better result.

> You still haven't convinced me that the limited size is a fundamental
> problem.
> In the case of ext3,  I have suggested (various things including)
> using 56bits of the hash and an 8 bit sequence number through any
> hash chain that may eventuate.
> The seq number could not be stable across reboots with the current
> on-disk data, but could be made stable for almost all real usage with
> a little bit of caching (I imagine using upper 4 bits to give a
> sequence number while reading from disk, and lower 4 bits to give a
> sequence number when adding new entries).

The challenge is making it be stable across inserts/deletes, never
mind reboots.  And it's not a "little bit of cacheing"; in order to be
correct we would have to cache *forever*, since at least in theory an
NFS client could hold on to a cookie for an arbitrarily long period of
time (weeks, months, years, decades), yes?

> Supposing I were to try my hand at making the modifications I
> suggested to ext3 so that per page/hashvalue rbtrees were cached in
> the pagecache and were used to provide a stable-as-possible telldir
> cookie.

Well, the first problem you'll run into is that ext3 doesn't store
directories in the page cache.  That's because the journaling layer
fs/jbd operates on a buffer cache basis (i.e., on a per filesystem
block level).  

Secondly, storing the rbtree in the page cache doesn't help, because
the whole point of the rbtree is to have a stable pointer that doesn't
change across a b-tree node split.  Whether you use a page cache or
the buffer cache, it doesn't change the fact that when you do a b-tree
node split, half the directory entries *go* *someplace* *else*.  Put
another way, Linus's standard reasons for wanting to store the
directory in the page cache so we can have better readahead don't
apply when you're using htree, because with htree we're never reading
entries in page cache order.

So you wouldn't be able to store the rbtree in the page cache, since
the doing so would defeat the whole purpose of the having the rbtree
in the first place.  The rbtree needs to be per directory stream,
*not* per directory block.   

You could store a mapping between each directory entry and a sequence
number.  That could work, modulo the need to store these entries
*forever*.  But now you take a performance hit since every time you
create or delete a directory entry, you'll need to keep the cache up
to date.  And since we don't store the hash in the on-disk format
(since you can always calculate it from the filename), you'll need to
store the hash as well in the cache information.  So that means
storing a 64-bit hash plus the sequence counter information --- and if
we assume 32-bit alignment requirements for data structures, it means
the simplest way to do this would be to hang a data structure off of
each directory block which consumes 12 bytes of data for every
directory entry, plus either (a) extra space for pointers so you can
easily insert and delete entries as directory entries are created or
deleted, or (b) resigning yourself to use memmove() to insert and
delete entries in the linear array.  

You're welcome to try, but I suspect it won't take long before you'll
see why I'm asserting that a directory fd cache in nfsd is *way* less
work.  :-)

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-16 Thread Theodore Tso

On Mon, Apr 16, 2007 at 03:47:21PM +1000, Neil Brown wrote:
> "my guess", "pretty much" really bother me.
> 
> It sounds like "The largest anyone has asked for is 128bits, or let's
> double it and hope that is enough until the next protocol revision".
> Which was probably reasonable when NFSv2 was being developed and maybe
> even when v3 was developed, but I kind of hoped we were beyond that.
> 
> If a filesystem wanted to order filenames lexically, it really needs
> 256 *bytes*.  And it is fairly silly having a cookie that big.
> 
> I still thinking that 
>filename + 64bits
> is required and plenty (aka necessary and sufficient).  

Sure, but if you're going to include the filename in the cookie, on
the client side you now have to store a variable-length state, which
probably means you'll need to allocate and free memory each time; and
if the filename is 256 characters, you'll have to send that back in
the next readdir() request. 

If we could get filename + 64bits, sure, that would be great.  I was
just assuming we couldn't get it --- and if we can't get it, 256 bits
is two SHA-1 hashes.  So that's one hash for the filename, and 128
bits for a filesystem's internal hash collision.  There might be other
ways that the space could be divided up, which might be somewhat
wasteful of space --- say you need a host identifier for a clustered
filesystem, although arguably adding a host might be infrequent enough
that you just use the cookie verifier hammer and force the client to
get a new set of readdir cookies.  :-)

> I wouldn't argue against 128bits (64 for a search key and 64 to
> guarantee uniqueness) but I really think 256 excessive with no value.
> We we still need the last-filename in the READDIR key.

I wouldn't complain too much about 128 bits, but if we're going to go
fixed size, I can imagine filesystems where that might not be enough.
And the differecen between 16 and 32 bytes isn't that great.  But I
could easily live with either filename + 64bits, or 128 bits.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-16 Thread Neil Brown

On Thursday April 12, [EMAIL PROTECTED] wrote:
> 
> Unfortunately, in the NFS case if there are hash collisions, under the
> wrong circumstances the NFS client could loop forever trying to
> readdir() a directory stream.

I think that can be easily avoided by the filesystem by simply
ensuring that the cookies returned in a request are all *after* the
'llseek' cookie.  There still maybe some risk of missing or extra
entries in the directory if the filesystem cannot perform a precise
cookie->position mapping, but an infinite loop should not be a problem.

> 
> > This is a simple consequence of the design decision to use hashes as
> > the search key.  They aren't dense and they will collide.  So the
> > solution will be a bit fuzzy around the edges.  And maybe that is an
> > acceptable tradeoff.  But the filesystem should take full
> > responsibility for it, whether in performance or correctness :-)
> 
> Well, we could also say that it is NFS's fault that they used a
> limited size cookie as a fundamental part of their protocol 
> 

You still haven't convinced me that the limited size is a fundamental
problem.
In the case of ext3,  I have suggested (various things including)
using 56bits of the hash and an 8 bit sequence number through any
hash chain that may eventuate.
The seq number could not be stable across reboots with the current
on-disk data, but could be made stable for almost all real usage with
a little bit of caching (I imagine using upper 4 bits to give a
sequence number while reading from disk, and lower 4 bits to give a
sequence number when adding new entries).

Even if you adjusted this to 32bits hash and 32bits seq number to
guarantee that the seq number never overflowed, collisions would still
be sufficiently rare that you wouldn't notice them in performance
measurements, but it might just give the collision handling code some
testing in the real world.

> > But there are alternatives.  e.g. internal chaining.
..
> 
> This solution requires an incompatible file format change. 

I wasn't suggesting that ext3 use internal chaining.  That is
obviously not an option.  I was putting the idea forward for any
filesystem developers who may not have finalised their directory
design yet so they could see a mechanism for indexing directories that
gives 100% correct semantics.

> Again, compared to a directory fd cache, what you're proposing a huge
> hit to the filesystem, and at the moment, given that telldir/seekdir
> is rarely used by everyone else, it's mainly NFS which is the main bad
> actor here by insisting on the use of a small 31/63-bit cookie as a
> condition of protocol correctness.

What I would really like to propose (which may or may not match what
it might have seemed like I was proposing in the past) is that ext3
(and every filesystem) should produce and consume cookies that are as
close to perfect as practically possible.  That means guaranteeing no
infinite loops and minimising the chance of lost or duplicated entries
in the case of completely uncached NFS/READDIR access.

On top of that I propose that it might be appropriate to provide some
caching somewhere on the understanding that is isn't required for
correctness and is there primarily for performance (though it might
smooth over some correctness rough edges).

I think it is best for that caching to reside in the page cache, but I
am not completely against putting fd-caching in nfsd.

Supposing I were to try my hand at making the modifications I
suggested to ext3 so that per page/hashvalue rbtrees were cached in
the pagecache and were used to provide a stable-as-possible telldir
cookie.
Would you be willing to review and comment on such patches, and would
you be open to including them in ext3 (or ext4) if they proved to be
suitably stable, maintainable, efficient, etc??

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-16 Thread Neil Brown

On Thursday April 12, [EMAIL PROTECTED] wrote:
 
 Unfortunately, in the NFS case if there are hash collisions, under the
 wrong circumstances the NFS client could loop forever trying to
 readdir() a directory stream.

I think that can be easily avoided by the filesystem by simply
ensuring that the cookies returned in a request are all *after* the
'llseek' cookie.  There still maybe some risk of missing or extra
entries in the directory if the filesystem cannot perform a precise
cookie-position mapping, but an infinite loop should not be a problem.

 
  This is a simple consequence of the design decision to use hashes as
  the search key.  They aren't dense and they will collide.  So the
  solution will be a bit fuzzy around the edges.  And maybe that is an
  acceptable tradeoff.  But the filesystem should take full
  responsibility for it, whether in performance or correctness :-)
 
 Well, we could also say that it is NFS's fault that they used a
 limited size cookie as a fundamental part of their protocol 
 

You still haven't convinced me that the limited size is a fundamental
problem.
In the case of ext3,  I have suggested (various things including)
using 56bits of the hash and an 8 bit sequence number through any
hash chain that may eventuate.
The seq number could not be stable across reboots with the current
on-disk data, but could be made stable for almost all real usage with
a little bit of caching (I imagine using upper 4 bits to give a
sequence number while reading from disk, and lower 4 bits to give a
sequence number when adding new entries).

Even if you adjusted this to 32bits hash and 32bits seq number to
guarantee that the seq number never overflowed, collisions would still
be sufficiently rare that you wouldn't notice them in performance
measurements, but it might just give the collision handling code some
testing in the real world.

  But there are alternatives.  e.g. internal chaining.
..
 
 This solution requires an incompatible file format change. 

I wasn't suggesting that ext3 use internal chaining.  That is
obviously not an option.  I was putting the idea forward for any
filesystem developers who may not have finalised their directory
design yet so they could see a mechanism for indexing directories that
gives 100% correct semantics.

 Again, compared to a directory fd cache, what you're proposing a huge
 hit to the filesystem, and at the moment, given that telldir/seekdir
 is rarely used by everyone else, it's mainly NFS which is the main bad
 actor here by insisting on the use of a small 31/63-bit cookie as a
 condition of protocol correctness.

What I would really like to propose (which may or may not match what
it might have seemed like I was proposing in the past) is that ext3
(and every filesystem) should produce and consume cookies that are as
close to perfect as practically possible.  That means guaranteeing no
infinite loops and minimising the chance of lost or duplicated entries
in the case of completely uncached NFS/READDIR access.

On top of that I propose that it might be appropriate to provide some
caching somewhere on the understanding that is isn't required for
correctness and is there primarily for performance (though it might
smooth over some correctness rough edges).

I think it is best for that caching to reside in the page cache, but I
am not completely against putting fd-caching in nfsd.

Supposing I were to try my hand at making the modifications I
suggested to ext3 so that per page/hashvalue rbtrees were cached in
the pagecache and were used to provide a stable-as-possible telldir
cookie.
Would you be willing to review and comment on such patches, and would
you be open to including them in ext3 (or ext4) if they proved to be
suitably stable, maintainable, efficient, etc??

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-16 Thread Theodore Tso

On Mon, Apr 16, 2007 at 03:47:21PM +1000, Neil Brown wrote:
 my guess, pretty much really bother me.
 
 It sounds like The largest anyone has asked for is 128bits, or let's
 double it and hope that is enough until the next protocol revision.
 Which was probably reasonable when NFSv2 was being developed and maybe
 even when v3 was developed, but I kind of hoped we were beyond that.
 
 If a filesystem wanted to order filenames lexically, it really needs
 256 *bytes*.  And it is fairly silly having a cookie that big.
 
 I still thinking that 
filename + 64bits
 is required and plenty (aka necessary and sufficient).  

Sure, but if you're going to include the filename in the cookie, on
the client side you now have to store a variable-length state, which
probably means you'll need to allocate and free memory each time; and
if the filename is 256 characters, you'll have to send that back in
the next readdir() request. 

If we could get filename + 64bits, sure, that would be great.  I was
just assuming we couldn't get it --- and if we can't get it, 256 bits
is two SHA-1 hashes.  So that's one hash for the filename, and 128
bits for a filesystem's internal hash collision.  There might be other
ways that the space could be divided up, which might be somewhat
wasteful of space --- say you need a host identifier for a clustered
filesystem, although arguably adding a host might be infrequent enough
that you just use the cookie verifier hammer and force the client to
get a new set of readdir cookies.  :-)

 I wouldn't argue against 128bits (64 for a search key and 64 to
 guarantee uniqueness) but I really think 256 excessive with no value.
 We we still need the last-filename in the READDIR key.

I wouldn't complain too much about 128 bits, but if we're going to go
fixed size, I can imagine filesystems where that might not be enough.
And the differecen between 16 and 32 bytes isn't that great.  But I
could easily live with either filename + 64bits, or 128 bits.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-16 Thread Theodore Tso

On Mon, Apr 16, 2007 at 04:18:53PM +1000, Neil Brown wrote:
 On Thursday April 12, [EMAIL PROTECTED] wrote:
  
  Unfortunately, in the NFS case if there are hash collisions, under the
  wrong circumstances the NFS client could loop forever trying to
  readdir() a directory stream.
 
 I think that can be easily avoided by the filesystem by simply
 ensuring that the cookies returned in a request are all *after* the
 'llseek' cookie.  There still maybe some risk of missing or extra
 entries in the directory if the filesystem cannot perform a precise
 cookie-position mapping, but an infinite loop should not be a problem.

It is true that you can replace the infinite loop problem by
guaranteeing that certain filenames are simply always missing.
Neither seemed desirable to me, but it is true that if we have to err
in one direction or another, simply skipping files in the case of a
hash collision is probably a better result.

 You still haven't convinced me that the limited size is a fundamental
 problem.
 In the case of ext3,  I have suggested (various things including)
 using 56bits of the hash and an 8 bit sequence number through any
 hash chain that may eventuate.
 The seq number could not be stable across reboots with the current
 on-disk data, but could be made stable for almost all real usage with
 a little bit of caching (I imagine using upper 4 bits to give a
 sequence number while reading from disk, and lower 4 bits to give a
 sequence number when adding new entries).

The challenge is making it be stable across inserts/deletes, never
mind reboots.  And it's not a little bit of cacheing; in order to be
correct we would have to cache *forever*, since at least in theory an
NFS client could hold on to a cookie for an arbitrarily long period of
time (weeks, months, years, decades), yes?

 Supposing I were to try my hand at making the modifications I
 suggested to ext3 so that per page/hashvalue rbtrees were cached in
 the pagecache and were used to provide a stable-as-possible telldir
 cookie.

Well, the first problem you'll run into is that ext3 doesn't store
directories in the page cache.  That's because the journaling layer
fs/jbd operates on a buffer cache basis (i.e., on a per filesystem
block level).  

Secondly, storing the rbtree in the page cache doesn't help, because
the whole point of the rbtree is to have a stable pointer that doesn't
change across a b-tree node split.  Whether you use a page cache or
the buffer cache, it doesn't change the fact that when you do a b-tree
node split, half the directory entries *go* *someplace* *else*.  Put
another way, Linus's standard reasons for wanting to store the
directory in the page cache so we can have better readahead don't
apply when you're using htree, because with htree we're never reading
entries in page cache order.

So you wouldn't be able to store the rbtree in the page cache, since
the doing so would defeat the whole purpose of the having the rbtree
in the first place.  The rbtree needs to be per directory stream,
*not* per directory block.   

You could store a mapping between each directory entry and a sequence
number.  That could work, modulo the need to store these entries
*forever*.  But now you take a performance hit since every time you
create or delete a directory entry, you'll need to keep the cache up
to date.  And since we don't store the hash in the on-disk format
(since you can always calculate it from the filename), you'll need to
store the hash as well in the cache information.  So that means
storing a 64-bit hash plus the sequence counter information --- and if
we assume 32-bit alignment requirements for data structures, it means
the simplest way to do this would be to hang a data structure off of
each directory block which consumes 12 bytes of data for every
directory entry, plus either (a) extra space for pointers so you can
easily insert and delete entries as directory entries are created or
deleted, or (b) resigning yourself to use memmove() to insert and
delete entries in the linear array.  

You're welcome to try, but I suspect it won't take long before you'll
see why I'm asserting that a directory fd cache in nfsd is *way* less
work.  :-)

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-16 Thread Neil Brown

On Monday April 16, [EMAIL PROTECTED] wrote:
 
 The challenge is making it be stable across inserts/deletes, never
 mind reboots.  And it's not a little bit of cacheing; in order to be
 correct we would have to cache *forever*, since at least in theory an
 NFS client could hold on to a cookie for an arbitrarily long period of
 time (weeks, months, years, decades), yes?

Yes.  But I think we've already establish that the on-disk structure
chosen by ext3/htree is not able to perfectly support NFS (which is a
pity given that it was written for Linux and Linux is thought to
support NFS).  Our goal is to find the best mapping possible and,
where caching can improve stability for real-world uses, use caching
to help stabilise that mapping.

 
 You're welcome to try, but I suspect it won't take long before you'll
 see why I'm asserting that a directory fd cache in nfsd is *way* less
 work.  :-)

You have provided some very helpful insights into how ext3/htree
currently works - thanks for that.
I will definitely make a closer inspection of the code and so how
possible it is to realise by ideas.  I'll let you know how I go.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-15 Thread Neil Brown

On Sunday April 15, [EMAIL PROTECTED] wrote:
> On Thu, Apr 12, 2007 at 10:35:49AM -0700, H. Peter Anvin wrote:
> > 
> > Any fixed size is too small.  It should be a dynamic size.
> 
> Idally it should be dynamic, but my guess is that if the cookie were a
> fixed 256 bits, it would be sufficient for pretty much all filesystems.  
> 

"my guess", "pretty much" really bother me.

It sounds like "The largest anyone has asked for is 128bits, or let's
double it and hope that is enough until the next protocol revision".
Which was probably reasonable when NFSv2 was being developed and maybe
even when v3 was developed, but I kind of hoped we were beyond that.

If a filesystem wanted to order filenames lexically, it really needs
256 *bytes*.  And it is fairly silly having a cookie that big.

I still thinking that 
   filename + 64bits
is required and plenty (aka necessary and sufficient).  You would need
a truly enormous directory before that would be insufficient for an
off-set based scheme or a hash-plus-collision scheme, and it is
clearly enough for a lexically-ordered scheme.

I wouldn't argue against 128bits (64 for a search key and 64 to
guarantee uniqueness) but I really think 256 excessive with no value.
We we still need the last-filename in the READDIR key.

(Mind you, I'm still wondering why NFSv4 felt the need for 128 *byte*
 file handles so maybe I'm just small minded :-)

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-15 Thread Theodore Tso

On Thu, Apr 12, 2007 at 10:35:49AM -0700, H. Peter Anvin wrote:
> J. Bruce Fields wrote:
> >On Thu, Apr 12, 2007 at 08:21:16AM -0400, Theodore Tso wrote:
> >>Again, compared to a directory fd cache, what you're proposing a huge
> >>hit to the filesystem, and at the moment, given that telldir/seekdir
> >>is rarely used by everyone else, it's mainly NFS which is the main bad
> >>actor here by insisting on the use of a small 31/63-bit cookie as a
> >>condition of protocol correctness.
> >
> >If we want to get bigger cookies into the protocol, then the sooner we
> >start working on that the better  How big is big enough?  And is a
> >larger cookie sufficient on its own?
> >
> 
> Any fixed size is too small.  It should be a dynamic size.

Idally it should be dynamic, but my guess is that if the cookie were a
fixed 256 bits, it would be sufficient for pretty much all filesystems.  

- Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-15 Thread Theodore Tso

On Thu, Apr 12, 2007 at 10:35:49AM -0700, H. Peter Anvin wrote:
 J. Bruce Fields wrote:
 On Thu, Apr 12, 2007 at 08:21:16AM -0400, Theodore Tso wrote:
 Again, compared to a directory fd cache, what you're proposing a huge
 hit to the filesystem, and at the moment, given that telldir/seekdir
 is rarely used by everyone else, it's mainly NFS which is the main bad
 actor here by insisting on the use of a small 31/63-bit cookie as a
 condition of protocol correctness.
 
 If we want to get bigger cookies into the protocol, then the sooner we
 start working on that the better  How big is big enough?  And is a
 larger cookie sufficient on its own?
 
 
 Any fixed size is too small.  It should be a dynamic size.

Idally it should be dynamic, but my guess is that if the cookie were a
fixed 256 bits, it would be sufficient for pretty much all filesystems.  

- Ted

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-15 Thread Neil Brown

On Sunday April 15, [EMAIL PROTECTED] wrote:
 On Thu, Apr 12, 2007 at 10:35:49AM -0700, H. Peter Anvin wrote:
  
  Any fixed size is too small.  It should be a dynamic size.
 
 Idally it should be dynamic, but my guess is that if the cookie were a
 fixed 256 bits, it would be sufficient for pretty much all filesystems.  
 

my guess, pretty much really bother me.

It sounds like The largest anyone has asked for is 128bits, or let's
double it and hope that is enough until the next protocol revision.
Which was probably reasonable when NFSv2 was being developed and maybe
even when v3 was developed, but I kind of hoped we were beyond that.

If a filesystem wanted to order filenames lexically, it really needs
256 *bytes*.  And it is fairly silly having a cookie that big.

I still thinking that 
   filename + 64bits
is required and plenty (aka necessary and sufficient).  You would need
a truly enormous directory before that would be insufficient for an
off-set based scheme or a hash-plus-collision scheme, and it is
clearly enough for a lexically-ordered scheme.

I wouldn't argue against 128bits (64 for a search key and 64 to
guarantee uniqueness) but I really think 256 excessive with no value.
We we still need the last-filename in the READDIR key.

(Mind you, I'm still wondering why NFSv4 felt the need for 128 *byte*
 file handles so maybe I'm just small minded :-)

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-12 Thread H. Peter Anvin


J. Bruce Fields wrote:

On Thu, Apr 12, 2007 at 08:21:16AM -0400, Theodore Tso wrote:

Again, compared to a directory fd cache, what you're proposing a huge
hit to the filesystem, and at the moment, given that telldir/seekdir
is rarely used by everyone else, it's mainly NFS which is the main bad
actor here by insisting on the use of a small 31/63-bit cookie as a
condition of protocol correctness.


If we want to get bigger cookies into the protocol, then the sooner we
start working on that the better  How big is big enough?  And is a
larger cookie sufficient on its own?



Any fixed size is too small.  It should be a dynamic size.

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-12 Thread J. Bruce Fields

On Thu, Apr 12, 2007 at 08:21:16AM -0400, Theodore Tso wrote:
> Again, compared to a directory fd cache, what you're proposing a huge
> hit to the filesystem, and at the moment, given that telldir/seekdir
> is rarely used by everyone else, it's mainly NFS which is the main bad
> actor here by insisting on the use of a small 31/63-bit cookie as a
> condition of protocol correctness.

If we want to get bigger cookies into the protocol, then the sooner we
start working on that the better  How big is big enough?  And is a
larger cookie sufficient on its own?

--b.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-12 Thread Theodore Tso

On Thu, Apr 12, 2007 at 03:57:41PM +1000, Neil Brown wrote:
> But my perspective is that a solution in nfsd at-best a work-around.
> Caching the whole 'struct file' when there is just a small bit that we
> might want seems like a heavy hammer.  The filesystem is in the best
> place to know what needs to be cached, and it should be the one doing
> the caching.

Sure, but ext3 needs to cache the information in the file handle,
because it's dynamic, per-directory stream information that needs to
be cached.  The fundamental problem is the broken nature of the 64-bit
cookie; it simply isn't big enough.  So what's being disputed is who
gets to pay the cost of that particular design mistake, in POSIX and
in NFS?

In the POSIX case, right now only applications that use
telldir/seekdir pay the cost, which is they might see some repeated
directory entries in the case of hash collisions.

Unfortunately, in the NFS case if there are hash collisions, under the
wrong circumstances the NFS client could loop forever trying to
readdir() a directory stream.

> This is a simple consequence of the design decision to use hashes as
> the search key.  They aren't dense and they will collide.  So the
> solution will be a bit fuzzy around the edges.  And maybe that is an
> acceptable tradeoff.  But the filesystem should take full
> responsibility for it, whether in performance or correctness :-)

Well, we could also say that it is NFS's fault that they used a
limited size cookie as a fundamental part of their protocol 

> But there are alternatives.  e.g. internal chaining.
> Insist on a unique 64bit hash for every file.  If the hash is in use,
> increment and try again.  On lookup, if the hash leads you to a file
> with the wrong name, increment and try again until you find a hole
> (hash value that is not stored).  When you delete an entry, leave a
> place holder if the next hash is in use.  Conversely if the next hash
> is not in use, delete the entry and delete the previous one if it is a
> place holder.

This solution requires an incompatible file format change.  In
addition, it means trying to garbage collect directory entries when at
the beginning or end of a directory block without dragging in the
previous or next directory block.  This is a huge amount of hair, will
screw over performance, and is not compatible with how we do things
today.  

(It also means that you have to store the hash in the directory entry,
which we don't do today, since we can always calculate the hash from
the file name.  But if in some cases the hash is going to be some
small integer plus the calculated hash, you have to bloat the
directory entry by an extra 8 bytes per dirent.)

Again, compared to a directory fd cache, what you're proposing a huge
hit to the filesystem, and at the moment, given that telldir/seekdir
is rarely used by everyone else, it's mainly NFS which is the main bad
actor here by insisting on the use of a small 31/63-bit cookie as a
condition of protocol correctness.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-12 Thread Jörn Engel

On Thu, 12 April 2007 15:57:41 +1000, Neil Brown wrote:
> 
> > However, the collision chain gives me quite a bit of headache.  One
> > would have to store each entry's position on the chain, deal with older
> > entries getting deleted, newer entries getting removed, etc.  All this
> > requires a lot of complicated code that basically never gets tested in
> > the wild.
> 
> This is a simple consequence of the design decision to use hashes as
> the search key.  They aren't dense and they will collide.  So the
> solution will be a bit fuzzy around the edges.  And maybe that is an
> acceptable tradeoff.  But the filesystem should take full
> responsibility for it, whether in performance or correctness :-)

Sure.  And seeing that not using hashes would kill performance long
before 4 billion dentries are reached, there don't seem to be many
downsides to hashing in principle.

> > Just settling for a 64bit hash and returning -EEXIST when someone causes
> > a collision an creat() sounds more appealing.  Directories with 4
> > billion entries will cause problems, but that is hardly news to anyone.
> 
> I think you want -EFBIG or -ENOSPC.  -EEXIST sounds just wrong.

None of them are 100% correct.  But you are right, -ENOSPC seems to do
less harm.

> But there are alternatives.  e.g. internal chaining.
> Insist on a unique 64bit hash for every file.  If the hash is in use,
> increment and try again.  On lookup, if the hash leads you to a file
> with the wrong name, increment and try again until you find a hole
> (hash value that is not stored).  When you delete an entry, leave a
> place holder if the next hash is in use.  Conversely if the next hash
> is not in use, delete the entry and delete the previous one if it is a
> place holder.

That would work and is limited to reasonable complexity.  It still
suffers from getting virtually no testing in the wild and therefore
being one of the dark corners little critters thrive in.  But one can at
least add a config option to fold the hash to 16bit or so.  And cross
fingers that at least one person will occasionally test with that
option.

> You have to require 64bit cookies/fpos, but I think that today, that
> is a reasonable thing to require (5 years ago it might not have been).

Which brings us back to the start of this thread.

Jörn

-- 
Why do musicians compose symphonies and poets write poems?
They do it because life wouldn't have any meaning for them if they didn't.
That's why I draw cartoons.  It's my life.
-- Charles Shultz
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-12 Thread Neil Brown

On Thursday April 12, [EMAIL PROTECTED] wrote:
> On Thu, 12 April 2007 11:46:41 +1000, Neil Brown wrote:
> > 
> > I could argue that nfs came before ext3+dirindex, so ext3 should have
> > been designed to work properly with NFS.  You could argue that fixing
> > it in nfsd fixes it for all filesystems.  But I'm not sure either of
> > those arguments are likely to be at all convincing...
> 
> Caring about a non-ext3 filesystem, I sure would like an nfs solution as
> well. :)

I have a non-ext3 filesystem I care about too.

But my perspective is that a solution in nfsd at-best a work-around.
Caching the whole 'struct file' when there is just a small bit that we
might want seems like a heavy hammer.  The filesystem is in the best
place to know what needs to be cached, and it should be the one doing
the caching.

> 
> > Hmmm. I wonder.  Which is more likely?
> >   - That two 64bit hashes from some set are the same
> >   - or that 65536 48bit hashes from a set of equal size are the same.
> 
> The former.  Each bit going from hash strength to collision chain length
> reduces the likelihood of an overflow.  In the extreme case of a 0bit
> hash and 64bit collision chain, you need 2^64 entries compared to 2^32
> for the other extreme.
> 
> However, the collision chain gives me quite a bit of headache.  One
> would have to store each entry's position on the chain, deal with older
> entries getting deleted, newer entries getting removed, etc.  All this
> requires a lot of complicated code that basically never gets tested in
> the wild.

This is a simple consequence of the design decision to use hashes as
the search key.  They aren't dense and they will collide.  So the
solution will be a bit fuzzy around the edges.  And maybe that is an
acceptable tradeoff.  But the filesystem should take full
responsibility for it, whether in performance or correctness :-)

> 
> Just settling for a 64bit hash and returning -EEXIST when someone causes
> a collision an creat() sounds more appealing.  Directories with 4
> billion entries will cause problems, but that is hardly news to anyone.
> 

I think you want -EFBIG or -ENOSPC.  -EEXIST sounds just wrong.

But there are alternatives.  e.g. internal chaining.
Insist on a unique 64bit hash for every file.  If the hash is in use,
increment and try again.  On lookup, if the hash leads you to a file
with the wrong name, increment and try again until you find a hole
(hash value that is not stored).  When you delete an entry, leave a
place holder if the next hash is in use.  Conversely if the next hash
is not in use, delete the entry and delete the previous one if it is a
place holder.

Then you get 100% correct semantics and a performance hit in the face
of hash collisions that is probably no worse than that which ext3
currently gets.  It probably does cost you a bit of storage to store
those 64bit hashes, though I suspect some clever compression can help
out there (You only need one bit more than the filename when there is
no chaining).

You have to require 64bit cookies/fpos, but I think that today, that
is a reasonable thing to require (5 years ago it might not have been).

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-12 Thread Neil Brown

On Thursday April 12, [EMAIL PROTECTED] wrote:
 On Thu, 12 April 2007 11:46:41 +1000, Neil Brown wrote:
  
  I could argue that nfs came before ext3+dirindex, so ext3 should have
  been designed to work properly with NFS.  You could argue that fixing
  it in nfsd fixes it for all filesystems.  But I'm not sure either of
  those arguments are likely to be at all convincing...
 
 Caring about a non-ext3 filesystem, I sure would like an nfs solution as
 well. :)

I have a non-ext3 filesystem I care about too.

But my perspective is that a solution in nfsd at-best a work-around.
Caching the whole 'struct file' when there is just a small bit that we
might want seems like a heavy hammer.  The filesystem is in the best
place to know what needs to be cached, and it should be the one doing
the caching.

 
  Hmmm. I wonder.  Which is more likely?
- That two 64bit hashes from some set are the same
- or that 65536 48bit hashes from a set of equal size are the same.
 
 The former.  Each bit going from hash strength to collision chain length
 reduces the likelihood of an overflow.  In the extreme case of a 0bit
 hash and 64bit collision chain, you need 2^64 entries compared to 2^32
 for the other extreme.
 
 However, the collision chain gives me quite a bit of headache.  One
 would have to store each entry's position on the chain, deal with older
 entries getting deleted, newer entries getting removed, etc.  All this
 requires a lot of complicated code that basically never gets tested in
 the wild.

This is a simple consequence of the design decision to use hashes as
the search key.  They aren't dense and they will collide.  So the
solution will be a bit fuzzy around the edges.  And maybe that is an
acceptable tradeoff.  But the filesystem should take full
responsibility for it, whether in performance or correctness :-)

 
 Just settling for a 64bit hash and returning -EEXIST when someone causes
 a collision an creat() sounds more appealing.  Directories with 4
 billion entries will cause problems, but that is hardly news to anyone.
 

I think you want -EFBIG or -ENOSPC.  -EEXIST sounds just wrong.

But there are alternatives.  e.g. internal chaining.
Insist on a unique 64bit hash for every file.  If the hash is in use,
increment and try again.  On lookup, if the hash leads you to a file
with the wrong name, increment and try again until you find a hole
(hash value that is not stored).  When you delete an entry, leave a
place holder if the next hash is in use.  Conversely if the next hash
is not in use, delete the entry and delete the previous one if it is a
place holder.

Then you get 100% correct semantics and a performance hit in the face
of hash collisions that is probably no worse than that which ext3
currently gets.  It probably does cost you a bit of storage to store
those 64bit hashes, though I suspect some clever compression can help
out there (You only need one bit more than the filename when there is
no chaining).

You have to require 64bit cookies/fpos, but I think that today, that
is a reasonable thing to require (5 years ago it might not have been).

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-12 Thread Jörn Engel

On Thu, 12 April 2007 15:57:41 +1000, Neil Brown wrote:
 
  However, the collision chain gives me quite a bit of headache.  One
  would have to store each entry's position on the chain, deal with older
  entries getting deleted, newer entries getting removed, etc.  All this
  requires a lot of complicated code that basically never gets tested in
  the wild.
 
 This is a simple consequence of the design decision to use hashes as
 the search key.  They aren't dense and they will collide.  So the
 solution will be a bit fuzzy around the edges.  And maybe that is an
 acceptable tradeoff.  But the filesystem should take full
 responsibility for it, whether in performance or correctness :-)

Sure.  And seeing that not using hashes would kill performance long
before 4 billion dentries are reached, there don't seem to be many
downsides to hashing in principle.

  Just settling for a 64bit hash and returning -EEXIST when someone causes
  a collision an creat() sounds more appealing.  Directories with 4
  billion entries will cause problems, but that is hardly news to anyone.
 
 I think you want -EFBIG or -ENOSPC.  -EEXIST sounds just wrong.

None of them are 100% correct.  But you are right, -ENOSPC seems to do
less harm.

 But there are alternatives.  e.g. internal chaining.
 Insist on a unique 64bit hash for every file.  If the hash is in use,
 increment and try again.  On lookup, if the hash leads you to a file
 with the wrong name, increment and try again until you find a hole
 (hash value that is not stored).  When you delete an entry, leave a
 place holder if the next hash is in use.  Conversely if the next hash
 is not in use, delete the entry and delete the previous one if it is a
 place holder.

That would work and is limited to reasonable complexity.  It still
suffers from getting virtually no testing in the wild and therefore
being one of the dark corners little critters thrive in.  But one can at
least add a config option to fold the hash to 16bit or so.  And cross
fingers that at least one person will occasionally test with that
option.

 You have to require 64bit cookies/fpos, but I think that today, that
 is a reasonable thing to require (5 years ago it might not have been).

Which brings us back to the start of this thread.

Jörn

-- 
Why do musicians compose symphonies and poets write poems?
They do it because life wouldn't have any meaning for them if they didn't.
That's why I draw cartoons.  It's my life.
-- Charles Shultz
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-12 Thread Theodore Tso

On Thu, Apr 12, 2007 at 03:57:41PM +1000, Neil Brown wrote:
 But my perspective is that a solution in nfsd at-best a work-around.
 Caching the whole 'struct file' when there is just a small bit that we
 might want seems like a heavy hammer.  The filesystem is in the best
 place to know what needs to be cached, and it should be the one doing
 the caching.

Sure, but ext3 needs to cache the information in the file handle,
because it's dynamic, per-directory stream information that needs to
be cached.  The fundamental problem is the broken nature of the 64-bit
cookie; it simply isn't big enough.  So what's being disputed is who
gets to pay the cost of that particular design mistake, in POSIX and
in NFS?

In the POSIX case, right now only applications that use
telldir/seekdir pay the cost, which is they might see some repeated
directory entries in the case of hash collisions.

Unfortunately, in the NFS case if there are hash collisions, under the
wrong circumstances the NFS client could loop forever trying to
readdir() a directory stream.

 This is a simple consequence of the design decision to use hashes as
 the search key.  They aren't dense and they will collide.  So the
 solution will be a bit fuzzy around the edges.  And maybe that is an
 acceptable tradeoff.  But the filesystem should take full
 responsibility for it, whether in performance or correctness :-)

Well, we could also say that it is NFS's fault that they used a
limited size cookie as a fundamental part of their protocol 

 But there are alternatives.  e.g. internal chaining.
 Insist on a unique 64bit hash for every file.  If the hash is in use,
 increment and try again.  On lookup, if the hash leads you to a file
 with the wrong name, increment and try again until you find a hole
 (hash value that is not stored).  When you delete an entry, leave a
 place holder if the next hash is in use.  Conversely if the next hash
 is not in use, delete the entry and delete the previous one if it is a
 place holder.

This solution requires an incompatible file format change.  In
addition, it means trying to garbage collect directory entries when at
the beginning or end of a directory block without dragging in the
previous or next directory block.  This is a huge amount of hair, will
screw over performance, and is not compatible with how we do things
today.  

(It also means that you have to store the hash in the directory entry,
which we don't do today, since we can always calculate the hash from
the file name.  But if in some cases the hash is going to be some
small integer plus the calculated hash, you have to bloat the
directory entry by an extra 8 bytes per dirent.)

Again, compared to a directory fd cache, what you're proposing a huge
hit to the filesystem, and at the moment, given that telldir/seekdir
is rarely used by everyone else, it's mainly NFS which is the main bad
actor here by insisting on the use of a small 31/63-bit cookie as a
condition of protocol correctness.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-12 Thread J. Bruce Fields

On Thu, Apr 12, 2007 at 08:21:16AM -0400, Theodore Tso wrote:
 Again, compared to a directory fd cache, what you're proposing a huge
 hit to the filesystem, and at the moment, given that telldir/seekdir
 is rarely used by everyone else, it's mainly NFS which is the main bad
 actor here by insisting on the use of a small 31/63-bit cookie as a
 condition of protocol correctness.

If we want to get bigger cookies into the protocol, then the sooner we
start working on that the better  How big is big enough?  And is a
larger cookie sufficient on its own?

--b.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-12 Thread H. Peter Anvin


J. Bruce Fields wrote:

On Thu, Apr 12, 2007 at 08:21:16AM -0400, Theodore Tso wrote:

Again, compared to a directory fd cache, what you're proposing a huge
hit to the filesystem, and at the moment, given that telldir/seekdir
is rarely used by everyone else, it's mainly NFS which is the main bad
actor here by insisting on the use of a small 31/63-bit cookie as a
condition of protocol correctness.


If we want to get bigger cookies into the protocol, then the sooner we
start working on that the better  How big is big enough?  And is a
larger cookie sufficient on its own?



Any fixed size is too small.  It should be a dynamic size.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Jörn Engel

On Thu, 12 April 2007 11:46:41 +1000, Neil Brown wrote:
> 
> I could argue that nfs came before ext3+dirindex, so ext3 should have
> been designed to work properly with NFS.  You could argue that fixing
> it in nfsd fixes it for all filesystems.  But I'm not sure either of
> those arguments are likely to be at all convincing...

Caring about a non-ext3 filesystem, I sure would like an nfs solution as
well. :)

> Hmmm. I wonder.  Which is more likely?
>   - That two 64bit hashes from some set are the same
>   - or that 65536 48bit hashes from a set of equal size are the same.

The former.  Each bit going from hash strength to collision chain length
reduces the likelihood of an overflow.  In the extreme case of a 0bit
hash and 64bit collision chain, you need 2^64 entries compared to 2^32
for the other extreme.

However, the collision chain gives me quite a bit of headache.  One
would have to store each entry's position on the chain, deal with older
entries getting deleted, newer entries getting removed, etc.  All this
requires a lot of complicated code that basically never gets tested in
the wild.

Just settling for a 64bit hash and returning -EEXIST when someone causes
a collision an creat() sounds more appealing.  Directories with 4
billion entries will cause problems, but that is hardly news to anyone.

Jörn

-- 
Fantasy is more important than knowledge. Knowledge is limited,
while fantasy embraces the whole world.
-- Albert Einstein
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Neil Brown

On Wednesday April 11, [EMAIL PROTECTED] wrote:
> 
> Actually, no, we can't keep the collision chain count stable across a
> create/delete even while the tree is cached.  At least, not without
> storing a huge amount of state associated with each page.  (It would
> be a lot more work than simply having nfsd keep a fd cache for
> directory streams ;-).

Well, there's the rub, isn't it :-)
You think it is easier to fix the problem in nfsd, and I think it is
easier to fix the problem in ext3.  Both positions are quite
understandable and quite genuine.
And I am quite sure that all the issues that have been raised can be
solved with a bit of effort providing the motivation is there.

I could argue that nfs came before ext3+dirindex, so ext3 should have
been designed to work properly with NFS.  You could argue that fixing
it in nfsd fixes it for all filesystems.  But I'm not sure either of
those arguments are likely to be at all convincing...

Maybe the best compromise is to both fix the 'problem' :-?

Let me explores some designs a bit more..

NFS:
   All we have to do is cache the open files.  This should be only a
   performance issue, not a correctness issue (once we get 64bit
   cookies from ext3).  So the important thing is to cache then for
   a reasonable period of time.

   We currently cache the read-ahead info from regular files (though
   I was hoping that would go away when the page-cache-based readahead
   became a reality).  We can reasonably replace this with caching
   the open files if we are going to do it for directories anyway.

   So the primary key is "struct inode * + loff_t".  This is suitable
   both for file-readahead and for ext3-directory-caching.  Might also
   be useful for filesystem that stores pre-allocation information in
   the struct-file.
   We keep these in an LRU list and a hash table.  We register a
   callback with register_shrinker (or whatever it is called today) so
   that VM pressure can shrink the cache, and also arrange a timer to
   remove entries older than -- say -- 5 seconds.

   I think that placing a fixed size on the cache based on number of
   active clients would be a mistake, as it is virtually impossible to
   know how many active clients there are, and the number can change
   very dynamically.

   When a filesystem is un-exported, (rare event) we walk the whole
   list and discard entries for that filesystem.

   Look into the possibility of a callback on unlink to drop the
   cached open when the link count hits zero I wonder if inotify
   or leases can help with that.

   To help with large NUMA machine, we probably don't want a single
   hash LRU chain, but rather a number of LRU chains.  That way the
   action of moving an entry to the end of the chain is less likely to
   conflict with another processor trying to do the same thing to a
   different entry.  This is the sort of consideration that is already
   handled in the page cache, and having to handle it in every other
   cache is troublesome because the next time a need like that
   arises, the page cache will get fixed but other little caches won't
   until someone like Greg Banks come along with a big hammer...

EXT3:
   Have a rbtree for storing directory entries. This is attached to a
   pages via the ->private page field.
   Normally each page of a directory has it's own rbtree, but when two
   pages contain entries with the same hash, the one rbtree is shared
   between the pages.
   Thus when you load a block you must also load other blocks under
   the same hash, but I think you do that already.

   When you split a block (because it has become too big) the rbtree
   attached to that block is dismantled and each entry is inserted
   into the appropriate new rbtree, one for each of the two blocks.
   The entries are unchanged - they just get placed in a different
   tree - so cursors in the struct file will still be valid.

   Each entry has a count of the number of cursors pointing to it, and
   when this is non-zero, a refcount on the page is held, thus making
   sure the page doesn't get freed and the btree lost.  The entry
   should possibly also contain a pointer to the page.. not sure if
   that is needed.

   Each entry in the rbtree contains (in minor_hash) a sequence number
   that is used when multiple entries hash to the same value.  We
   store a 'current-seq-number' in the root of the rbtree and when an
   attempt to insert an entry finds a collision, we increase
   current-seq-number, set the minor_hash to that, and retry the
   insert.
   This minor_hash is combined with some bits of the major hash to
   form the fpos/cookie.

   The releasepage address_space_operation will check that all pages
   which share the same major hash are treated as a unit, all released
   at the same time.  So it will fail if any of the pages in the
   group are in use.  If they can all be freed, it will free the
   rbtree for that group.

   This not only benefits nfsd, which opens

Re: If not readdir() then what?

2007-04-11 Thread Neil Brown

On Wednesday April 11, [EMAIL PROTECTED] wrote:
> On Thu, 12 Apr 2007, Neil Brown wrote:
> 
> > For the second.
> >  You say that you " would need at least 96 bits in order to make that
> >  guarantee; 64 bits of hash, plus a 32-bit count value in the hash
> >  collision chain".  I think 96 is a bit greedy.  Surely 48 bits of
> >  hash and 16 bits of collision-chain-position would plenty.  You would
> >  need 65537 entries before a collision was even possible, and
> >  billions before it was at all likely. (How big does a set of 48bit
> >  numbers have to get before the probability that "No subset of 65536
> >  numbers are all the same" drops below 0.95?)
> 
> Neil,
>you can get a hash collision with two entries.

You need at least 65537 entries before there is any possibility of
collision between two
   "48-bit-hash ++ 16-bit-sequence-number"
objects where the 16-bit-sequence-number is chosen to be different from all
other 16 bit sequence numbers combined with the same 48 bit hash.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Jörn Engel

On Wed, 11 April 2007 16:23:21 -0700, H. Peter Anvin wrote:
> David Lang wrote:
> >On Thu, 12 Apr 2007, Neil Brown wrote:
> >
> >>For the second.
> >> You say that you " would need at least 96 bits in order to make that
> >> guarantee; 64 bits of hash, plus a 32-bit count value in the hash
> >> collision chain".  I think 96 is a bit greedy.  Surely 48 bits of
> >> hash and 16 bits of collision-chain-position would plenty.  You would
> >> need 65537 entries before a collision was even possible, and
> >> billions before it was at all likely. (How big does a set of 48bit
> >> numbers have to get before the probability that "No subset of 65536
> >> numbers are all the same" drops below 0.95?)
> >
> >  you can get a hash collision with two entries.
> 
> Yes, but the probability is 2^-n for an n-bit hash, assuming it's 
> uniformly distributed.
> 
> The probability approaches 1/2 as the number of entries hashes 
> approaches 2^(n/2) (birthday number.)

I believe you are both barking up the wrong tree.  Neil proposed a 16bit
collision chain.  With that, it takes 65537 entries before a collision
chain overflow is possible.

Calling a collision chain overflow "collision" is inviting confusion, of
course. :)

Jörn

-- 
The competent programmer is fully aware of the strictly limited size of
his own skull; therefore he approaches the programming task in full
humility, and among other things he avoids clever tricks like the plague.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread H. Peter Anvin


David Lang wrote:

On Thu, 12 Apr 2007, Neil Brown wrote:


For the second.
 You say that you " would need at least 96 bits in order to make that
 guarantee; 64 bits of hash, plus a 32-bit count value in the hash
 collision chain".  I think 96 is a bit greedy.  Surely 48 bits of
 hash and 16 bits of collision-chain-position would plenty.  You would
 need 65537 entries before a collision was even possible, and
 billions before it was at all likely. (How big does a set of 48bit
 numbers have to get before the probability that "No subset of 65536
 numbers are all the same" drops below 0.95?)


Neil,
  you can get a hash collision with two entries.



Yes, but the probability is 2^-n for an n-bit hash, assuming it's 
uniformly distributed.


The probability approaches 1/2 as the number of entries hashes 
approaches 2^(n/2) (birthday number.)


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Theodore Tso

On Thu, Apr 12, 2007 at 08:32:05AM +1000, Neil Brown wrote:
> For the first:
>   You are storing an internal tree representation of part of the
>   directory attached to the 'struct file'.
>   Would it be possible to store this attached to the page via the
>   ->private pointer?  What would avoid the allocate/create/free
>   overhead on every request.

The reason why we are storing it associated with the file pointer
instead of the page/block is because the a filename insertion might
cause a node split, in which case half of the 4k block gets copied to
another block.  We need a stable pointer to where we are in the tree
that can cope with hash collisions, and that's the reason for creating
red/black tree in the first place, since it *doesn't* get split and
reorganized when the directory's hash tree gets reorg'ed.  So
attaching the tree to the page breaks the reason why we have the
separate data structure in the first place.

>   You suggest caching the open files in nfsd.  While that is probably
>   possible (I've thought of it a number of times) is would also be
>   quite complex, e.g. requiring some sort of call-back to close all
>   those files when the filesystem is unexported.  And it is very easy
>   to get caching heuristics wrong.  Leveraging the page-cache which is
>   a very mature cache seems to make a lot of sense.

Is it really that complex?  The simplest way of handling it is simply
keeping a open directory fd cache in a per-filesystem rbtree index
which is indexed by file handle and contains the file pointer.  When
you unexport the filesystem, you simply walk the rbtree and close all
of the file descriptors; no callback is required.

The caching hueristics are an issue; but fixed-size cache with a
simple LFU replacement strategy isn't all that complex to implement.
If 95% of the time, the readdir's come in quick succession, even a
small cache will probably provide huge performance gains, and
increaing the cache size past some critical point will probably only
provide marginal improvements.  

> For the second.
>   You say that you " would need at least 96 bits in order to make that
>   guarantee; 64 bits of hash, plus a 32-bit count value in the hash
>   collision chain".  I think 96 is a bit greedy.  Surely 48 bits of 
>   hash and 16 bits of collision-chain-position would plenty.  You would
>   need 65537 entries before a collision was even possible, and
>   billions before it was at all likely. (How big does a set of 48bit
>   numbers have to get before the probability that "No subset of 65536
>   numbers are all the same" drops below 0.95?)
> 
>   This would really require that the collision-chain-index was stable
>   across create/delete.  Doing that while you have the tree in the
>   page cache is probably easy enough.  Doing it across reboots is
>   probably not possible without on-disk changes.

Actually, no, we can't keep the collision chain count stable across a
create/delete even while the tree is cached.  At least, not without
storing a huge amount of state associated with each page.  (It would
be a lot more work than simply having nfsd keep a fd cache for
directory streams ;-).

If we need create/delete stability, probably our only sane
implementation choice is to just stick with a 63-bit hash, and cross
our fingers and hope for the best.  

If nfsd caches the last N used directory caches, where N is roughly
proportional to the number of active clients, and the clients all only
use the last cookie returned in the readdir entry (since it would be
stupid to use one of the earlier ones and request the server to
re-send something which the client already has), at least in the
absense of telldir/seekdir calls, then that might be quite sufficient,
even if we return multiple direntory entries which contain hash
collisions to the client.  As long as the directory fd is cached, and
the client just uses the last cookie to fetch the next batch of
dirents, we'll be fine.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread David Lang


On Thu, 12 Apr 2007, Neil Brown wrote:


For the second.
 You say that you " would need at least 96 bits in order to make that
 guarantee; 64 bits of hash, plus a 32-bit count value in the hash
 collision chain".  I think 96 is a bit greedy.  Surely 48 bits of
 hash and 16 bits of collision-chain-position would plenty.  You would
 need 65537 entries before a collision was even possible, and
 billions before it was at all likely. (How big does a set of 48bit
 numbers have to get before the probability that "No subset of 65536
 numbers are all the same" drops below 0.95?)


Neil,
  you can get a hash collision with two entries.

David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Neil Brown

On Wednesday April 11, [EMAIL PROTECTED] wrote:
> On Wed, Apr 11, 2007 at 07:15:57AM +1000, Neil Brown wrote:
> > > Neil, is this correct?
> > 
> > It's just NFS.  nfsd does open/getdents/close on every readdir
> > request.
> 
> That's... unfortunate.

I like to think of it as "challenging" :-)

> 
> 
> So the problem with nfsd doing an open/getdents/close on every readdir
> request is two-fold.  First of all, it means that you are incurring
> extra overhead by forcing ext3 to allocate, create, and free the
> red-black tree on every single NFS readdir request.  Secondly, it
> increases the chances of problems with the hash collision logic.

I guess we need a two-fold solution then  :-)

For the first:
  You are storing an internal tree representation of part of the
  directory attached to the 'struct file'.
  Would it be possible to store this attached to the page via the
  ->private pointer?  What would avoid the allocate/create/free
  overhead on every request.

  You suggest caching the open files in nfsd.  While that is probably
  possible (I've thought of it a number of times) is would also be
  quite complex, e.g. requiring some sort of call-back to close all
  those files when the filesystem is unexported.  And it is very easy
  to get caching heuristics wrong.  Leveraging the page-cache which is
  a very mature cache seems to make a lot of sense.

  The cursor would still be in the 'struct file', but if the second
  second fold works, that should be much less critical.

For the second.
  You say that you " would need at least 96 bits in order to make that
  guarantee; 64 bits of hash, plus a 32-bit count value in the hash
  collision chain".  I think 96 is a bit greedy.  Surely 48 bits of 
  hash and 16 bits of collision-chain-position would plenty.  You would
  need 65537 entries before a collision was even possible, and
  billions before it was at all likely. (How big does a set of 48bit
  numbers have to get before the probability that "No subset of 65536
  numbers are all the same" drops below 0.95?)

  This would really require that the collision-chain-index was stable
  across create/delete.  Doing that while you have the tree in the
  page cache is probably easy enough.  Doing it across reboots is
  probably not possible without on-disk changes.

So while my two-fold solution is not perfect(*), I think it is
substantially better than what we have now, and is extremely unlike to
ever show an observable problem.
(*)And perfection can only be achieved we two independent indexes.

This requires 64bit cookies which NFSv2 cannot handle.  I think it is
OK to say "tough".  We should probably make sure that READDIRv2 *always*
fails on a directory which is indexed, and possibly on a filesystem
which allows index directories.  People can always disable indexing if
they want a v2 export.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Theodore Tso

On Wed, Apr 11, 2007 at 07:15:57AM +1000, Neil Brown wrote:
> > Neil, is this correct?
> 
> It's just NFS.  nfsd does open/getdents/close on every readdir
> request.

That's... unfortunate.

Ext3 w/htree enable creates an in-memory red-black tree of a portion
of the directory; this is typically one filesystem block worth of
entries, but if there is a hash collision which spells across a
directory block, we will fill the red-block tree until we have read in
all of the directory entries that collide on a particular hash value
into the red-black tree.

We then maintain an internal pointer which provides us with exactly
which directory entries we have and haven't returned via
getdents/readdir via a pointer referenced off the file pointer.  So
long as the user never changes filp->f_pos, we can meet the POSIX
guarantees about returning each file (that hasn't been deleted or
added since the last opendir or rewinddir) once and exactly once.  If
the user calls telldir/seekdir, we can no longer make this guarantee
since the telldir cookie can't encapsulate the location inside our AVL
tree.  (We would need at least 96 bits in order to make that
guarantee; 64 bits of hash, plus a 32-bit count value in the hash
collision chain.)  The relevant code which does this is in
ext3_dx_readdir:

#define hash2pos(major, minor)  (major >> 1)
#define pos2maj_hash(pos)   ((pos << 1) & 0x)
#define pos2min_hash(pos)   (0)

/* Some one has messed with f_pos; reset the world */
if (info->last_pos != filp->f_pos) {
free_rb_tree_fname(>root);
info->curr_node = NULL;
info->extra_fname = NULL;
info->curr_hash = pos2maj_hash(filp->f_pos);
info->curr_minor_hash = pos2min_hash(filp->f_pos);
}

(Note, we're only using a 31-bit telldir cookie because of NFSv2
compatibility.  If we could know we're using NFSv3/v4, by via checking
O_LARGEFILE, we could easily supply a 63-bit telldir cookie.  As I
recall there were some signed/unsigned issues which is why we had to
limit ourselves to a 31-bit/63-bit value).

So the problem with nfsd doing an open/getdents/close on every readdir
request is two-fold.  First of all, it means that you are incurring
extra overhead by forcing ext3 to allocate, create, and free the
red-black tree on every single NFS readdir request.  Secondly, it
increases the chances of problems with the hash collision logic.

Yes, right now we're probably courting danger if we try to operate on
a directory with hash collisions because it means that multiple
dirents returned by the NFS server to the NFS client will have the
same NFS readdir cookie (currently supplied by filp->f_pos/llseek).
But if most NFS clients typically use the cookie associated with the
last dirent returned by the readdir RPC, we might be getting away with
potential hash collisions --- and if nfsd were to cache open directory
fd's for some short period of time, we might get away with it even
more, since if the NFS client does a linear readthrough of the
dirents, we might end up winning even though the results might be a
little cheasy.  

It is true that if the NFS client tries to use a telldir cookie
associated with some directory entry not at the end of the returned
block of dirents, and the directory entry is part of a set of
filenames that result in a hash collision, we could end up losing.
But I'm guessing this happens very rarely in practice.

One of the things I did when I was debugging this code way back when
was that I had a debugging hash type which always returned the value
42.  This allowed me to be sure that all of the edge conditions
involving hash collisions could be well tested, and modulo
telldir/seekdir simply not working, I was able to verify that the
filesystem worked correctly, albeit not very efficiently.  I didn't
test to see what would happen with exporting such a filesystem via
NFS.  It might be interesting to try that and then see how many NFS
clients were able to deal with such a filesystem export.  

Given that we haven't gotten any complaints of clients stuck in an
infinite readdir loop, I have to conclude that either NFS clients are
a lot more robust/predictable than we are giving them credit for, or
people aren't exporting via NFS directories big enough that hash
collisions given a 31-bit cookie value haven't happened yet, which
seems to be rather unbelievable.

Regards,

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Jan Engelhardt


On Apr 11 2007 07:15, Neil Brown wrote:
>On Sunday April 8, [EMAIL PROTECTED] wrote:
>> On Sun, 8 April 2007 11:11:20 -0700, H. Peter Anvin wrote:
>> > 
>> > Well, the question is if you can keep the seekdir/telldir cookie around 
>> > as a pointer -- preferrably in userspace, of course.  You would 
>> > presumably garbage-collect them on closedir() -- there is no other point 
>> > at which you could.
>> 
>> Garbage-collecting them on closedir() does not work.  It surprised me as
>> well, but there seem to be applications that keep the telldir() cookie
>> around after closedir().  Iirc, "rm -r" was one of them.
>> 
>> Neil, is this correct?
>
>It's just NFS.  nfsd does open/getdents/close on every readdir
>request.

Uhoh, that sounds even more expensive than doing one getdent during opendir.



Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Jan Engelhardt


On Apr 11 2007 07:15, Neil Brown wrote:
On Sunday April 8, [EMAIL PROTECTED] wrote:
 On Sun, 8 April 2007 11:11:20 -0700, H. Peter Anvin wrote:
  
  Well, the question is if you can keep the seekdir/telldir cookie around 
  as a pointer -- preferrably in userspace, of course.  You would 
  presumably garbage-collect them on closedir() -- there is no other point 
  at which you could.
 
 Garbage-collecting them on closedir() does not work.  It surprised me as
 well, but there seem to be applications that keep the telldir() cookie
 around after closedir().  Iirc, rm -r was one of them.
 
 Neil, is this correct?

It's just NFS.  nfsd does open/getdents/close on every readdir
request.

Uhoh, that sounds even more expensive than doing one getdent during opendir.



Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Theodore Tso

On Wed, Apr 11, 2007 at 07:15:57AM +1000, Neil Brown wrote:
  Neil, is this correct?
 
 It's just NFS.  nfsd does open/getdents/close on every readdir
 request.

That's... unfortunate.

Ext3 w/htree enable creates an in-memory red-black tree of a portion
of the directory; this is typically one filesystem block worth of
entries, but if there is a hash collision which spells across a
directory block, we will fill the red-block tree until we have read in
all of the directory entries that collide on a particular hash value
into the red-black tree.

We then maintain an internal pointer which provides us with exactly
which directory entries we have and haven't returned via
getdents/readdir via a pointer referenced off the file pointer.  So
long as the user never changes filp-f_pos, we can meet the POSIX
guarantees about returning each file (that hasn't been deleted or
added since the last opendir or rewinddir) once and exactly once.  If
the user calls telldir/seekdir, we can no longer make this guarantee
since the telldir cookie can't encapsulate the location inside our AVL
tree.  (We would need at least 96 bits in order to make that
guarantee; 64 bits of hash, plus a 32-bit count value in the hash
collision chain.)  The relevant code which does this is in
ext3_dx_readdir:

#define hash2pos(major, minor)  (major  1)
#define pos2maj_hash(pos)   ((pos  1)  0x)
#define pos2min_hash(pos)   (0)

/* Some one has messed with f_pos; reset the world */
if (info-last_pos != filp-f_pos) {
free_rb_tree_fname(info-root);
info-curr_node = NULL;
info-extra_fname = NULL;
info-curr_hash = pos2maj_hash(filp-f_pos);
info-curr_minor_hash = pos2min_hash(filp-f_pos);
}

(Note, we're only using a 31-bit telldir cookie because of NFSv2
compatibility.  If we could know we're using NFSv3/v4, by via checking
O_LARGEFILE, we could easily supply a 63-bit telldir cookie.  As I
recall there were some signed/unsigned issues which is why we had to
limit ourselves to a 31-bit/63-bit value).


So the problem with nfsd doing an open/getdents/close on every readdir
request is two-fold.  First of all, it means that you are incurring
extra overhead by forcing ext3 to allocate, create, and free the
red-black tree on every single NFS readdir request.  Secondly, it
increases the chances of problems with the hash collision logic.

Yes, right now we're probably courting danger if we try to operate on
a directory with hash collisions because it means that multiple
dirents returned by the NFS server to the NFS client will have the
same NFS readdir cookie (currently supplied by filp-f_pos/llseek).
But if most NFS clients typically use the cookie associated with the
last dirent returned by the readdir RPC, we might be getting away with
potential hash collisions --- and if nfsd were to cache open directory
fd's for some short period of time, we might get away with it even
more, since if the NFS client does a linear readthrough of the
dirents, we might end up winning even though the results might be a
little cheasy.  

It is true that if the NFS client tries to use a telldir cookie
associated with some directory entry not at the end of the returned
block of dirents, and the directory entry is part of a set of
filenames that result in a hash collision, we could end up losing.
But I'm guessing this happens very rarely in practice.

One of the things I did when I was debugging this code way back when
was that I had a debugging hash type which always returned the value
42.  This allowed me to be sure that all of the edge conditions
involving hash collisions could be well tested, and modulo
telldir/seekdir simply not working, I was able to verify that the
filesystem worked correctly, albeit not very efficiently.  I didn't
test to see what would happen with exporting such a filesystem via
NFS.  It might be interesting to try that and then see how many NFS
clients were able to deal with such a filesystem export.  

Given that we haven't gotten any complaints of clients stuck in an
infinite readdir loop, I have to conclude that either NFS clients are
a lot more robust/predictable than we are giving them credit for, or
people aren't exporting via NFS directories big enough that hash
collisions given a 31-bit cookie value haven't happened yet, which
seems to be rather unbelievable.

Regards,

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Neil Brown

On Wednesday April 11, [EMAIL PROTECTED] wrote:
 On Wed, Apr 11, 2007 at 07:15:57AM +1000, Neil Brown wrote:
   Neil, is this correct?
  
  It's just NFS.  nfsd does open/getdents/close on every readdir
  request.
 
 That's... unfortunate.

I like to think of it as challenging :-)

 
 
 So the problem with nfsd doing an open/getdents/close on every readdir
 request is two-fold.  First of all, it means that you are incurring
 extra overhead by forcing ext3 to allocate, create, and free the
 red-black tree on every single NFS readdir request.  Secondly, it
 increases the chances of problems with the hash collision logic.

I guess we need a two-fold solution then  :-)

For the first:
  You are storing an internal tree representation of part of the
  directory attached to the 'struct file'.
  Would it be possible to store this attached to the page via the
  -private pointer?  What would avoid the allocate/create/free
  overhead on every request.

  You suggest caching the open files in nfsd.  While that is probably
  possible (I've thought of it a number of times) is would also be
  quite complex, e.g. requiring some sort of call-back to close all
  those files when the filesystem is unexported.  And it is very easy
  to get caching heuristics wrong.  Leveraging the page-cache which is
  a very mature cache seems to make a lot of sense.

  The cursor would still be in the 'struct file', but if the second
  second fold works, that should be much less critical.

For the second.
  You say that you  would need at least 96 bits in order to make that
  guarantee; 64 bits of hash, plus a 32-bit count value in the hash
  collision chain.  I think 96 is a bit greedy.  Surely 48 bits of 
  hash and 16 bits of collision-chain-position would plenty.  You would
  need 65537 entries before a collision was even possible, and
  billions before it was at all likely. (How big does a set of 48bit
  numbers have to get before the probability that No subset of 65536
  numbers are all the same drops below 0.95?)

  This would really require that the collision-chain-index was stable
  across create/delete.  Doing that while you have the tree in the
  page cache is probably easy enough.  Doing it across reboots is
  probably not possible without on-disk changes.

So while my two-fold solution is not perfect(*), I think it is
substantially better than what we have now, and is extremely unlike to
ever show an observable problem.
(*)And perfection can only be achieved we two independent indexes.


This requires 64bit cookies which NFSv2 cannot handle.  I think it is
OK to say tough.  We should probably make sure that READDIRv2 *always*
fails on a directory which is indexed, and possibly on a filesystem
which allows index directories.  People can always disable indexing if
they want a v2 export.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread David Lang


On Thu, 12 Apr 2007, Neil Brown wrote:


For the second.
 You say that you  would need at least 96 bits in order to make that
 guarantee; 64 bits of hash, plus a 32-bit count value in the hash
 collision chain.  I think 96 is a bit greedy.  Surely 48 bits of
 hash and 16 bits of collision-chain-position would plenty.  You would
 need 65537 entries before a collision was even possible, and
 billions before it was at all likely. (How big does a set of 48bit
 numbers have to get before the probability that No subset of 65536
 numbers are all the same drops below 0.95?)


Neil,
  you can get a hash collision with two entries.

David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Theodore Tso

On Thu, Apr 12, 2007 at 08:32:05AM +1000, Neil Brown wrote:
 For the first:
   You are storing an internal tree representation of part of the
   directory attached to the 'struct file'.
   Would it be possible to store this attached to the page via the
   -private pointer?  What would avoid the allocate/create/free
   overhead on every request.

The reason why we are storing it associated with the file pointer
instead of the page/block is because the a filename insertion might
cause a node split, in which case half of the 4k block gets copied to
another block.  We need a stable pointer to where we are in the tree
that can cope with hash collisions, and that's the reason for creating
red/black tree in the first place, since it *doesn't* get split and
reorganized when the directory's hash tree gets reorg'ed.  So
attaching the tree to the page breaks the reason why we have the
separate data structure in the first place.

   You suggest caching the open files in nfsd.  While that is probably
   possible (I've thought of it a number of times) is would also be
   quite complex, e.g. requiring some sort of call-back to close all
   those files when the filesystem is unexported.  And it is very easy
   to get caching heuristics wrong.  Leveraging the page-cache which is
   a very mature cache seems to make a lot of sense.

Is it really that complex?  The simplest way of handling it is simply
keeping a open directory fd cache in a per-filesystem rbtree index
which is indexed by file handle and contains the file pointer.  When
you unexport the filesystem, you simply walk the rbtree and close all
of the file descriptors; no callback is required.

The caching hueristics are an issue; but fixed-size cache with a
simple LFU replacement strategy isn't all that complex to implement.
If 95% of the time, the readdir's come in quick succession, even a
small cache will probably provide huge performance gains, and
increaing the cache size past some critical point will probably only
provide marginal improvements.  

 For the second.
   You say that you  would need at least 96 bits in order to make that
   guarantee; 64 bits of hash, plus a 32-bit count value in the hash
   collision chain.  I think 96 is a bit greedy.  Surely 48 bits of 
   hash and 16 bits of collision-chain-position would plenty.  You would
   need 65537 entries before a collision was even possible, and
   billions before it was at all likely. (How big does a set of 48bit
   numbers have to get before the probability that No subset of 65536
   numbers are all the same drops below 0.95?)
 
   This would really require that the collision-chain-index was stable
   across create/delete.  Doing that while you have the tree in the
   page cache is probably easy enough.  Doing it across reboots is
   probably not possible without on-disk changes.

Actually, no, we can't keep the collision chain count stable across a
create/delete even while the tree is cached.  At least, not without
storing a huge amount of state associated with each page.  (It would
be a lot more work than simply having nfsd keep a fd cache for
directory streams ;-).

If we need create/delete stability, probably our only sane
implementation choice is to just stick with a 63-bit hash, and cross
our fingers and hope for the best.  

If nfsd caches the last N used directory caches, where N is roughly
proportional to the number of active clients, and the clients all only
use the last cookie returned in the readdir entry (since it would be
stupid to use one of the earlier ones and request the server to
re-send something which the client already has), at least in the
absense of telldir/seekdir calls, then that might be quite sufficient,
even if we return multiple direntory entries which contain hash
collisions to the client.  As long as the directory fd is cached, and
the client just uses the last cookie to fetch the next batch of
dirents, we'll be fine.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread H. Peter Anvin


David Lang wrote:

On Thu, 12 Apr 2007, Neil Brown wrote:


For the second.
 You say that you  would need at least 96 bits in order to make that
 guarantee; 64 bits of hash, plus a 32-bit count value in the hash
 collision chain.  I think 96 is a bit greedy.  Surely 48 bits of
 hash and 16 bits of collision-chain-position would plenty.  You would
 need 65537 entries before a collision was even possible, and
 billions before it was at all likely. (How big does a set of 48bit
 numbers have to get before the probability that No subset of 65536
 numbers are all the same drops below 0.95?)


Neil,
  you can get a hash collision with two entries.



Yes, but the probability is 2^-n for an n-bit hash, assuming it's 
uniformly distributed.


The probability approaches 1/2 as the number of entries hashes 
approaches 2^(n/2) (birthday number.)


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Jörn Engel

On Wed, 11 April 2007 16:23:21 -0700, H. Peter Anvin wrote:
 David Lang wrote:
 On Thu, 12 Apr 2007, Neil Brown wrote:
 
 For the second.
  You say that you  would need at least 96 bits in order to make that
  guarantee; 64 bits of hash, plus a 32-bit count value in the hash
  collision chain.  I think 96 is a bit greedy.  Surely 48 bits of
  hash and 16 bits of collision-chain-position would plenty.  You would
  need 65537 entries before a collision was even possible, and
  billions before it was at all likely. (How big does a set of 48bit
  numbers have to get before the probability that No subset of 65536
  numbers are all the same drops below 0.95?)
 
   you can get a hash collision with two entries.
 
 Yes, but the probability is 2^-n for an n-bit hash, assuming it's 
 uniformly distributed.
 
 The probability approaches 1/2 as the number of entries hashes 
 approaches 2^(n/2) (birthday number.)

I believe you are both barking up the wrong tree.  Neil proposed a 16bit
collision chain.  With that, it takes 65537 entries before a collision
chain overflow is possible.

Calling a collision chain overflow collision is inviting confusion, of
course. :)

Jörn

-- 
The competent programmer is fully aware of the strictly limited size of
his own skull; therefore he approaches the programming task in full
humility, and among other things he avoids clever tricks like the plague.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Neil Brown

On Wednesday April 11, [EMAIL PROTECTED] wrote:
 On Thu, 12 Apr 2007, Neil Brown wrote:
 
  For the second.
   You say that you  would need at least 96 bits in order to make that
   guarantee; 64 bits of hash, plus a 32-bit count value in the hash
   collision chain.  I think 96 is a bit greedy.  Surely 48 bits of
   hash and 16 bits of collision-chain-position would plenty.  You would
   need 65537 entries before a collision was even possible, and
   billions before it was at all likely. (How big does a set of 48bit
   numbers have to get before the probability that No subset of 65536
   numbers are all the same drops below 0.95?)
 
 Neil,
you can get a hash collision with two entries.

You need at least 65537 entries before there is any possibility of
collision between two
   48-bit-hash ++ 16-bit-sequence-number
objects where the 16-bit-sequence-number is chosen to be different from all
other 16 bit sequence numbers combined with the same 48 bit hash.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Neil Brown

On Wednesday April 11, [EMAIL PROTECTED] wrote:
 
 Actually, no, we can't keep the collision chain count stable across a
 create/delete even while the tree is cached.  At least, not without
 storing a huge amount of state associated with each page.  (It would
 be a lot more work than simply having nfsd keep a fd cache for
 directory streams ;-).

Well, there's the rub, isn't it :-)
You think it is easier to fix the problem in nfsd, and I think it is
easier to fix the problem in ext3.  Both positions are quite
understandable and quite genuine.
And I am quite sure that all the issues that have been raised can be
solved with a bit of effort providing the motivation is there.

I could argue that nfs came before ext3+dirindex, so ext3 should have
been designed to work properly with NFS.  You could argue that fixing
it in nfsd fixes it for all filesystems.  But I'm not sure either of
those arguments are likely to be at all convincing...

Maybe the best compromise is to both fix the 'problem' :-?

Let me explores some designs a bit more..

NFS:
   All we have to do is cache the open files.  This should be only a
   performance issue, not a correctness issue (once we get 64bit
   cookies from ext3).  So the important thing is to cache then for
   a reasonable period of time.

   We currently cache the read-ahead info from regular files (though
   I was hoping that would go away when the page-cache-based readahead
   became a reality).  We can reasonably replace this with caching
   the open files if we are going to do it for directories anyway.

   So the primary key is struct inode * + loff_t.  This is suitable
   both for file-readahead and for ext3-directory-caching.  Might also
   be useful for filesystem that stores pre-allocation information in
   the struct-file.
   We keep these in an LRU list and a hash table.  We register a
   callback with register_shrinker (or whatever it is called today) so
   that VM pressure can shrink the cache, and also arrange a timer to
   remove entries older than -- say -- 5 seconds.

   I think that placing a fixed size on the cache based on number of
   active clients would be a mistake, as it is virtually impossible to
   know how many active clients there are, and the number can change
   very dynamically.

   When a filesystem is un-exported, (rare event) we walk the whole
   list and discard entries for that filesystem.

   Look into the possibility of a callback on unlink to drop the
   cached open when the link count hits zero I wonder if inotify
   or leases can help with that.

   To help with large NUMA machine, we probably don't want a single
   hash LRU chain, but rather a number of LRU chains.  That way the
   action of moving an entry to the end of the chain is less likely to
   conflict with another processor trying to do the same thing to a
   different entry.  This is the sort of consideration that is already
   handled in the page cache, and having to handle it in every other
   cache is troublesome because the next time a need like that
   arises, the page cache will get fixed but other little caches won't
   until someone like Greg Banks come along with a big hammer...

EXT3:
   Have a rbtree for storing directory entries. This is attached to a
   pages via the -private page field.
   Normally each page of a directory has it's own rbtree, but when two
   pages contain entries with the same hash, the one rbtree is shared
   between the pages.
   Thus when you load a block you must also load other blocks under
   the same hash, but I think you do that already.

   When you split a block (because it has become too big) the rbtree
   attached to that block is dismantled and each entry is inserted
   into the appropriate new rbtree, one for each of the two blocks.
   The entries are unchanged - they just get placed in a different
   tree - so cursors in the struct file will still be valid.

   Each entry has a count of the number of cursors pointing to it, and
   when this is non-zero, a refcount on the page is held, thus making
   sure the page doesn't get freed and the btree lost.  The entry
   should possibly also contain a pointer to the page.. not sure if
   that is needed.

   Each entry in the rbtree contains (in minor_hash) a sequence number
   that is used when multiple entries hash to the same value.  We
   store a 'current-seq-number' in the root of the rbtree and when an
   attempt to insert an entry finds a collision, we increase
   current-seq-number, set the minor_hash to that, and retry the
   insert.
   This minor_hash is combined with some bits of the major hash to
   form the fpos/cookie.

   The releasepage address_space_operation will check that all pages
   which share the same major hash are treated as a unit, all released
   at the same time.  So it will fail if any of the pages in the
   group are in use.  If they can all be freed, it will free the
   rbtree for that group.


   This not only benefits nfsd, which opens and

Re: If not readdir() then what?

2007-04-11 Thread Jörn Engel

On Thu, 12 April 2007 11:46:41 +1000, Neil Brown wrote:
 
 I could argue that nfs came before ext3+dirindex, so ext3 should have
 been designed to work properly with NFS.  You could argue that fixing
 it in nfsd fixes it for all filesystems.  But I'm not sure either of
 those arguments are likely to be at all convincing...

Caring about a non-ext3 filesystem, I sure would like an nfs solution as
well. :)

 Hmmm. I wonder.  Which is more likely?
   - That two 64bit hashes from some set are the same
   - or that 65536 48bit hashes from a set of equal size are the same.

The former.  Each bit going from hash strength to collision chain length
reduces the likelihood of an overflow.  In the extreme case of a 0bit
hash and 64bit collision chain, you need 2^64 entries compared to 2^32
for the other extreme.

However, the collision chain gives me quite a bit of headache.  One
would have to store each entry's position on the chain, deal with older
entries getting deleted, newer entries getting removed, etc.  All this
requires a lot of complicated code that basically never gets tested in
the wild.

Just settling for a 64bit hash and returning -EEXIST when someone causes
a collision an creat() sounds more appealing.  Directories with 4
billion entries will cause problems, but that is hardly news to anyone.

Jörn

-- 
Fantasy is more important than knowledge. Knowledge is limited,
while fantasy embraces the whole world.
-- Albert Einstein
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Bernd Eckenfels

In article <[EMAIL PROTECTED]> you wrote:
> Otherwise, the client would have to cache _all_ previous READDIR results
> since the last opendir()/rewinddir() in order to be able to do its own
> loop detection and that will obviously never scale for large directories
> or for directories that change frequently...

Unless you have a COW style filesystem with versioning (think oracle tables)
you will have to invalidate cookies often or do copies - on client or
server. And I am not sure whats worse (for apps)... disappearing/missing
files or duplicates.

Gruss
bernd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Trond Myklebust

On Wed, 2007-04-11 at 08:33 +1000, Neil Brown wrote:

>   A READDIR (aka getdents2) should take a directory handle, a cookie,
>   and a filename, and should return filenames and cookies.  The
>   cookies may all be identical or may not.  The filename might be used
>   by the filesystem, or it might not.
> 
>   Filesystems that require a cursor in the 'struct file' to support
>   (local) getdents cannot be used with NFS.
> 
> While it doesn't make it possible to support all conceivable
> filesystems, it should make it easier for some filesystems to support
> the demands of NFS.

In order to be useful, I think you need to add a demand that the READDIR
call cannot loop back on itself for the case of a series of sequential
reads.

IOW: if a client attempts to step sequentially through the directory,
and is supplying valid filenames+cookies from the preceding READDIR
call, then the next READDIR call should be guaranteed never to loop back
to an earlier entry. Alternatively, if there is a danger that it might
due to some sudden and radical change in the directory layout, then it
should notify the client by returning something like a BAD_COOKIE error.

Otherwise, the client would have to cache _all_ previous READDIR results
since the last opendir()/rewinddir() in order to be able to do its own
loop detection and that will obviously never scale for large directories
or for directories that change frequently...

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Neil Brown

On Tuesday April 10, [EMAIL PROTECTED] wrote:
> On Wed, 2007-04-11 at 07:37 +1000, Neil Brown wrote:
> > On Tuesday April 10, [EMAIL PROTECTED] wrote:
> > > On Wed, 2007-04-11 at 07:12 +1000, Neil Brown wrote:
> > > > 
> > > > Is there something that makes that interface problematic?
> > > 
> > > File deletions...
> > 
> > How are they a problem ?
> > 
> > There are only two ways to organise a directory.
> > 1/ Unsorted linear list of entries in which no repacking is ever done.
> > 2/ Some data structure using an ordered search key that is based on
> >   the filename (e.g. a B-tree with a search key that is a hash of the
> >   filename). 
> > 
> > In the first case, you just use a fixed opaque cookie for location in
> > a directory.
> > In the second you use the filename.  If the file has been deleted,
> > that shouldn't stop you finding the place where it would have been in
> > the overall sort order.
> 
> I beg to differ. RAMFS is an instance of a filesystem which uses an
> unsorted linear list of entries which is automatically repacked when you
> delete a file.

hmm... yes...
RAMFS uses dcache_readdir which keeps a cursor in the list of entries
such that delete/insert doesn't change the effective location of the
cursor.

Doing that in NFS would require the server keeping an open file on
behalf of the client.  That wouldn't be too hard for NFSv4, except
that you would still need some way to deal with server reboots.

But then RAMFS wouldn't survive a server reboot anyway.

> 
> You may also have filesystems which are only partially sorted. The
> dcache is for instance organised as a hashtable with multiple entries
> per bucket...

That is like the filesystem Bob Copeland described.

And if the chains are not sorted, then I think the only way you could
reliably implement getdents is to use a cursor, and that doesn't map
over NFS very well.

Even if the chains were sorted, they could conceivably be sorted by
position-in-file in which case my 'cookie or filename' would not be
enough.  You need 'cookie and filename'.

So my new position is that:

  A READDIR (aka getdents2) should take a directory handle, a cookie,
  and a filename, and should return filenames and cookies.  The
  cookies may all be identical or may not.  The filename might be used
  by the filesystem, or it might not.

  Filesystems that require a cursor in the 'struct file' to support
  (local) getdents cannot be used with NFS.

While it doesn't make it possible to support all conceivable
filesystems, it should make it easier for some filesystems to support
the demands of NFS.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Trond Myklebust

On Wed, 2007-04-11 at 07:37 +1000, Neil Brown wrote:
> On Tuesday April 10, [EMAIL PROTECTED] wrote:
> > On Wed, 2007-04-11 at 07:12 +1000, Neil Brown wrote:
> > > 
> > > Is there something that makes that interface problematic?
> > 
> > File deletions...
> 
> How are they a problem ?
> 
> There are only two ways to organise a directory.
> 1/ Unsorted linear list of entries in which no repacking is ever done.
> 2/ Some data structure using an ordered search key that is based on
>   the filename (e.g. a B-tree with a search key that is a hash of the
>   filename). 
> 
> In the first case, you just use a fixed opaque cookie for location in
> a directory.
> In the second you use the filename.  If the file has been deleted,
> that shouldn't stop you finding the place where it would have been in
> the overall sort order.

I beg to differ. RAMFS is an instance of a filesystem which uses an
unsorted linear list of entries which is automatically repacked when you
delete a file.

You may also have filesystems which are only partially sorted. The
dcache is for instance organised as a hashtable with multiple entries
per bucket...

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Bob Copeland


On 4/10/07, Neil Brown <[EMAIL PROTECTED]> wrote:

2/ Some data structure using an ordered search key that is based on
  the filename (e.g. a B-tree with a search key that is a hash of the
  filename).

In the first case, you just use a fixed opaque cookie for location in
a directory.
In the second you use the filename.  If the file has been deleted,
that shouldn't stop you finding the place where it would have been in
the overall sort order.


I can think of one (admittedly insane) FS that is between those two:

3/ an unsorted hash table, where each directory entry has an indirect
pointer to its neighbor in case of hash collisions.

  a
  b -> d -> c -> e
  g
  f -> x

Given 'c' as the "last" thing returned, you can hash c to find out
that you are in the bucket with 'b', but if 'c' was deleted, the best
you can do is return b twice or skip the chain entirely.  I maintain
an out-of-tree driver for such an fs (I promise I did not invent it);
the best I could come up with is to encode the hash chain index in the
top byte of f_pos.   Needless to say, readdir is very not performant
with all the seeking this hash scheme entails.

-Bob
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Neil Brown

On Sunday April 8, [EMAIL PROTECTED] wrote:
> On Sun, 8 April 2007 11:11:20 -0700, H. Peter Anvin wrote:
> > 
> > Well, the question is if you can keep the seekdir/telldir cookie around 
> > as a pointer -- preferrably in userspace, of course.  You would 
> > presumably garbage-collect them on closedir() -- there is no other point 
> > at which you could.
> 
> Garbage-collecting them on closedir() does not work.  It surprised me as
> well, but there seem to be applications that keep the telldir() cookie
> around after closedir().  Iirc, "rm -r" was one of them.
> 
> Neil, is this correct?

It's just NFS.  nfsd does open/getdents/close on every readdir
request.
"rm -r" is the application that tends to detect any problems with
cookie handling between the NFS client and server.

(I'm sure there was once an 'rm -r' that used telldir/seekdir, but I
cannot find it.  Maybe it was level-7 UNIX :-)

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Alan Cox

> A READDIR request contains either a cookie or a filename.  Either mean
> "This identifies the last name I got from you, give me the next one".
> 
> Is there something that makes that interface problematic?

Another client may have removed the name and re-added a new file with
that name. At this point most file systems may not have placed it at the
same spot in the directory.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Neil Brown

On Tuesday April 10, [EMAIL PROTECTED] wrote:
> 
> Seems like it would simply make more sense for the server to be allowed 
> to determine what the size of the cookie should be.

That is one possibility.  but if the cookie gets too big, you
substantially reduce the number of entries you can fit in a single
reply.  And if you choose to use the filename as the cookie, you end
up sending the filename twice for each entry, which seems somewhat
pointless. 

> 
> Of course, that doesn't help NFSv2/3/4.0.

No.  And while we can probably deprecate support for NFSv2 (at least
in newer filesystems) I think it will be quite some years before we
can think about doing the same for v3 (in fact we can probably
deprecate 4.0 and 4.1 before v3 :-)
So while it doesn't hurt to plan for the longer-term future, I think
we need to accept that over the next 5 years, filesystems needs to
cope with 64bit dir-entry cookies to be generally useful.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Neil Brown

On Tuesday April 10, [EMAIL PROTECTED] wrote:
> On Wed, 2007-04-11 at 07:12 +1000, Neil Brown wrote:
> > 
> > Is there something that makes that interface problematic?
> 
> File deletions...

How are they a problem ?

There are only two ways to organise a directory.
1/ Unsorted linear list of entries in which no repacking is ever done.
2/ Some data structure using an ordered search key that is based on
  the filename (e.g. a B-tree with a search key that is a hash of the
  filename). 

In the first case, you just use a fixed opaque cookie for location in
a directory.
In the second you use the filename.  If the file has been deleted,
that shouldn't stop you finding the place where it would have been in
the overall sort order.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Neil Brown

On Monday April 9, [EMAIL PROTECTED] wrote:
> On Mon, Apr 09, 2007 at 08:31:37AM -0400, Trond Myklebust wrote:
> > That is a protocol limitation, not a client limitation.
> 
> 
> 
> And after quickly checking RFC 3010, I see this limitation hasn't been
> lifted in NFSv4.
> 
> Speaking of which, right now ext3 doesn't know whether it's talking to
> an NFSv2 or NFS v3/v4 server, so it's always passing a 32-bit cookie.
> If NFSv3/v4 could use an explicit interface to request a 64-bit
> cookie, instead of just relying on the f_pos field in the file handle,
> we can reduce the chance of hash collisions when reading an ext3
> directory significantly.   
> 

We don't use f_pos (any more), we call llseek.

I think it would make a lot of sense - as Trond suggests - to not pass
O_LARGEFILE to dentry_open for an NFSv2 request.  Then ext3 could
trigger off that and return 64bits of cookie ... and I think nfsd will
actually pass them all back to the client now.  There is a
truncate-to-32bits bug that has only just been fixed.

But if a separate call is wanted, we have the export_operations struct
to put it in.  All we need is a good case an useful specification.

> If there are 2 or 3 directory entries that have a hash collision,
> would the NFS protocol allow the server to juggle things so that those
> 2-3 directory entries with the hash collision are sent back in a
> single readdir RPC reply?  Is it aceptable/legal to have multiple
> entries in the same READDIR reply packet have the same cookie value?

I think Trond has answered this, but I think it is also worth noting
that every entry returned in a READDIR reply includes a cookie, the
NFS client may use any of those cookies in a subsequent READDIR.  One
might hope the client will only ever use the last, but one can never
be sure

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Trond Myklebust

On Wed, 2007-04-11 at 07:12 +1000, Neil Brown wrote:
> On Tuesday April 10, [EMAIL PROTECTED] wrote:
> > The problem is that it is extremely hard to come up with an alternative
> > that doesn't impose new conditions on what filesystems you can support.
> 
> I seem to remember Hans Reiser making a credible suggestion years ago
> when NFSv4 was still in draft.  It didn't fly, but I don't really
> remember why.
> 
> The NFS server gets to either return a cookie like it currently does,
> or sets a flag (or maybe returns a special cookie) which says 'just
> use the name'.
> A READDIR request contains either a cookie or a filename.  Either mean
> "This identifies the last name I got from you, give me the next one".
> 
> Is there something that makes that interface problematic?

File deletions...

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread H. Peter Anvin


Neil Brown wrote:

On Tuesday April 10, [EMAIL PROTECTED] wrote:

The problem is that it is extremely hard to come up with an alternative
that doesn't impose new conditions on what filesystems you can support.


I seem to remember Hans Reiser making a credible suggestion years ago
when NFSv4 was still in draft.  It didn't fly, but I don't really
remember why.

The NFS server gets to either return a cookie like it currently does,
or sets a flag (or maybe returns a special cookie) which says 'just
use the name'.
A READDIR request contains either a cookie or a filename.  Either mean
"This identifies the last name I got from you, give me the next one".

Is there something that makes that interface problematic?


Seems like it would simply make more sense for the server to be allowed 
to determine what the size of the cookie should be.


Of course, that doesn't help NFSv2/3/4.0.

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Neil Brown

On Tuesday April 10, [EMAIL PROTECTED] wrote:
> The problem is that it is extremely hard to come up with an alternative
> that doesn't impose new conditions on what filesystems you can support.

I seem to remember Hans Reiser making a credible suggestion years ago
when NFSv4 was still in draft.  It didn't fly, but I don't really
remember why.

The NFS server gets to either return a cookie like it currently does,
or sets a flag (or maybe returns a special cookie) which says 'just
use the name'.
A READDIR request contains either a cookie or a filename.  Either mean
"This identifies the last name I got from you, give me the next one".

Is there something that makes that interface problematic?

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Ulrich Drepper


On 4/10/07, H. Peter Anvin <[EMAIL PROTECTED]> wrote:

It rather makes any user space accesses irrelevant.  The main question
seems to be if we can realistically increase the cookie size even to 64
bits.


On 32-bit platforms, *not* using _FILE_OFFSET_BITS=64 is already today
a stupid thing to do.  There are several places where things simply
break in today's world.

Once you do use 53-bit offsets (and we always do this for 64-bit
platforms) telldir returns a 64-bit cookie.  So, no problem.  The one
big issue is that there are no error values for telldir.  There was no
need so far.  But a) we can change this or b) we can really go
stronger in the direction to deprecate 32-bit ops and force people to
update their code.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Valdis . Kletnieks

On Tue, 10 Apr 2007 17:54:54 +0200, Jan Engelhardt said:

> NFS server sends the whole directory contents on NFS client opendir,
> so that the whole readdir/telldir/seekdir magic can happen on the
> client only... which would perhaps also enable a cheap telldir/seekdir,
> and would also give a 'fixed view' when adding/deleting files
> during readdir.

What should happen if the directory contents go stale?  If *two* systems
cache the directory and then proceed to add/delete files, what happens?


pgpKUxNQKTp2A.pgp
Description: PGP signature

Re: If not readdir() then what?

2007-04-10 Thread H. Peter Anvin


Jan Engelhardt wrote:


I had a thought, but I think it's not quite ripe..

NFS server sends the whole directory contents on NFS client opendir,
so that the whole readdir/telldir/seekdir magic can happen on the
client only... which would perhaps also enable a cheap telldir/seekdir,
and would also give a 'fixed view' when adding/deleting files
during readdir.



That doesn't sound right... what if there are two million files in the 
directory?


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Jan Engelhardt


On Apr 10 2007 10:37, Trond Myklebust wrote:
>
>NFS, OTOH, simply could not work without that requirement, since there
>exists no file pointer to tell you where you are in a stream beyond
>whatever the server manages to encode inside the opaque cookie+verifier.
>
>> But the fact of the matter is that if NFS protocols demands that a
>> per-directory entry cookie can be uniquely and permanently (including
>> across server reboots) identified with a small integer number, it's
>> dreaming.  Filesystem authors will cheat one way or another, because
>> there's nothing else for them to do.  
>
>Few people in the NFS community would disagree that the design of
>READDIR sucks (personally, I consider it to be one of the biggest
>scalability issues we have).
>The problem is that it is extremely hard to come up with an alternative
>that doesn't impose new conditions on what filesystems you can support.

I had a thought, but I think it's not quite ripe..

NFS server sends the whole directory contents on NFS client opendir,
so that the whole readdir/telldir/seekdir magic can happen on the
client only... which would perhaps also enable a cheap telldir/seekdir,
and would also give a 'fixed view' when adding/deleting files
during readdir.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread H. Peter Anvin


Ulrich Drepper wrote:

On 4/10/07, Theodore Tso <[EMAIL PROTECTED]> wrote:

That might work.  But if in the long term we want to separate out what
we can send back via telldir/seekdir, and some future new Posix
interface, [...]


With all these discussions about fixes for telldir, do we want to
persue an alternative interface where the user can explicitly specify
that no telldir/seekdir will ever be used?  From what I know so far it
would make technical sense since we could speed up/reduce the memory
footprint of readdir in 99% of most programs.  But is the benefit
large enough to warrant a second interface?


The real problem sounds like NFS is, in effect, doing telldir/seekdir on 
every directory reference.


It rather makes any user space accesses irrelevant.  The main question 
seems to be if we can realistically increase the cookie size even to 64 
bits.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Trond Myklebust

On Tue, 2007-04-10 at 09:56 -0400, Theodore Tso wrote:

> That might work.  But if in the long term we want to separate out what
> we can send back via telldir/seekdir, and some future new Posix
> interface, I wonder if we might be better off defining a formal
> interface which can be used by NFSv2 and NFSv3/v4 that isn't
> necessarily tied to f_pos.

Note: POSIX does not mandate telldir/seekdir. The latter are only
mandated by the XSI extensions (that are part of the Single Unix spec).

Also note that SuSv3 states

One of the perceived problems of implementation is that
returning to a given point in a directory is quite difficult to
describe formally, in spite of its intuitive appeal, when
systems that use B-trees, hashing functions, or other similar
mechanisms to order their directories are considered. The
definition of seekdir() and telldir() does not specify whether,
when using these interfaces, a given directory entry will be
seen at all, or more than once.

So whereas collisions are not supported, it does appear that the SuSv3
does not mandate that you should be able to replay the exact same
stream.

NFS, OTOH, simply could not work without that requirement, since there
exists no file pointer to tell you where you are in a stream beyond
whatever the server manages to encode inside the opaque cookie+verifier.

> But the fact of the matter is that if NFS protocols demands that a
> per-directory entry cookie can be uniquely and permanently (including
> across server reboots) identified with a small integer number, it's
> dreaming.  Filesystem authors will cheat one way or another, because
> there's nothing else for them to do.  

Few people in the NFS community would disagree that the design of
READDIR sucks (personally, I consider it to be one of the biggest
scalability issues we have).
The problem is that it is extremely hard to come up with an alternative
that doesn't impose new conditions on what filesystems you can support.

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Ulrich Drepper


On 4/10/07, Theodore Tso <[EMAIL PROTECTED]> wrote:

That might work.  But if in the long term we want to separate out what
we can send back via telldir/seekdir, and some future new Posix
interface, [...]


With all these discussions about fixes for telldir, do we want to
persue an alternative interface where the user can explicitly specify
that no telldir/seekdir will ever be used?  From what I know so far it
would make technical sense since we could speed up/reduce the memory
footprint of readdir in 99% of most programs.  But is the benefit
large enough to warrant a second interface?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Theodore Tso

On Mon, Apr 09, 2007 at 10:03:15AM -0400, Trond Myklebust wrote:
> We could perhaps teach nfsd to open the file without the O_LARGEFILE
> attribute in the case of NFSv2?

That might work.  But if in the long term we want to separate out what
we can send back via telldir/seekdir, and some future new Posix
interface, I wonder if we might be better off defining a formal
interface which can be used by NFSv2 and NFSv3/v4 that isn't
necessarily tied to f_pos.  Given that the semantics for what
telldir/seekdir are different from what what NFS needs
(telldir/seekdir cookies don't have to be persistent), it may be
useful to allow filesystems the option of having two separate options
for how to export this information.  

> Not really.
> 
> However on NFSv3 and v4 there is actually a mechanism for declaring that
> the existing set of cookies have expired and are no longer valid: you
> have an 8-byte opaque 'verifier' which is supplied by the server, and
> which is supposed to be returned by the client on every call to READDIR.
> If the server wants to change its cookie scheme, then it signals it to
> the client by changing its verifier, and returning an error whenever the
> client tries to use the old verifier. Upon receiving that error, the
> client is supposed to clear out all cached cookies, and read the
> directory in again from the start.

I looked at that, and it's not really helpful.  Basically if NFS
demands that cookies never collide, and states that cookies must be
some small (32 or 64) bit value that are persistent across time and
server reboots, then that's fundmaentally incompatible with any kind
of non-linear directory structure.  So whether the filesystem is
ext3/htree, or ntfs, or reiserfs, people will be cheating one way or
another.

One of the things which they could do I suppose is use a linear
offset, and then change the verifier every single time there is a
b-tree split or merge which changes the configuration of the tree.  As
you say, though, forcing the client to re-read the entire contents of
the directory each time we change the verifier doesn't scale too well.
But the fact of the matter is that if NFS protocols demands that a
per-directory entry cookie can be uniquely and permanently (including
across server reboots) identified with a small integer number, it's
dreaming.  Filesystem authors will cheat one way or another, because
there's nothing else for them to do.  

> Note also that we would have to fix the client implementation. Nobody
> has bothered working on the code to handle verifier changes since there
> are no servers out there in the wild that use it.

... which means changing the verifier every node merge/split operation
would probably cause all sorts of interesting breakages, even more
than the occasional hash collision (which as far as I know no one has
complained about so far --- but with the 32-bit cookie, the birthday
paradox states that the probability of a collision is 1 in 65536, so
it's probably happened out in the wild already).

Regards,

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Theodore Tso

On Mon, Apr 09, 2007 at 10:03:15AM -0400, Trond Myklebust wrote:
 We could perhaps teach nfsd to open the file without the O_LARGEFILE
 attribute in the case of NFSv2?

That might work.  But if in the long term we want to separate out what
we can send back via telldir/seekdir, and some future new Posix
interface, I wonder if we might be better off defining a formal
interface which can be used by NFSv2 and NFSv3/v4 that isn't
necessarily tied to f_pos.  Given that the semantics for what
telldir/seekdir are different from what what NFS needs
(telldir/seekdir cookies don't have to be persistent), it may be
useful to allow filesystems the option of having two separate options
for how to export this information.  

 Not really.
 
 However on NFSv3 and v4 there is actually a mechanism for declaring that
 the existing set of cookies have expired and are no longer valid: you
 have an 8-byte opaque 'verifier' which is supplied by the server, and
 which is supposed to be returned by the client on every call to READDIR.
 If the server wants to change its cookie scheme, then it signals it to
 the client by changing its verifier, and returning an error whenever the
 client tries to use the old verifier. Upon receiving that error, the
 client is supposed to clear out all cached cookies, and read the
 directory in again from the start.

I looked at that, and it's not really helpful.  Basically if NFS
demands that cookies never collide, and states that cookies must be
some small (32 or 64) bit value that are persistent across time and
server reboots, then that's fundmaentally incompatible with any kind
of non-linear directory structure.  So whether the filesystem is
ext3/htree, or ntfs, or reiserfs, people will be cheating one way or
another.

One of the things which they could do I suppose is use a linear
offset, and then change the verifier every single time there is a
b-tree split or merge which changes the configuration of the tree.  As
you say, though, forcing the client to re-read the entire contents of
the directory each time we change the verifier doesn't scale too well.
But the fact of the matter is that if NFS protocols demands that a
per-directory entry cookie can be uniquely and permanently (including
across server reboots) identified with a small integer number, it's
dreaming.  Filesystem authors will cheat one way or another, because
there's nothing else for them to do.  

 Note also that we would have to fix the client implementation. Nobody
 has bothered working on the code to handle verifier changes since there
 are no servers out there in the wild that use it.

... which means changing the verifier every node merge/split operation
would probably cause all sorts of interesting breakages, even more
than the occasional hash collision (which as far as I know no one has
complained about so far --- but with the 32-bit cookie, the birthday
paradox states that the probability of a collision is 1 in 65536, so
it's probably happened out in the wild already).

Regards,

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Ulrich Drepper


On 4/10/07, Theodore Tso [EMAIL PROTECTED] wrote:

That might work.  But if in the long term we want to separate out what
we can send back via telldir/seekdir, and some future new Posix
interface, [...]


With all these discussions about fixes for telldir, do we want to
persue an alternative interface where the user can explicitly specify
that no telldir/seekdir will ever be used?  From what I know so far it
would make technical sense since we could speed up/reduce the memory
footprint of readdir in 99% of most programs.  But is the benefit
large enough to warrant a second interface?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Trond Myklebust

On Tue, 2007-04-10 at 09:56 -0400, Theodore Tso wrote:


 That might work.  But if in the long term we want to separate out what
 we can send back via telldir/seekdir, and some future new Posix
 interface, I wonder if we might be better off defining a formal
 interface which can be used by NFSv2 and NFSv3/v4 that isn't
 necessarily tied to f_pos.

Note: POSIX does not mandate telldir/seekdir. The latter are only
mandated by the XSI extensions (that are part of the Single Unix spec).

Also note that SuSv3 states

One of the perceived problems of implementation is that
returning to a given point in a directory is quite difficult to
describe formally, in spite of its intuitive appeal, when
systems that use B-trees, hashing functions, or other similar
mechanisms to order their directories are considered. The
definition of seekdir() and telldir() does not specify whether,
when using these interfaces, a given directory entry will be
seen at all, or more than once.

So whereas collisions are not supported, it does appear that the SuSv3
does not mandate that you should be able to replay the exact same
stream.

NFS, OTOH, simply could not work without that requirement, since there
exists no file pointer to tell you where you are in a stream beyond
whatever the server manages to encode inside the opaque cookie+verifier.

 But the fact of the matter is that if NFS protocols demands that a
 per-directory entry cookie can be uniquely and permanently (including
 across server reboots) identified with a small integer number, it's
 dreaming.  Filesystem authors will cheat one way or another, because
 there's nothing else for them to do.  

Few people in the NFS community would disagree that the design of
READDIR sucks (personally, I consider it to be one of the biggest
scalability issues we have).
The problem is that it is extremely hard to come up with an alternative
that doesn't impose new conditions on what filesystems you can support.

Cheers
  Trond

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread H. Peter Anvin


Ulrich Drepper wrote:

On 4/10/07, Theodore Tso [EMAIL PROTECTED] wrote:

That might work.  But if in the long term we want to separate out what
we can send back via telldir/seekdir, and some future new Posix
interface, [...]


With all these discussions about fixes for telldir, do we want to
persue an alternative interface where the user can explicitly specify
that no telldir/seekdir will ever be used?  From what I know so far it
would make technical sense since we could speed up/reduce the memory
footprint of readdir in 99% of most programs.  But is the benefit
large enough to warrant a second interface?


The real problem sounds like NFS is, in effect, doing telldir/seekdir on 
every directory reference.


It rather makes any user space accesses irrelevant.  The main question 
seems to be if we can realistically increase the cookie size even to 64 
bits.


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Jan Engelhardt


On Apr 10 2007 10:37, Trond Myklebust wrote:

NFS, OTOH, simply could not work without that requirement, since there
exists no file pointer to tell you where you are in a stream beyond
whatever the server manages to encode inside the opaque cookie+verifier.

 But the fact of the matter is that if NFS protocols demands that a
 per-directory entry cookie can be uniquely and permanently (including
 across server reboots) identified with a small integer number, it's
 dreaming.  Filesystem authors will cheat one way or another, because
 there's nothing else for them to do.  

Few people in the NFS community would disagree that the design of
READDIR sucks (personally, I consider it to be one of the biggest
scalability issues we have).
The problem is that it is extremely hard to come up with an alternative
that doesn't impose new conditions on what filesystems you can support.

I had a thought, but I think it's not quite ripe..

NFS server sends the whole directory contents on NFS client opendir,
so that the whole readdir/telldir/seekdir magic can happen on the
client only... which would perhaps also enable a cheap telldir/seekdir,
and would also give a 'fixed view' when adding/deleting files
during readdir.


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread H. Peter Anvin


Jan Engelhardt wrote:


I had a thought, but I think it's not quite ripe..

NFS server sends the whole directory contents on NFS client opendir,
so that the whole readdir/telldir/seekdir magic can happen on the
client only... which would perhaps also enable a cheap telldir/seekdir,
and would also give a 'fixed view' when adding/deleting files
during readdir.



That doesn't sound right... what if there are two million files in the 
directory?


-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Valdis . Kletnieks

On Tue, 10 Apr 2007 17:54:54 +0200, Jan Engelhardt said:

 NFS server sends the whole directory contents on NFS client opendir,
 so that the whole readdir/telldir/seekdir magic can happen on the
 client only... which would perhaps also enable a cheap telldir/seekdir,
 and would also give a 'fixed view' when adding/deleting files
 during readdir.

What should happen if the directory contents go stale?  If *two* systems
cache the directory and then proceed to add/delete files, what happens?


pgpKUxNQKTp2A.pgp
Description: PGP signature

Re: If not readdir() then what?

2007-04-10 Thread Ulrich Drepper


On 4/10/07, H. Peter Anvin [EMAIL PROTECTED] wrote:

It rather makes any user space accesses irrelevant.  The main question
seems to be if we can realistically increase the cookie size even to 64
bits.


On 32-bit platforms, *not* using _FILE_OFFSET_BITS=64 is already today
a stupid thing to do.  There are several places where things simply
break in today's world.

Once you do use 53-bit offsets (and we always do this for 64-bit
platforms) telldir returns a 64-bit cookie.  So, no problem.  The one
big issue is that there are no error values for telldir.  There was no
need so far.  But a) we can change this or b) we can really go
stronger in the direction to deprecate 32-bit ops and force people to
update their code.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Neil Brown

On Tuesday April 10, [EMAIL PROTECTED] wrote:
 The problem is that it is extremely hard to come up with an alternative
 that doesn't impose new conditions on what filesystems you can support.

I seem to remember Hans Reiser making a credible suggestion years ago
when NFSv4 was still in draft.  It didn't fly, but I don't really
remember why.

The NFS server gets to either return a cookie like it currently does,
or sets a flag (or maybe returns a special cookie) which says 'just
use the name'.
A READDIR request contains either a cookie or a filename.  Either mean
This identifies the last name I got from you, give me the next one.

Is there something that makes that interface problematic?

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread H. Peter Anvin


Neil Brown wrote:

On Tuesday April 10, [EMAIL PROTECTED] wrote:

The problem is that it is extremely hard to come up with an alternative
that doesn't impose new conditions on what filesystems you can support.


I seem to remember Hans Reiser making a credible suggestion years ago
when NFSv4 was still in draft.  It didn't fly, but I don't really
remember why.

The NFS server gets to either return a cookie like it currently does,
or sets a flag (or maybe returns a special cookie) which says 'just
use the name'.
A READDIR request contains either a cookie or a filename.  Either mean
This identifies the last name I got from you, give me the next one.

Is there something that makes that interface problematic?


Seems like it would simply make more sense for the server to be allowed 
to determine what the size of the cookie should be.


Of course, that doesn't help NFSv2/3/4.0.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Trond Myklebust

On Wed, 2007-04-11 at 07:12 +1000, Neil Brown wrote:
 On Tuesday April 10, [EMAIL PROTECTED] wrote:
  The problem is that it is extremely hard to come up with an alternative
  that doesn't impose new conditions on what filesystems you can support.
 
 I seem to remember Hans Reiser making a credible suggestion years ago
 when NFSv4 was still in draft.  It didn't fly, but I don't really
 remember why.
 
 The NFS server gets to either return a cookie like it currently does,
 or sets a flag (or maybe returns a special cookie) which says 'just
 use the name'.
 A READDIR request contains either a cookie or a filename.  Either mean
 This identifies the last name I got from you, give me the next one.
 
 Is there something that makes that interface problematic?

File deletions...

Cheers
  Trond

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Neil Brown

On Monday April 9, [EMAIL PROTECTED] wrote:
 On Mon, Apr 09, 2007 at 08:31:37AM -0400, Trond Myklebust wrote:
  That is a protocol limitation, not a client limitation.
 
 Groan
 
 And after quickly checking RFC 3010, I see this limitation hasn't been
 lifted in NFSv4.
 
 Speaking of which, right now ext3 doesn't know whether it's talking to
 an NFSv2 or NFS v3/v4 server, so it's always passing a 32-bit cookie.
 If NFSv3/v4 could use an explicit interface to request a 64-bit
 cookie, instead of just relying on the f_pos field in the file handle,
 we can reduce the chance of hash collisions when reading an ext3
 directory significantly.   
 

We don't use f_pos (any more), we call llseek.

I think it would make a lot of sense - as Trond suggests - to not pass
O_LARGEFILE to dentry_open for an NFSv2 request.  Then ext3 could
trigger off that and return 64bits of cookie ... and I think nfsd will
actually pass them all back to the client now.  There is a
truncate-to-32bits bug that has only just been fixed.

But if a separate call is wanted, we have the export_operations struct
to put it in.  All we need is a good case an useful specification.


 If there are 2 or 3 directory entries that have a hash collision,
 would the NFS protocol allow the server to juggle things so that those
 2-3 directory entries with the hash collision are sent back in a
 single readdir RPC reply?  Is it aceptable/legal to have multiple
 entries in the same READDIR reply packet have the same cookie value?

I think Trond has answered this, but I think it is also worth noting
that every entry returned in a READDIR reply includes a cookie, the
NFS client may use any of those cookies in a subsequent READDIR.  One
might hope the client will only ever use the last, but one can never
be sure


NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Neil Brown

On Tuesday April 10, [EMAIL PROTECTED] wrote:
 On Wed, 2007-04-11 at 07:12 +1000, Neil Brown wrote:
  
  Is there something that makes that interface problematic?
 
 File deletions...

How are they a problem ?

There are only two ways to organise a directory.
1/ Unsorted linear list of entries in which no repacking is ever done.
2/ Some data structure using an ordered search key that is based on
  the filename (e.g. a B-tree with a search key that is a hash of the
  filename). 

In the first case, you just use a fixed opaque cookie for location in
a directory.
In the second you use the filename.  If the file has been deleted,
that shouldn't stop you finding the place where it would have been in
the overall sort order.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Neil Brown

On Tuesday April 10, [EMAIL PROTECTED] wrote:
 
 Seems like it would simply make more sense for the server to be allowed 
 to determine what the size of the cookie should be.

That is one possibility.  but if the cookie gets too big, you
substantially reduce the number of entries you can fit in a single
reply.  And if you choose to use the filename as the cookie, you end
up sending the filename twice for each entry, which seems somewhat
pointless. 

 
 Of course, that doesn't help NFSv2/3/4.0.

No.  And while we can probably deprecate support for NFSv2 (at least
in newer filesystems) I think it will be quite some years before we
can think about doing the same for v3 (in fact we can probably
deprecate 4.0 and 4.1 before v3 :-)
So while it doesn't hurt to plan for the longer-term future, I think
we need to accept that over the next 5 years, filesystems needs to
cope with 64bit dir-entry cookies to be generally useful.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Alan Cox

 A READDIR request contains either a cookie or a filename.  Either mean
 This identifies the last name I got from you, give me the next one.
 
 Is there something that makes that interface problematic?

Another client may have removed the name and re-added a new file with
that name. At this point most file systems may not have placed it at the
same spot in the directory.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Neil Brown

On Sunday April 8, [EMAIL PROTECTED] wrote:
 On Sun, 8 April 2007 11:11:20 -0700, H. Peter Anvin wrote:
  
  Well, the question is if you can keep the seekdir/telldir cookie around 
  as a pointer -- preferrably in userspace, of course.  You would 
  presumably garbage-collect them on closedir() -- there is no other point 
  at which you could.
 
 Garbage-collecting them on closedir() does not work.  It surprised me as
 well, but there seem to be applications that keep the telldir() cookie
 around after closedir().  Iirc, rm -r was one of them.
 
 Neil, is this correct?

It's just NFS.  nfsd does open/getdents/close on every readdir
request.
rm -r is the application that tends to detect any problems with
cookie handling between the NFS client and server.

(I'm sure there was once an 'rm -r' that used telldir/seekdir, but I
cannot find it.  Maybe it was level-7 UNIX :-)

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Bob Copeland


On 4/10/07, Neil Brown [EMAIL PROTECTED] wrote:

2/ Some data structure using an ordered search key that is based on
  the filename (e.g. a B-tree with a search key that is a hash of the
  filename).

In the first case, you just use a fixed opaque cookie for location in
a directory.
In the second you use the filename.  If the file has been deleted,
that shouldn't stop you finding the place where it would have been in
the overall sort order.


I can think of one (admittedly insane) FS that is between those two:

3/ an unsorted hash table, where each directory entry has an indirect
pointer to its neighbor in case of hash collisions.

  a
  b - d - c - e
  g
  f - x

Given 'c' as the last thing returned, you can hash c to find out
that you are in the bucket with 'b', but if 'c' was deleted, the best
you can do is return b twice or skip the chain entirely.  I maintain
an out-of-tree driver for such an fs (I promise I did not invent it);
the best I could come up with is to encode the hash chain index in the
top byte of f_pos.   Needless to say, readdir is very not performant
with all the seeking this hash scheme entails.

-Bob
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Trond Myklebust

On Wed, 2007-04-11 at 07:37 +1000, Neil Brown wrote:
 On Tuesday April 10, [EMAIL PROTECTED] wrote:
  On Wed, 2007-04-11 at 07:12 +1000, Neil Brown wrote:
   
   Is there something that makes that interface problematic?
  
  File deletions...
 
 How are they a problem ?
 
 There are only two ways to organise a directory.
 1/ Unsorted linear list of entries in which no repacking is ever done.
 2/ Some data structure using an ordered search key that is based on
   the filename (e.g. a B-tree with a search key that is a hash of the
   filename). 
 
 In the first case, you just use a fixed opaque cookie for location in
 a directory.
 In the second you use the filename.  If the file has been deleted,
 that shouldn't stop you finding the place where it would have been in
 the overall sort order.

I beg to differ. RAMFS is an instance of a filesystem which uses an
unsorted linear list of entries which is automatically repacked when you
delete a file.

You may also have filesystems which are only partially sorted. The
dcache is for instance organised as a hashtable with multiple entries
per bucket...

Cheers
  Trond

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Neil Brown

On Tuesday April 10, [EMAIL PROTECTED] wrote:
 On Wed, 2007-04-11 at 07:37 +1000, Neil Brown wrote:
  On Tuesday April 10, [EMAIL PROTECTED] wrote:
   On Wed, 2007-04-11 at 07:12 +1000, Neil Brown wrote:

Is there something that makes that interface problematic?
   
   File deletions...
  
  How are they a problem ?
  
  There are only two ways to organise a directory.
  1/ Unsorted linear list of entries in which no repacking is ever done.
  2/ Some data structure using an ordered search key that is based on
the filename (e.g. a B-tree with a search key that is a hash of the
filename). 
  
  In the first case, you just use a fixed opaque cookie for location in
  a directory.
  In the second you use the filename.  If the file has been deleted,
  that shouldn't stop you finding the place where it would have been in
  the overall sort order.
 
 I beg to differ. RAMFS is an instance of a filesystem which uses an
 unsorted linear list of entries which is automatically repacked when you
 delete a file.

hmm... yes...
RAMFS uses dcache_readdir which keeps a cursor in the list of entries
such that delete/insert doesn't change the effective location of the
cursor.

Doing that in NFS would require the server keeping an open file on
behalf of the client.  That wouldn't be too hard for NFSv4, except
that you would still need some way to deal with server reboots.

But then RAMFS wouldn't survive a server reboot anyway.

 
 You may also have filesystems which are only partially sorted. The
 dcache is for instance organised as a hashtable with multiple entries
 per bucket...

That is like the filesystem Bob Copeland described.

And if the chains are not sorted, then I think the only way you could
reliably implement getdents is to use a cursor, and that doesn't map
over NFS very well.

Even if the chains were sorted, they could conceivably be sorted by
position-in-file in which case my 'cookie or filename' would not be
enough.  You need 'cookie and filename'.

So my new position is that:

  A READDIR (aka getdents2) should take a directory handle, a cookie,
  and a filename, and should return filenames and cookies.  The
  cookies may all be identical or may not.  The filename might be used
  by the filesystem, or it might not.

  Filesystems that require a cursor in the 'struct file' to support
  (local) getdents cannot be used with NFS.

While it doesn't make it possible to support all conceivable
filesystems, it should make it easier for some filesystems to support
the demands of NFS.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Trond Myklebust

On Wed, 2007-04-11 at 08:33 +1000, Neil Brown wrote:

   A READDIR (aka getdents2) should take a directory handle, a cookie,
   and a filename, and should return filenames and cookies.  The
   cookies may all be identical or may not.  The filename might be used
   by the filesystem, or it might not.
 
   Filesystems that require a cursor in the 'struct file' to support
   (local) getdents cannot be used with NFS.
 
 While it doesn't make it possible to support all conceivable
 filesystems, it should make it easier for some filesystems to support
 the demands of NFS.

In order to be useful, I think you need to add a demand that the READDIR
call cannot loop back on itself for the case of a series of sequential
reads.

IOW: if a client attempts to step sequentially through the directory,
and is supplying valid filenames+cookies from the preceding READDIR
call, then the next READDIR call should be guaranteed never to loop back
to an earlier entry. Alternatively, if there is a danger that it might
due to some sudden and radical change in the directory layout, then it
should notify the client by returning something like a BAD_COOKIE error.

Otherwise, the client would have to cache _all_ previous READDIR results
since the last opendir()/rewinddir() in order to be able to do its own
loop detection and that will obviously never scale for large directories
or for directories that change frequently...

Cheers
  Trond

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Bernd Eckenfels

In article [EMAIL PROTECTED] you wrote:
 Otherwise, the client would have to cache _all_ previous READDIR results
 since the last opendir()/rewinddir() in order to be able to do its own
 loop detection and that will obviously never scale for large directories
 or for directories that change frequently...

Unless you have a COW style filesystem with versioning (think oracle tables)
you will have to invalidate cookies often or do copies - on client or
server. And I am not sure whats worse (for apps)... disappearing/missing
files or duplicates.

Gruss
bernd
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Trond Myklebust

On Mon, 2007-04-09 at 18:34 +0200, Jan Engelhardt wrote:
> On Apr 9 2007 10:03, Trond Myklebust wrote:
> 
> >In practice, though, this sort of behaviour has to be managed carefully
> >by the server. Forcing a client to re-read the entire contents of the
> >directory doesn't really scale too well...
> 
> What does the spec (readdir, and NFS READDIR) say about duplicate entries,
> and what's done in practice?
> Actually, I'd be more concerned about code that does
> 
> while(d = readdir(...))
> append_a_line_to(d);
> 
> (which would be not-so-good on duplicate entries), rather than
> bad scaling.

>From SuSv3:

If a file is removed from or added to the directory after the
most recent call to opendir() or rewinddir(), whether a
subsequent call to readdir() returns an entry for that file is
unspecified.

For NFS, the results in the above case would be governed by the caching
rules: the client is not strictly required to revalidate the cache
except on opendir(). However if the client itself is the one changing
the directory, it will in practice cause the cache to be invalidated,
and the directory contents to be read in again from scratch.

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Jan Engelhardt

On Apr 9 2007 10:03, Trond Myklebust wrote:

>In practice, though, this sort of behaviour has to be managed carefully
>by the server. Forcing a client to re-read the entire contents of the
>directory doesn't really scale too well...

What does the spec (readdir, and NFS READDIR) say about duplicate entries,
and what's done in practice?
Actually, I'd be more concerned about code that does

while(d = readdir(...))
append_a_line_to(d);

(which would be not-so-good on duplicate entries), rather than
bad scaling.

Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Trond Myklebust

On Mon, 2007-04-09 at 09:19 -0400, Theodore Tso wrote:
> On Mon, Apr 09, 2007 at 08:31:37AM -0400, Trond Myklebust wrote:
> > On Mon, 2007-04-09 at 13:09 +0200, Jörn Engel wrote:
> > > That surely doesn't make life any easier for filesystem developers, I
> > > agree.  From that point of view, all telldir cookies should end their
> > > life at closedir time.  For "rm -r" it would be sufficient if the nfs
> > > client simply didn't seekdir at all.  For "ls -lR", this would return
> > > duplicate dentries.
> > 
> > Please go read the NFS spec. The only thing an NFS client has in order
> > to read a directory is a READDIR operation that in essence takes a
> > filehandle and a cookie as its arguments. Unless the server is able to
> > return the entire rest of the directory in one RPC reply, the client
> > needs to send a second READDIR operation with a cookie from the previous
> > READDIR operation. The server is expected to return cookies for _each_
> > entry in the directory.
> > 
> > That is a protocol limitation, not a client limitation.
> 
> 
> 
> And after quickly checking RFC 3010, I see this limitation hasn't been
> lifted in NFSv4.

If we can come up with an interface that makes sense in the context of
NFS, then we should be able to push it into a future minor revision of
NFSv4. It is unfortunately looking too late to push it into v4.1, since
the final drafts of the RFC are already circulating.

> Speaking of which, right now ext3 doesn't know whether it's talking to
> an NFSv2 or NFS v3/v4 server, so it's always passing a 32-bit cookie.
> If NFSv3/v4 could use an explicit interface to request a 64-bit
> cookie, instead of just relying on the f_pos field in the file handle,
> we can reduce the chance of hash collisions when reading an ext3
> directory significantly.   

We could perhaps teach nfsd to open the file without the O_LARGEFILE
attribute in the case of NFSv2?

> If there are 2 or 3 directory entries that have a hash collision,
> would the NFS protocol allow the server to juggle things so that those
> 2-3 directory entries with the hash collision are sent back in a
> single readdir RPC reply?  Is it aceptable/legal to have multiple
> entries in the same READDIR reply packet have the same cookie value?

Not really.

However on NFSv3 and v4 there is actually a mechanism for declaring that
the existing set of cookies have expired and are no longer valid: you
have an 8-byte opaque 'verifier' which is supplied by the server, and
which is supposed to be returned by the client on every call to READDIR.
If the server wants to change its cookie scheme, then it signals it to
the client by changing its verifier, and returning an error whenever the
client tries to use the old verifier. Upon receiving that error, the
client is supposed to clear out all cached cookies, and read the
directory in again from the start.

In practice, though, this sort of behaviour has to be managed carefully
by the server. Forcing a client to re-read the entire contents of the
directory doesn't really scale too well...
Note also that we would have to fix the client implementation. Nobody
has bothered working on the code to handle verifier changes since there
are no servers out there in the wild that use it.

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Theodore Tso

On Mon, Apr 09, 2007 at 08:31:37AM -0400, Trond Myklebust wrote:
> On Mon, 2007-04-09 at 13:09 +0200, Jörn Engel wrote:
> > That surely doesn't make life any easier for filesystem developers, I
> > agree.  From that point of view, all telldir cookies should end their
> > life at closedir time.  For "rm -r" it would be sufficient if the nfs
> > client simply didn't seekdir at all.  For "ls -lR", this would return
> > duplicate dentries.
> 
> Please go read the NFS spec. The only thing an NFS client has in order
> to read a directory is a READDIR operation that in essence takes a
> filehandle and a cookie as its arguments. Unless the server is able to
> return the entire rest of the directory in one RPC reply, the client
> needs to send a second READDIR operation with a cookie from the previous
> READDIR operation. The server is expected to return cookies for _each_
> entry in the directory.
> 
> That is a protocol limitation, not a client limitation.

And after quickly checking RFC 3010, I see this limitation hasn't been
lifted in NFSv4.

Speaking of which, right now ext3 doesn't know whether it's talking to
an NFSv2 or NFS v3/v4 server, so it's always passing a 32-bit cookie.
If NFSv3/v4 could use an explicit interface to request a 64-bit
cookie, instead of just relying on the f_pos field in the file handle,
we can reduce the chance of hash collisions when reading an ext3
directory significantly.   

If there are 2 or 3 directory entries that have a hash collision,
would the NFS protocol allow the server to juggle things so that those
2-3 directory entries with the hash collision are sent back in a
single readdir RPC reply?  Is it aceptable/legal to have multiple
entries in the same READDIR reply packet have the same cookie value?

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Andreas Schwab

Jörn Engel <[EMAIL PROTECTED]> writes:

> On Sun, 8 April 2007 21:44:26 -0400, Theodore Tso wrote:
>> 
>> Well, Joern thought that rm -rf might relying on the telldir cookie
>> being valid in precisely that circumstance.  If that is true, I'd
>> argue that this is a BUG in GNU coreutils that should be fixed...
>
> I heard it and accepted that claim without checking it.

There is not a single call to telldir/seekdir in the coreutils source.

Andreas.

-- 
Andreas Schwab, SuSE Labs, [EMAIL PROTECTED]
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Trond Myklebust

On Mon, 2007-04-09 at 13:09 +0200, Jörn Engel wrote:
> That surely doesn't make life any easier for filesystem developers, I
> agree.  From that point of view, all telldir cookies should end their
> life at closedir time.  For "rm -r" it would be sufficient if the nfs
> client simply didn't seekdir at all.  For "ls -lR", this would return
> duplicate dentries.

Please go read the NFS spec. The only thing an NFS client has in order
to read a directory is a READDIR operation that in essence takes a
filehandle and a cookie as its arguments. Unless the server is able to
return the entire rest of the directory in one RPC reply, the client
needs to send a second READDIR operation with a cookie from the previous
READDIR operation. The server is expected to return cookies for _each_
entry in the directory.

That is a protocol limitation, not a client limitation.

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Trond Myklebust

On Mon, 2007-04-09 at 13:09 +0200, Jörn Engel wrote:
> That surely doesn't make life any easier for filesystem developers, I
> agree.  From that point of view, all telldir cookies should end their
> life at closedir time.  For "rm -r" it would be sufficient if the nfs
> client simply didn't seekdir at all.  For "ls -lR", this would return
> duplicate dentries.

Please go read the NFS spec. The only thing an NFS client has in order
to read a directory is a READDIR operation that in essence takes a
filehandle and a cookie as its arguments. Unless the server is able to
return the entire directory in one RPC reply, the client That is a
protocol limitation, not a client limitation.

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Jörn Engel

On Sun, 8 April 2007 21:44:26 -0400, Theodore Tso wrote:
> 
> Well, Joern thought that rm -rf might relying on the telldir cookie
> being valid in precisely that circumstance.  If that is true, I'd
> argue that this is a BUG in GNU coreutils that should be fixed...

I heard it and accepted that claim without checking it.  Might have been
a mistake.  But the claim came from an NFS developer, which may explain
a thing or two.

NFS clients have to deal with a server rebooting underneith them and
should still behave as expected.  An "rm -r" running on the client
concurrently to a rebooting server is a problem indeed and could be
solved with seekdir/telldir.

That surely doesn't make life any easier for filesystem developers, I
agree.  From that point of view, all telldir cookies should end their
life at closedir time.  For "rm -r" it would be sufficient if the nfs
client simply didn't seekdir at all.  For "ls -lR", this would return
duplicate dentries.

Jörn

-- 
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Jörn Engel

On Sun, 8 April 2007 21:44:26 -0400, Theodore Tso wrote:
 
 Well, Joern thought that rm -rf might relying on the telldir cookie
 being valid in precisely that circumstance.  If that is true, I'd
 argue that this is a BUG in GNU coreutils that should be fixed...

I heard it and accepted that claim without checking it.  Might have been
a mistake.  But the claim came from an NFS developer, which may explain
a thing or two.

NFS clients have to deal with a server rebooting underneith them and
should still behave as expected.  An rm -r running on the client
concurrently to a rebooting server is a problem indeed and could be
solved with seekdir/telldir.

That surely doesn't make life any easier for filesystem developers, I
agree.  From that point of view, all telldir cookies should end their
life at closedir time.  For rm -r it would be sufficient if the nfs
client simply didn't seekdir at all.  For ls -lR, this would return
duplicate dentries.

Jörn

-- 
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Trond Myklebust

On Mon, 2007-04-09 at 13:09 +0200, Jörn Engel wrote:
 That surely doesn't make life any easier for filesystem developers, I
 agree.  From that point of view, all telldir cookies should end their
 life at closedir time.  For rm -r it would be sufficient if the nfs
 client simply didn't seekdir at all.  For ls -lR, this would return
 duplicate dentries.

Please go read the NFS spec. The only thing an NFS client has in order
to read a directory is a READDIR operation that in essence takes a
filehandle and a cookie as its arguments. Unless the server is able to
return the entire directory in one RPC reply, the client That is a
protocol limitation, not a client limitation.

Trond

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Trond Myklebust

On Mon, 2007-04-09 at 13:09 +0200, Jörn Engel wrote:
 That surely doesn't make life any easier for filesystem developers, I
 agree.  From that point of view, all telldir cookies should end their
 life at closedir time.  For rm -r it would be sufficient if the nfs
 client simply didn't seekdir at all.  For ls -lR, this would return
 duplicate dentries.

Please go read the NFS spec. The only thing an NFS client has in order
to read a directory is a READDIR operation that in essence takes a
filehandle and a cookie as its arguments. Unless the server is able to
return the entire rest of the directory in one RPC reply, the client
needs to send a second READDIR operation with a cookie from the previous
READDIR operation. The server is expected to return cookies for _each_
entry in the directory.

That is a protocol limitation, not a client limitation.

Trond

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Andreas Schwab

Jörn Engel [EMAIL PROTECTED] writes:

 On Sun, 8 April 2007 21:44:26 -0400, Theodore Tso wrote:
 
 Well, Joern thought that rm -rf might relying on the telldir cookie
 being valid in precisely that circumstance.  If that is true, I'd
 argue that this is a BUG in GNU coreutils that should be fixed...

 I heard it and accepted that claim without checking it.

There is not a single call to telldir/seekdir in the coreutils source.

Andreas.

-- 
Andreas Schwab, SuSE Labs, [EMAIL PROTECTED]
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
And now for something completely different.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Theodore Tso

On Mon, Apr 09, 2007 at 08:31:37AM -0400, Trond Myklebust wrote:
 On Mon, 2007-04-09 at 13:09 +0200, Jörn Engel wrote:
  That surely doesn't make life any easier for filesystem developers, I
  agree.  From that point of view, all telldir cookies should end their
  life at closedir time.  For rm -r it would be sufficient if the nfs
  client simply didn't seekdir at all.  For ls -lR, this would return
  duplicate dentries.
 
 Please go read the NFS spec. The only thing an NFS client has in order
 to read a directory is a READDIR operation that in essence takes a
 filehandle and a cookie as its arguments. Unless the server is able to
 return the entire rest of the directory in one RPC reply, the client
 needs to send a second READDIR operation with a cookie from the previous
 READDIR operation. The server is expected to return cookies for _each_
 entry in the directory.
 
 That is a protocol limitation, not a client limitation.

Groan

And after quickly checking RFC 3010, I see this limitation hasn't been
lifted in NFSv4.

Speaking of which, right now ext3 doesn't know whether it's talking to
an NFSv2 or NFS v3/v4 server, so it's always passing a 32-bit cookie.
If NFSv3/v4 could use an explicit interface to request a 64-bit
cookie, instead of just relying on the f_pos field in the file handle,
we can reduce the chance of hash collisions when reading an ext3
directory significantly.   

If there are 2 or 3 directory entries that have a hash collision,
would the NFS protocol allow the server to juggle things so that those
2-3 directory entries with the hash collision are sent back in a
single readdir RPC reply?  Is it aceptable/legal to have multiple
entries in the same READDIR reply packet have the same cookie value?

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Trond Myklebust

On Mon, 2007-04-09 at 09:19 -0400, Theodore Tso wrote:
 On Mon, Apr 09, 2007 at 08:31:37AM -0400, Trond Myklebust wrote:
  On Mon, 2007-04-09 at 13:09 +0200, Jörn Engel wrote:
   That surely doesn't make life any easier for filesystem developers, I
   agree.  From that point of view, all telldir cookies should end their
   life at closedir time.  For rm -r it would be sufficient if the nfs
   client simply didn't seekdir at all.  For ls -lR, this would return
   duplicate dentries.
  
  Please go read the NFS spec. The only thing an NFS client has in order
  to read a directory is a READDIR operation that in essence takes a
  filehandle and a cookie as its arguments. Unless the server is able to
  return the entire rest of the directory in one RPC reply, the client
  needs to send a second READDIR operation with a cookie from the previous
  READDIR operation. The server is expected to return cookies for _each_
  entry in the directory.
  
  That is a protocol limitation, not a client limitation.
 
 Groan
 
 And after quickly checking RFC 3010, I see this limitation hasn't been
 lifted in NFSv4.

If we can come up with an interface that makes sense in the context of
NFS, then we should be able to push it into a future minor revision of
NFSv4. It is unfortunately looking too late to push it into v4.1, since
the final drafts of the RFC are already circulating.

 Speaking of which, right now ext3 doesn't know whether it's talking to
 an NFSv2 or NFS v3/v4 server, so it's always passing a 32-bit cookie.
 If NFSv3/v4 could use an explicit interface to request a 64-bit
 cookie, instead of just relying on the f_pos field in the file handle,
 we can reduce the chance of hash collisions when reading an ext3
 directory significantly.   

We could perhaps teach nfsd to open the file without the O_LARGEFILE
attribute in the case of NFSv2?

 If there are 2 or 3 directory entries that have a hash collision,
 would the NFS protocol allow the server to juggle things so that those
 2-3 directory entries with the hash collision are sent back in a
 single readdir RPC reply?  Is it aceptable/legal to have multiple
 entries in the same READDIR reply packet have the same cookie value?

Not really.

However on NFSv3 and v4 there is actually a mechanism for declaring that
the existing set of cookies have expired and are no longer valid: you
have an 8-byte opaque 'verifier' which is supplied by the server, and
which is supposed to be returned by the client on every call to READDIR.
If the server wants to change its cookie scheme, then it signals it to
the client by changing its verifier, and returning an error whenever the
client tries to use the old verifier. Upon receiving that error, the
client is supposed to clear out all cached cookies, and read the
directory in again from the start.

In practice, though, this sort of behaviour has to be managed carefully
by the server. Forcing a client to re-read the entire contents of the
directory doesn't really scale too well...
Note also that we would have to fix the client implementation. Nobody
has bothered working on the code to handle verifier changes since there
are no servers out there in the wild that use it.

Cheers
  Trond

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Jan Engelhardt


On Apr 9 2007 10:03, Trond Myklebust wrote:

In practice, though, this sort of behaviour has to be managed carefully
by the server. Forcing a client to re-read the entire contents of the
directory doesn't really scale too well...

What does the spec (readdir, and NFS READDIR) say about duplicate entries,
and what's done in practice?
Actually, I'd be more concerned about code that does

while(d = readdir(...))
append_a_line_to(d);

(which would be not-so-good on duplicate entries), rather than
bad scaling.


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Trond Myklebust

On Mon, 2007-04-09 at 18:34 +0200, Jan Engelhardt wrote:
 On Apr 9 2007 10:03, Trond Myklebust wrote:
 
 In practice, though, this sort of behaviour has to be managed carefully
 by the server. Forcing a client to re-read the entire contents of the
 directory doesn't really scale too well...
 
 What does the spec (readdir, and NFS READDIR) say about duplicate entries,
 and what's done in practice?
 Actually, I'd be more concerned about code that does
 
 while(d = readdir(...))
 append_a_line_to(d);
 
 (which would be not-so-good on duplicate entries), rather than
 bad scaling.

From SuSv3:

If a file is removed from or added to the directory after the
most recent call to opendir() or rewinddir(), whether a
subsequent call to readdir() returns an entry for that file is
unspecified.

For NFS, the results in the above case would be governed by the caching
rules: the client is not strictly required to revalidate the cache
except on opendir(). However if the client itself is the one changing
the directory, it will in practice cause the cache to be invalidated,
and the directory contents to be read in again from scratch.

Cheers
  Trond

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 130 matches

Mail list logo