subject:"\"If not readdir\\\(\\\) then what\\\?\""

Re: If not readdir() then what?

2007-04-16 Thread Neil Brown

On Monday April 16, [EMAIL PROTECTED] wrote:
> 
> The challenge is making it be stable across inserts/deletes, never
> mind reboots.  And it's not a "little bit of cacheing"; in order to be
> correct we would have to cache *forever*, since at least in theory an
> NFS client could hold on to a cookie for an arbitrarily long period of
> time (weeks, months, years, decades), yes?

Yes.  But I think we've already establish that the on-disk structure
chosen by ext3/htree is not able to perfectly support NFS (which is a
pity given that it was written for Linux and Linux is thought to
support NFS).  Our goal is to find the best mapping possible and,
where caching can improve stability for real-world uses, use caching
to help stabilise that mapping.

> 
> You're welcome to try, but I suspect it won't take long before you'll
> see why I'm asserting that a directory fd cache in nfsd is *way* less
> work.  :-)

You have provided some very helpful insights into how ext3/htree
currently works - thanks for that.
I will definitely make a closer inspection of the code and so how
possible it is to realise by ideas.  I'll let you know how I go.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-16 Thread Theodore Tso

On Mon, Apr 16, 2007 at 04:18:53PM +1000, Neil Brown wrote:
> On Thursday April 12, [EMAIL PROTECTED] wrote:
> > 
> > Unfortunately, in the NFS case if there are hash collisions, under the
> > wrong circumstances the NFS client could loop forever trying to
> > readdir() a directory stream.
> 
> I think that can be easily avoided by the filesystem by simply
> ensuring that the cookies returned in a request are all *after* the
> 'llseek' cookie.  There still maybe some risk of missing or extra
> entries in the directory if the filesystem cannot perform a precise
> cookie->position mapping, but an infinite loop should not be a problem.

It is true that you can replace the infinite loop problem by
guaranteeing that certain filenames are simply always missing.
Neither seemed desirable to me, but it is true that if we have to err
in one direction or another, simply skipping files in the case of a
hash collision is probably a better result.

> You still haven't convinced me that the limited size is a fundamental
> problem.
> In the case of ext3,  I have suggested (various things including)
> using 56bits of the hash and an 8 bit sequence number through any
> hash chain that may eventuate.
> The seq number could not be stable across reboots with the current
> on-disk data, but could be made stable for almost all real usage with
> a little bit of caching (I imagine using upper 4 bits to give a
> sequence number while reading from disk, and lower 4 bits to give a
> sequence number when adding new entries).

The challenge is making it be stable across inserts/deletes, never
mind reboots.  And it's not a "little bit of cacheing"; in order to be
correct we would have to cache *forever*, since at least in theory an
NFS client could hold on to a cookie for an arbitrarily long period of
time (weeks, months, years, decades), yes?

> Supposing I were to try my hand at making the modifications I
> suggested to ext3 so that per page/hashvalue rbtrees were cached in
> the pagecache and were used to provide a stable-as-possible telldir
> cookie.

Well, the first problem you'll run into is that ext3 doesn't store
directories in the page cache.  That's because the journaling layer
fs/jbd operates on a buffer cache basis (i.e., on a per filesystem
block level).  

Secondly, storing the rbtree in the page cache doesn't help, because
the whole point of the rbtree is to have a stable pointer that doesn't
change across a b-tree node split.  Whether you use a page cache or
the buffer cache, it doesn't change the fact that when you do a b-tree
node split, half the directory entries *go* *someplace* *else*.  Put
another way, Linus's standard reasons for wanting to store the
directory in the page cache so we can have better readahead don't
apply when you're using htree, because with htree we're never reading
entries in page cache order.

So you wouldn't be able to store the rbtree in the page cache, since
the doing so would defeat the whole purpose of the having the rbtree
in the first place.  The rbtree needs to be per directory stream,
*not* per directory block.   

You could store a mapping between each directory entry and a sequence
number.  That could work, modulo the need to store these entries
*forever*.  But now you take a performance hit since every time you
create or delete a directory entry, you'll need to keep the cache up
to date.  And since we don't store the hash in the on-disk format
(since you can always calculate it from the filename), you'll need to
store the hash as well in the cache information.  So that means
storing a 64-bit hash plus the sequence counter information --- and if
we assume 32-bit alignment requirements for data structures, it means
the simplest way to do this would be to hang a data structure off of
each directory block which consumes 12 bytes of data for every
directory entry, plus either (a) extra space for pointers so you can
easily insert and delete entries as directory entries are created or
deleted, or (b) resigning yourself to use memmove() to insert and
delete entries in the linear array.  

You're welcome to try, but I suspect it won't take long before you'll
see why I'm asserting that a directory fd cache in nfsd is *way* less
work.  :-)

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-16 Thread Theodore Tso

On Mon, Apr 16, 2007 at 03:47:21PM +1000, Neil Brown wrote:
> "my guess", "pretty much" really bother me.
> 
> It sounds like "The largest anyone has asked for is 128bits, or let's
> double it and hope that is enough until the next protocol revision".
> Which was probably reasonable when NFSv2 was being developed and maybe
> even when v3 was developed, but I kind of hoped we were beyond that.
> 
> If a filesystem wanted to order filenames lexically, it really needs
> 256 *bytes*.  And it is fairly silly having a cookie that big.
> 
> I still thinking that 
>filename + 64bits
> is required and plenty (aka necessary and sufficient).  

Sure, but if you're going to include the filename in the cookie, on
the client side you now have to store a variable-length state, which
probably means you'll need to allocate and free memory each time; and
if the filename is 256 characters, you'll have to send that back in
the next readdir() request. 

If we could get filename + 64bits, sure, that would be great.  I was
just assuming we couldn't get it --- and if we can't get it, 256 bits
is two SHA-1 hashes.  So that's one hash for the filename, and 128
bits for a filesystem's internal hash collision.  There might be other
ways that the space could be divided up, which might be somewhat
wasteful of space --- say you need a host identifier for a clustered
filesystem, although arguably adding a host might be infrequent enough
that you just use the cookie verifier hammer and force the client to
get a new set of readdir cookies.  :-)

> I wouldn't argue against 128bits (64 for a search key and 64 to
> guarantee uniqueness) but I really think 256 excessive with no value.
> We we still need the last-filename in the READDIR key.

I wouldn't complain too much about 128 bits, but if we're going to go
fixed size, I can imagine filesystems where that might not be enough.
And the differecen between 16 and 32 bytes isn't that great.  But I
could easily live with either filename + 64bits, or 128 bits.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-15 Thread Neil Brown

On Thursday April 12, [EMAIL PROTECTED] wrote:
> 
> Unfortunately, in the NFS case if there are hash collisions, under the
> wrong circumstances the NFS client could loop forever trying to
> readdir() a directory stream.

I think that can be easily avoided by the filesystem by simply
ensuring that the cookies returned in a request are all *after* the
'llseek' cookie.  There still maybe some risk of missing or extra
entries in the directory if the filesystem cannot perform a precise
cookie->position mapping, but an infinite loop should not be a problem.

> 
> > This is a simple consequence of the design decision to use hashes as
> > the search key.  They aren't dense and they will collide.  So the
> > solution will be a bit fuzzy around the edges.  And maybe that is an
> > acceptable tradeoff.  But the filesystem should take full
> > responsibility for it, whether in performance or correctness :-)
> 
> Well, we could also say that it is NFS's fault that they used a
> limited size cookie as a fundamental part of their protocol 
> 

You still haven't convinced me that the limited size is a fundamental
problem.
In the case of ext3,  I have suggested (various things including)
using 56bits of the hash and an 8 bit sequence number through any
hash chain that may eventuate.
The seq number could not be stable across reboots with the current
on-disk data, but could be made stable for almost all real usage with
a little bit of caching (I imagine using upper 4 bits to give a
sequence number while reading from disk, and lower 4 bits to give a
sequence number when adding new entries).

Even if you adjusted this to 32bits hash and 32bits seq number to
guarantee that the seq number never overflowed, collisions would still
be sufficiently rare that you wouldn't notice them in performance
measurements, but it might just give the collision handling code some
testing in the real world.

> > But there are alternatives.  e.g. internal chaining.
..
> 
> This solution requires an incompatible file format change. 

I wasn't suggesting that ext3 use internal chaining.  That is
obviously not an option.  I was putting the idea forward for any
filesystem developers who may not have finalised their directory
design yet so they could see a mechanism for indexing directories that
gives 100% correct semantics.

> Again, compared to a directory fd cache, what you're proposing a huge
> hit to the filesystem, and at the moment, given that telldir/seekdir
> is rarely used by everyone else, it's mainly NFS which is the main bad
> actor here by insisting on the use of a small 31/63-bit cookie as a
> condition of protocol correctness.

What I would really like to propose (which may or may not match what
it might have seemed like I was proposing in the past) is that ext3
(and every filesystem) should produce and consume cookies that are as
close to perfect as practically possible.  That means guaranteeing no
infinite loops and minimising the chance of lost or duplicated entries
in the case of completely uncached NFS/READDIR access.

On top of that I propose that it might be appropriate to provide some
caching somewhere on the understanding that is isn't required for
correctness and is there primarily for performance (though it might
smooth over some correctness rough edges).

I think it is best for that caching to reside in the page cache, but I
am not completely against putting fd-caching in nfsd.

Supposing I were to try my hand at making the modifications I
suggested to ext3 so that per page/hashvalue rbtrees were cached in
the pagecache and were used to provide a stable-as-possible telldir
cookie.
Would you be willing to review and comment on such patches, and would
you be open to including them in ext3 (or ext4) if they proved to be
suitably stable, maintainable, efficient, etc??

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-15 Thread Neil Brown

On Sunday April 15, [EMAIL PROTECTED] wrote:
> On Thu, Apr 12, 2007 at 10:35:49AM -0700, H. Peter Anvin wrote:
> > 
> > Any fixed size is too small.  It should be a dynamic size.
> 
> Idally it should be dynamic, but my guess is that if the cookie were a
> fixed 256 bits, it would be sufficient for pretty much all filesystems.  
> 

"my guess", "pretty much" really bother me.

It sounds like "The largest anyone has asked for is 128bits, or let's
double it and hope that is enough until the next protocol revision".
Which was probably reasonable when NFSv2 was being developed and maybe
even when v3 was developed, but I kind of hoped we were beyond that.

If a filesystem wanted to order filenames lexically, it really needs
256 *bytes*.  And it is fairly silly having a cookie that big.

I still thinking that 
   filename + 64bits
is required and plenty (aka necessary and sufficient).  You would need
a truly enormous directory before that would be insufficient for an
off-set based scheme or a hash-plus-collision scheme, and it is
clearly enough for a lexically-ordered scheme.

I wouldn't argue against 128bits (64 for a search key and 64 to
guarantee uniqueness) but I really think 256 excessive with no value.
We we still need the last-filename in the READDIR key.

(Mind you, I'm still wondering why NFSv4 felt the need for 128 *byte*
 file handles so maybe I'm just small minded :-)

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-15 Thread Theodore Tso

On Thu, Apr 12, 2007 at 10:35:49AM -0700, H. Peter Anvin wrote:
> J. Bruce Fields wrote:
> >On Thu, Apr 12, 2007 at 08:21:16AM -0400, Theodore Tso wrote:
> >>Again, compared to a directory fd cache, what you're proposing a huge
> >>hit to the filesystem, and at the moment, given that telldir/seekdir
> >>is rarely used by everyone else, it's mainly NFS which is the main bad
> >>actor here by insisting on the use of a small 31/63-bit cookie as a
> >>condition of protocol correctness.
> >
> >If we want to get bigger cookies into the protocol, then the sooner we
> >start working on that the better  How big is big enough?  And is a
> >larger cookie sufficient on its own?
> >
> 
> Any fixed size is too small.  It should be a dynamic size.

Idally it should be dynamic, but my guess is that if the cookie were a
fixed 256 bits, it would be sufficient for pretty much all filesystems.  

- Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-12 Thread H. Peter Anvin


J. Bruce Fields wrote:

On Thu, Apr 12, 2007 at 08:21:16AM -0400, Theodore Tso wrote:

Again, compared to a directory fd cache, what you're proposing a huge
hit to the filesystem, and at the moment, given that telldir/seekdir
is rarely used by everyone else, it's mainly NFS which is the main bad
actor here by insisting on the use of a small 31/63-bit cookie as a
condition of protocol correctness.


If we want to get bigger cookies into the protocol, then the sooner we
start working on that the better  How big is big enough?  And is a
larger cookie sufficient on its own?



Any fixed size is too small.  It should be a dynamic size.

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-12 Thread J. Bruce Fields

On Thu, Apr 12, 2007 at 08:21:16AM -0400, Theodore Tso wrote:
> Again, compared to a directory fd cache, what you're proposing a huge
> hit to the filesystem, and at the moment, given that telldir/seekdir
> is rarely used by everyone else, it's mainly NFS which is the main bad
> actor here by insisting on the use of a small 31/63-bit cookie as a
> condition of protocol correctness.

If we want to get bigger cookies into the protocol, then the sooner we
start working on that the better  How big is big enough?  And is a
larger cookie sufficient on its own?

--b.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-12 Thread Theodore Tso

On Thu, Apr 12, 2007 at 03:57:41PM +1000, Neil Brown wrote:
> But my perspective is that a solution in nfsd at-best a work-around.
> Caching the whole 'struct file' when there is just a small bit that we
> might want seems like a heavy hammer.  The filesystem is in the best
> place to know what needs to be cached, and it should be the one doing
> the caching.

Sure, but ext3 needs to cache the information in the file handle,
because it's dynamic, per-directory stream information that needs to
be cached.  The fundamental problem is the broken nature of the 64-bit
cookie; it simply isn't big enough.  So what's being disputed is who
gets to pay the cost of that particular design mistake, in POSIX and
in NFS?

In the POSIX case, right now only applications that use
telldir/seekdir pay the cost, which is they might see some repeated
directory entries in the case of hash collisions.

Unfortunately, in the NFS case if there are hash collisions, under the
wrong circumstances the NFS client could loop forever trying to
readdir() a directory stream.

> This is a simple consequence of the design decision to use hashes as
> the search key.  They aren't dense and they will collide.  So the
> solution will be a bit fuzzy around the edges.  And maybe that is an
> acceptable tradeoff.  But the filesystem should take full
> responsibility for it, whether in performance or correctness :-)

Well, we could also say that it is NFS's fault that they used a
limited size cookie as a fundamental part of their protocol 

> But there are alternatives.  e.g. internal chaining.
> Insist on a unique 64bit hash for every file.  If the hash is in use,
> increment and try again.  On lookup, if the hash leads you to a file
> with the wrong name, increment and try again until you find a hole
> (hash value that is not stored).  When you delete an entry, leave a
> place holder if the next hash is in use.  Conversely if the next hash
> is not in use, delete the entry and delete the previous one if it is a
> place holder.

This solution requires an incompatible file format change.  In
addition, it means trying to garbage collect directory entries when at
the beginning or end of a directory block without dragging in the
previous or next directory block.  This is a huge amount of hair, will
screw over performance, and is not compatible with how we do things
today.  

(It also means that you have to store the hash in the directory entry,
which we don't do today, since we can always calculate the hash from
the file name.  But if in some cases the hash is going to be some
small integer plus the calculated hash, you have to bloat the
directory entry by an extra 8 bytes per dirent.)

Again, compared to a directory fd cache, what you're proposing a huge
hit to the filesystem, and at the moment, given that telldir/seekdir
is rarely used by everyone else, it's mainly NFS which is the main bad
actor here by insisting on the use of a small 31/63-bit cookie as a
condition of protocol correctness.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-12 Thread Jörn Engel

On Thu, 12 April 2007 15:57:41 +1000, Neil Brown wrote:
> 
> > However, the collision chain gives me quite a bit of headache.  One
> > would have to store each entry's position on the chain, deal with older
> > entries getting deleted, newer entries getting removed, etc.  All this
> > requires a lot of complicated code that basically never gets tested in
> > the wild.
> 
> This is a simple consequence of the design decision to use hashes as
> the search key.  They aren't dense and they will collide.  So the
> solution will be a bit fuzzy around the edges.  And maybe that is an
> acceptable tradeoff.  But the filesystem should take full
> responsibility for it, whether in performance or correctness :-)

Sure.  And seeing that not using hashes would kill performance long
before 4 billion dentries are reached, there don't seem to be many
downsides to hashing in principle.

> > Just settling for a 64bit hash and returning -EEXIST when someone causes
> > a collision an creat() sounds more appealing.  Directories with 4
> > billion entries will cause problems, but that is hardly news to anyone.
> 
> I think you want -EFBIG or -ENOSPC.  -EEXIST sounds just wrong.

None of them are 100% correct.  But you are right, -ENOSPC seems to do
less harm.

> But there are alternatives.  e.g. internal chaining.
> Insist on a unique 64bit hash for every file.  If the hash is in use,
> increment and try again.  On lookup, if the hash leads you to a file
> with the wrong name, increment and try again until you find a hole
> (hash value that is not stored).  When you delete an entry, leave a
> place holder if the next hash is in use.  Conversely if the next hash
> is not in use, delete the entry and delete the previous one if it is a
> place holder.

That would work and is limited to reasonable complexity.  It still
suffers from getting virtually no testing in the wild and therefore
being one of the dark corners little critters thrive in.  But one can at
least add a config option to fold the hash to 16bit or so.  And cross
fingers that at least one person will occasionally test with that
option.

> You have to require 64bit cookies/fpos, but I think that today, that
> is a reasonable thing to require (5 years ago it might not have been).

Which brings us back to the start of this thread.

Jörn

-- 
Why do musicians compose symphonies and poets write poems?
They do it because life wouldn't have any meaning for them if they didn't.
That's why I draw cartoons.  It's my life.
-- Charles Shultz
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Neil Brown

On Thursday April 12, [EMAIL PROTECTED] wrote:
> On Thu, 12 April 2007 11:46:41 +1000, Neil Brown wrote:
> > 
> > I could argue that nfs came before ext3+dirindex, so ext3 should have
> > been designed to work properly with NFS.  You could argue that fixing
> > it in nfsd fixes it for all filesystems.  But I'm not sure either of
> > those arguments are likely to be at all convincing...
> 
> Caring about a non-ext3 filesystem, I sure would like an nfs solution as
> well. :)

I have a non-ext3 filesystem I care about too.

But my perspective is that a solution in nfsd at-best a work-around.
Caching the whole 'struct file' when there is just a small bit that we
might want seems like a heavy hammer.  The filesystem is in the best
place to know what needs to be cached, and it should be the one doing
the caching.

> 
> > Hmmm. I wonder.  Which is more likely?
> >   - That two 64bit hashes from some set are the same
> >   - or that 65536 48bit hashes from a set of equal size are the same.
> 
> The former.  Each bit going from hash strength to collision chain length
> reduces the likelihood of an overflow.  In the extreme case of a 0bit
> hash and 64bit collision chain, you need 2^64 entries compared to 2^32
> for the other extreme.
> 
> However, the collision chain gives me quite a bit of headache.  One
> would have to store each entry's position on the chain, deal with older
> entries getting deleted, newer entries getting removed, etc.  All this
> requires a lot of complicated code that basically never gets tested in
> the wild.

This is a simple consequence of the design decision to use hashes as
the search key.  They aren't dense and they will collide.  So the
solution will be a bit fuzzy around the edges.  And maybe that is an
acceptable tradeoff.  But the filesystem should take full
responsibility for it, whether in performance or correctness :-)

> 
> Just settling for a 64bit hash and returning -EEXIST when someone causes
> a collision an creat() sounds more appealing.  Directories with 4
> billion entries will cause problems, but that is hardly news to anyone.
> 

I think you want -EFBIG or -ENOSPC.  -EEXIST sounds just wrong.

But there are alternatives.  e.g. internal chaining.
Insist on a unique 64bit hash for every file.  If the hash is in use,
increment and try again.  On lookup, if the hash leads you to a file
with the wrong name, increment and try again until you find a hole
(hash value that is not stored).  When you delete an entry, leave a
place holder if the next hash is in use.  Conversely if the next hash
is not in use, delete the entry and delete the previous one if it is a
place holder.

Then you get 100% correct semantics and a performance hit in the face
of hash collisions that is probably no worse than that which ext3
currently gets.  It probably does cost you a bit of storage to store
those 64bit hashes, though I suspect some clever compression can help
out there (You only need one bit more than the filename when there is
no chaining).

You have to require 64bit cookies/fpos, but I think that today, that
is a reasonable thing to require (5 years ago it might not have been).

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Jörn Engel

On Thu, 12 April 2007 11:46:41 +1000, Neil Brown wrote:
> 
> I could argue that nfs came before ext3+dirindex, so ext3 should have
> been designed to work properly with NFS.  You could argue that fixing
> it in nfsd fixes it for all filesystems.  But I'm not sure either of
> those arguments are likely to be at all convincing...

Caring about a non-ext3 filesystem, I sure would like an nfs solution as
well. :)

> Hmmm. I wonder.  Which is more likely?
>   - That two 64bit hashes from some set are the same
>   - or that 65536 48bit hashes from a set of equal size are the same.

The former.  Each bit going from hash strength to collision chain length
reduces the likelihood of an overflow.  In the extreme case of a 0bit
hash and 64bit collision chain, you need 2^64 entries compared to 2^32
for the other extreme.

However, the collision chain gives me quite a bit of headache.  One
would have to store each entry's position on the chain, deal with older
entries getting deleted, newer entries getting removed, etc.  All this
requires a lot of complicated code that basically never gets tested in
the wild.

Just settling for a 64bit hash and returning -EEXIST when someone causes
a collision an creat() sounds more appealing.  Directories with 4
billion entries will cause problems, but that is hardly news to anyone.

Jörn

-- 
Fantasy is more important than knowledge. Knowledge is limited,
while fantasy embraces the whole world.
-- Albert Einstein
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Neil Brown

On Wednesday April 11, [EMAIL PROTECTED] wrote:
> 
> Actually, no, we can't keep the collision chain count stable across a
> create/delete even while the tree is cached.  At least, not without
> storing a huge amount of state associated with each page.  (It would
> be a lot more work than simply having nfsd keep a fd cache for
> directory streams ;-).

Well, there's the rub, isn't it :-)
You think it is easier to fix the problem in nfsd, and I think it is
easier to fix the problem in ext3.  Both positions are quite
understandable and quite genuine.
And I am quite sure that all the issues that have been raised can be
solved with a bit of effort providing the motivation is there.

I could argue that nfs came before ext3+dirindex, so ext3 should have
been designed to work properly with NFS.  You could argue that fixing
it in nfsd fixes it for all filesystems.  But I'm not sure either of
those arguments are likely to be at all convincing...

Maybe the best compromise is to both fix the 'problem' :-?

Let me explores some designs a bit more..

NFS:
   All we have to do is cache the open files.  This should be only a
   performance issue, not a correctness issue (once we get 64bit
   cookies from ext3).  So the important thing is to cache then for
   a reasonable period of time.

   We currently cache the read-ahead info from regular files (though
   I was hoping that would go away when the page-cache-based readahead
   became a reality).  We can reasonably replace this with caching
   the open files if we are going to do it for directories anyway.

   So the primary key is "struct inode * + loff_t".  This is suitable
   both for file-readahead and for ext3-directory-caching.  Might also
   be useful for filesystem that stores pre-allocation information in
   the struct-file.
   We keep these in an LRU list and a hash table.  We register a
   callback with register_shrinker (or whatever it is called today) so
   that VM pressure can shrink the cache, and also arrange a timer to
   remove entries older than -- say -- 5 seconds.

   I think that placing a fixed size on the cache based on number of
   active clients would be a mistake, as it is virtually impossible to
   know how many active clients there are, and the number can change
   very dynamically.

   When a filesystem is un-exported, (rare event) we walk the whole
   list and discard entries for that filesystem.

   Look into the possibility of a callback on unlink to drop the
   cached open when the link count hits zero I wonder if inotify
   or leases can help with that.

   To help with large NUMA machine, we probably don't want a single
   hash LRU chain, but rather a number of LRU chains.  That way the
   action of moving an entry to the end of the chain is less likely to
   conflict with another processor trying to do the same thing to a
   different entry.  This is the sort of consideration that is already
   handled in the page cache, and having to handle it in every other
   cache is troublesome because the next time a need like that
   arises, the page cache will get fixed but other little caches won't
   until someone like Greg Banks come along with a big hammer...

EXT3:
   Have a rbtree for storing directory entries. This is attached to a
   pages via the ->private page field.
   Normally each page of a directory has it's own rbtree, but when two
   pages contain entries with the same hash, the one rbtree is shared
   between the pages.
   Thus when you load a block you must also load other blocks under
   the same hash, but I think you do that already.

   When you split a block (because it has become too big) the rbtree
   attached to that block is dismantled and each entry is inserted
   into the appropriate new rbtree, one for each of the two blocks.
   The entries are unchanged - they just get placed in a different
   tree - so cursors in the struct file will still be valid.

   Each entry has a count of the number of cursors pointing to it, and
   when this is non-zero, a refcount on the page is held, thus making
   sure the page doesn't get freed and the btree lost.  The entry
   should possibly also contain a pointer to the page.. not sure if
   that is needed.

   Each entry in the rbtree contains (in minor_hash) a sequence number
   that is used when multiple entries hash to the same value.  We
   store a 'current-seq-number' in the root of the rbtree and when an
   attempt to insert an entry finds a collision, we increase
   current-seq-number, set the minor_hash to that, and retry the
   insert.
   This minor_hash is combined with some bits of the major hash to
   form the fpos/cookie.

   The releasepage address_space_operation will check that all pages
   which share the same major hash are treated as a unit, all released
   at the same time.  So it will fail if any of the pages in the
   group are in use.  If they can all be freed, it will free the
   rbtree for that group.

   This not only benefits nfsd, which opens

Re: If not readdir() then what?

2007-04-11 Thread Neil Brown

On Wednesday April 11, [EMAIL PROTECTED] wrote:
> On Thu, 12 Apr 2007, Neil Brown wrote:
> 
> > For the second.
> >  You say that you " would need at least 96 bits in order to make that
> >  guarantee; 64 bits of hash, plus a 32-bit count value in the hash
> >  collision chain".  I think 96 is a bit greedy.  Surely 48 bits of
> >  hash and 16 bits of collision-chain-position would plenty.  You would
> >  need 65537 entries before a collision was even possible, and
> >  billions before it was at all likely. (How big does a set of 48bit
> >  numbers have to get before the probability that "No subset of 65536
> >  numbers are all the same" drops below 0.95?)
> 
> Neil,
>you can get a hash collision with two entries.

You need at least 65537 entries before there is any possibility of
collision between two
   "48-bit-hash ++ 16-bit-sequence-number"
objects where the 16-bit-sequence-number is chosen to be different from all
other 16 bit sequence numbers combined with the same 48 bit hash.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Jörn Engel

On Wed, 11 April 2007 16:23:21 -0700, H. Peter Anvin wrote:
> David Lang wrote:
> >On Thu, 12 Apr 2007, Neil Brown wrote:
> >
> >>For the second.
> >> You say that you " would need at least 96 bits in order to make that
> >> guarantee; 64 bits of hash, plus a 32-bit count value in the hash
> >> collision chain".  I think 96 is a bit greedy.  Surely 48 bits of
> >> hash and 16 bits of collision-chain-position would plenty.  You would
> >> need 65537 entries before a collision was even possible, and
> >> billions before it was at all likely. (How big does a set of 48bit
> >> numbers have to get before the probability that "No subset of 65536
> >> numbers are all the same" drops below 0.95?)
> >
> >  you can get a hash collision with two entries.
> 
> Yes, but the probability is 2^-n for an n-bit hash, assuming it's 
> uniformly distributed.
> 
> The probability approaches 1/2 as the number of entries hashes 
> approaches 2^(n/2) (birthday number.)

I believe you are both barking up the wrong tree.  Neil proposed a 16bit
collision chain.  With that, it takes 65537 entries before a collision
chain overflow is possible.

Calling a collision chain overflow "collision" is inviting confusion, of
course. :)

Jörn

-- 
The competent programmer is fully aware of the strictly limited size of
his own skull; therefore he approaches the programming task in full
humility, and among other things he avoids clever tricks like the plague.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread H. Peter Anvin


David Lang wrote:

On Thu, 12 Apr 2007, Neil Brown wrote:


For the second.
 You say that you " would need at least 96 bits in order to make that
 guarantee; 64 bits of hash, plus a 32-bit count value in the hash
 collision chain".  I think 96 is a bit greedy.  Surely 48 bits of
 hash and 16 bits of collision-chain-position would plenty.  You would
 need 65537 entries before a collision was even possible, and
 billions before it was at all likely. (How big does a set of 48bit
 numbers have to get before the probability that "No subset of 65536
 numbers are all the same" drops below 0.95?)


Neil,
  you can get a hash collision with two entries.



Yes, but the probability is 2^-n for an n-bit hash, assuming it's 
uniformly distributed.


The probability approaches 1/2 as the number of entries hashes 
approaches 2^(n/2) (birthday number.)


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Theodore Tso

On Thu, Apr 12, 2007 at 08:32:05AM +1000, Neil Brown wrote:
> For the first:
>   You are storing an internal tree representation of part of the
>   directory attached to the 'struct file'.
>   Would it be possible to store this attached to the page via the
>   ->private pointer?  What would avoid the allocate/create/free
>   overhead on every request.

The reason why we are storing it associated with the file pointer
instead of the page/block is because the a filename insertion might
cause a node split, in which case half of the 4k block gets copied to
another block.  We need a stable pointer to where we are in the tree
that can cope with hash collisions, and that's the reason for creating
red/black tree in the first place, since it *doesn't* get split and
reorganized when the directory's hash tree gets reorg'ed.  So
attaching the tree to the page breaks the reason why we have the
separate data structure in the first place.

>   You suggest caching the open files in nfsd.  While that is probably
>   possible (I've thought of it a number of times) is would also be
>   quite complex, e.g. requiring some sort of call-back to close all
>   those files when the filesystem is unexported.  And it is very easy
>   to get caching heuristics wrong.  Leveraging the page-cache which is
>   a very mature cache seems to make a lot of sense.

Is it really that complex?  The simplest way of handling it is simply
keeping a open directory fd cache in a per-filesystem rbtree index
which is indexed by file handle and contains the file pointer.  When
you unexport the filesystem, you simply walk the rbtree and close all
of the file descriptors; no callback is required.

The caching hueristics are an issue; but fixed-size cache with a
simple LFU replacement strategy isn't all that complex to implement.
If 95% of the time, the readdir's come in quick succession, even a
small cache will probably provide huge performance gains, and
increaing the cache size past some critical point will probably only
provide marginal improvements.  

> For the second.
>   You say that you " would need at least 96 bits in order to make that
>   guarantee; 64 bits of hash, plus a 32-bit count value in the hash
>   collision chain".  I think 96 is a bit greedy.  Surely 48 bits of 
>   hash and 16 bits of collision-chain-position would plenty.  You would
>   need 65537 entries before a collision was even possible, and
>   billions before it was at all likely. (How big does a set of 48bit
>   numbers have to get before the probability that "No subset of 65536
>   numbers are all the same" drops below 0.95?)
> 
>   This would really require that the collision-chain-index was stable
>   across create/delete.  Doing that while you have the tree in the
>   page cache is probably easy enough.  Doing it across reboots is
>   probably not possible without on-disk changes.

Actually, no, we can't keep the collision chain count stable across a
create/delete even while the tree is cached.  At least, not without
storing a huge amount of state associated with each page.  (It would
be a lot more work than simply having nfsd keep a fd cache for
directory streams ;-).

If we need create/delete stability, probably our only sane
implementation choice is to just stick with a 63-bit hash, and cross
our fingers and hope for the best.  

If nfsd caches the last N used directory caches, where N is roughly
proportional to the number of active clients, and the clients all only
use the last cookie returned in the readdir entry (since it would be
stupid to use one of the earlier ones and request the server to
re-send something which the client already has), at least in the
absense of telldir/seekdir calls, then that might be quite sufficient,
even if we return multiple direntory entries which contain hash
collisions to the client.  As long as the directory fd is cached, and
the client just uses the last cookie to fetch the next batch of
dirents, we'll be fine.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread David Lang


On Thu, 12 Apr 2007, Neil Brown wrote:


For the second.
 You say that you " would need at least 96 bits in order to make that
 guarantee; 64 bits of hash, plus a 32-bit count value in the hash
 collision chain".  I think 96 is a bit greedy.  Surely 48 bits of
 hash and 16 bits of collision-chain-position would plenty.  You would
 need 65537 entries before a collision was even possible, and
 billions before it was at all likely. (How big does a set of 48bit
 numbers have to get before the probability that "No subset of 65536
 numbers are all the same" drops below 0.95?)


Neil,
  you can get a hash collision with two entries.

David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Neil Brown

On Wednesday April 11, [EMAIL PROTECTED] wrote:
> On Wed, Apr 11, 2007 at 07:15:57AM +1000, Neil Brown wrote:
> > > Neil, is this correct?
> > 
> > It's just NFS.  nfsd does open/getdents/close on every readdir
> > request.
> 
> That's... unfortunate.

I like to think of it as "challenging" :-)

> 
> 
> So the problem with nfsd doing an open/getdents/close on every readdir
> request is two-fold.  First of all, it means that you are incurring
> extra overhead by forcing ext3 to allocate, create, and free the
> red-black tree on every single NFS readdir request.  Secondly, it
> increases the chances of problems with the hash collision logic.

I guess we need a two-fold solution then  :-)

For the first:
  You are storing an internal tree representation of part of the
  directory attached to the 'struct file'.
  Would it be possible to store this attached to the page via the
  ->private pointer?  What would avoid the allocate/create/free
  overhead on every request.

  You suggest caching the open files in nfsd.  While that is probably
  possible (I've thought of it a number of times) is would also be
  quite complex, e.g. requiring some sort of call-back to close all
  those files when the filesystem is unexported.  And it is very easy
  to get caching heuristics wrong.  Leveraging the page-cache which is
  a very mature cache seems to make a lot of sense.

  The cursor would still be in the 'struct file', but if the second
  second fold works, that should be much less critical.

For the second.
  You say that you " would need at least 96 bits in order to make that
  guarantee; 64 bits of hash, plus a 32-bit count value in the hash
  collision chain".  I think 96 is a bit greedy.  Surely 48 bits of 
  hash and 16 bits of collision-chain-position would plenty.  You would
  need 65537 entries before a collision was even possible, and
  billions before it was at all likely. (How big does a set of 48bit
  numbers have to get before the probability that "No subset of 65536
  numbers are all the same" drops below 0.95?)

  This would really require that the collision-chain-index was stable
  across create/delete.  Doing that while you have the tree in the
  page cache is probably easy enough.  Doing it across reboots is
  probably not possible without on-disk changes.

So while my two-fold solution is not perfect(*), I think it is
substantially better than what we have now, and is extremely unlike to
ever show an observable problem.
(*)And perfection can only be achieved we two independent indexes.

This requires 64bit cookies which NFSv2 cannot handle.  I think it is
OK to say "tough".  We should probably make sure that READDIRv2 *always*
fails on a directory which is indexed, and possibly on a filesystem
which allows index directories.  People can always disable indexing if
they want a v2 export.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Theodore Tso

On Wed, Apr 11, 2007 at 07:15:57AM +1000, Neil Brown wrote:
> > Neil, is this correct?
> 
> It's just NFS.  nfsd does open/getdents/close on every readdir
> request.

That's... unfortunate.

Ext3 w/htree enable creates an in-memory red-black tree of a portion
of the directory; this is typically one filesystem block worth of
entries, but if there is a hash collision which spells across a
directory block, we will fill the red-block tree until we have read in
all of the directory entries that collide on a particular hash value
into the red-black tree.

We then maintain an internal pointer which provides us with exactly
which directory entries we have and haven't returned via
getdents/readdir via a pointer referenced off the file pointer.  So
long as the user never changes filp->f_pos, we can meet the POSIX
guarantees about returning each file (that hasn't been deleted or
added since the last opendir or rewinddir) once and exactly once.  If
the user calls telldir/seekdir, we can no longer make this guarantee
since the telldir cookie can't encapsulate the location inside our AVL
tree.  (We would need at least 96 bits in order to make that
guarantee; 64 bits of hash, plus a 32-bit count value in the hash
collision chain.)  The relevant code which does this is in
ext3_dx_readdir:

#define hash2pos(major, minor)  (major >> 1)
#define pos2maj_hash(pos)   ((pos << 1) & 0x)
#define pos2min_hash(pos)   (0)

/* Some one has messed with f_pos; reset the world */
if (info->last_pos != filp->f_pos) {
free_rb_tree_fname(&info->root);
info->curr_node = NULL;
info->extra_fname = NULL;
info->curr_hash = pos2maj_hash(filp->f_pos);
info->curr_minor_hash = pos2min_hash(filp->f_pos);
}

(Note, we're only using a 31-bit telldir cookie because of NFSv2
compatibility.  If we could know we're using NFSv3/v4, by via checking
O_LARGEFILE, we could easily supply a 63-bit telldir cookie.  As I
recall there were some signed/unsigned issues which is why we had to
limit ourselves to a 31-bit/63-bit value).

So the problem with nfsd doing an open/getdents/close on every readdir
request is two-fold.  First of all, it means that you are incurring
extra overhead by forcing ext3 to allocate, create, and free the
red-black tree on every single NFS readdir request.  Secondly, it
increases the chances of problems with the hash collision logic.

Yes, right now we're probably courting danger if we try to operate on
a directory with hash collisions because it means that multiple
dirents returned by the NFS server to the NFS client will have the
same NFS readdir cookie (currently supplied by filp->f_pos/llseek).
But if most NFS clients typically use the cookie associated with the
last dirent returned by the readdir RPC, we might be getting away with
potential hash collisions --- and if nfsd were to cache open directory
fd's for some short period of time, we might get away with it even
more, since if the NFS client does a linear readthrough of the
dirents, we might end up winning even though the results might be a
little cheasy.  

It is true that if the NFS client tries to use a telldir cookie
associated with some directory entry not at the end of the returned
block of dirents, and the directory entry is part of a set of
filenames that result in a hash collision, we could end up losing.
But I'm guessing this happens very rarely in practice.

One of the things I did when I was debugging this code way back when
was that I had a debugging hash type which always returned the value
42.  This allowed me to be sure that all of the edge conditions
involving hash collisions could be well tested, and modulo
telldir/seekdir simply not working, I was able to verify that the
filesystem worked correctly, albeit not very efficiently.  I didn't
test to see what would happen with exporting such a filesystem via
NFS.  It might be interesting to try that and then see how many NFS
clients were able to deal with such a filesystem export.  

Given that we haven't gotten any complaints of clients stuck in an
infinite readdir loop, I have to conclude that either NFS clients are
a lot more robust/predictable than we are giving them credit for, or
people aren't exporting via NFS directories big enough that hash
collisions given a 31-bit cookie value haven't happened yet, which
seems to be rather unbelievable.

Regards,

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-11 Thread Jan Engelhardt


On Apr 11 2007 07:15, Neil Brown wrote:
>On Sunday April 8, [EMAIL PROTECTED] wrote:
>> On Sun, 8 April 2007 11:11:20 -0700, H. Peter Anvin wrote:
>> > 
>> > Well, the question is if you can keep the seekdir/telldir cookie around 
>> > as a pointer -- preferrably in userspace, of course.  You would 
>> > presumably garbage-collect them on closedir() -- there is no other point 
>> > at which you could.
>> 
>> Garbage-collecting them on closedir() does not work.  It surprised me as
>> well, but there seem to be applications that keep the telldir() cookie
>> around after closedir().  Iirc, "rm -r" was one of them.
>> 
>> Neil, is this correct?
>
>It's just NFS.  nfsd does open/getdents/close on every readdir
>request.

Uhoh, that sounds even more expensive than doing one getdent during opendir.



Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Bernd Eckenfels

In article <[EMAIL PROTECTED]> you wrote:
> Otherwise, the client would have to cache _all_ previous READDIR results
> since the last opendir()/rewinddir() in order to be able to do its own
> loop detection and that will obviously never scale for large directories
> or for directories that change frequently...

Unless you have a COW style filesystem with versioning (think oracle tables)
you will have to invalidate cookies often or do copies - on client or
server. And I am not sure whats worse (for apps)... disappearing/missing
files or duplicates.

Gruss
bernd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Trond Myklebust

On Wed, 2007-04-11 at 08:33 +1000, Neil Brown wrote:

>   A READDIR (aka getdents2) should take a directory handle, a cookie,
>   and a filename, and should return filenames and cookies.  The
>   cookies may all be identical or may not.  The filename might be used
>   by the filesystem, or it might not.
> 
>   Filesystems that require a cursor in the 'struct file' to support
>   (local) getdents cannot be used with NFS.
> 
> While it doesn't make it possible to support all conceivable
> filesystems, it should make it easier for some filesystems to support
> the demands of NFS.

In order to be useful, I think you need to add a demand that the READDIR
call cannot loop back on itself for the case of a series of sequential
reads.

IOW: if a client attempts to step sequentially through the directory,
and is supplying valid filenames+cookies from the preceding READDIR
call, then the next READDIR call should be guaranteed never to loop back
to an earlier entry. Alternatively, if there is a danger that it might
due to some sudden and radical change in the directory layout, then it
should notify the client by returning something like a BAD_COOKIE error.

Otherwise, the client would have to cache _all_ previous READDIR results
since the last opendir()/rewinddir() in order to be able to do its own
loop detection and that will obviously never scale for large directories
or for directories that change frequently...

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Neil Brown

On Tuesday April 10, [EMAIL PROTECTED] wrote:
> On Wed, 2007-04-11 at 07:37 +1000, Neil Brown wrote:
> > On Tuesday April 10, [EMAIL PROTECTED] wrote:
> > > On Wed, 2007-04-11 at 07:12 +1000, Neil Brown wrote:
> > > > 
> > > > Is there something that makes that interface problematic?
> > > 
> > > File deletions...
> > 
> > How are they a problem ?
> > 
> > There are only two ways to organise a directory.
> > 1/ Unsorted linear list of entries in which no repacking is ever done.
> > 2/ Some data structure using an ordered search key that is based on
> >   the filename (e.g. a B-tree with a search key that is a hash of the
> >   filename). 
> > 
> > In the first case, you just use a fixed opaque cookie for location in
> > a directory.
> > In the second you use the filename.  If the file has been deleted,
> > that shouldn't stop you finding the place where it would have been in
> > the overall sort order.
> 
> I beg to differ. RAMFS is an instance of a filesystem which uses an
> unsorted linear list of entries which is automatically repacked when you
> delete a file.

hmm... yes...
RAMFS uses dcache_readdir which keeps a cursor in the list of entries
such that delete/insert doesn't change the effective location of the
cursor.

Doing that in NFS would require the server keeping an open file on
behalf of the client.  That wouldn't be too hard for NFSv4, except
that you would still need some way to deal with server reboots.

But then RAMFS wouldn't survive a server reboot anyway.

> 
> You may also have filesystems which are only partially sorted. The
> dcache is for instance organised as a hashtable with multiple entries
> per bucket...

That is like the filesystem Bob Copeland described.

And if the chains are not sorted, then I think the only way you could
reliably implement getdents is to use a cursor, and that doesn't map
over NFS very well.

Even if the chains were sorted, they could conceivably be sorted by
position-in-file in which case my 'cookie or filename' would not be
enough.  You need 'cookie and filename'.

So my new position is that:

  A READDIR (aka getdents2) should take a directory handle, a cookie,
  and a filename, and should return filenames and cookies.  The
  cookies may all be identical or may not.  The filename might be used
  by the filesystem, or it might not.

  Filesystems that require a cursor in the 'struct file' to support
  (local) getdents cannot be used with NFS.

While it doesn't make it possible to support all conceivable
filesystems, it should make it easier for some filesystems to support
the demands of NFS.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Trond Myklebust

On Wed, 2007-04-11 at 07:37 +1000, Neil Brown wrote:
> On Tuesday April 10, [EMAIL PROTECTED] wrote:
> > On Wed, 2007-04-11 at 07:12 +1000, Neil Brown wrote:
> > > 
> > > Is there something that makes that interface problematic?
> > 
> > File deletions...
> 
> How are they a problem ?
> 
> There are only two ways to organise a directory.
> 1/ Unsorted linear list of entries in which no repacking is ever done.
> 2/ Some data structure using an ordered search key that is based on
>   the filename (e.g. a B-tree with a search key that is a hash of the
>   filename). 
> 
> In the first case, you just use a fixed opaque cookie for location in
> a directory.
> In the second you use the filename.  If the file has been deleted,
> that shouldn't stop you finding the place where it would have been in
> the overall sort order.

I beg to differ. RAMFS is an instance of a filesystem which uses an
unsorted linear list of entries which is automatically repacked when you
delete a file.

You may also have filesystems which are only partially sorted. The
dcache is for instance organised as a hashtable with multiple entries
per bucket...

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Bob Copeland


On 4/10/07, Neil Brown <[EMAIL PROTECTED]> wrote:

2/ Some data structure using an ordered search key that is based on
  the filename (e.g. a B-tree with a search key that is a hash of the
  filename).

In the first case, you just use a fixed opaque cookie for location in
a directory.
In the second you use the filename.  If the file has been deleted,
that shouldn't stop you finding the place where it would have been in
the overall sort order.


I can think of one (admittedly insane) FS that is between those two:

3/ an unsorted hash table, where each directory entry has an indirect
pointer to its neighbor in case of hash collisions.

  a
  b -> d -> c -> e
  g
  f -> x

Given 'c' as the "last" thing returned, you can hash c to find out
that you are in the bucket with 'b', but if 'c' was deleted, the best
you can do is return b twice or skip the chain entirely.  I maintain
an out-of-tree driver for such an fs (I promise I did not invent it);
the best I could come up with is to encode the hash chain index in the
top byte of f_pos.   Needless to say, readdir is very not performant
with all the seeking this hash scheme entails.

-Bob
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Neil Brown

On Sunday April 8, [EMAIL PROTECTED] wrote:
> On Sun, 8 April 2007 11:11:20 -0700, H. Peter Anvin wrote:
> > 
> > Well, the question is if you can keep the seekdir/telldir cookie around 
> > as a pointer -- preferrably in userspace, of course.  You would 
> > presumably garbage-collect them on closedir() -- there is no other point 
> > at which you could.
> 
> Garbage-collecting them on closedir() does not work.  It surprised me as
> well, but there seem to be applications that keep the telldir() cookie
> around after closedir().  Iirc, "rm -r" was one of them.
> 
> Neil, is this correct?

It's just NFS.  nfsd does open/getdents/close on every readdir
request.
"rm -r" is the application that tends to detect any problems with
cookie handling between the NFS client and server.

(I'm sure there was once an 'rm -r' that used telldir/seekdir, but I
cannot find it.  Maybe it was level-7 UNIX :-)

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Alan Cox

> A READDIR request contains either a cookie or a filename.  Either mean
> "This identifies the last name I got from you, give me the next one".
> 
> Is there something that makes that interface problematic?

Another client may have removed the name and re-added a new file with
that name. At this point most file systems may not have placed it at the
same spot in the directory.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Neil Brown

On Tuesday April 10, [EMAIL PROTECTED] wrote:
> 
> Seems like it would simply make more sense for the server to be allowed 
> to determine what the size of the cookie should be.

That is one possibility.  but if the cookie gets too big, you
substantially reduce the number of entries you can fit in a single
reply.  And if you choose to use the filename as the cookie, you end
up sending the filename twice for each entry, which seems somewhat
pointless. 

> 
> Of course, that doesn't help NFSv2/3/4.0.

No.  And while we can probably deprecate support for NFSv2 (at least
in newer filesystems) I think it will be quite some years before we
can think about doing the same for v3 (in fact we can probably
deprecate 4.0 and 4.1 before v3 :-)
So while it doesn't hurt to plan for the longer-term future, I think
we need to accept that over the next 5 years, filesystems needs to
cope with 64bit dir-entry cookies to be generally useful.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Neil Brown

On Tuesday April 10, [EMAIL PROTECTED] wrote:
> On Wed, 2007-04-11 at 07:12 +1000, Neil Brown wrote:
> > 
> > Is there something that makes that interface problematic?
> 
> File deletions...

How are they a problem ?

There are only two ways to organise a directory.
1/ Unsorted linear list of entries in which no repacking is ever done.
2/ Some data structure using an ordered search key that is based on
  the filename (e.g. a B-tree with a search key that is a hash of the
  filename). 

In the first case, you just use a fixed opaque cookie for location in
a directory.
In the second you use the filename.  If the file has been deleted,
that shouldn't stop you finding the place where it would have been in
the overall sort order.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Neil Brown

On Monday April 9, [EMAIL PROTECTED] wrote:
> On Mon, Apr 09, 2007 at 08:31:37AM -0400, Trond Myklebust wrote:
> > That is a protocol limitation, not a client limitation.
> 
> 
> 
> And after quickly checking RFC 3010, I see this limitation hasn't been
> lifted in NFSv4.
> 
> Speaking of which, right now ext3 doesn't know whether it's talking to
> an NFSv2 or NFS v3/v4 server, so it's always passing a 32-bit cookie.
> If NFSv3/v4 could use an explicit interface to request a 64-bit
> cookie, instead of just relying on the f_pos field in the file handle,
> we can reduce the chance of hash collisions when reading an ext3
> directory significantly.   
> 

We don't use f_pos (any more), we call llseek.

I think it would make a lot of sense - as Trond suggests - to not pass
O_LARGEFILE to dentry_open for an NFSv2 request.  Then ext3 could
trigger off that and return 64bits of cookie ... and I think nfsd will
actually pass them all back to the client now.  There is a
truncate-to-32bits bug that has only just been fixed.

But if a separate call is wanted, we have the export_operations struct
to put it in.  All we need is a good case an useful specification.

> If there are 2 or 3 directory entries that have a hash collision,
> would the NFS protocol allow the server to juggle things so that those
> 2-3 directory entries with the hash collision are sent back in a
> single readdir RPC reply?  Is it aceptable/legal to have multiple
> entries in the same READDIR reply packet have the same cookie value?

I think Trond has answered this, but I think it is also worth noting
that every entry returned in a READDIR reply includes a cookie, the
NFS client may use any of those cookies in a subsequent READDIR.  One
might hope the client will only ever use the last, but one can never
be sure

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Trond Myklebust

On Wed, 2007-04-11 at 07:12 +1000, Neil Brown wrote:
> On Tuesday April 10, [EMAIL PROTECTED] wrote:
> > The problem is that it is extremely hard to come up with an alternative
> > that doesn't impose new conditions on what filesystems you can support.
> 
> I seem to remember Hans Reiser making a credible suggestion years ago
> when NFSv4 was still in draft.  It didn't fly, but I don't really
> remember why.
> 
> The NFS server gets to either return a cookie like it currently does,
> or sets a flag (or maybe returns a special cookie) which says 'just
> use the name'.
> A READDIR request contains either a cookie or a filename.  Either mean
> "This identifies the last name I got from you, give me the next one".
> 
> Is there something that makes that interface problematic?

File deletions...

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread H. Peter Anvin


Neil Brown wrote:

On Tuesday April 10, [EMAIL PROTECTED] wrote:

The problem is that it is extremely hard to come up with an alternative
that doesn't impose new conditions on what filesystems you can support.


I seem to remember Hans Reiser making a credible suggestion years ago
when NFSv4 was still in draft.  It didn't fly, but I don't really
remember why.

The NFS server gets to either return a cookie like it currently does,
or sets a flag (or maybe returns a special cookie) which says 'just
use the name'.
A READDIR request contains either a cookie or a filename.  Either mean
"This identifies the last name I got from you, give me the next one".

Is there something that makes that interface problematic?


Seems like it would simply make more sense for the server to be allowed 
to determine what the size of the cookie should be.


Of course, that doesn't help NFSv2/3/4.0.

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Neil Brown

On Tuesday April 10, [EMAIL PROTECTED] wrote:
> The problem is that it is extremely hard to come up with an alternative
> that doesn't impose new conditions on what filesystems you can support.

I seem to remember Hans Reiser making a credible suggestion years ago
when NFSv4 was still in draft.  It didn't fly, but I don't really
remember why.

The NFS server gets to either return a cookie like it currently does,
or sets a flag (or maybe returns a special cookie) which says 'just
use the name'.
A READDIR request contains either a cookie or a filename.  Either mean
"This identifies the last name I got from you, give me the next one".

Is there something that makes that interface problematic?

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Ulrich Drepper


On 4/10/07, H. Peter Anvin <[EMAIL PROTECTED]> wrote:

It rather makes any user space accesses irrelevant.  The main question
seems to be if we can realistically increase the cookie size even to 64
bits.


On 32-bit platforms, *not* using _FILE_OFFSET_BITS=64 is already today
a stupid thing to do.  There are several places where things simply
break in today's world.

Once you do use 53-bit offsets (and we always do this for 64-bit
platforms) telldir returns a 64-bit cookie.  So, no problem.  The one
big issue is that there are no error values for telldir.  There was no
need so far.  But a) we can change this or b) we can really go
stronger in the direction to deprecate 32-bit ops and force people to
update their code.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Valdis . Kletnieks

On Tue, 10 Apr 2007 17:54:54 +0200, Jan Engelhardt said:

> NFS server sends the whole directory contents on NFS client opendir,
> so that the whole readdir/telldir/seekdir magic can happen on the
> client only... which would perhaps also enable a cheap telldir/seekdir,
> and would also give a 'fixed view' when adding/deleting files
> during readdir.

What should happen if the directory contents go stale?  If *two* systems
cache the directory and then proceed to add/delete files, what happens?


pgpKUxNQKTp2A.pgp
Description: PGP signature

Re: If not readdir() then what?

2007-04-10 Thread H. Peter Anvin


Jan Engelhardt wrote:


I had a thought, but I think it's not quite ripe..

NFS server sends the whole directory contents on NFS client opendir,
so that the whole readdir/telldir/seekdir magic can happen on the
client only... which would perhaps also enable a cheap telldir/seekdir,
and would also give a 'fixed view' when adding/deleting files
during readdir.



That doesn't sound right... what if there are two million files in the 
directory?


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Jan Engelhardt


On Apr 10 2007 10:37, Trond Myklebust wrote:
>
>NFS, OTOH, simply could not work without that requirement, since there
>exists no file pointer to tell you where you are in a stream beyond
>whatever the server manages to encode inside the opaque cookie+verifier.
>
>> But the fact of the matter is that if NFS protocols demands that a
>> per-directory entry cookie can be uniquely and permanently (including
>> across server reboots) identified with a small integer number, it's
>> dreaming.  Filesystem authors will cheat one way or another, because
>> there's nothing else for them to do.  
>
>Few people in the NFS community would disagree that the design of
>READDIR sucks (personally, I consider it to be one of the biggest
>scalability issues we have).
>The problem is that it is extremely hard to come up with an alternative
>that doesn't impose new conditions on what filesystems you can support.

I had a thought, but I think it's not quite ripe..

NFS server sends the whole directory contents on NFS client opendir,
so that the whole readdir/telldir/seekdir magic can happen on the
client only... which would perhaps also enable a cheap telldir/seekdir,
and would also give a 'fixed view' when adding/deleting files
during readdir.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread H. Peter Anvin


Ulrich Drepper wrote:

On 4/10/07, Theodore Tso <[EMAIL PROTECTED]> wrote:

That might work.  But if in the long term we want to separate out what
we can send back via telldir/seekdir, and some future new Posix
interface, [...]


With all these discussions about fixes for telldir, do we want to
persue an alternative interface where the user can explicitly specify
that no telldir/seekdir will ever be used?  From what I know so far it
would make technical sense since we could speed up/reduce the memory
footprint of readdir in 99% of most programs.  But is the benefit
large enough to warrant a second interface?


The real problem sounds like NFS is, in effect, doing telldir/seekdir on 
every directory reference.


It rather makes any user space accesses irrelevant.  The main question 
seems to be if we can realistically increase the cookie size even to 64 
bits.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Trond Myklebust

On Tue, 2007-04-10 at 09:56 -0400, Theodore Tso wrote:

> That might work.  But if in the long term we want to separate out what
> we can send back via telldir/seekdir, and some future new Posix
> interface, I wonder if we might be better off defining a formal
> interface which can be used by NFSv2 and NFSv3/v4 that isn't
> necessarily tied to f_pos.

Note: POSIX does not mandate telldir/seekdir. The latter are only
mandated by the XSI extensions (that are part of the Single Unix spec).

Also note that SuSv3 states

One of the perceived problems of implementation is that
returning to a given point in a directory is quite difficult to
describe formally, in spite of its intuitive appeal, when
systems that use B-trees, hashing functions, or other similar
mechanisms to order their directories are considered. The
definition of seekdir() and telldir() does not specify whether,
when using these interfaces, a given directory entry will be
seen at all, or more than once.

So whereas collisions are not supported, it does appear that the SuSv3
does not mandate that you should be able to replay the exact same
stream.

NFS, OTOH, simply could not work without that requirement, since there
exists no file pointer to tell you where you are in a stream beyond
whatever the server manages to encode inside the opaque cookie+verifier.

> But the fact of the matter is that if NFS protocols demands that a
> per-directory entry cookie can be uniquely and permanently (including
> across server reboots) identified with a small integer number, it's
> dreaming.  Filesystem authors will cheat one way or another, because
> there's nothing else for them to do.  

Few people in the NFS community would disagree that the design of
READDIR sucks (personally, I consider it to be one of the biggest
scalability issues we have).
The problem is that it is extremely hard to come up with an alternative
that doesn't impose new conditions on what filesystems you can support.

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Ulrich Drepper


On 4/10/07, Theodore Tso <[EMAIL PROTECTED]> wrote:

That might work.  But if in the long term we want to separate out what
we can send back via telldir/seekdir, and some future new Posix
interface, [...]


With all these discussions about fixes for telldir, do we want to
persue an alternative interface where the user can explicitly specify
that no telldir/seekdir will ever be used?  From what I know so far it
would make technical sense since we could speed up/reduce the memory
footprint of readdir in 99% of most programs.  But is the benefit
large enough to warrant a second interface?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-10 Thread Theodore Tso

On Mon, Apr 09, 2007 at 10:03:15AM -0400, Trond Myklebust wrote:
> We could perhaps teach nfsd to open the file without the O_LARGEFILE
> attribute in the case of NFSv2?

That might work.  But if in the long term we want to separate out what
we can send back via telldir/seekdir, and some future new Posix
interface, I wonder if we might be better off defining a formal
interface which can be used by NFSv2 and NFSv3/v4 that isn't
necessarily tied to f_pos.  Given that the semantics for what
telldir/seekdir are different from what what NFS needs
(telldir/seekdir cookies don't have to be persistent), it may be
useful to allow filesystems the option of having two separate options
for how to export this information.  

> Not really.
> 
> However on NFSv3 and v4 there is actually a mechanism for declaring that
> the existing set of cookies have expired and are no longer valid: you
> have an 8-byte opaque 'verifier' which is supplied by the server, and
> which is supposed to be returned by the client on every call to READDIR.
> If the server wants to change its cookie scheme, then it signals it to
> the client by changing its verifier, and returning an error whenever the
> client tries to use the old verifier. Upon receiving that error, the
> client is supposed to clear out all cached cookies, and read the
> directory in again from the start.

I looked at that, and it's not really helpful.  Basically if NFS
demands that cookies never collide, and states that cookies must be
some small (32 or 64) bit value that are persistent across time and
server reboots, then that's fundmaentally incompatible with any kind
of non-linear directory structure.  So whether the filesystem is
ext3/htree, or ntfs, or reiserfs, people will be cheating one way or
another.

One of the things which they could do I suppose is use a linear
offset, and then change the verifier every single time there is a
b-tree split or merge which changes the configuration of the tree.  As
you say, though, forcing the client to re-read the entire contents of
the directory each time we change the verifier doesn't scale too well.
But the fact of the matter is that if NFS protocols demands that a
per-directory entry cookie can be uniquely and permanently (including
across server reboots) identified with a small integer number, it's
dreaming.  Filesystem authors will cheat one way or another, because
there's nothing else for them to do.  

> Note also that we would have to fix the client implementation. Nobody
> has bothered working on the code to handle verifier changes since there
> are no servers out there in the wild that use it.

... which means changing the verifier every node merge/split operation
would probably cause all sorts of interesting breakages, even more
than the occasional hash collision (which as far as I know no one has
complained about so far --- but with the 32-bit cookie, the birthday
paradox states that the probability of a collision is 1 in 65536, so
it's probably happened out in the wild already).

Regards,

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Trond Myklebust

On Mon, 2007-04-09 at 18:34 +0200, Jan Engelhardt wrote:
> On Apr 9 2007 10:03, Trond Myklebust wrote:
> 
> >In practice, though, this sort of behaviour has to be managed carefully
> >by the server. Forcing a client to re-read the entire contents of the
> >directory doesn't really scale too well...
> 
> What does the spec (readdir, and NFS READDIR) say about duplicate entries,
> and what's done in practice?
> Actually, I'd be more concerned about code that does
> 
> while(d = readdir(...))
> append_a_line_to(d);
> 
> (which would be not-so-good on duplicate entries), rather than
> bad scaling.

>From SuSv3:

If a file is removed from or added to the directory after the
most recent call to opendir() or rewinddir(), whether a
subsequent call to readdir() returns an entry for that file is
unspecified.

For NFS, the results in the above case would be governed by the caching
rules: the client is not strictly required to revalidate the cache
except on opendir(). However if the client itself is the one changing
the directory, it will in practice cause the cache to be invalidated,
and the directory contents to be read in again from scratch.

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Jan Engelhardt

On Apr 9 2007 10:03, Trond Myklebust wrote:

>In practice, though, this sort of behaviour has to be managed carefully
>by the server. Forcing a client to re-read the entire contents of the
>directory doesn't really scale too well...

What does the spec (readdir, and NFS READDIR) say about duplicate entries,
and what's done in practice?
Actually, I'd be more concerned about code that does

while(d = readdir(...))
append_a_line_to(d);

(which would be not-so-good on duplicate entries), rather than
bad scaling.

Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Trond Myklebust

On Mon, 2007-04-09 at 09:19 -0400, Theodore Tso wrote:
> On Mon, Apr 09, 2007 at 08:31:37AM -0400, Trond Myklebust wrote:
> > On Mon, 2007-04-09 at 13:09 +0200, Jörn Engel wrote:
> > > That surely doesn't make life any easier for filesystem developers, I
> > > agree.  From that point of view, all telldir cookies should end their
> > > life at closedir time.  For "rm -r" it would be sufficient if the nfs
> > > client simply didn't seekdir at all.  For "ls -lR", this would return
> > > duplicate dentries.
> > 
> > Please go read the NFS spec. The only thing an NFS client has in order
> > to read a directory is a READDIR operation that in essence takes a
> > filehandle and a cookie as its arguments. Unless the server is able to
> > return the entire rest of the directory in one RPC reply, the client
> > needs to send a second READDIR operation with a cookie from the previous
> > READDIR operation. The server is expected to return cookies for _each_
> > entry in the directory.
> > 
> > That is a protocol limitation, not a client limitation.
> 
> 
> 
> And after quickly checking RFC 3010, I see this limitation hasn't been
> lifted in NFSv4.

If we can come up with an interface that makes sense in the context of
NFS, then we should be able to push it into a future minor revision of
NFSv4. It is unfortunately looking too late to push it into v4.1, since
the final drafts of the RFC are already circulating.

> Speaking of which, right now ext3 doesn't know whether it's talking to
> an NFSv2 or NFS v3/v4 server, so it's always passing a 32-bit cookie.
> If NFSv3/v4 could use an explicit interface to request a 64-bit
> cookie, instead of just relying on the f_pos field in the file handle,
> we can reduce the chance of hash collisions when reading an ext3
> directory significantly.   

We could perhaps teach nfsd to open the file without the O_LARGEFILE
attribute in the case of NFSv2?

> If there are 2 or 3 directory entries that have a hash collision,
> would the NFS protocol allow the server to juggle things so that those
> 2-3 directory entries with the hash collision are sent back in a
> single readdir RPC reply?  Is it aceptable/legal to have multiple
> entries in the same READDIR reply packet have the same cookie value?

Not really.

However on NFSv3 and v4 there is actually a mechanism for declaring that
the existing set of cookies have expired and are no longer valid: you
have an 8-byte opaque 'verifier' which is supplied by the server, and
which is supposed to be returned by the client on every call to READDIR.
If the server wants to change its cookie scheme, then it signals it to
the client by changing its verifier, and returning an error whenever the
client tries to use the old verifier. Upon receiving that error, the
client is supposed to clear out all cached cookies, and read the
directory in again from the start.

In practice, though, this sort of behaviour has to be managed carefully
by the server. Forcing a client to re-read the entire contents of the
directory doesn't really scale too well...
Note also that we would have to fix the client implementation. Nobody
has bothered working on the code to handle verifier changes since there
are no servers out there in the wild that use it.

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Theodore Tso

On Mon, Apr 09, 2007 at 08:31:37AM -0400, Trond Myklebust wrote:
> On Mon, 2007-04-09 at 13:09 +0200, Jörn Engel wrote:
> > That surely doesn't make life any easier for filesystem developers, I
> > agree.  From that point of view, all telldir cookies should end their
> > life at closedir time.  For "rm -r" it would be sufficient if the nfs
> > client simply didn't seekdir at all.  For "ls -lR", this would return
> > duplicate dentries.
> 
> Please go read the NFS spec. The only thing an NFS client has in order
> to read a directory is a READDIR operation that in essence takes a
> filehandle and a cookie as its arguments. Unless the server is able to
> return the entire rest of the directory in one RPC reply, the client
> needs to send a second READDIR operation with a cookie from the previous
> READDIR operation. The server is expected to return cookies for _each_
> entry in the directory.
> 
> That is a protocol limitation, not a client limitation.

And after quickly checking RFC 3010, I see this limitation hasn't been
lifted in NFSv4.

Speaking of which, right now ext3 doesn't know whether it's talking to
an NFSv2 or NFS v3/v4 server, so it's always passing a 32-bit cookie.
If NFSv3/v4 could use an explicit interface to request a 64-bit
cookie, instead of just relying on the f_pos field in the file handle,
we can reduce the chance of hash collisions when reading an ext3
directory significantly.   

If there are 2 or 3 directory entries that have a hash collision,
would the NFS protocol allow the server to juggle things so that those
2-3 directory entries with the hash collision are sent back in a
single readdir RPC reply?  Is it aceptable/legal to have multiple
entries in the same READDIR reply packet have the same cookie value?

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Andreas Schwab

Jörn Engel <[EMAIL PROTECTED]> writes:

> On Sun, 8 April 2007 21:44:26 -0400, Theodore Tso wrote:
>> 
>> Well, Joern thought that rm -rf might relying on the telldir cookie
>> being valid in precisely that circumstance.  If that is true, I'd
>> argue that this is a BUG in GNU coreutils that should be fixed...
>
> I heard it and accepted that claim without checking it.

There is not a single call to telldir/seekdir in the coreutils source.

Andreas.

-- 
Andreas Schwab, SuSE Labs, [EMAIL PROTECTED]
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Trond Myklebust

On Mon, 2007-04-09 at 13:09 +0200, Jörn Engel wrote:
> That surely doesn't make life any easier for filesystem developers, I
> agree.  From that point of view, all telldir cookies should end their
> life at closedir time.  For "rm -r" it would be sufficient if the nfs
> client simply didn't seekdir at all.  For "ls -lR", this would return
> duplicate dentries.

Please go read the NFS spec. The only thing an NFS client has in order
to read a directory is a READDIR operation that in essence takes a
filehandle and a cookie as its arguments. Unless the server is able to
return the entire rest of the directory in one RPC reply, the client
needs to send a second READDIR operation with a cookie from the previous
READDIR operation. The server is expected to return cookies for _each_
entry in the directory.

That is a protocol limitation, not a client limitation.

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Trond Myklebust

On Mon, 2007-04-09 at 13:09 +0200, Jörn Engel wrote:
> That surely doesn't make life any easier for filesystem developers, I
> agree.  From that point of view, all telldir cookies should end their
> life at closedir time.  For "rm -r" it would be sufficient if the nfs
> client simply didn't seekdir at all.  For "ls -lR", this would return
> duplicate dentries.

Please go read the NFS spec. The only thing an NFS client has in order
to read a directory is a READDIR operation that in essence takes a
filehandle and a cookie as its arguments. Unless the server is able to
return the entire directory in one RPC reply, the client That is a
protocol limitation, not a client limitation.

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-09 Thread Jörn Engel

On Sun, 8 April 2007 21:44:26 -0400, Theodore Tso wrote:
> 
> Well, Joern thought that rm -rf might relying on the telldir cookie
> being valid in precisely that circumstance.  If that is true, I'd
> argue that this is a BUG in GNU coreutils that should be fixed...

I heard it and accepted that claim without checking it.  Might have been
a mistake.  But the claim came from an NFS developer, which may explain
a thing or two.

NFS clients have to deal with a server rebooting underneith them and
should still behave as expected.  An "rm -r" running on the client
concurrently to a rebooting server is a problem indeed and could be
solved with seekdir/telldir.

That surely doesn't make life any easier for filesystem developers, I
agree.  From that point of view, all telldir cookies should end their
life at closedir time.  For "rm -r" it would be sufficient if the nfs
client simply didn't seekdir at all.  For "ls -lR", this would return
duplicate dentries.

Jörn

-- 
My second remark is that our intellectual powers are rather geared to
master static relations and that our powers to visualize processes
evolving in time are relatively poorly developed.
-- Edsger W. Dijkstra
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-08 Thread Theodore Tso

On Sun, Apr 08, 2007 at 12:28:32PM -0700, H. Peter Anvin wrote:
> Theodore Tso wrote:
> >It doesn't state explicitly that you can use the telldir cookie()
> >after closing the directory stream using closedir() and then reopening
> >it using opendir(), but given that it states that results are
> >undefined after a rewinddir() --- which is much less violent than a
> >closedir()/opendir(), I would definitely argue that an application
> >programmer would be very ill-advised to rely on this working.
> >
> >(Of course, I'd argue that an application programmer shouldn't use
> >telldir/seekdir at all.)
> >
> >Ulrich, is it too late to insert a clarification that the telldir()
> >cookie isn't guaranteed to be valid after closedir() *or* rewinddir()?
> 
> More fundamentally, the telldir cookie should never be valid when 
> applied to a different DIR * (even one that refers to the same directory.)

Well, Joern thought that rm -rf might relying on the telldir cookie
being valid in precisely that circumstance.  If that is true, I'd
argue that this is a BUG in GNU coreutils that should be fixed...

- Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-08 Thread J. Bruce Fields

On Sat, Apr 07, 2007 at 04:36:33PM -0400, Theodore Tso wrote:
> 1) Deprecate telldir/seekdir() altogether.  Relatively few progams use
> this functionality, and it is highly questionable how useful it is,
> anyway.  If you use telldir/seekdir and keep the cookie for a long
> time, even the POSIX-provided guarantees about files that are created
> and deleted between the telldir() and seekdir() points in time makes
> its utility highly dubious.

How will nfsd implement readdir?

> 2) If application programs must have telldir/seekdir, than expand the
> size of the cookie from 32-bits to a minimum of 128 bits, and
> preferably larger --- say 512 bits, to accomodate systems that might
> be using 512-bit variant of SHA-2.

NFS readdir cookies are currently 64 bits.

It'd be interesting to think about how to modify the protocol to make
this all easier, but any network filesystem protocol will need to give
clients some way to read through big directories one piece at a time.
Might also be nice if it worked even if the server rebooted partway
through

--b.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-08 Thread Ulrich Drepper


On 4/8/07, H. Peter Anvin <[EMAIL PROTECTED]> wrote:

More fundamentally, the telldir cookie should never be valid when
applied to a different DIR * (even one that refers to the same directory.)


Don't worry about this.  This is clearly the semantics which was
always wanted. I've filed a defect report and it'll be handled.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-08 Thread H. Peter Anvin


Theodore Tso wrote:

It doesn't state explicitly that you can use the telldir cookie()
after closing the directory stream using closedir() and then reopening
it using opendir(), but given that it states that results are
undefined after a rewinddir() --- which is much less violent than a
closedir()/opendir(), I would definitely argue that an application
programmer would be very ill-advised to rely on this working.

(Of course, I'd argue that an application programmer shouldn't use
telldir/seekdir at all.)

Ulrich, is it too late to insert a clarification that the telldir()
cookie isn't guaranteed to be valid after closedir() *or* rewinddir()?


More fundamentally, the telldir cookie should never be valid when 
applied to a different DIR * (even one that refers to the same directory.)


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-08 Thread Ulrich Drepper


On 4/8/07, Theodore Tso <[EMAIL PROTECTED]> wrote:

Ulrich, is it too late to insert a clarification that the telldir()
cookie isn't guaranteed to be valid after closedir() *or* rewinddir()?


It's never too late.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-08 Thread Theodore Tso

On Sun, Apr 08, 2007 at 08:41:30PM +0200, Jörn Engel wrote:
> 
> Garbage-collecting them on closedir() does not work.  It surprised me as
> well, but there seem to be applications that keep the telldir() cookie
> around after closedir().  Iirc, "rm -r" was one of them.
> 
> Neil, is this correct?

Well, according to the Single Unix Specification:

If the value of loc was not obtained from an earlier call to
telldir(), or if a call to rewinddir() occurred between the call
to telldir() and the call to seekdir(), the results of subsequent
calls to readdir() are unspecified.

It doesn't state explicitly that you can use the telldir cookie()
after closing the directory stream using closedir() and then reopening
it using opendir(), but given that it states that results are
undefined after a rewinddir() --- which is much less violent than a
closedir()/opendir(), I would definitely argue that an application
programmer would be very ill-advised to rely on this working.

(Of course, I'd argue that an application programmer shouldn't use
telldir/seekdir at all.)

Ulrich, is it too late to insert a clarification that the telldir()
cookie isn't guaranteed to be valid after closedir() *or* rewinddir()?

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-08 Thread H. Peter Anvin


Theodore Tso wrote:


You could, but then you're succeptible to a memory allocation attack.
If you have an arbitrarily large directory (say, one with multiple
millions of entries), and the attacker program calls seekdir() after
every single readdir() call, you would then force the kernel to
allocate and then pin arbitrarily large amounts of memory, which as
you point out, as currently specified by the POSIX specification, you
are not allowed to release until closedir().

This could be done in userspace, by forcing glibc to readdir() the
entire directory into memory, at which point seekdir()/telldir() will
work just fine.  But for a really big directory, this could consume a
huge amount of space. 


If we had the 64-byte telldir cookie that I had proposed, then in
userspace we could simply associate that 64-byte telldir cookie with a
small 32-bit integer, either in memory, or in some berkdb or tdb
interface, at least until the use of telldir/seekdir had actually
disappeared.  (Which probably wouldn't take that long; I really doubt
there are that many users of it out there, so it's probably OK if they
suffer a performance penality if they use this really wretched and
horrible interface.)



If you want to have a large cookies, you could have glibc allocate a 
memory block to store it, and have glibc responsible for keeping track 
of it.  As far as I know, off_t can hold a pointer on all our 
implementations (only 32-bit machines have 32-bit off_t as an option; 
Alpha *might* be an exception but I don't think so.)



I'll also note, by the way, that there are those who have been much
more cavalier with breaking the wireless interface or the udev/sys
interface after one year.  Not that I would agree with that, but over
some deprecation period measured in years, I think it is possible to
nuke what was a horribly misguided interface that should have never
existed.  Whoever invented it really should receive the brown paper
award for one of the worst design decisions of all time.


Readdir/telldir are much, much, more fundamental than that.  We're 
talking interfaces which have been standardized for longer than Linux 
itself has existed.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-08 Thread Ulrich Drepper


On 4/7/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote:

It's not going to solve anything at all.  We can't stop supporting
functionality that has been there forever.


Not necessarily.

One problem here is that the interface for using readdir() with and
without telldir()/seekdir() is the same.  A second problem is that the
functionality is universally required.  Both of these problems can be
addressed.

For the second problem, I certainly could imagine that making the
functionality to to use seekdir()/telldir() optional.  It might be
hard in POSIX but this does not mean anything about implementations.
Implementations just have to provide a way to allow these functions to
be used.  It does not mean it always and everywhere has to work.  What
this means is that if, for instance, a filesystem would be (for now)
be able to have a mount option to not allow seekdir()/telldir() the
system still can conform to POSIX.  At the same time we can gather
information as to whether seekdir()/telldir() are really needed.  I
personally think the number of apps which depend on this functionality
is miniscule.

Using a mount option isn't the nicest solution, though.  If a
filesystem can support seekdir()/telldir() the better solution from
the userlevel API POV would be to provide a better, alternative
interface.  Maybe an alternative opendir() call (opendir2?) which
takes a second parameter as to whether seeking is needed or not.  Then
this opendir2() function can use a new getdents() syscall and return
the entries.  The difference would be that if the user wants to use
seekdir()/telldir() the userlevel code would cache the old results and
the seekdir()/telldir() handling would be entirely at userlevel.

It's not a good idea to make this the default behavior for the old
opendir() since the vast majority of the current users don't want to
seek and therefore the caching would significantly impact the
performance.  With the extra argument saying when caching is needed
this is no problem anymore.  Over time people would migrate off of
opendir() and towards opendir2() (with some "careful" encouragement)
and the whole problem will go away.

And the best: this is certainly a path I can see being viable for
POSIX.  But it requires that we have
a) established existing practice
b) shown the impact is really low

So, I think it would be great to get started writing this new getdents
call.  Yes, for now it means maintaining two separate versions.  If
all goes well those filsystems which feel a high burden can simply
stop supporting the old syscall or at least the seek functionality.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-08 Thread Theodore Tso

On Sun, Apr 08, 2007 at 11:11:20AM -0700, H. Peter Anvin wrote:
> Christoph Hellwig wrote:
> >On Sat, Apr 07, 2007 at 04:36:33PM -0400, Theodore Tso wrote:
> >>this functionality, and it is highly questionable how useful it is,
> >>anyway.  If you use telldir/seekdir and keep the cookie for a long
> >>time, even the POSIX-provided guarantees about files that are created
> >>and deleted between the telldir() and seekdir() points in time makes
> >>its utility highly dubious.
> >
> >It's not going to solve anything at all.  We can't stop supporting
> >functionality that has been there forever. 
> 
> Well, the question is if you can keep the seekdir/telldir cookie around 
> as a pointer -- preferrably in userspace, of course.  You would 
> presumably garbage-collect them on closedir() -- there is no other point 
> at which you could.
> 
> I personally suspect that hch is right -- this stuff has been there 
> since time immemorial and it'll be hard or impossible to deprecate it.

You could, but then you're succeptible to a memory allocation attack.
If you have an arbitrarily large directory (say, one with multiple
millions of entries), and the attacker program calls seekdir() after
every single readdir() call, you would then force the kernel to
allocate and then pin arbitrarily large amounts of memory, which as
you point out, as currently specified by the POSIX specification, you
are not allowed to release until closedir().

This could be done in userspace, by forcing glibc to readdir() the
entire directory into memory, at which point seekdir()/telldir() will
work just fine.  But for a really big directory, this could consume a
huge amount of space. 

If we had the 64-byte telldir cookie that I had proposed, then in
userspace we could simply associate that 64-byte telldir cookie with a
small 32-bit integer, either in memory, or in some berkdb or tdb
interface, at least until the use of telldir/seekdir had actually
disappeared.  (Which probably wouldn't take that long; I really doubt
there are that many users of it out there, so it's probably OK if they
suffer a performance penality if they use this really wretched and
horrible interface.)

I'll also note, by the way, that there are those who have been much
more cavalier with breaking the wireless interface or the udev/sys
interface after one year.  Not that I would agree with that, but over
some deprecation period measured in years, I think it is possible to
nuke what was a horribly misguided interface that should have never
existed.  Whoever invented it really should receive the brown paper
award for one of the worst design decisions of all time.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-08 Thread Jörn Engel

On Sun, 8 April 2007 11:11:20 -0700, H. Peter Anvin wrote:
> 
> Well, the question is if you can keep the seekdir/telldir cookie around 
> as a pointer -- preferrably in userspace, of course.  You would 
> presumably garbage-collect them on closedir() -- there is no other point 
> at which you could.

Garbage-collecting them on closedir() does not work.  It surprised me as
well, but there seem to be applications that keep the telldir() cookie
around after closedir().  Iirc, "rm -r" was one of them.

Neil, is this correct?

Jörn

-- 
Data dominates. If you've chosen the right data structures and organized
things well, the algorithms will almost always be self-evident. Data
structures, not algorithms, are central to programming.
-- Rob Pike
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-08 Thread H. Peter Anvin


Christoph Hellwig wrote:

On Sat, Apr 07, 2007 at 04:36:33PM -0400, Theodore Tso wrote:

this functionality, and it is highly questionable how useful it is,
anyway.  If you use telldir/seekdir and keep the cookie for a long
time, even the POSIX-provided guarantees about files that are created
and deleted between the telldir() and seekdir() points in time makes
its utility highly dubious.


It's not going to solve anything at all.  We can't stop supporting
functionality that has been there forever. 


Well, the question is if you can keep the seekdir/telldir cookie around 
as a pointer -- preferrably in userspace, of course.  You would 
presumably garbage-collect them on closedir() -- there is no other point 
at which you could.


I personally suspect that hch is right -- this stuff has been there 
since time immemorial and it'll be hard or impossible to deprecate it.


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-07 Thread Jan Engelhardt


On Apr 7 2007 16:36, Theodore Tso wrote:
>
>So how do we solve this problem?  I can think of two solutions:
>
>1) Deprecate telldir/seekdir() altogether.  Relatively few progams use
>this functionality, and it is highly questionable how useful it is,
>anyway.  If you use telldir/seekdir and keep the cookie for a long
>time, even the POSIX-provided guarantees about files that are created
>and deleted between the telldir() and seekdir() points in time makes
>its utility highly dubious.
>
>2) If application programs must have telldir/seekdir, than expand the
>size of the cookie from 32-bits to a minimum of 128 bits, and
>preferably larger --- say 512 bits, to accomodate systems that might
>be using 512-bit variant of SHA-2.  [...]

Maybe a combination of both? That is, replace telldir by an
independent emulation layer.


Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-07 Thread Christoph Hellwig

On Sat, Apr 07, 2007 at 04:36:33PM -0400, Theodore Tso wrote:
> this functionality, and it is highly questionable how useful it is,
> anyway.  If you use telldir/seekdir and keep the cookie for a long
> time, even the POSIX-provided guarantees about files that are created
> and deleted between the telldir() and seekdir() points in time makes
> its utility highly dubious.

It's not going to solve anything at all.  We can't stop supporting
functionality that has been there forever. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: If not readdir() then what?

2007-04-07 Thread Theodore Tso

On Sat, Apr 07, 2007 at 09:57:32AM -0700, Ulrich Drepper wrote:
> In their closed chambers (well, workshops,
> http://lwn.net/Articles/226351/), the filesystem developers complain
> about readdir.  I fully appreciate the difficulties.  But what I fail
> to see so far is any proposal for an alternative interface.
> 
> The phase to get new functionality included in the next revision of
> POSIX is over.  But that does not mean we should not try to get some
> sensible new implementation in place.  There is, for example, the
> "High End Computing Extensions Working Group" (the guys who showed up
> here with their statlite and readdirplus proposals).  This is an
> official working group at the OpenGroup which can produce a document
> which can be the basis of inclusion in the next revision and become a
> OpenGroup specification earlier than that.
> 
> So, if anybody has a proposal for better interfaces let's hear them.
> "Now" is a very good time to start working on this.

The problem isn't as much readdir() as it is telldir()/seekdir(),
which fundamentally assumes that the directory is a linear file.  JFS
for example was forced to implement an entire separate btree whose
only purpose was to record telldir cookies.  

With ext3 and htree we return the directories in hash tree sort order.
This sacrifices some performance with readdir/stat workloads unless
the userspace program does a readdir/qsort on inode number/stat
replacement.  In addition, there is the risk of hash collusions
because of the pathetically small size of the telldir cookie (32
bits).  If this happens, the readdir() guarantees of only returning
each directory entry at most once are compromised.  We use a keyed
hash with a per-filesystem superblock secret to prevent someone from
deliberately finding a hash collision and proving that we're not POSIX
compliant in the edge cases.  Cheasy, yes, but only loser programs
should be using telldir/seekdir anyway.  :-)

So how do we solve this problem?  I can think of two solutions:

1) Deprecate telldir/seekdir() altogether.  Relatively few progams use
this functionality, and it is highly questionable how useful it is,
anyway.  If you use telldir/seekdir and keep the cookie for a long
time, even the POSIX-provided guarantees about files that are created
and deleted between the telldir() and seekdir() points in time makes
its utility highly dubious.

2) If application programs must have telldir/seekdir, than expand the
size of the cookie from 32-bits to a minimum of 128 bits, and
preferably larger --- say 512 bits, to accomodate systems that might
be using 512-bit variant of SHA-2.   So something like this?

/* TELLDIR_COOKIE_SIZE must be >= 64 */
#define TELLDIR_COOKIE_SIZE 64
typedef unsigned char ltelldir_cookie_t[TELLDIR_COOKIE_SIZE];

int ltelldir(int fd, ltelldircookie_t *cookie);
int lseekdir(int fd, ltelldircookie_t *cookie);

My personal preference would be #1, though.  :-)

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

If not readdir() then what?

2007-04-07 Thread Ulrich Drepper


In their closed chambers (well, workshops,
http://lwn.net/Articles/226351/), the filesystem developers complain
about readdir.  I fully appreciate the difficulties.  But what I fail
to see so far is any proposal for an alternative interface.

The phase to get new functionality included in the next revision of
POSIX is over.  But that does not mean we should not try to get some
sensible new implementation in place.  There is, for example, the
"High End Computing Extensions Working Group" (the guys who showed up
here with their statlite and readdirplus proposals).  This is an
official working group at the OpenGroup which can produce a document
which can be the basis of inclusion in the next revision and become a
OpenGroup specification earlier than that.

So, if anybody has a proposal for better interfaces let's hear them.
"Now" is a very good time to start working on this.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

65 matches

Mail list logo