Re: [RFC PATCH 0/5] Union Mount: A Directory listing approach with lseek support

2007-12-06 Thread sfjro

Bharata B Rao:
> - The cache can grow arbitrarily large in size for big directories thereby
> consuming lots of memory. Pruning individual cache entries is out of question
> as entire cache is needed for subsequent readdirs for duplicate elimination.

Additionally, the memory usage may be a problem too since your
implementation calls kmalloc() for every names.


> - Whenever _any_ directory that is part of the union gets
> modified (addition/deletion of entries), the dirent cache of all the unions
> which this directory is part of, needs to be purged and rebuilt. This is
> expensive not only due to re-reads of dirents but also because
> readdir(2)/getdents(2) needs to be synchronized with other operations
> like mkdir/mknod/link/unlink etc.

The cache in struct file doesn't need to be refreshed unless rewinddir()
is issued. Also you can maintain the cache in every add/del entries,
instead of discarding the cache entirely.


> After all this, I am beginning to think if it would be better to delegate
> this readdir and whiteout processing to userspace. Can this be better handled

Yes, I had such idea once. And copy-up too. They can be done in
userspace (while you need to be careful about the privilege).

Anyway I agree with you. As I wrote before, this approach consumes a lot
of memory and cpu (for comparing whiteouted names).


Junjiro Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Union Mount: A Directory listing approach with lseek support

2007-12-06 Thread Dave Hansen
On Thu, 2007-12-06 at 11:01 +0100, Jan Blunck wrote:
> > Rather than give each _dirent_ an offset, could we give each sub-mount
> > an offset?  Let's say we have three members comprising a union mount
> > directory.  The first has 100 dirents, the second 200, and the third
> > 10,000.  When the first readdir is done, we populate the table like
> > this:
> > 
> >   mount_offset[0] = 0;
> >   mount_offset[1] = 100;
> >   mount_offset[2] = 300;
> > 
> > If someone seeks back to 150, then we subtrack the mount[1]'s offset
> > (100), and realize that we want the 50th dirent from mount[1].
> 
> Yes, that is a nice idea and it is exactly what I have implemented in my patch
> series. But you forgot one thing: directories are not flat files. The dentry
> offset in a directory is a random cookie. Therefore it is not possible to have
> a linear mapping without allocating memory.

Is is truly a random cookie where the fs gets to put anything in there
that it can fit in a long?  Isn't that an advantage?  Being a random
cookie, we can encode anything we want in there.  We could encode the
"mount index" (or whatever you want to call it) in high or low bits and
have the rest store the fs-specific offset.  The only problem there
would be running out of storage space in long.

But, what do people expect when they have huge directories (or huge
directory offsets)?  Surely, if their fs is already pressing the fs's
directory data types to the limits, they can't simply union mount a
couple of those directories together and expect magic.  Union mounts
also can't be expected to compress these directory positions to fit.

So, I think it's reasonable behavior to readdir() until the sum of the
highest position seen on each mount would overflow the off_t that we
have to store it in.  

-- Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Union Mount: A Directory listing approach with lseek support

2007-12-06 Thread Bharata B Rao
On Thu, Dec 06, 2007 at 11:01:18AM +0100, Jan Blunck wrote:
> On Wed, Dec 05, Dave Hansen wrote:
> 
> > I think the key here is what kind of consistency we're trying to
> > provide.  If a directory is being changed underneath a reader, what
> > kinds of guarantees do they get about the contents of their directory
> > read?  When do those guarantees start?  Are there any at open() time?
> 
> But we still want to be compliant to what POSIX defines. The problem isn't the
> consistency of the readdir result but the seekdir/telldir interface. IMHO that
> interface is totally broken: you need to be able to find every offset given by
> telldir since the last open. The problem is that seekdir isn't able to return
> errors. Otherwise you could just forbid seeking on union directories.

Also, what kind of consistency is expected when a directory is open(2)ed
and readdir(2) and lseek(2) are applied to it when the directory gets
changed underneath the reader. From this:
http://www.opengroup.org/onlinepubs/009695399/functions/lseek.html
the behaviour/guarantees wasn't apparent to me.

> 
> > Rather than give each _dirent_ an offset, could we give each sub-mount
> > an offset?  Let's say we have three members comprising a union mount
> > directory.  The first has 100 dirents, the second 200, and the third
> > 10,000.  When the first readdir is done, we populate the table like
> > this:
> > 
> > mount_offset[0] = 0;
> > mount_offset[1] = 100;
> > mount_offset[2] = 300;
> > 
> > If someone seeks back to 150, then we subtrack the mount[1]'s offset
> > (100), and realize that we want the 50th dirent from mount[1].
> 
> Yes, that is a nice idea and it is exactly what I have implemented in my patch
> series. But you forgot one thing: directories are not flat files. The dentry
> offset in a directory is a random cookie. Therefore it is not possible to have
> a linear mapping without allocating memory.

And I defined this linear behaviour on the cache of dirents we maintain
in the approach I posted. And the main reason we maintain cache of
dirents in memory is for duplicate elimination.

> 
> > I don't know whether we're bound to this:
> > 
> > http://www.opengroup.org/onlinepubs/007908775/xsh/readdir.html
> > 
> > "If a file is removed from or added to the directory after the
> > most recent call to opendir() or rewinddir(), whether a
> > subsequent call to readdir() returns an entry for that file is
> > unspecified."
> > 
> > But that would seem to tell me that once you populate a table such as
> > the one I've described and create it at open(dir) time, you don't
> > actually ever need to update it.
> 
> Yes, I'm using such a patch on our S390 buildservers to work around some
> readdir/seek/rm problem with old glibc versions. It seems to work but on the
> other hand this are really huge systems and I haven't run out of memory while
> doing a readdir yet ;)
> 
> The proper way to implement this would be to cache the offsets on a per inode
> base. Otherwise the user could easily DoS this by opening a number of
> directories and never close them.
> 

You mean cache the offsets or dirents ? How would that solve
the seek problem ? How would it enable you to define a seek behaviour
for the entire union of directories ?

Regards,
Bharata.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Union Mount: A Directory listing approach with lseek support

2007-12-06 Thread Jan Blunck
On Wed, Dec 05, Dave Hansen wrote:

> I think the key here is what kind of consistency we're trying to
> provide.  If a directory is being changed underneath a reader, what
> kinds of guarantees do they get about the contents of their directory
> read?  When do those guarantees start?  Are there any at open() time?

But we still want to be compliant to what POSIX defines. The problem isn't the
consistency of the readdir result but the seekdir/telldir interface. IMHO that
interface is totally broken: you need to be able to find every offset given by
telldir since the last open. The problem is that seekdir isn't able to return
errors. Otherwise you could just forbid seeking on union directories.

> Rather than give each _dirent_ an offset, could we give each sub-mount
> an offset?  Let's say we have three members comprising a union mount
> directory.  The first has 100 dirents, the second 200, and the third
> 10,000.  When the first readdir is done, we populate the table like
> this:
> 
>   mount_offset[0] = 0;
>   mount_offset[1] = 100;
>   mount_offset[2] = 300;
> 
> If someone seeks back to 150, then we subtrack the mount[1]'s offset
> (100), and realize that we want the 50th dirent from mount[1].

Yes, that is a nice idea and it is exactly what I have implemented in my patch
series. But you forgot one thing: directories are not flat files. The dentry
offset in a directory is a random cookie. Therefore it is not possible to have
a linear mapping without allocating memory.

> I don't know whether we're bound to this:
> 
> http://www.opengroup.org/onlinepubs/007908775/xsh/readdir.html
> 
> "If a file is removed from or added to the directory after the
> most recent call to opendir() or rewinddir(), whether a
> subsequent call to readdir() returns an entry for that file is
> unspecified."
> 
> But that would seem to tell me that once you populate a table such as
> the one I've described and create it at open(dir) time, you don't
> actually ever need to update it.

Yes, I'm using such a patch on our S390 buildservers to work around some
readdir/seek/rm problem with old glibc versions. It seems to work but on the
other hand this are really huge systems and I haven't run out of memory while
doing a readdir yet ;)

The proper way to implement this would be to cache the offsets on a per inode
base. Otherwise the user could easily DoS this by opening a number of
directories and never close them.

Regards,
Jan

-- 
Jan Blunck <[EMAIL PROTECTED]>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Union Mount: A Directory listing approach with lseek support

2007-12-06 Thread Jan Blunck
On Wed, Dec 05, Dave Hansen wrote:

 I think the key here is what kind of consistency we're trying to
 provide.  If a directory is being changed underneath a reader, what
 kinds of guarantees do they get about the contents of their directory
 read?  When do those guarantees start?  Are there any at open() time?

But we still want to be compliant to what POSIX defines. The problem isn't the
consistency of the readdir result but the seekdir/telldir interface. IMHO that
interface is totally broken: you need to be able to find every offset given by
telldir since the last open. The problem is that seekdir isn't able to return
errors. Otherwise you could just forbid seeking on union directories.

 Rather than give each _dirent_ an offset, could we give each sub-mount
 an offset?  Let's say we have three members comprising a union mount
 directory.  The first has 100 dirents, the second 200, and the third
 10,000.  When the first readdir is done, we populate the table like
 this:
 
   mount_offset[0] = 0;
   mount_offset[1] = 100;
   mount_offset[2] = 300;
 
 If someone seeks back to 150, then we subtrack the mount[1]'s offset
 (100), and realize that we want the 50th dirent from mount[1].

Yes, that is a nice idea and it is exactly what I have implemented in my patch
series. But you forgot one thing: directories are not flat files. The dentry
offset in a directory is a random cookie. Therefore it is not possible to have
a linear mapping without allocating memory.

 I don't know whether we're bound to this:
 
 http://www.opengroup.org/onlinepubs/007908775/xsh/readdir.html
 
 If a file is removed from or added to the directory after the
 most recent call to opendir() or rewinddir(), whether a
 subsequent call to readdir() returns an entry for that file is
 unspecified.
 
 But that would seem to tell me that once you populate a table such as
 the one I've described and create it at open(dir) time, you don't
 actually ever need to update it.

Yes, I'm using such a patch on our S390 buildservers to work around some
readdir/seek/rm problem with old glibc versions. It seems to work but on the
other hand this are really huge systems and I haven't run out of memory while
doing a readdir yet ;)

The proper way to implement this would be to cache the offsets on a per inode
base. Otherwise the user could easily DoS this by opening a number of
directories and never close them.

Regards,
Jan

-- 
Jan Blunck [EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Union Mount: A Directory listing approach with lseek support

2007-12-06 Thread Bharata B Rao
On Thu, Dec 06, 2007 at 11:01:18AM +0100, Jan Blunck wrote:
 On Wed, Dec 05, Dave Hansen wrote:
 
  I think the key here is what kind of consistency we're trying to
  provide.  If a directory is being changed underneath a reader, what
  kinds of guarantees do they get about the contents of their directory
  read?  When do those guarantees start?  Are there any at open() time?
 
 But we still want to be compliant to what POSIX defines. The problem isn't the
 consistency of the readdir result but the seekdir/telldir interface. IMHO that
 interface is totally broken: you need to be able to find every offset given by
 telldir since the last open. The problem is that seekdir isn't able to return
 errors. Otherwise you could just forbid seeking on union directories.

Also, what kind of consistency is expected when a directory is open(2)ed
and readdir(2) and lseek(2) are applied to it when the directory gets
changed underneath the reader. From this:
http://www.opengroup.org/onlinepubs/009695399/functions/lseek.html
the behaviour/guarantees wasn't apparent to me.

 
  Rather than give each _dirent_ an offset, could we give each sub-mount
  an offset?  Let's say we have three members comprising a union mount
  directory.  The first has 100 dirents, the second 200, and the third
  10,000.  When the first readdir is done, we populate the table like
  this:
  
  mount_offset[0] = 0;
  mount_offset[1] = 100;
  mount_offset[2] = 300;
  
  If someone seeks back to 150, then we subtrack the mount[1]'s offset
  (100), and realize that we want the 50th dirent from mount[1].
 
 Yes, that is a nice idea and it is exactly what I have implemented in my patch
 series. But you forgot one thing: directories are not flat files. The dentry
 offset in a directory is a random cookie. Therefore it is not possible to have
 a linear mapping without allocating memory.

And I defined this linear behaviour on the cache of dirents we maintain
in the approach I posted. And the main reason we maintain cache of
dirents in memory is for duplicate elimination.

 
  I don't know whether we're bound to this:
  
  http://www.opengroup.org/onlinepubs/007908775/xsh/readdir.html
  
  If a file is removed from or added to the directory after the
  most recent call to opendir() or rewinddir(), whether a
  subsequent call to readdir() returns an entry for that file is
  unspecified.
  
  But that would seem to tell me that once you populate a table such as
  the one I've described and create it at open(dir) time, you don't
  actually ever need to update it.
 
 Yes, I'm using such a patch on our S390 buildservers to work around some
 readdir/seek/rm problem with old glibc versions. It seems to work but on the
 other hand this are really huge systems and I haven't run out of memory while
 doing a readdir yet ;)
 
 The proper way to implement this would be to cache the offsets on a per inode
 base. Otherwise the user could easily DoS this by opening a number of
 directories and never close them.
 

You mean cache the offsets or dirents ? How would that solve
the seek problem ? How would it enable you to define a seek behaviour
for the entire union of directories ?

Regards,
Bharata.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Union Mount: A Directory listing approach with lseek support

2007-12-06 Thread Dave Hansen
On Thu, 2007-12-06 at 11:01 +0100, Jan Blunck wrote:
  Rather than give each _dirent_ an offset, could we give each sub-mount
  an offset?  Let's say we have three members comprising a union mount
  directory.  The first has 100 dirents, the second 200, and the third
  10,000.  When the first readdir is done, we populate the table like
  this:
  
mount_offset[0] = 0;
mount_offset[1] = 100;
mount_offset[2] = 300;
  
  If someone seeks back to 150, then we subtrack the mount[1]'s offset
  (100), and realize that we want the 50th dirent from mount[1].
 
 Yes, that is a nice idea and it is exactly what I have implemented in my patch
 series. But you forgot one thing: directories are not flat files. The dentry
 offset in a directory is a random cookie. Therefore it is not possible to have
 a linear mapping without allocating memory.

Is is truly a random cookie where the fs gets to put anything in there
that it can fit in a long?  Isn't that an advantage?  Being a random
cookie, we can encode anything we want in there.  We could encode the
mount index (or whatever you want to call it) in high or low bits and
have the rest store the fs-specific offset.  The only problem there
would be running out of storage space in long.

But, what do people expect when they have huge directories (or huge
directory offsets)?  Surely, if their fs is already pressing the fs's
directory data types to the limits, they can't simply union mount a
couple of those directories together and expect magic.  Union mounts
also can't be expected to compress these directory positions to fit.

So, I think it's reasonable behavior to readdir() until the sum of the
highest position seen on each mount would overflow the off_t that we
have to store it in.  

-- Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Union Mount: A Directory listing approach with lseek support

2007-12-06 Thread sfjro

Bharata B Rao:
 - The cache can grow arbitrarily large in size for big directories thereby
 consuming lots of memory. Pruning individual cache entries is out of question
 as entire cache is needed for subsequent readdirs for duplicate elimination.

Additionally, the memory usage may be a problem too since your
implementation calls kmalloc() for every names.


 - Whenever _any_ directory that is part of the union gets
 modified (addition/deletion of entries), the dirent cache of all the unions
 which this directory is part of, needs to be purged and rebuilt. This is
 expensive not only due to re-reads of dirents but also because
 readdir(2)/getdents(2) needs to be synchronized with other operations
 like mkdir/mknod/link/unlink etc.

The cache in struct file doesn't need to be refreshed unless rewinddir()
is issued. Also you can maintain the cache in every add/del entries,
instead of discarding the cache entirely.


 After all this, I am beginning to think if it would be better to delegate
 this readdir and whiteout processing to userspace. Can this be better handled

Yes, I had such idea once. And copy-up too. They can be done in
userspace (while you need to be careful about the privilege).

Anyway I agree with you. As I wrote before, this approach consumes a lot
of memory and cpu (for comparing whiteouted names).


Junjiro Okajima
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Union Mount: A Directory listing approach with lseek support

2007-12-05 Thread Dave Hansen
On Wed, 2007-12-05 at 20:07 +0530, Bharata B Rao wrote:
> In this approach, the cached dirents are given offsets in the form of
> linearly increasing indices/cookies (like 0, 1, 2,...). This helps us to
> uniformly define offsets across all the directories of the union
> irrespective of the type of filesystem involved. Also this is needed to
> define a seek behaviour on the union mounted directory. This cache is stored
> as part of the struct file of the topmost directory of the union and will
> remain as long as the directory is kept open. 

That's a little brute force, don't you think?  Especially the memory
exhaustion seems to really make this an undesirable approach.

I think the key here is what kind of consistency we're trying to
provide.  If a directory is being changed underneath a reader, what
kinds of guarantees do they get about the contents of their directory
read?  When do those guarantees start?  Are there any at open() time?

Rather than give each _dirent_ an offset, could we give each sub-mount
an offset?  Let's say we have three members comprising a union mount
directory.  The first has 100 dirents, the second 200, and the third
10,000.  When the first readdir is done, we populate the table like
this:

mount_offset[0] = 0;
mount_offset[1] = 100;
mount_offset[2] = 300;

If someone seeks back to 150, then we subtrack the mount[1]'s offset
(100), and realize that we want the 50th dirent from mount[1].

I don't know whether we're bound to this:

http://www.opengroup.org/onlinepubs/007908775/xsh/readdir.html

"If a file is removed from or added to the directory after the
most recent call to opendir() or rewinddir(), whether a
subsequent call to readdir() returns an entry for that file is
unspecified."

But that would seem to tell me that once you populate a table such as
the one I've described and create it at open(dir) time, you don't
actually ever need to update it.

The storage for this is only comparable to the number of mounts that you
have.  One issue comes if we manage to overflow our data types with too
many entries in too many submounts.  But, I guess we can just truncate
the directory in that case.  

-- Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 0/5] Union Mount: A Directory listing approach with lseek support

2007-12-05 Thread Bharata B Rao
Hi,

In Union Mount, the merged view of directories of the union is obtained
by enhancing readdir(2)/getdents(2) to read and merge the entries of
all the directories  by eliminating the duplicates. While we have tried
a few approaches for this, none of them could perfectly solve all the problems.
One of the challenges has been to provide a correct directory seek support for
the union mounted directories. Sometime back when I posted an
RFC (http://lkml.org/lkml/2007/9/7/22) on this problem, one of the
suggestions I got was to have the dirents stored in a cache (which we
already do for duplicate elimination) and define the directory seek
behaviour on this cache constructed out of unioned directories.

I set out to try this and the result is this set of patches. I am myself
not impressed by the implementation complexity in these patches but posting
them here only to get further comments and suggestions. Moreover I haven't
debugged them thoroughly to uncover all the problems. While I don't expect
anybody try these patches, for the completeness sake I have to mention that
these apply on top of Jan Blunck's patches on 2.6.24-rc2-mm1
(ftp://ftp.suse.com/pub/people/jblunck/patches/).

In this approach, the cached dirents are given offsets in the form of
linearly increasing indices/cookies (like 0, 1, 2,...). This helps us to
uniformly define offsets across all the directories of the union
irrespective of the type of filesystem involved. Also this is needed to
define a seek behaviour on the union mounted directory. This cache is stored
as part of the struct file of the topmost directory of the union and will
remain as long as the directory is kept open.

While this approach works, it has the following downsides:

- The cache can grow arbitrarily large in size for big directories thereby
consuming lots of memory. Pruning individual cache entries is out of question
as entire cache is needed for subsequent readdirs for duplicate elimination.

- When an exising union gets overlaid by a new directory, there is a
possibility of the cache getting duplicated for the new union, thereby wasting
more space.

- Whenever _any_ directory that is part of the union gets
modified (addition/deletion of entries), the dirent cache of all the unions
which this directory is part of, needs to be purged and rebuilt. This is
expensive not only due to re-reads of dirents but also because
readdir(2)/getdents(2) needs to be synchronized with other operations
like mkdir/mknod/link/unlink etc.

- Since lseek(2) of the unioned directory also works on the same dirent
cache, it too needs to be invalidated when the directory gets modified.

- Supporting SEEK_END of lseek(2) is expensive as it involves reading in
all the dirents to know the EOF of the directory file.

After all this, I am beginning to think if it would be better to delegate
this readdir and whiteout processing to userspace. Can this be better handled
by readdir(3) in glibc ? But even with this, I am not sure if correct seek
behaviour (from lseek(2) or seekdir(3))  can be achieved in an easy manner.
Any thoughts on this ?

Earlier Erez Zodak had suggested that things will become easier if readdir
state is stored as a disk file (http://lkml.org/lkml/2007/9/7/114). This
approach simplifies directory seek support in Unionfs. But I am not sure if
such an approach would go well with VFS based unification approach like
Union Mount.

Regards,
Bharata.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 0/5] Union Mount: A Directory listing approach with lseek support

2007-12-05 Thread Bharata B Rao
Hi,

In Union Mount, the merged view of directories of the union is obtained
by enhancing readdir(2)/getdents(2) to read and merge the entries of
all the directories  by eliminating the duplicates. While we have tried
a few approaches for this, none of them could perfectly solve all the problems.
One of the challenges has been to provide a correct directory seek support for
the union mounted directories. Sometime back when I posted an
RFC (http://lkml.org/lkml/2007/9/7/22) on this problem, one of the
suggestions I got was to have the dirents stored in a cache (which we
already do for duplicate elimination) and define the directory seek
behaviour on this cache constructed out of unioned directories.

I set out to try this and the result is this set of patches. I am myself
not impressed by the implementation complexity in these patches but posting
them here only to get further comments and suggestions. Moreover I haven't
debugged them thoroughly to uncover all the problems. While I don't expect
anybody try these patches, for the completeness sake I have to mention that
these apply on top of Jan Blunck's patches on 2.6.24-rc2-mm1
(ftp://ftp.suse.com/pub/people/jblunck/patches/).

In this approach, the cached dirents are given offsets in the form of
linearly increasing indices/cookies (like 0, 1, 2,...). This helps us to
uniformly define offsets across all the directories of the union
irrespective of the type of filesystem involved. Also this is needed to
define a seek behaviour on the union mounted directory. This cache is stored
as part of the struct file of the topmost directory of the union and will
remain as long as the directory is kept open.

While this approach works, it has the following downsides:

- The cache can grow arbitrarily large in size for big directories thereby
consuming lots of memory. Pruning individual cache entries is out of question
as entire cache is needed for subsequent readdirs for duplicate elimination.

- When an exising union gets overlaid by a new directory, there is a
possibility of the cache getting duplicated for the new union, thereby wasting
more space.

- Whenever _any_ directory that is part of the union gets
modified (addition/deletion of entries), the dirent cache of all the unions
which this directory is part of, needs to be purged and rebuilt. This is
expensive not only due to re-reads of dirents but also because
readdir(2)/getdents(2) needs to be synchronized with other operations
like mkdir/mknod/link/unlink etc.

- Since lseek(2) of the unioned directory also works on the same dirent
cache, it too needs to be invalidated when the directory gets modified.

- Supporting SEEK_END of lseek(2) is expensive as it involves reading in
all the dirents to know the EOF of the directory file.

After all this, I am beginning to think if it would be better to delegate
this readdir and whiteout processing to userspace. Can this be better handled
by readdir(3) in glibc ? But even with this, I am not sure if correct seek
behaviour (from lseek(2) or seekdir(3))  can be achieved in an easy manner.
Any thoughts on this ?

Earlier Erez Zodak had suggested that things will become easier if readdir
state is stored as a disk file (http://lkml.org/lkml/2007/9/7/114). This
approach simplifies directory seek support in Unionfs. But I am not sure if
such an approach would go well with VFS based unification approach like
Union Mount.

Regards,
Bharata.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] Union Mount: A Directory listing approach with lseek support

2007-12-05 Thread Dave Hansen
On Wed, 2007-12-05 at 20:07 +0530, Bharata B Rao wrote:
 In this approach, the cached dirents are given offsets in the form of
 linearly increasing indices/cookies (like 0, 1, 2,...). This helps us to
 uniformly define offsets across all the directories of the union
 irrespective of the type of filesystem involved. Also this is needed to
 define a seek behaviour on the union mounted directory. This cache is stored
 as part of the struct file of the topmost directory of the union and will
 remain as long as the directory is kept open. 

That's a little brute force, don't you think?  Especially the memory
exhaustion seems to really make this an undesirable approach.

I think the key here is what kind of consistency we're trying to
provide.  If a directory is being changed underneath a reader, what
kinds of guarantees do they get about the contents of their directory
read?  When do those guarantees start?  Are there any at open() time?

Rather than give each _dirent_ an offset, could we give each sub-mount
an offset?  Let's say we have three members comprising a union mount
directory.  The first has 100 dirents, the second 200, and the third
10,000.  When the first readdir is done, we populate the table like
this:

mount_offset[0] = 0;
mount_offset[1] = 100;
mount_offset[2] = 300;

If someone seeks back to 150, then we subtrack the mount[1]'s offset
(100), and realize that we want the 50th dirent from mount[1].

I don't know whether we're bound to this:

http://www.opengroup.org/onlinepubs/007908775/xsh/readdir.html

If a file is removed from or added to the directory after the
most recent call to opendir() or rewinddir(), whether a
subsequent call to readdir() returns an entry for that file is
unspecified.

But that would seem to tell me that once you populate a table such as
the one I've described and create it at open(dir) time, you don't
actually ever need to update it.

The storage for this is only comparable to the number of mounts that you
have.  One issue comes if we manage to overflow our data types with too
many entries in too many submounts.  But, I guess we can just truncate
the directory in that case.  

-- Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/