Re: Multiple devfs mounts

2000-05-01 Thread Theodore Y. Ts'o

   Date:Mon, 1 May 2000 11:27:04 -0400 (EDT)
   From: Alexander Viro <[EMAIL PROTECTED]>

   Keep in mind that userland may need to be taught how to deal with getdents()
   returning duplicates - there is no reasonable way to serve that in
   the kernel. 

*BSD does this in libc, for the exactly same reason; there's no good way
to do this in the kernel.

- Ted



Re: ext2 feature request

2000-05-01 Thread Andrew Clausen

[EMAIL PROTECTED] wrote:
> The answer to that is pretty simple --- you create a compatible
> filesystem feature, and indicate the number of "extra" saved blocks in
> the superblock.  Old e2fsck's who don't know about the compatible
> filesystem features will refuse to touch the filesystem, so they won't
> clear the extra bits in the block bitmap.  However, since it's a
> compatible filesystem feature, the kernel will happily mount such a
> filesystem --- since it trusts the accuracy of the block allocation
> bitmaps, it won't try to allocate any of the reserved blocks.

Clever.
 
> P.S.  The e2fsprogs 1.19 will have support for off-line resizing, even
> without this feature.  It just simply moves the inode table if necessary
> (with very paranoid, very carefully written code).  (Yes, that's right,
> the GPL timeout on resize2fs finally timed out.)  If you want an early
> peek at it, the 1.19 pre-release snapshot at e2fsprogs.sourceforge.net
> already has it.  I just need to get a couple of final bug
> fixes/enhancements before I release 1.19 for real.
> 
> (I'm currently debapting whether or not to code this above idea before
> or after the 1.19 release.  :-)

So can your ext2 resizer move the start?  Can it move it, even if the
blocks don't align?

Andrew Clausen



Re: ext2 feature request

2000-05-01 Thread Andrew Clausen

"Stephen C. Tweedie" wrote:
> On Fri, Apr 28, 2000 at 08:09:21PM +1000, Andrew Clausen wrote:
> >
> > Is it possible to have a gap between the super-block and the
> > start of group 0's metadata?
> 
> Yes.  It's called the "s_first_data_block" field in the ext2
> superblock, and lets you offset the data zone from the start
> of the filesystem.

Sorry, I don't think this is what I want.  The unit is blocks,
not sectors.  OTOH, it is useful for boot loaders...

Andrew Clausen



Re: fs changes in 2.3

2000-05-01 Thread Hans Reiser

"Roman V. Shaposhnick" wrote:

>   Hans, I do not want to be unpleasant, but you behave like an second level
> manager who can not get to the first level for quite a long time.

Ok, let me put it in different lingo.  Viro is a fucking asshole who makes life
miserable for people trying to add functionality to Linux.  What's more, he has
no imagination and his contributions do not begin to make up for the shit he
gives people like Gooch who is trying to add something that wasn't in System 7
and is therefor beyond Viro's ability to cope with.  It takes more than a
license to make source code open.  He shouldn't be let near the source code, his
existence is a net loss for Linux.

There, is that less like a second line manager?  If any of you prefer second
line manager lingo, stick with the previous email.:-)

In 2.5 we'll add stuff to VFS.  We might even try to sneak some stuff into 2.3
if Viro keeps 2.4 from being ready for another two weeks.  

Hans



Re: ext2 feature request

2000-05-01 Thread tytso

   Date: Mon, 1 May 2000 21:29:42 +0200 (MEST)
   From: Lennert Buytenhek <[EMAIL PROTECTED]>

   Eeeeh. not quite.

   Because we reserve the first 1024 bytes of an ext2 fs for putting the boot
   loader, we put the superblock at offset 1024. For 1k blocks, this means
   the superblock ends up in block #1. For 2k/4k block sizes, this means the
   superblock ends up in block #0 (albeit at offset 1024). So for 1k block
   fs'es, s_first_data_block is 1. For 2k/4k blocks fs'es, s_first_data_block
   is 0. So this field is actually being used.

Yup, that's right.  In order to make this work, we'd need to make a
certain number of blocks as "in use" so that the kernel doesn't allocate
into them, and then somehow signal to e2fsck to not to release those
blocks when it is doing the check.  

The answer to that is pretty simple --- you create a compatible
filesystem feature, and indicate the number of "extra" saved blocks in
the superblock.  Old e2fsck's who don't know about the compatible
filesystem features will refuse to touch the filesystem, so they won't
clear the extra bits in the block bitmap.  However, since it's a
compatible filesystem feature, the kernel will happily mount such a
filesystem --- since it trusts the accuracy of the block allocation
bitmaps, it won't try to allocate any of the reserved blocks.

So this can be done using a pure e2fsprogs enhancement. maybe I'll
just code it up, and see what happens.  :-)

- Ted

P.S.  The e2fsprogs 1.19 will have support for off-line resizing, even
without this feature.  It just simply moves the inode table if necessary
(with very paranoid, very carefully written code).  (Yes, that's right,
the GPL timeout on resize2fs finally timed out.)  If you want an early
peek at it, the 1.19 pre-release snapshot at e2fsprogs.sourceforge.net
already has it.  I just need to get a couple of final bug
fixes/enhancements before I release 1.19 for real.  

(I'm currently debapting whether or not to code this above idea before
or after the 1.19 release.  :-)



Re: ext2 feature request

2000-05-01 Thread Lennert Buytenhek


Hi,


On Mon, 1 May 2000, Stephen C. Tweedie wrote:

> > Is it possible to have a gap between the super-block and the
> > start of group 0's metadata?
> 
> Yes.  It's called the "s_first_data_block" field in the ext2
> superblock, and lets you offset the data zone from the start of the
> filesystem.
> 
> The mke2fs code doesn't currently set up superblocks which use this,
> but the kernel support should already work.

Eeeeh. not quite.

Because we reserve the first 1024 bytes of an ext2 fs for putting the boot
loader, we put the superblock at offset 1024. For 1k blocks, this means
the superblock ends up in block #1. For 2k/4k block sizes, this means the
superblock ends up in block #0 (albeit at offset 1024). So for 1k block
fs'es, s_first_data_block is 1. For 2k/4k blocks fs'es, s_first_data_block
is 0. So this field is actually being used.

(For 1k blocks and 8192 blocks/group, the groups start at blocks 1,8193,
16385, etc. For 2k blocks and 8192 blocks/group, the groups start at
blocks 0,8192,16384, etc.)


greetings,
Lennert




Re: new VFS method sync_page and stacking

2000-05-01 Thread Roman V. Shaposhnick

On Mon, May 01, 2000 at 01:57:23AM -0700, Hans Reiser wrote:
> "Roman V. Shaposhnick" wrote:
> :
> > 
> >  1. In UNIX everything is a file. Thus we need just one type of a
> > cache -- cache for files that can be cached.
> 
> Directories and other metadata are not necessarily implemented as files.  

  Directories are files, indeed. Read-only files. Sometimes with special
alignment rules for the reader, but they are files. I can say nothing 
about other metadata, because I was speaking about generic usecase. And other
metadata is "other". But you are right sometimes it can fit into a different
model, but it is up to an implementation to chose one. 

Thanks,
Roman.



Re: new VFS method sync_page and stacking

2000-05-01 Thread Roman V. Shaposhnick

On Mon, May 01, 2000 at 02:33:58PM -0400, Benjamin C.R. LaHaise wrote:
> On Mon, 1 May 2000, Roman V. Shaposhnick wrote:
> 
> >   I see what you mean. And I completely agree with only one exception: we
> > should not use and we should not think of address_space as a generic cache
> > till we implement the interface for address_space_operations that:
> >  
> > 1. can work with *any* type of host object
> > 2. at the same time can work with stackable or derived ( in C++
> >terminology ) host objects like file->dentry->inode->phis.
> > 3. and has a reasonable and kludge-free specification.
> > 
> >   I agree that providing such interface to the address_space will simplify 
> > things a lot, but for now I see no evidence of somebody doing this. 
> 
> If we remove the struct file *'s from the interface, it becomes quite
> trivial to do this.

  It is not hard to remove struct file *'s from the interface, but, please,
specify, what we will get instead ? 

> > > Hmm. Take ->readpage() for example. It is used to fill a page which "belongs"
> > > to an object. That object is referenced in page->mapping->host. For inode
> > > data, the host is the inode structure. When should readpage() ever need to
> > > see anything other than the object to which the pages belong? It doesn't make
> > > sense (to me). 
> > 
> >I disagree; and mainly because sometimes it is hard to tell where object 
> > "begins" and where it "ends". Complex objects  are often  derived from 
> > simple  and it is a very common practice to use the highest level available
> > for the sake of optimization. E.g. in Linux userspace file is built around 
> > dentry which in turn is built around inode etc. Can we say that file do not
> > include inode directly ? Yes. Can we say that file do not include inode at
> > all? No. 
> > 
> >  Let me show you an example. Consider, that:
> >1. you have a network-based filesystem with a state oriented protocol. 
> >2. to do something with a file you need a special pointer to it called 
> >   file_id.
> >3. before reading/writing to/from a file you should open its file_id
> >   for reading, writing or both. After you open file_id you can only
> >   read/write from/to it but you can not do nothing more.
> >
> >  I guess you will agree with me that  the best place for "opened" file_id
> > is a file->private_data ? Ok. Now the question is, how can I access opened
> > file_id if all methods ( readpage, writepage, prepare_write and commit_write ) 
> > get the first argument of type inode ?
> 
> Holding the file_id in file->private_data is just plain wrong in the
> presence of delayed writes (especially on NFS).  Consider the case where a
> temporary file is opened and deleted but then used for either read/write
> or backing an mmap. Does it not make sense that when the file is finally
> close()'d/munmap()'d by the last user that the contents should be merely
> discarded?  

   I guess what you suggest is something like:

  fd = open("file", O_RDWR);
  unlink("file");
  read(fd, buff, count);
  ...
  ...
  close(fd).

If yes than it does not make perfect sense to discard file contents because 
in any case we are *bound* to transfer it to the server side just because 
it may has different users on that side too.

> If you're tieing the ability to complete write()'s to the
> struct file * that was opened, you'll never be able to detect this case
> (which is trivial if the file_id token is placed in the inode or address
> space).  

Yes. I store a special "clean" file_id there, and if there is no better
way I can repeat damn procedure ( clone file_id, open copy, read/write, remove
copy ) but from my point of view this is definitely waste of network capacity.
NFS ( no file security ) handles those issues because it is stateless. And I
can not because it is a waste of resources to pretend being quasi-stateless.

> An address_space should be able to do everything without
> requiring a struct file.  Granted, this raises the problem that Al pointed
> out about signing RPC requests, but if a copy of the authentication token
> is made at open() time into the lower layer for the filesystem, this
> problem goes away.
> This is important because you can't have a struct file when getting calls
> from the page reclaim action in shrink_mmap.

  That's right. Did you read my previous e-mail? I mean, that interface should 
provide highest possible level that's it. If shrink_mmap can only work at some
lower level -- it's ok. But that should be an exception not the rule.

> > > Inode data pages are per-inode, not per-dentry or per-file-struct.
> > 
> >   Frankly, inode data pages are file pages, because it is userspace files we
> > care of. Nothing more, nothing less. 
> 
> No, they are not.  address_spaces are a generic way of caching pages of a
> vm object.  Files happen to fit nicely into vm objects.

  If all you have is a h

Re: new VFS method sync_page and stacking

2000-05-01 Thread Hans Reiser

"Roman V. Shaposhnick" wrote:
:
> 
>  1. In UNIX everything is a file. Thus we need just one type of a
> cache -- cache for files that can be cached.

Directories and other metadata are not necessarily implemented as files.  

Hans



Re: new VFS method sync_page and stacking

2000-05-01 Thread Alexander Viro



On Mon, 1 May 2000, Roman V. Shaposhnick wrote:

> On Mon, May 01, 2000 at 01:50:58PM -0400, Alexander Viro wrote:
> > 
> > 
> > On Mon, 1 May 2000, Roman V. Shaposhnick wrote:
> > 
> > > 2. at the same time can work with stackable or derived ( in C++
> > >terminology ) host objects like file->dentry->inode->phis.
> > 
> > These are _not_ derived in C++ sense. Sorry.
> 
>   Ooops! I mean, aggregation, of course, but nevertheless I just try to 
> express the idea using less possible number of words. 
>   Ok, what about the idea ?

Tokens. Opaque tokens. Interface should not care about them. Callers must
know what they are dealing with, indeed.

> > > > Inode data pages are per-inode, not per-dentry or per-file-struct.
> > > 
> > >   Frankly, inode data pages are file pages, because it is userspace files we
> > > care of. Nothing more, nothing less. 
> > 
> > You've missed the point here. We cache the data on the client
> > side. _All_ openers share that cache. 
> 
>You mean local, client side openers ? 

Yep.

> > IOW, we have a chance that data submitted by one of them will be sent with 
> > credentials of another. Nothing to do here.
> 
>For now? Yes. But if we would be able to use "struct file *" than we could 
> store all meta information on a per opener basis. That's the idea. And we
> could always separate one opener from another by syncing each and every
> operation. Yes?

All meta information? Consider mmap(). You have a dirty page. It might be
dirtied by a process that has it mapped. It might be dirtied by direct
write() by another opener. You end up with the need to generate Rwrite.
And no way to separate the data from different openers. So either you use
a new cache for each opener (_big_ waste of memory + hell of coherency
problems) or you end up with the described effect.




Re: new VFS method sync_page and stacking

2000-05-01 Thread Roman V. Shaposhnick

On Mon, May 01, 2000 at 01:50:58PM -0400, Alexander Viro wrote:
> 
> 
> On Mon, 1 May 2000, Roman V. Shaposhnick wrote:
> 
> > 2. at the same time can work with stackable or derived ( in C++
> >terminology ) host objects like file->dentry->inode->phis.
> 
> These are _not_ derived in C++ sense. Sorry.

  Ooops! I mean, aggregation, of course, but nevertheless I just try to 
express the idea using less possible number of words. 
  Ok, what about the idea ?

> > > Inode data pages are per-inode, not per-dentry or per-file-struct.
> > 
> >   Frankly, inode data pages are file pages, because it is userspace files we
> > care of. Nothing more, nothing less. 
> 
>   You've missed the point here. We cache the data on the client
> side. _All_ openers share that cache. 

   You mean local, client side openers ? 

> IOW, we have a chance that data submitted by one of them will be sent with 
> credentials of another. Nothing to do here.

   For now? Yes. But if we would be able to use "struct file *" than we could 
store all meta information on a per opener basis. That's the idea. And we
could always separate one opener from another by syncing each and every
operation. Yes?

Thanks,
Roman.

P.S. I hope we are talking about same thing ? If not, example please.



Re: new VFS method sync_page and stacking

2000-05-01 Thread Benjamin C.R. LaHaise

On Mon, 1 May 2000, Roman V. Shaposhnick wrote:

>   I see what you mean. And I completely agree with only one exception: we
> should not use and we should not think of address_space as a generic cache
> till we implement the interface for address_space_operations that:
>  
> 1. can work with *any* type of host object
> 2. at the same time can work with stackable or derived ( in C++
>terminology ) host objects like file->dentry->inode->phis.
> 3. and has a reasonable and kludge-free specification.
> 
>   I agree that providing such interface to the address_space will simplify 
> things a lot, but for now I see no evidence of somebody doing this. 

If we remove the struct file *'s from the interface, it becomes quite
trivial to do this.

> > Hmm. Take ->readpage() for example. It is used to fill a page which "belongs"
> > to an object. That object is referenced in page->mapping->host. For inode
> > data, the host is the inode structure. When should readpage() ever need to
> > see anything other than the object to which the pages belong? It doesn't make
> > sense (to me). 
> 
>I disagree; and mainly because sometimes it is hard to tell where object 
> "begins" and where it "ends". Complex objects  are often  derived from 
> simple  and it is a very common practice to use the highest level available
> for the sake of optimization. E.g. in Linux userspace file is built around 
> dentry which in turn is built around inode etc. Can we say that file do not
> include inode directly ? Yes. Can we say that file do not include inode at
> all? No. 
> 
>  Let me show you an example. Consider, that:
>1. you have a network-based filesystem with a state oriented protocol. 
>2. to do something with a file you need a special pointer to it called 
>   file_id.
>3. before reading/writing to/from a file you should open its file_id
>   for reading, writing or both. After you open file_id you can only
>   read/write from/to it but you can not do nothing more.
>
>  I guess you will agree with me that  the best place for "opened" file_id
> is a file->private_data ? Ok. Now the question is, how can I access opened
> file_id if all methods ( readpage, writepage, prepare_write and commit_write ) 
> get the first argument of type inode ?

Holding the file_id in file->private_data is just plain wrong in the
presence of delayed writes (especially on NFS).  Consider the case where a
temporary file is opened and deleted but then used for either read/write
or backing an mmap.  Does it not make sense that when the file is finally
close()'d/munmap()'d by the last user that the contents should be merely
discarded?  If you're tieing the ability to complete write()'s to the
struct file * that was opened, you'll never be able to detect this case
(which is trivial if the file_id token is placed in the inode or address
space).  An address_space should be able to do everything without
requiring a struct file.  Granted, this raises the problem that Al pointed
out about signing RPC requests, but if a copy of the authentication token
is made at open() time into the lower layer for the filesystem, this
problem goes away.

This is important because you can't have a struct file when getting calls
from the page reclaim action in shrink_mmap.

> > Inode data pages are per-inode, not per-dentry or per-file-struct.
> 
>   Frankly, inode data pages are file pages, because it is userspace files we
> care of. Nothing more, nothing less. 

No, they are not.  address_spaces are a generic way of caching pages of a
vm object.  Files happen to fit nicely into vm objects.

-ben




Re: fs changes in 2.3

2000-05-01 Thread Roman V. Shaposhnick

On Sun, Apr 30, 2000 at 10:09:45PM -0700, Hans Reiser wrote:
> I think that improving support from folks who change VFS code for folks who are
> affected is needed.

  Hans, what are you talking about ? Are you talking about large solution
provider who should change its attitude and be more customer friendly or 
are you talking about people who spend their *own* time hacking around 
sources ?
 
> I don't much care for the screaming at people who don't track your changes by

  Again, Hans, what. are. you. talking. about. ? 

> reading ext2 code updates development model that is currently in place.  
> (E.g. the amiga FS emails seen on this list)
> Perhaps a set of comments in the VFS code saying here is what you do to
> interface with me would be better.
> 
> Perhaps in 2.5 the ReiserFS team will start contributing towards improving VFS
 
  I guess, you were unable to to this for a 2.3 due to some fundamental reason?
If no, just do it. Do not make proposals -- just do it. 

> in this direction as we start trying to move linux VFS towards something richer 
> in functionality.:)

  Oh, no! "richer in functionality", where we are? A trade show ? 
 
> 
> VFS code needs to be maintained by persons with supportive personalities.

  Hans, I do not want to be unpleasant, but you behave like an second level 
manager who can not get to the first level for quite a long time. Stop 
ranting. Read sources. Write good code. Discuss reasonable things. And please,
let us know what was the message of this message.

Roman.



Re: new VFS method sync_page and stacking

2000-05-01 Thread Alexander Viro



On Mon, 1 May 2000, Roman V. Shaposhnick wrote:

> 2. at the same time can work with stackable or derived ( in C++
>terminology ) host objects like file->dentry->inode->phis.

These are _not_ derived in C++ sense. Sorry.

> > Inode data pages are per-inode, not per-dentry or per-file-struct.
> 
>   Frankly, inode data pages are file pages, because it is userspace files we
> care of. Nothing more, nothing less. 

You've missed the point here. We cache the data on the client
side. _All_ openers share that cache. IOW, we have a chance that data
submitted by one of them will be sent with credentials of another. Nothing
to do here.




Re: new VFS method sync_page and stacking

2000-05-01 Thread Roman V. Shaposhnick

On Mon, May 01, 2000 at 01:35:52PM +0100, Steve Dodd wrote:
> On Mon, May 01, 2000 at 01:41:43AM +0400, Roman V. Shaposhnick wrote:
> > On Sun, Apr 30, 2000 at 03:28:18PM +0100, Steve Dodd wrote:
> 
> > > But an address_space is (or could be) a completely generic cache. It might
> > > never be associated with an inode, let alone a dentry or file structure.
> > 
> >   Ok, ok, hold on, it is filemap.c where all this stuff is defined, I guess
> > the name gives a hint about that it definitely is associated with some kind 
> > of file machinery. But I understand what you mean. See my comments below.
> 
> OK, more precisely, an address_space could (and IMO *should*) be a generic
> /page/ cache. Anything that's caching multiple pages of data indexed by an
> offset could use it. Then the page cache can sync pages, steal pages, etc.,
> etc., without having to know what the pages are being used for. At the moment,
> it's only used for inode data pages, but with some small changes it can be
> much more useful - and I believe this was Al's original intention.

  I see what you mean. And I completely agree with only one exception: we
should not use and we should not think of address_space as a generic cache
till we implement the interface for address_space_operations that:
 
1. can work with *any* type of host object
2. at the same time can work with stackable or derived ( in C++
   terminology ) host objects like file->dentry->inode->phis.
3. and has a reasonable and kludge-free specification.

  I agree that providing such interface to the address_space will simplify 
things a lot, but for now I see no evidence of somebody doing this. 

 As for, Al, I guess that only he could ever tell us what he is doing. But he 
keeps silence.

> > 
> > > For example, I've got some experimental NTFS code which caches all metadata
> > > in the page cache using the address_space stuff.
> [..]
> 
> >   IMHO, you are going wrong direction here. My point is that sometimes  
> > we would like to see address_space stuff as generic cache but it is not.
> 
> I disagree; at the moment the interface is not generic, but my belief is
> that it can be made so with only small changes to existing code. However,
> I haven't looked at all existing code yet. NFS is doing something odd, and I
> also should look at your code if that's possible (is it available somewhere?)
 
  No but I hope you will be able to see it in 2.5. btw,  if you would like 
to see some parts that you are interested at I can send you early pre-alpha.

> > Or being more precise it is a generic cache from the generic perspective,
> > but when you try to use generic functions from filemap.c you are bound to
> > treat this beast as a file cache.
> 
> The page cache at the moment is mostly used for caching inode data, hence
> the name "filemap.c".

   Steve, please read my previous e-mail about what I consider file-cache and 
what generic cache.

> [..]
> >   Thus my opinion is that address_space_operations should remain 
> > file-oriented ( and if there are no good contras take the first argument 
> > of "struct file *" type ). At the same time it is possible to have completely
> > different set of methods around  the same address_space stuff, but from my 
> > point of view this story has nothing in common with how an *existing* 
> > file-oriented interface should work.
> 
> Hmm. Take ->readpage() for example. It is used to fill a page which "belongs"
> to an object. That object is referenced in page->mapping->host. For inode
> data, the host is the inode structure. When should readpage() ever need to
> see anything other than the object to which the pages belong? It doesn't make
> sense (to me). 

   I disagree; and mainly because sometimes it is hard to tell where object 
"begins" and where it "ends". Complex objects  are often  derived from 
simple  and it is a very common practice to use the highest level available
for the sake of optimization. E.g. in Linux userspace file is built around 
dentry which in turn is built around inode etc. Can we say that file do not
include inode directly ? Yes. Can we say that file do not include inode at
all? No. 

 Let me show you an example. Consider, that:
   1. you have a network-based filesystem with a state oriented protocol. 
   2. to do something with a file you need a special pointer to it called 
  file_id.
   3. before reading/writing to/from a file you should open its file_id
  for reading, writing or both. After you open file_id you can only
  read/write from/to it but you can not do nothing more.
   
 I guess you will agree with me that  the best place for "opened" file_id
is a file->private_data ? Ok. Now the question is, how can I access opened
file_id if all methods ( readpage, writepage, prepare_write and commit_write ) 
get the first argument of type inode ?

  Now imagine that we have interface with all properties described above, than 
it would be up to the each method to decide how to handle th

Re: fs changes in 2.3

2000-05-01 Thread Peter Schneider-Kamp

Hans Reiser wrote:
> 
> I think that improving support from folks who change VFS code
> for folks who are affected is needed.

Hei Hans!

I second that. I had to stop maintaining the steganographic file
system around 2.3.7 because I did not have that much time to
find out where my fs is "broken" and needs to be "fixed".

I think the VFS needs a better documentation. Correct me if I am
mistaken in this matter but I cannot find anything that is
nearly up to date and I don't want to read and understand the
sources for two or three file systems I do not have any strong
affections to.

One should be able to interface to the VFS without having to
know or learn intimacies of the second extended file system.

Peter



Re: new VFS method sync_page and stacking

2000-05-01 Thread Alexander Viro



On Mon, 1 May 2000, Steve Dodd wrote:

> On Mon, May 01, 2000 at 01:41:43AM +0400, Roman V. Shaposhnick wrote:
> > On Sun, Apr 30, 2000 at 03:28:18PM +0100, Steve Dodd wrote:
> 
> > > But an address_space is (or could be) a completely generic cache. It might
> > > never be associated with an inode, let alone a dentry or file structure.
> > 
> >   Ok, ok, hold on, it is filemap.c where all this stuff is defined, I guess
> > the name gives a hint about that it definitely is associated with some kind 
> > of file machinery. But I understand what you mean. See my comments below.
> 
> OK, more precisely, an address_space could (and IMO *should*) be a generic
> /page/ cache. Anything that's caching multiple pages of data indexed by an
> offset could use it. Then the page cache can sync pages, steal pages, etc.,
> etc., without having to know what the pages are being used for. At the moment,
> it's only used for inode data pages, but with some small changes it can be
> much more useful - and I believe this was Al's original intention.

It was and it still is. However, completely losing the damn struct file
may be tricky.

Theory: different clients may want to share the cache for remote file.
They use some authentication to sign the RPC requests. I.e. you need to
have some token passed to the methods. struct file * is bogus here - it's
void * and it should be ignored by almost every address_space.

generic_write_file() has every reason to expect that a-s it deals with
accepts file as a token (if it needs the thing at all), but that's
generic_file_write(). mm/filemap.c contains a lot of stuff that is
completely generic _and_ some file-related pieces. Yes, it needs further
cleanups and separation.

> > > For example, I've got some experimental NTFS code which caches all metadata
> > > in the page cache using the address_space stuff.
> [..]
> 
> >   IMHO, you are going wrong direction here. My point is that sometimes  
> > we would like to see address_space stuff as generic cache but it is not.
> 
> I disagree; at the moment the interface is not generic, but my belief is
> that it can be made so with only small changes to existing code. However,
> I haven't looked at all existing code yet. NFS is doing something odd, and I
> also should look at your code if that's possible (is it available somewhere?)

RPC signing...




Re: Multiple devfs mounts

2000-05-01 Thread Alexander Viro




On Mon, 1 May 2000, Richard Gooch wrote:

> Alexander Viro writes:
> > 
> > 
> > 
> > On Mon, 1 May 2000, Richard Gooch wrote:
> > 
> > > Eric W. Biederman writes:
> > > > Richard Gooch <[EMAIL PROTECTED]> writes:
> > > > 
> > > > >   Hi, Al. You've previously stated that you consider the multiple
> > > > > mount feature of devfs broken. I agree that there are some races in
> > > > > there. However, I'm not clear on whether you're saying that the entire
> > > > > concept is broken, or that it can be fixed with appropriate loffcking.
> > > > > I've asked this before, but haven't had a response.
> > > > 
> > > > Last I saw it was his complaint that you varied what you
> > > > showed at different mount points, and that doing that all in 
> > > > one dcache tree was fundamentally broken.
> > > 
> > > But it's not one dcache tree: there is a separate dcache tree for each
> > > mount of devfs. So I don't understand that complaint.
> > 
> > There is a lot of places where we do serialization using semaphores in
> > struct inode. You don't duplicate it all. Think what happens if two
> > instances operate on the same directory. You've got no locking here.
> 
> Correct, I have no locking now. But adding locking is on my list of
> things to do. So, for example, I could put a semaphore in my "struct
> directory_type" (amongst other places). That would serialise changes.
> 
> I get the impression you think no amount of locking added to devfs
> will solve the races, but I don't see why that would be the case.
> 
> Or are you just saying that it would be a lot of work to get all the
> locking correct? If that's what bothers you, don't worry. That's my
> job ;-)

There are dragons. It _is_ tricky and ordering is going to be a bitch
to deal with - just look what do_rename()/vfs_rename() have to do and
consider the effects of second system of locks interacting with them.
Besides, basic assumption is that ->i_sem on parent blocks any changes
of dentry status (coming from other threads, that is). Making your
revalidate() deal with that nicely... Not an easy thing. And it will
be very vulnerable to pretty subtle changes in VFS, ones that would
never be visible to any other fs. I'm not saying that it can't be done -
just that it will be either full of holes _or_ hell to maintain. Essentially,
the code will be so tightly bound to VFS details that for all practical
purposes it _will_ be a part of VFS...

> > > > > If you feel that it's fundamentally impossible to mount a FS multiple
> > > > > times, please explain your reasoning.
> > > > 
> > > > At this point it would make sense to just use the generic multiple
> > > > mount features in the VFS that Alexander has been putting in.
> > > 
> > > The generic multi-mount patch is good, but it doesn't solve the
> > > particular problem of mounting with selective exposure.
> > 
> > There is one case when you can safely have multiple trees - when
> > directories in each tree are read-only. If you union-mount such
> > trees you get some selective exposure.
> 
> Yes, I've been thinking about this approach. I'll like to table it for
> now and sort the locking issue out first.
> 
> BTW: when are we likely to have union mounts available?

Umm... Well, let's see. I've got lookup code switched to the new linkage
(as of pre7-1) and the only major offend^Wuser of old linkage is autofs4.
Fixable, and I know how to fix it (knfsd is another, but that one is utterly
trivial). So the next steps are
* remove the last traces of ->d_mounts/->d_covers
* remove the checks preventing multiple mounts (they are there to
avoid screwing the old linkage)
* add FS_SINGLE
* add phony superblocks for pipe and socket inodes.
* switch shmfs and possibly devpts to FS_SINGLE stuff.
* union-mounts.
It's not that far, considering that in the beginning of April list was
several times longer.

Keep in mind that userland may need to be taught how to deal with getdents()
returning duplicates - there is no reasonable way to serve that in the kernel.




Re: new VFS method sync_page and stacking

2000-05-01 Thread Roman V. Shaposhnick

On Sun, Apr 30, 2000 at 06:40:40PM +0100, Steve Dodd wrote:

[ ... ]

> I'd like to the see the address_space methods /lose/ the struct file /
> struct dentry pointer, but it may be there are situations which require
> it.

  And you can suggest a viable alternative, that can be a truly generic? 
You've told us your needs and I've told mine. Your filesystem is local and 
mine is network based. We *need* different address_space_operations, but
I feel very uncomfortable when one tries to speculate on this subject 
without reasonable proposals. For example, it costs me some time to resurrect 
struct file * as a first argument for 'writepage', and I hope I will be able
to cast that kind of spell on 'readpage' too. 

> I presume it's been added to replace explicit tq_disk runs, which will allow
> finer-grained syncing as described above, and also to ensure correctness in
> nfs and other similar filesystems.

  From my point of view, this method is like a sequence point in C
expressions. Nfs has a good description of what this method should do.
Finally, I guess we are speaking about the same thing only using different
words. 

Thanks,
Roman.



Re: new VFS method sync_page and stacking

2000-05-01 Thread Roman V. Shaposhnick

On Sun, Apr 30, 2000 at 05:58:22PM -0400, Erez Zadok wrote:
> In message <[EMAIL PROTECTED]>, "Roman V. Shaposhnick" writes:
> > On Sun, Apr 30, 2000 at 03:28:18PM +0100, Steve Dodd wrote:
> [...]
> > > But an address_space is (or could be) a completely generic cache. It
> > > might never be associated with an inode, let alone a dentry or file
> > > structure.
> 
> [...]
> >   Thus my opinion is that address_space_operations should remain 
> > file-oriented ( and if there are no good contras take the first argument 
> > of "struct file *" type ). At the same time it is possible to have completely
> > different set of methods around  the same address_space stuff, but from my 
> > point of view this story has nothing in common with how an *existing* 
> > file-oriented interface should work.
> >   
> > Thanks, 
> > Roman.
> 
> If you look at how various address_space ops are called, you'll see enough
> evidence of an attempt to make this interface both a file-based interface
> and a generic cache one (well, at least as far as I understood the code):

  What do you mean by "generic cache" ? From what I see in kernel code I have 
a feeling that this kind of evidence is caused by an abuse of the interface. 
  Why? IMHO, because:
  
 1. In UNIX everything is a file. Thus we need just one type of a 
cache -- cache for files that can be cached.
 2. File is not an atomic beast and there are four kind of objects 
that userspace file brings into existence:
 file->dentry->inode->[some phisical entity].
 3. Different kernel subsystems work on different layers of this 
picture and every relation in it is many-to-one.
 
Thus, considering 1, 2 and 3 we can clearly see that methods of
address_space_operations should be polymorphic in C++ sense. Depending on 
what subsystem calls them they should accept right type of parameters or
speaking it other way they should know on what level of the picture from 
para 2 the are supposed to work. Do you agree with that ? If yes, that 
the question of how to implement that kind  of polymorphism is a simple 
matter of coding style. Yes ?


> (1) generic_file_write (mm/filemap.c) can call ->commit_write with a normal
> non-NULL file.
> (2) block_symlink (fs/buffer.c) calls ->commit_write with NULL for the file
> arg.

   But it's the only one example. And moreover it proves that 'block_symlink', 
the only function that uses adress_space_operations directly,
suffers from the problem I've mentioned above. Let's see a typical usecase:

 fs/ext2/namei.c:

   ext2_symlink( struct inode *, struct dentry *, const char *) {
   .
   err = block_symlink(inode, symname, l);
   .
   }

What we have here is the subsystem that works on inode->... level -- that's it.
It is not an example of generic caching but just an example of that we can not
do without polymorphic interface.

> So perhaps to satisfy the various needs, all address_space ops should be
  
  Steve, Erez, please, show me an example of where do you need generic caching 
that fits into address_space infrastructure ? I can agree that we do need
some caches but they are not generic at all. For example in every network
based filesystem ( nfs, smbfs, ncpfs, mine ) there is a directory cache.
But all of them use address_space infrastructure with their own methods.
See nfs/dir.c and smbfs/cache.c for amusement. 

> passed a struct file which may be NULL; the individual f/s will have to
> check for it being NULL and deal with it.  (My stacking code already treats
> commit_write this way.)

   Don't get me wrong, but my strong persuasion here is that passing NULLS
when we can not pass meaningful value is yet another example of a huge 
abuse of the interface. 

Thanks,
Roman.



Re: Multiple devfs mounts

2000-05-01 Thread Richard Gooch

Alexander Viro writes:
> 
> 
> 
> On Mon, 1 May 2000, Richard Gooch wrote:
> 
> > Eric W. Biederman writes:
> > > Richard Gooch <[EMAIL PROTECTED]> writes:
> > > 
> > > >   Hi, Al. You've previously stated that you consider the multiple
> > > > mount feature of devfs broken. I agree that there are some races in
> > > > there. However, I'm not clear on whether you're saying that the entire
> > > > concept is broken, or that it can be fixed with appropriate loffcking.
> > > > I've asked this before, but haven't had a response.
> > > 
> > > Last I saw it was his complaint that you varied what you
> > > showed at different mount points, and that doing that all in 
> > > one dcache tree was fundamentally broken.
> > 
> > But it's not one dcache tree: there is a separate dcache tree for each
> > mount of devfs. So I don't understand that complaint.
> 
> There is a lot of places where we do serialization using semaphores in
> struct inode. You don't duplicate it all. Think what happens if two
> instances operate on the same directory. You've got no locking here.

Correct, I have no locking now. But adding locking is on my list of
things to do. So, for example, I could put a semaphore in my "struct
directory_type" (amongst other places). That would serialise changes.

I get the impression you think no amount of locking added to devfs
will solve the races, but I don't see why that would be the case.

Or are you just saying that it would be a lot of work to get all the
locking correct? If that's what bothers you, don't worry. That's my
job ;-)

> > > > If you feel that it's fundamentally impossible to mount a FS multiple
> > > > times, please explain your reasoning.
> > > 
> > > At this point it would make sense to just use the generic multiple
> > > mount features in the VFS that Alexander has been putting in.
> > 
> > The generic multi-mount patch is good, but it doesn't solve the
> > particular problem of mounting with selective exposure.
> 
> There is one case when you can safely have multiple trees - when
> directories in each tree are read-only. If you union-mount such
> trees you get some selective exposure.

Yes, I've been thinking about this approach. I'll like to table it for
now and sort the locking issue out first.

BTW: when are we likely to have union mounts available?

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]



Re: ext2 feature request

2000-05-01 Thread Stephen C. Tweedie

Hi,

On Fri, Apr 28, 2000 at 08:09:21PM +1000, Andrew Clausen wrote:
> 
> Is it possible to have a gap between the super-block and the
> start of group 0's metadata?

Yes.  It's called the "s_first_data_block" field in the ext2
superblock, and lets you offset the data zone from the start
of the filesystem.

The mke2fs code doesn't currently set up superblocks which use
this, but the kernel support should already work.

Ted, has anybody actually used/tested this code yet, to your
knowledge?

--Stephen



Re: Multiple devfs mounts

2000-05-01 Thread Alexander Viro




On Mon, 1 May 2000, Richard Gooch wrote:

> Eric W. Biederman writes:
> > Richard Gooch <[EMAIL PROTECTED]> writes:
> > 
> > >   Hi, Al. You've previously stated that you consider the multiple
> > > mount feature of devfs broken. I agree that there are some races in
> > > there. However, I'm not clear on whether you're saying that the entire
> > > concept is broken, or that it can be fixed with appropriate loffcking.
> > > I've asked this before, but haven't had a response.
> > 
> > Last I saw it was his complaint that you varied what you
> > showed at different mount points, and that doing that all in 
> > one dcache tree was fundamentally broken.
> 
> But it's not one dcache tree: there is a separate dcache tree for each
> mount of devfs. So I don't understand that complaint.

There is a lot of places where we do serialization using semaphores in
struct inode. You don't duplicate it all. Think what happens if two
instances operate on the same directory. You've got no locking here.

> > > If you feel that it's fundamentally impossible to mount a FS multiple
> > > times, please explain your reasoning.
> > 
> > At this point it would make sense to just use the generic multiple
> > mount features in the VFS that Alexander has been putting in.
> 
> The generic multi-mount patch is good, but it doesn't solve the
> particular problem of mounting with selective exposure.

There is one case when you can safely have multiple trees - when directories
in each tree are read-only. If you union-mount such trees you get some
selective exposure.




Re: Multiple devfs mounts

2000-05-01 Thread Richard Gooch

Eric W. Biederman writes:
> Richard Gooch <[EMAIL PROTECTED]> writes:
> 
> >   Hi, Al. You've previously stated that you consider the multiple
> > mount feature of devfs broken. I agree that there are some races in
> > there. However, I'm not clear on whether you're saying that the entire
> > concept is broken, or that it can be fixed with appropriate loffcking.
> > I've asked this before, but haven't had a response.
> 
> Last I saw it was his complaint that you varied what you
> showed at different mount points, and that doing that all in 
> one dcache tree was fundamentally broken.

But it's not one dcache tree: there is a separate dcache tree for each
mount of devfs. So I don't understand that complaint.

> > If you feel that it's fundamentally impossible to mount a FS multiple
> > times, please explain your reasoning.
> 
> At this point it would make sense to just use the generic multiple
> mount features in the VFS that Alexander has been putting in.

The generic multi-mount patch is good, but it doesn't solve the
particular problem of mounting with selective exposure.

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]



Re: fs changes in 2.3

2000-05-01 Thread Hans Reiser

I think that improving support from folks who change VFS code for folks who are
affected is needed.

I don't much care for the screaming at people who don't track your changes by
reading ext2 code updates
development model that is currently in place.  (E.g. the amiga FS emails seen on
this list)
Perhaps a set of comments in the VFS code saying here is what you do to
interface with me would be better.

Perhaps in 2.5 the ReiserFS team will start contributing towards improving VFS
in this direction
as we start trying to move linux VFS towards something richer in
functionality.:)

VFS code needs to be maintained by persons with supportive personalities.

Hans



Re: new VFS method sync_page and stacking

2000-05-01 Thread Erez Zadok

In message <[EMAIL PROTECTED]>, Steve Dodd writes:
> On Sun, Apr 30, 2000 at 04:46:37AM -0400, Erez Zadok wrote:
> 
> > Background: my stacking code for linux is minimal.  I only stack on
> > things I absolutely have to.  By "stack on" I mean that I save a
> > link/pointer to a lower-level object in the private data field of an
> > upper-level object.  I do so for struct file, inode, dentry, etc.  But I
> > do NOT stack on pages.  Doing so would complicate stacking considerably.
> > So far I was able to avoid this b/c every function that deals with pages
> > also passes a struct file/dentry to it so I can find the correct lower
> > page.
> 
> You shouldn't need to "stack on" pages anyway, I wouldn't have thought.
> For each page you can reference mapping->host, which should point to the
> hosting structure (at the moment always an inode, but this may change).
> 
> > The new method, sync_page() is only passed a struct page.  So I cannot
> > stack on it!  If I have to stack on it, I'll have to either
> > 
> > (1) complicate my stacking code considerably by stacking on pages.  This is
> > impossible for my stackable compression file system, b/c the mapping of
> > upper and lower pages is not 1-to-1.
> 
> Why can your sync_page implementation not grab the inode from mapping->host
> and then call sync_page on the underlying fs' page(s) that hold the data?

I can, and I do (at least now :-) I tried it last night and so far it seems
to work just fine.

> > (2) change the kernel so that every instance of sync_page is passed the
> > corresponding struct file.  This isn't pretty either.
> 
> I'd like to the see the address_space methods /lose/ the struct file /
> struct dentry pointer, but it may be there are situations which require
> it.

I took a closer look at my address_space ops for stacking.  We don't do
anything special with the struct file/dentry that we get.  We just pass
those along (or their lower unstacked counterparts) to other address_space
ops which require them.  We get corresponding lower pages using the
mapping->host inode.

I also agree that pages should be associated with the inode, not the
file/dentry.

So I'm now leaning more towards losing the struct file/dentry from the
address_space ops.  Furthermore, since the address_space structure showed up
relatively recently, we might consider cleaning up this API before 2.4.  I
believe my stacking code would work fine w/o these struct file/dentry being
passed around (Ion, can you verify this please?)

Thanks for the info Steve.

Erez.



Re: Multiple devfs mounts

2000-05-01 Thread Eric W. Biederman

Richard Gooch <[EMAIL PROTECTED]> writes:

>   Hi, Al. You've previously stated that you consider the multiple
> mount feature of devfs broken. I agree that there are some races in
> there. However, I'm not clear on whether you're saying that the entire
> concept is broken, or that it can be fixed with appropriate loffcking.
> I've asked this before, but haven't had a response.

Last I saw it was his complaint that you varied what you
showed at different mount points, and that doing that all in 
one dcache tree was fundamentally broken.

> 
> If you feel that it's fundamentally impossible to mount a FS multiple
> times, please explain your reasoning.

At this point it would make sense to just use the generic multiple
mount features in the VFS that Alexander has been putting in.

Eric



Re: new VFS method sync_page and stacking

2000-05-01 Thread Steve Dodd

On Mon, May 01, 2000 at 01:41:43AM +0400, Roman V. Shaposhnick wrote:
> On Sun, Apr 30, 2000 at 03:28:18PM +0100, Steve Dodd wrote:

> > But an address_space is (or could be) a completely generic cache. It might
> > never be associated with an inode, let alone a dentry or file structure.
> 
>   Ok, ok, hold on, it is filemap.c where all this stuff is defined, I guess
> the name gives a hint about that it definitely is associated with some kind 
> of file machinery. But I understand what you mean. See my comments below.

OK, more precisely, an address_space could (and IMO *should*) be a generic
/page/ cache. Anything that's caching multiple pages of data indexed by an
offset could use it. Then the page cache can sync pages, steal pages, etc.,
etc., without having to know what the pages are being used for. At the moment,
it's only used for inode data pages, but with some small changes it can be
much more useful - and I believe this was Al's original intention.

> 
> > For example, I've got some experimental NTFS code which caches all metadata
> > in the page cache using the address_space stuff.
[..]

>   IMHO, you are going wrong direction here. My point is that sometimes  
> we would like to see address_space stuff as generic cache but it is not.

I disagree; at the moment the interface is not generic, but my belief is
that it can be made so with only small changes to existing code. However,
I haven't looked at all existing code yet. NFS is doing something odd, and I
also should look at your code if that's possible (is it available somewhere?)

> Or being more precise it is a generic cache from the generic perspective,
> but when you try to use generic functions from filemap.c you are bound to
> treat this beast as a file cache.

The page cache at the moment is mostly used for caching inode data, hence
the name "filemap.c".

[..]
>   Thus my opinion is that address_space_operations should remain 
> file-oriented ( and if there are no good contras take the first argument 
> of "struct file *" type ). At the same time it is possible to have completely
> different set of methods around  the same address_space stuff, but from my 
> point of view this story has nothing in common with how an *existing* 
> file-oriented interface should work.

Hmm. Take ->readpage() for example. It is used to fill a page which "belongs"
to an object. That object is referenced in page->mapping->host. For inode
data, the host is the inode structure. When should readpage() ever need to
see anything other than the object to which the pages belong? It doesn't make
sense (to me). If you think you need to access other data, it should either
be accessible through the host object, or you have used the wrong host object.

Inode data pages are per-inode, not per-dentry or per-file-struct.




Re: fs changes in 2.3

2000-05-01 Thread Steve Dodd

On Sun, Apr 30, 2000 at 10:59:57PM -0700, Ani Joshi wrote:

> hello, can someone point me to a rough list (if there is one) of all the
> things that need to be changed in each fs driver for 2.3?

Not off the top of my head; one thing that might be instructive is to compare
the implementations of a couple of "simple" filesystems (minix, romfs, ext2)
in 2.2 and 2.3.

> i'm not subscribed to this list so if this was posted before, please point
> me to a list archive if there is one.

Try http://www.kernelnotes.org/lnxlists> (from memory, that may not be
exactly right); Anything from Al Viro with [RFC] or similar in the subject
is likely to be relevant.



Re: new VFS method sync_page and stacking

2000-05-01 Thread Steve Dodd

On Sun, Apr 30, 2000 at 03:57:26PM -0400, Erez Zadok wrote:

> It sounds like different people have possibly conflicting needs.  I think
> any major changes should wait for 2.5.

Almost certainly, though there is an argument for cleaning up these APIs
now (before we go to 2.4) so that we don't have to change them again too
much in 2.5/6. OTOH, if it's going to cause problems for existing code (nfs,
and external modules like the stacking stuff) then we should wait.

> I would also suggest that such
> significant VFS changes be discussed on this list so we can ensure that we
> can all get what we need out of the VFS.  Thanks.

Of course :)



Re: fs changes in 2.3

2000-05-01 Thread Amit S. Kale

On Mon, 01 May 2000, Ani Joshi wrote:
> hello, can someone point me to a rough list (if there is one) of all the
> things that need to be changed in each fs driver for 2.3?  i'm not subscribed 
> to this list so if this was posted before, please point me to a list archive
> if there is one.

I am afraid, changes in vfs from 2.2.x to 2.3.99-x are too many to list.
There have been many changes in vm also.

If you are porting a fs which is currently working on 2.2.x, you may want to
look at 2.2.x and 2.3.99-x ext2 source.

> 
> 
> thanks,
> 
> ani
> 
> 
> please CC me to any replies
-- 
Amit Kale
Veritas Software ( http://www.veritas.com )



Re: new VFS method sync_page and stacking

2000-05-01 Thread Steve Dodd

On Sun, Apr 30, 2000 at 04:46:37AM -0400, Erez Zadok wrote:

> Background: my stacking code for linux is minimal.  I only stack on things I
> absolutely have to.  By "stack on" I mean that I save a link/pointer to a
> lower-level object in the private data field of an upper-level object.  I do
> so for struct file, inode, dentry, etc.  But I do NOT stack on pages.  Doing
> so would complicate stacking considerably.  So far I was able to avoid this
> b/c every function that deals with pages also passes a struct file/dentry to
> it so I can find the correct lower page.

You shouldn't need to "stack on" pages anyway, I wouldn't have thought.
For each page you can reference mapping->host, which should point to the
hosting structure (at the moment always an inode, but this may change).

> The new method, sync_page() is only passed a struct page.  So I cannot stack
> on it!  If I have to stack on it, I'll have to either
> 
> (1) complicate my stacking code considerably by stacking on pages.  This is
> impossible for my stackable compression file system, b/c the mapping of
> upper and lower pages is not 1-to-1.

Why can your sync_page implementation not grab the inode from mapping->host
and then call sync_page on the underlying fs' page(s) that hold the data?

> (2) change the kernel so that every instance of sync_page is passed the
> corresponding struct file.  This isn't pretty either.

I'd like to the see the address_space methods /lose/ the struct file /
struct dentry pointer, but it may be there are situations which require
it.

> Luckily, sync_page isn't used too much.  Only nfs seems to use it at the
> moment.  All other file systems which define ->sync_page use
> block_sync_page() which is defined as:
> 
> int block_sync_page(struct page *page)
> {
>   run_task_queue(&tq_disk);
>   return 0;
> }
> 
> This is confusing.  Why would block_sync_page ignore the page argument and
> call something else.  The name "block_sync_page" might be misleading.  The
> only thing I can think of is that block_sync_page is a placeholder for for a
> time when it would actually do something with the page.

No, this looks OK to me. sync_page seems to be designed to ensure that any
async I/O on a page is complete. For normal "block-mapped" filesystems (which
is what the block_* helpers in buffer.c are for), pages with "in flight" I/O
will have requests on the appropriate block device's queue. Running tq_disk
should ensure these are completed. If tq_disk is ever split up (I believe Jens
Axboe is working on this), block_sync_page will need to look at
page->mapping->host to determine the device, I assume.

Non block-mapped filesystems (e.g. nfs) will need to do something different.

> Anyway, since sync_page appears to be an optional function, I've tried my
> stacking without defining my own ->sync_page.  Preliminary results show it
> seems to work.  However, if at any point I'd have to define ->sync_page page
> and have to call the lower file system's ->sync_page, I'd urge a change in
> the prototype of this method that would make it possible for me to stack
> this operation.
> 
> Also, I don't understand what's ->sync_page for in the first place.  The
> name of the fxn implies it might be something like a commit_write.

I presume it's been added to replace explicit tq_disk runs, which will allow
finer-grained syncing as described above, and also to ensure correctness in
nfs and other similar filesystems.