Re: [FSAIO][PATCH 6/8] Enable asynchronous wait page and lock page

2007-01-02 Thread Christoph Hellwig
On Thu, Dec 28, 2006 at 08:17:17PM +0530, Suparna Bhattacharya wrote:
 I am really bad with names :(  I tried using the _wq suffixes earlier and
 that seemed confusing to some, but if no one else objects I'm happy to use
 that. I thought aio_lock_page() might be misleading because it is
 synchronous if a regular wait queue entry is passed in, but again it may not
 be too bad.
 
 What's your preference ? Does anything more intuitive come to mind ?

Beein bad about naming seems to be a disease, at least I suffer from it
aswell.  I wouldn't mind either the _wq or aio_ naming - _wq describes
the way it's called and aio_ describes it's a special case for aio.
Similarly to how -aio_read/-aio_write can be used for synchronous I/O
aswell.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [FSAIO][PATCH 7/8] Filesystem AIO read

2007-01-02 Thread Christoph Hellwig
On Thu, Dec 28, 2006 at 08:48:30PM +0530, Suparna Bhattacharya wrote:
 Yes, we can do that -- how about aio_restarted() as an alternate name ?

Sounds fine to me.

  Pluse possible naming updates discussed in the last mail.  Also do we
  really need to pass current-io_wait here?  Isn't the waitqueue in
  the kiocb always guaranteed to be the same?  Now that all pagecache
 
 We don't have have the kiocb available to this routine. Using current-io_wait
 avoids the need to pass the iocb down to deeper levels just for the sync vs
 async checks, also allowing such routines to be shared by other code which
 does not use iocbs (e.g. generic_file_sendfile-do_generic_file_read
 -do_generic_mapping_read) without having to set up dummy iocbs.

We really want to switch senfile to kiocbs btw, - for one thing to
allow an aio_sendfile implementation and second to make it more common
to all the other I/O path code so we can avoid special cases in the
fs code  So I'm not convinced by that argument.  But again we don't
need to put the io_wait removal into your patchkit.  I'll try to
hack on it once I'll get a little spare time.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2007-01-02 Thread Pavel Machek
Hi!

   It seems like the posix idea of unique st_dev, st_ino doesn't
   hold water for modern file systems 
   
   are you really sure?
  
  Well Jan's example was of Coda that uses 128-bit internal file ids.
  
   and if so, why don't we fix *THAT* instead
  
  Hmm, sometimes you can't fix the world, especially if the filesystem
  is exported over NFS and has a problem with fitting its file IDs uniquely
  into a 64-bit identifier.
 
 Note, it's pretty easy to fit _anything_ into a 64-bit identifier with
 the use of a good hash function.  The chance of an accidental
 collision is infinitesimally small.  For a set of 
 
  100 files: 0.03%
1,000,000 files: 0.03%

I do not think we want to play with probability like this. I mean...
imagine 4G files, 1KB each. That's 4TB disk space, not _completely_
unreasonable, and collision probability is going to be ~100% due to
birthday paradox.

You'll still want to back up your 4TB server...

Pavel
-- 
Thanks for all the (sleeping) penguins.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2007-01-02 Thread Miklos Szeredi
It seems like the posix idea of unique st_dev, st_ino doesn't
hold water for modern file systems 

are you really sure?
   
   Well Jan's example was of Coda that uses 128-bit internal file ids.
   
and if so, why don't we fix *THAT* instead
   
   Hmm, sometimes you can't fix the world, especially if the filesystem
   is exported over NFS and has a problem with fitting its file IDs uniquely
   into a 64-bit identifier.
  
  Note, it's pretty easy to fit _anything_ into a 64-bit identifier with
  the use of a good hash function.  The chance of an accidental
  collision is infinitesimally small.  For a set of 
  
   100 files: 0.03%
 1,000,000 files: 0.03%
 
 I do not think we want to play with probability like this. I mean...
 imagine 4G files, 1KB each. That's 4TB disk space, not _completely_
 unreasonable, and collision probability is going to be ~100% due to
 birthday paradox.
 
 You'll still want to back up your 4TB server...

Certainly, but tar isn't going to remember all the inode numbers.
Even if you solve the storage requirements (not impossible) it would
have to do (4e9^2)/2=8e18 comparisons, which computers don't have
enough CPU power just yet.

It doesn't matter if there are collisions within the filesystem, as
long as there are no collisions between the set of files an
application is working on at the same time.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2007-01-02 Thread Mikulas Patocka



On Tue, 2 Jan 2007, Miklos Szeredi wrote:


It seems like the posix idea of unique st_dev, st_ino doesn't
hold water for modern file systems


are you really sure?


Well Jan's example was of Coda that uses 128-bit internal file ids.


and if so, why don't we fix *THAT* instead


Hmm, sometimes you can't fix the world, especially if the filesystem
is exported over NFS and has a problem with fitting its file IDs uniquely
into a 64-bit identifier.


Note, it's pretty easy to fit _anything_ into a 64-bit identifier with
the use of a good hash function.  The chance of an accidental
collision is infinitesimally small.  For a set of

 100 files: 0.03%
   1,000,000 files: 0.03%


I do not think we want to play with probability like this. I mean...
imagine 4G files, 1KB each. That's 4TB disk space, not _completely_
unreasonable, and collision probability is going to be ~100% due to
birthday paradox.

You'll still want to back up your 4TB server...


Certainly, but tar isn't going to remember all the inode numbers.
Even if you solve the storage requirements (not impossible) it would
have to do (4e9^2)/2=8e18 comparisons, which computers don't have
enough CPU power just yet.


It is remembering all inode numbers with nlink  1 and many other tools 
are remembering all directory inode numbers (see my other post on this 
topic). It of course doesn't compare each number with all others, it is 
using hashing.



It doesn't matter if there are collisions within the filesystem, as
long as there are no collisions between the set of files an
application is working on at the same time.


--- that are all files in case of backup.


Miklos


Mikulas
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [nfsv4] RE: Finding hardlinks

2007-01-02 Thread Trond Myklebust
On Sun, 2006-12-31 at 16:25 -0500, Halevy, Benny wrote:
 Trond Myklebust wrote:
   
  On Thu, 2006-12-28 at 15:07 -0500, Halevy, Benny wrote:
   Mikulas Patocka wrote:
  
   BTW. how does (or how should?) NFS client deal with cache coherency if 
   filehandles for the same file differ?
   
   
   Trond can probably answer this better than me...
   As I read it, currently the nfs client matches both the fileid and the
   filehandle (in nfs_find_actor). This means that different filehandles
   for the same file would result in different inodes :(.
   Strictly following the nfs protocol, comparing only the fileid should
   be enough IF fileids are indeed unique within the filesystem.
   Comparing the filehandle works as a workaround when the exported 
   filesystem
   (or the nfs server) violates that.  From a user stand point I think that
   this should be configurable, probably per mount point.
  
  Matching files by fileid instead of filehandle is a lot more trouble
  since fileids may be reused after a file has been deleted. Every time
  you look up a file, and get a new filehandle for the same fileid, you
  would at the very least have to do another GETATTR using one of the
  'old' filehandles in order to ensure that the file is the same object as
  the one you have cached. Then there is the issue of what to do when you
  open(), read() or write() to the file: which filehandle do you use, are
  the access permissions the same for all filehandles, ...
  
  All in all, much pain for little or no gain.
 
 See my answer to your previous reply.  It seems like the current
 implementation is in violation of the nfs protocol and the extra pain
 is required.

...and we should care because...?

Trond

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Finding hardlinks

2007-01-02 Thread Trond Myklebust
On Sun, 2006-12-31 at 16:19 -0500, Halevy, Benny wrote:

 Even for NFSv3 (that doesn't have the unique_handles attribute I think
 that the linux nfs client can do a better job.  If you'd have a filehandle
 cache that points at inodes you could maintain a many to one relationship
 from multiple filehandles into one inode.  When you discover a new filehandle
 you can look up the inode cache for the same fileid and if one is found you
 can do a getattr on the old filehandle (without loss of generality you should 
 always use the latest filehandle that was returned for that filesystem object,
 although any filehandle that refers to it can be used).
 If the getattr succeeded then the filehandles refer to the same fs object and
 you can create a new entry in the filehandle cache pointing at that inode.
 Otherwise, if getattr says that the old filehandle is stale I think you should
 mark the inode as stale and keep it around so that applications can get an
 appropriate error until last close, before you clean up the fh cache from the
 stale filehandles. A new inode structure should be created for the new 
 filehandle.

There are, BTW, other reasons why the above is a bad idea: it breaks on
a bunch of well known servers. Look back at the 2.2.x kernels and the
kind of hacks they had in order to deal with crap like the Netapp
'.snapshot' directories which contain files with duplicate fileids that
do not represent hard links, but rather represent previous revisions of
the same file.

That kind of hackery was part of the reason why I ripped out that code.
The other reasons were
- that you end up playing unnecessary getattr games like the
above for little gain.
- the only servers that implemented the above were borken pieces
of crap that encoded parent directories in the filehandle, and
which end up breaking anyway under cross-directory renames.
- the world is filled with non-posix filesystems that frequently
don't have real fileids. They are often just generated on the
fly and can change at the drop of a hat.

Trond

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Finding hardlinks

2007-01-02 Thread Mikulas Patocka

On Wed, 3 Jan 2007, Trond Myklebust wrote:


On Sat, 2006-12-30 at 02:04 +0100, Mikulas Patocka wrote:


On Fri, 29 Dec 2006, Trond Myklebust wrote:


On Thu, 2006-12-28 at 19:14 +0100, Mikulas Patocka wrote:

Why don't you rip off the support for colliding inode number from the
kernel at all (i.e. remove iget5_locked)?

It's reasonable to have either no support for colliding ino_t or full
support for that (including syscalls that userspace can use to work with
such filesystem) --- but I don't see any point in having half-way support
in kernel as is right now.


What would ino_t have to do with inode numbers? It is only used as a
hash table lookup. The inode number is set in the -getattr() callback.


The question is: why does the kernel contain iget5 function that looks up
according to callback, if the filesystem cannot have more than 64-bit
inode identifier?


Huh? The filesystem can have as large a damned identifier as it likes.
NFSv4 uses 128-byte filehandles, for instance.


But then it needs some other syscall to let applications determine 
hardlinks --- which was the initial topic in this thread.



POSIX filesystems are another matter. They can only have 64-bit
identifiers thanks to the requirement that inode numbers be 64-bit
unique and permanently stored, however Linux caters for a whole
truckload of filesystems which will never fit that label: look at all
those users of iunique(), for one...


I see them. The bad thing is that many programmers read POSIX, write 
programs as if POSIX specification was true and these programs break 
randomly on non-POSIX filesystem. Each non-POSIX filesystem invents st_ino 
on its own, trying to minimize hash collision, making the failure even 
less probable and worse to find.


The current situation is (for example) that cp does stat(), open(), 
fstat() and compares st_ino/st_dev --- if they mismatch, it writes error 
and doesn't copy files --- so if kernel removes the inode from cache 
between stat() and open() and filesystem uses iunique(), cp will fail.


What utilities should the user use on those non-POSIX filesystems, if not 
cp?


Probably some file-handling guidelines should be specified and written to 
Documentation/ as a form of standard that can appliaction programmers use.


Mikulas


Trond


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on a series of AIO patchsets

2007-01-02 Thread Zach Brown

Sorry for the delay, I'm finally back from the holiday break :)

(1) The filesystem AIO patchset, attempts to address one part of  
the problem
which is to make regular file IO, (without O_DIRECT)  
asynchronous (mainly
the case of reads of uncached or partially cached files, and  
O_SYNC writes).


One of the properties of the currently implemented EIOCBRETRY aio  
path is that -mm is the only field in current which matches the  
submitting task_struct while inside the retry path.


It looks like a retry-based aio write path would be broken because of  
this.  generic_write_checks() could run in the aio thread and get its  
task_struct instead of that of the submitter.  The wrong rlimit will  
be tested and SIGXFSZ won't be raised.  remove_suid() could check the  
capabilities of the aio thread instead of those of the submitter.


I don't think EIOCBRETRY is the way to go because of this increased  
(and subtle!) complexity.  What are the chances that we would have  
ever found those bugs outside code review?  How do we make sure that  
current references don't sneak back in after having initially audited  
the paths?


Take the io_cmd_epoll_wait patch..

issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach  
Brown with
modifications from Jeff Moyer and me) addresses this problem  
for native

linux aio in a simple manner.


It's simple looking, sure.  This current flipping didn't even occur  
to me while throwing the patch together!


But that patch ends up calling -poll (and poll_table-qproc) and  
writing to userspace (so potentially calling -nopage) from the aio  
threads.  Are we sure that none of them will behave surprisingly  
because current changed under them?


It might be safe now, but that isn't really the point.  I'd rather we  
didn't have yet one more subtle invariant to audit and maintain.


At the risk of making myself vulnerable to the charge of mentioning  
vapourware, I will admit that I've been working on a (slightly mad)  
implementation of async syscalls.  I've been quiet about it because I  
don't want to whip up complicated discussion without being able to  
show code that works, even if barely.  I mention it now only to make  
it clear that I want to be constructive, not just critical :).


- z
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on a series of AIO patchsets

2007-01-02 Thread Kent Overstreet

 Any details?

Well, one path I tried I couldn't help but post a blog entry about
for my friends.  I'm not sure it's the direction I'll take with linux-
kernel, but the fundamentals are there:  the api should be the
syscall interface, and there should be no difference between sync and
async behaviour.

http://www.zabbo.net/?p=72


Any code you're willing to let people play with? I could at least have
real test cases, and a library to go along with it as it gets
finished.

Another pie in the sky idea:
One thing that's been bugging me lately (working on a 9p server), is
sendfile is hard to use in practice because you need packet headers
and such, and they need to go out at the same time.

Sendfile listio support would fix this, but it's not a general
solution. What would be really usefull is a way to say that a certain
batch of async ops either all succeed or all fail, and happen
atomically; i.e., transactions for syscalls.

Probably even harder to do than general async syscalls, but it'd be
the best thing since sliced bread... and hey, it seems the logical
next step.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on a series of AIO patchsets

2007-01-02 Thread Suparna Bhattacharya
On Tue, Jan 02, 2007 at 03:56:09PM -0800, Zach Brown wrote:
 Sorry for the delay, I'm finally back from the holiday break :)

Welcome back !

 
 (1) The filesystem AIO patchset, attempts to address one part of
 the problem
 which is to make regular file IO, (without O_DIRECT)
 asynchronous (mainly
 the case of reads of uncached or partially cached files, and
 O_SYNC writes).
 
 One of the properties of the currently implemented EIOCBRETRY aio
 path is that -mm is the only field in current which matches the
 submitting task_struct while inside the retry path.

Yes and that as I guess you know is to enable the aio worker thread to
operate on the caller's address space for copy_from/to_user. 

The actual io setup and associated checks are expected to have been
handled at submission time.

 
 It looks like a retry-based aio write path would be broken because of
 this.  generic_write_checks() could run in the aio thread and get its
 task_struct instead of that of the submitter.  The wrong rlimit will
 be tested and SIGXFSZ won't be raised.  remove_suid() could check the
 capabilities of the aio thread instead of those of the submitter.

generic_write_checks() are done in the submission path, not repeated during
retries, so such types of checks are not intended to run in the aio thread.

Did I miss something here ?

 
 I don't think EIOCBRETRY is the way to go because of this increased
 (and subtle!) complexity.  What are the chances that we would have
 ever found those bugs outside code review?  How do we make sure that
 current references don't sneak back in after having initially audited
 the paths?

The EIOCBRETRY route is not something that is intended to be used blindly,
It is just one alternative to implement an aio operation by splitting up
responsibility between the submitter and aio threads, where aio threads 
can run in the caller's address space.

 
 Take the io_cmd_epoll_wait patch..
 
 issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach
 Brown with
 modifications from Jeff Moyer and me) addresses this problem
 for native
 linux aio in a simple manner.
 
 It's simple looking, sure.  This current flipping didn't even occur
 to me while throwing the patch together!
 
 But that patch ends up calling -poll (and poll_table-qproc) and
 writing to userspace (so potentially calling -nopage) from the aio

Yes of course, but why is that a problem ?
The copy_from/to_user/put_user constructs are designed to handle soft failures,
and we are already using the caller's -mm. Do you see a need for any
additional asserts() ?

If there is something that is needed by -nopage etc which is not abstracted
out within the -mm, then we would need to fix that instead, for correctness
anyway, isn't that so ?

Now it is possible that there are minor blocking points in the code and the
effect of these would be to hold up / delay subsequent queued aio operations;
which is an efficiency issue, but not a correctness concern.

 threads.  Are we sure that none of them will behave surprisingly
 because current changed under them?

My take is that we should fix the problems that we see. It is likely that
what manifests relatively more easily with AIO is also a subtle problem
in other cases.

 
 It might be safe now, but that isn't really the point.  I'd rather we
 didn't have yet one more subtle invariant to audit and maintain.
 
 At the risk of making myself vulnerable to the charge of mentioning
 vapourware, I will admit that I've been working on a (slightly mad)
 implementation of async syscalls.  I've been quiet about it because I
 don't want to whip up complicated discussion without being able to
 show code that works, even if barely.  I mention it now only to make
 it clear that I want to be constructive, not just critical :).

That is great and I look forward to it :) I am, however assuming that
whatever implementation you come up will have a different interface
from current linux aio -- i.e. a next generation aio model, that will be
easily integratable with kevents etc.

Which takes me back to Ingo's point - lets have the new evolve parallely
with the old, if we can, and not hold up the patches for POSIX AIO to
start using kernel AIO, or for epoll to integrate with AIO.

OK, I just took a quick look at your blog and I see that you
are basically implementing Linus' microthreads scheduling approach -
one year since we had that discussion. Glad to see that you found a way
to make it workable ... (I'm guessing that you are copying over the part
of the stack in use at the time of every switch, is that correct ? At what
point do you do the allocation of the saved stacks ? Sorry I should hold
off all these questions till your patch comes out)

Regards
Suparna

 
 - z

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at