O API standards

Rob Ross Mon, 04 Dec 2006 17:00:28 -0800

Hi all,

I don't think that the group intended that there be an opendirplus();rather readdirplus() would simply be called instead of the usualreaddir(). We should clarify that.

Regarding Peter Staubach's comments about no one ever using thereaddirplus() call; well, if people weren't performing this workload inthe first place, we wouldn't *need* this sort of call! This call isspecifically targeted at improving "ls -l" performance on largedirectories, and Sage has pointed out quite nicely how that might work.

In our case (PVFS), we would essentially perform three phases ofcommunication with the file system for a readdirplus that was obtainingfull statistics: first grabbing the directory entries, then obtainingmetadata from servers on all objects in bulk, then gathering file sizesin bulk. The reduction in control message traffic is enormous, and theconcurrency is much greater than in a readdir()+stat()s workload. We'dnever perform this sort of optimization optimistically, as the cost ofguessing wrong is just too high. We would want to see the call as aproper VFS operation that we could act upon.

The entire readdirplus() operation wasn't intended to be atomic, and infact the returned structure has space for an error associated with thestat() on a particular entry, to allow for implementations that stat()subsequently and get an error because the object was removed betweenwhen the entry was read out of the directory and when the stat wasperformed. I think this fits well with what Andreas and others arethinking. We should clarify the description appropriately.

I don't think that we have a readdirpluslite() variation documented yet?Gary? It would make a lot of sense. Except that it should probably havea better name...

Regarding Andreas's note that he would prefer the statlite() flags tomean "valid", that makes good sense to me (and would obviously apply tothe so-far even more hypothetical readdirpluslite()). I don't thinkthere's a lot of value in returning possibly-inaccurate values?


Thanks everyone,

Rob

Trond Myklebust wrote:

On Mon, 2006-12-04 at 00:32 -0700, Andreas Dilger wrote:

I'm wondering if a corresponding opendirplus() (or similar) would also beappropriate to inform the kernel/filesystem that readdirplus() willfollow, and stat information should be gathered/buffered. Or do mostimplementations wait for the first readdir() before doing any actual workanyway?
I'm not sure what some filesystems might do here.  I suppose NFS has weak
enough cache semantics that it _might_ return stale cached data from the
client in order to fill the readdirplus() data, but it is just as likely
that it ships the whole thing to the server and returns everything in
one shot.  That would imply everything would be at least as up-to-date
as the opendir().


Whether or not the posix committee decides on readdirplus, I propose
that we implement this sort of thing in the kernel via a readdir
equivalent to posix_fadvise(). That can give exactly the barrier
semantics that they are asking for, and only costs 1 extra syscall as
opposed to 2 (opendirplus() and readdirplus()).

Cheers
  Trond


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFSv4/pNFS possible POSIX I/O API standards

Reply via email to