Re: O_DIRECT architecture (was Re: info point on linux hdr)

2000-04-18 Thread Steve Lord

 Hi,
 
 On Mon, Apr 17, 2000 at 05:58:48PM -0500, Steve Lord wrote:
  
  O_DIRECT on Linux XFS is still a work in progress, we only have
  direct reads so far. A very basic implementation was made available
  this weekend.
 
 Care to elaborate on how you are doing O_DIRECT?


XFS is using the pagebuf code we wrote (or I should say are writing - it
needs a lot of work yet). This uses kiobufs to represent data in a set of
pages. So, we have the infrastructure to take a kiobuf and read or write
it from disk (OK, it uses buffer heads under the covers). I glued this
together with the map_user_kiobuf() and unmap_kiobuf() calls from your raw
I/O driver and that was about it.

We only build these kiobufs for data which is sequential on disk, not for
the whole user request, the sequence we do things in is a bit different,
basically:

while data left to copy

obtain bmap from filesystem representing location of next
chunk of data (sequential on disk)

for buffered I/O

go find pages covering this range - create if they
do not exist.

issue blocking read for pages which are not uptodate

copy out to user space

for direct I/O 

map user pages into a kiobuf

issue blocking read for pages

unmap pages

I said basic implementation because it is currently paying no attention
to cached data. The Irix approach to this was to flush or toss cached
data which overlapped a direct I/O, I am leaning towards keeping them
as part of the I/O.

Other future possibilities I see are:

  o using caching to remove the alignment restrictions on direct I/O by
doing unaligned head and tail processing via buffered I/O.

  o Automatically switching to direct I/O under conditions where there
the I/O would flush to much cache.



 
 It's something I've been thinking about in the general case.  Basically
 what I want to do is this:
 
 Augment the inode operations with a new operation, "rw_kiovec" which
 performs reads and writes on vectors of kiobufs.  

You should probably take a look at what we have been doing to the ops,
although our extensions are really biased towards extent based filesystems,
rather than using getblock to identify individual blocks of file data we
added a bmap interface to return a larger range - this requires different
locking semantics than getblock, since the mapping we return covers multiple
pages. I suspect that any approach which assembles multiple pages in advance
is going to have similar issues.


 
 Provide a generic_rw_kiovec() function which uses the existing page-
 oriented IO vectors to set up page mappings much as generic_file_{read,
 write} do, but honouring the following flags in the file descriptor:
 
  * O_ALIAS

Allows the write function to install the page in the kiobuf 
into the page cache if the data is correctly aligned and there is
not already a page in the page cache.
 
For read, the meaning is different: it allows existing pages in 
the page cache to be installed into the kiobuf.
 
  * O_UNCACHE
 
If the IO created a new page in the page cache, then attempt to
unlink the page after the IO completes.
 
  * O_SYNC
 
Usual meaning: wait for synchronous write IO completion.
 
 O_DIRECT becomes no more than a combination of these options.

So if O_ALIAS allows user pages to be put in the cache (provided you use
O_UNCACHE with it), you can do this. However, O_DIRECT would be a bit more
than this - since if there already was cached data for part of the I/O
you still need to copy those pages up into the user pages which did not
get into cache. 


 
 Furthermore, by implementing this mechanism with kiobufs, we can go
 one step further and perform things like Larry's splice operations by
 performing reads and writes in kiobufs.  Using O_ALIAS kiobuf reads and
 writes gives us copies between regular files entirely in kernel space
 with the minimum possible memory copies.  sendfile() between regular
 files can be optimised to use this mechanism.  The data never has to
 hit user space.
 
 As an example of the flexibility of the interface, you can perform
 an O_ALIAS, O_UNCACHE sendfile to copy one file to another, with full
 readahead still being performed on the input file but with no memory 
 copies at all.  You can also choose not to have O_UNCACHE and O_SYNC
 on the writes, in which case you have both readahead and writebehind
 with zero copy.
 
 This is all fairly easy to implement (at least for ext2), and gives
 us much more than just O_DIRECT for no extra work.
 
 --Stephen

We (SGI) really need to get better hooked in on stuff like this - I really
don't want to see us going off in one direction (pagebuf) and all the other
filesystems going off in a different direction.

Steve

p.s. did you know we also cache meta data in 

Re: O_DIRECT architecture (was Re: info point on linux hdr)

2000-04-18 Thread Stephen C. Tweedie

Hi,

On Tue, Apr 18, 2000 at 07:56:04AM -0500, Steve Lord wrote:
 
 XFS is using the pagebuf code we wrote (or I should say are writing - it
 needs a lot of work yet). This uses kiobufs to represent data in a set of
 pages. So, we have the infrastructure to take a kiobuf and read or write
 it from disk (OK, it uses buffer heads under the covers).

That's fine, and in fact is exactly what kiobufs were designed for:
to abstract out the storage of the buffer from whatever construction
you happen to use to do the IO.  (Raw IO also uses buffer_heads 
internally but passes data around in kiobufs.)

 I said basic implementation because it is currently paying no attention
 to cached data. The Irix approach to this was to flush or toss cached
 data which overlapped a direct I/O, I am leaning towards keeping them
 as part of the I/O.

The big advantage of the scheme where I map the kiobuf pages into the
real page cache before the I/O, and unmap after, is that cache
coherency at the beginning of the I/O and all the way through it is
guaranteed.  The cost is that the direct I/O may end up doing copies
if there is other I/O going on at the same time to the same page, but
I don't see that as a problem!

   o using caching to remove the alignment restrictions on direct I/O by
 doing unaligned head and tail processing via buffered I/O.

I'm just planning on doing a copy for any unaligned I/O.  Raw character
devices simply reject unaligned I/O for now, but O_DIRECT will be a 
bit more forgiving.

  It's something I've been thinking about in the general case.  Basically
  what I want to do is this:
  
  Augment the inode operations with a new operation, "rw_kiovec" which
  performs reads and writes on vectors of kiobufs.  
 
 You should probably take a look at what we have been doing to the ops,
 although our extensions are really biased towards extent based filesystems,
 rather than using getblock to identify individual blocks of file data we
 added a bmap interface to return a larger range - this requires different
 locking semantics than getblock, since the mapping we return covers multiple
 pages. I suspect that any approach which assembles multiple pages in advance
 is going to have similar issues.

OK.  These are probably orthogonal for now, but doing extent bmaps is
an important optimisation.  

Ultimately we are going to have to review the whole device driver 
interface.  We need that both to do things like 2TB block devices, and
also to achieve better efficiency than we can attain right now with a
separate buffer_head for every single block in the I/O.  It's just using
too much CPU; being able to pass kiobufs directly to ll_rw_block along
with a block address list would be much more efficient.

 So if O_ALIAS allows user pages to be put in the cache (provided you use
 O_UNCACHE with it), you can do this.

Yes.

 However, O_DIRECT would be a bit more
 than this - since if there already was cached data for part of the I/O
 you still need to copy those pages up into the user pages which did not
 get into cache. 

That's the intention --- O_ALIAS _allows_ the user page to be mapped 
into the cache, but if existing cached data or alignment constraints
prevent that, it will fall back to doing a copy.

One consequence is that O_DIRECT I/O from a file which is already cached
will always result in copies, but I don't mind that too much.

 We (SGI) really need to get better hooked in on stuff like this - I really
 don't want to see us going off in one direction (pagebuf) and all the other
 filesystems going off in a different direction.

The pagebuf stuff sounds like it is fairly specialised for now.  As
long as all of the components that we are talking about can pass kiobufs
between themselves, we should be able to make them interoperate pretty
easily.

Is the pagebuf code intended to be core VFS functionality or do you
see it being an XFS library component for the forseeable future?
 
 p.s. did you know we also cache meta data in pages directly?

That was one of the intentions in the new page cache structure, and we
may actually end up moving ext2's metadata caching to use the page 
cache too in the future.  

--Stephen



Re: O_DIRECT architecture (was Re: info point on linux hdr)

2000-04-18 Thread Steve Lord

 Hi,
 
 On Tue, Apr 18, 2000 at 07:56:04AM -0500, Steve Lord wrote:
 
  I said basic implementation because it is currently paying no attention
  to cached data. The Irix approach to this was to flush or toss cached
  data which overlapped a direct I/O, I am leaning towards keeping them
  as part of the I/O.
 
 The big advantage of the scheme where I map the kiobuf pages into the
 real page cache before the I/O, and unmap after, is that cache
 coherency at the beginning of the I/O and all the way through it is
 guaranteed.  The cost is that the direct I/O may end up doing copies
 if there is other I/O going on at the same time to the same page, but
 I don't see that as a problem!

I was thinking along these lines.

So I guess the question here is how do you plan on keeping track of the
origin of the pages? Which ones were originally part of the kernel cache
and thus need copying up to user space? It does not seem hard, just wondering
what you had in mind. Also, I presume, if the page was already present
and up to date then on a read you would not refill it from disk - since it
may be more recent that the on disk data, existing buffer heads would
give you this information. 

 

 
 Ultimately we are going to have to review the whole device driver 
 interface.  We need that both to do things like 2TB block devices, and
 also to achieve better efficiency than we can attain right now with a
 separate buffer_head for every single block in the I/O.  It's just using
 too much CPU; being able to pass kiobufs directly to ll_rw_block along
 with a block address list would be much more efficient.

Agreed, XFS was getting killed by this (and the fixed block size requirement
of the interface) we have 512 byte I/O requests we need to do for some
meta-data, having to impose this on all I/O and create 8 buffer heads for
each 4K page was just nasty.

 
  So if O_ALIAS allows user pages to be put in the cache (provided you use
  O_UNCACHE with it), you can do this.
 
 Yes.
 
  However, O_DIRECT would be a bit more
  than this - since if there already was cached data for part of the I/O
  you still need to copy those pages up into the user pages which did not
  get into cache. 
 
 That's the intention --- O_ALIAS _allows_ the user page to be mapped 
 into the cache, but if existing cached data or alignment constraints
 prevent that, it will fall back to doing a copy.
 
 One consequence is that O_DIRECT I/O from a file which is already cached
 will always result in copies, but I don't mind that too much.

So maybe an O_CLEANCACHE (or something similar) could be used to indicate
that anything which is found cached should be moved out of the way (flushed
to disk or tossed depending on what is happening). Some other sort of API
such as an fsync variant or that fadvise call which was mentioned recently
could be used to clean cache for a file. This would let those apps which really
want direct disk - user memory I/O get what they wanted.

 
 The pagebuf stuff sounds like it is fairly specialised for now.  As
 long as all of the components that we are talking about can pass kiobufs
 between themselves, we should be able to make them interoperate pretty
 easily.
 
 Is the pagebuf code intended to be core VFS functionality or do you
 see it being an XFS library component for the forseeable future?

We had talked about trying to use it on some other filesystem to see what
happened, but we don't really have the bandwidth to do that. We don't see
it as being just there for XFS - although, for existing Linux filesystems,
there may not be benefits to switching over to it.

 
 --Stephen


Steve





Re: O_DIRECT architecture (was Re: info point on linux hdr)

2000-04-18 Thread Stephen C. Tweedie

Hi,

On Tue, Apr 18, 2000 at 01:17:52PM -0500, Steve Lord wrote:

 So I guess the question here is how do you plan on keeping track of the
 origin of the pages?

You don't have to.

 Which ones were originally part of the kernel cache
 and thus need copying up to user space?

If the caller requested O_ALIAS, then the IO routine is allowed to
free any page in the kiobuf and alias the existing page cache page into
the kiobuf.  It doesn't actually matter whether the original page in
the page cache was there when we started, or whether it was a kiobuf
page which we mapped into the page cache for a direct IO.  The whole
point is that as long as the IO is in progress, it is a perfectly 
legal page in the page cache.

There is one nastiness.  This use of the kiobuf effectively results
in a temporary mmap() of the file while the IO is in progress.  If 
another thread happens to write to the page undergoing an O_DIRECT
read, then we end up modifying the page cache.  That's bad.

So, we need to make sure that while the IO is in progress for a true
raw IO, we keep the page locked during the IO and mark the page not-
uptodate after the IO completes.  That way, even if another process
looks up the page while the IO is in progress, the data that was
undergoing IO will be kept private, and we get a second chance once
the IO has completed to see that the page is now shared and we have
to do a copy.

That is slightly messy, but it nicely hides the transient mmap while
still preserving zero-copy for all of the important cases.  It's 
about the cleanest solution I can see which preserves complete
cache coherency at all times, because it guarantees that the IO is
always done inside the page cache itself.

 It does not seem hard, just wondering
 what you had in mind. Also, I presume, if the page was already present
 and up to date then on a read you would not refill it from disk - since it
 may be more recent that the on disk data, existing buffer heads would
 give you this information. 

It's not a physical buffer_head lookup, it's a logical page cache 
lookup that we would do, but yes, we'd read from the page cache in 
this case and just do a copy.

  One consequence is that O_DIRECT I/O from a file which is already cached
  will always result in copies, but I don't mind that too much.
 
 So maybe an O_CLEANCACHE (or something similar) could be used to indicate
 that anything which is found cached should be moved out of the way (flushed
 to disk or tossed depending on what is happening).

That's an orthogonal issue: posix_fadvise() should be the mechanism for
that if the application truly wants to do explicit cache eviction.

--Stephen