date:20000418

Re: [Fwd: [linux-audio-dev] info point on linux hdr]

2000-04-18 Thread Paul Barton-Davis


>> 2) Why am I not having any of these problems ? Unlike Benno's code, I
>>have a working application that runs just fine. I get smooth
>>throughput from the disk subsystem too. 
>
>What do you mean exactly with "unlike Benno's code" ?
>My code just tries to simulate the operation of a busy harddisk recorder
>using the sorting algorithm to support variable speed.
>I read in 256kb chunks too therefore I don't see a big difference between
>my code and yours form a disk IO subsystem POV.

In your own words, your code "simulates the operations of a busy
harddisk recorder". Mine *is* a busy harddisk recorder. There's lot of
stuff in my code that isn't in yours because I have a bunch of real
world stuff like a tape transport mechanism, extra intra-thread
communication, MTC delivery, an audio thread event/play list, and
more. Yet, despite all this, I have never run into the problems you
describe. As Stephen has noted, this may very well be because of my
use of SCSI h/w.

>Therefore it would be useful if you could run my benchmark on your
>disk to see if your (or my approach) gets better performance out of the disk,
>and with how much buffer utilization.

When I get a minute or 30, I will.

>>In both cases, without O_SYNC, or anything else but preallocation
>>and careful design, I seem to be able to get smooth disk throughput
>>at significantly above the rate I need (9MB/sec; I get up to
>>17MB/sec from the UltraStar)
>
>17MB/sec using hdparm or linear large reads/writes ( large cat / cp etc) or
>17MB/sec within your harddisk recording app where 
>num_tracks * datarate_of_each_track = 17MB/sec  
>(it if's the latter then I doubt it because seek kills some of the throughput,
>that's almost unavoidable, at least on my EIDE UDMA disks)

No, I do mean the latter: 17MB/sec from within ardour. You can doubt
it all you want, but I get it regularly. Actually, the real numbers
look more like (from memory, each line is one iteration of disk i/o
across all tracks, so 24*256kB of data):

 15MB/sec
 450MB/sec
 10MB/sec
 14MB/sec
 567MB/sec
 16MB/sec
 19MB/sec
 8MB/sec
 378MB/sec

The super-high numbers, I assume, are because of the read-ahead being
done by the kernel, which helps us out every so often. Remember, these
files are as contiguous as I can make 'em with ext2. And keep in mind
that my disks have a maximum transfer rate of 35MB/sec (nothing to do
with U2W - just that they are just about the latest disks).

I have a very small, standalone single-threaded test app that gets
similar rates, even though it does random sized seeks across the whole
disk. I've posted that program before on LAD, and so I trust the
numbers.

--p

Re: O_DIRECT architecture (was Re: info point on linux hdr)

2000-04-18 Thread Stephen C. Tweedie

Hi,

On Tue, Apr 18, 2000 at 01:17:52PM -0500, Steve Lord wrote:

> So I guess the question here is how do you plan on keeping track of the
> origin of the pages?

You don't have to.

> Which ones were originally part of the kernel cache
> and thus need copying up to user space?

If the caller requested O_ALIAS, then the IO routine is allowed to
free any page in the kiobuf and alias the existing page cache page into
the kiobuf.  It doesn't actually matter whether the original page in
the page cache was there when we started, or whether it was a kiobuf
page which we mapped into the page cache for a direct IO.  The whole
point is that as long as the IO is in progress, it is a perfectly 
legal page in the page cache.

There is one nastiness.  This use of the kiobuf effectively results
in a temporary mmap() of the file while the IO is in progress.  If 
another thread happens to write to the page undergoing an O_DIRECT
read, then we end up modifying the page cache.  That's bad.

So, we need to make sure that while the IO is in progress for a true
raw IO, we keep the page locked during the IO and mark the page not-
uptodate after the IO completes.  That way, even if another process
looks up the page while the IO is in progress, the data that was
undergoing IO will be kept private, and we get a second chance once
the IO has completed to see that the page is now shared and we have
to do a copy.

That is slightly messy, but it nicely hides the transient mmap while
still preserving zero-copy for all of the important cases.  It's 
about the cleanest solution I can see which preserves complete
cache coherency at all times, because it guarantees that the IO is
always done inside the page cache itself.

> It does not seem hard, just wondering
> what you had in mind. Also, I presume, if the page was already present
> and up to date then on a read you would not refill it from disk - since it
> may be more recent that the on disk data, existing buffer heads would
> give you this information. 

It's not a physical buffer_head lookup, it's a logical page cache 
lookup that we would do, but yes, we'd read from the page cache in 
this case and just do a copy.

> > One consequence is that O_DIRECT I/O from a file which is already cached
> > will always result in copies, but I don't mind that too much.
> 
> So maybe an O_CLEANCACHE (or something similar) could be used to indicate
> that anything which is found cached should be moved out of the way (flushed
> to disk or tossed depending on what is happening).

That's an orthogonal issue: posix_fadvise() should be the mechanism for
that if the application truly wants to do explicit cache eviction.

--Stephen

Re: O_DIRECT architecture (was Re: info point on linux hdr)

2000-04-18 Thread Steve Lord

> Hi,
> 
> On Tue, Apr 18, 2000 at 07:56:04AM -0500, Steve Lord wrote:
> 
> > I said basic implementation because it is currently paying no attention
> > to cached data. The Irix approach to this was to flush or toss cached
> > data which overlapped a direct I/O, I am leaning towards keeping them
> > as part of the I/O.
> 
> The big advantage of the scheme where I map the kiobuf pages into the
> real page cache before the I/O, and unmap after, is that cache
> coherency at the beginning of the I/O and all the way through it is
> guaranteed.  The cost is that the direct I/O may end up doing copies
> if there is other I/O going on at the same time to the same page, but
> I don't see that as a problem!

I was thinking along these lines.

So I guess the question here is how do you plan on keeping track of the
origin of the pages? Which ones were originally part of the kernel cache
and thus need copying up to user space? It does not seem hard, just wondering
what you had in mind. Also, I presume, if the page was already present
and up to date then on a read you would not refill it from disk - since it
may be more recent that the on disk data, existing buffer heads would
give you this information. 

> 

> 
> Ultimately we are going to have to review the whole device driver 
> interface.  We need that both to do things like >2TB block devices, and
> also to achieve better efficiency than we can attain right now with a
> separate buffer_head for every single block in the I/O.  It's just using
> too much CPU; being able to pass kiobufs directly to ll_rw_block along
> with a block address list would be much more efficient.

Agreed, XFS was getting killed by this (and the fixed block size requirement
of the interface) we have 512 byte I/O requests we need to do for some
meta-data, having to impose this on all I/O and create 8 buffer heads for
each 4K page was just nasty.

> 
> > So if O_ALIAS allows user pages to be put in the cache (provided you use
> > O_UNCACHE with it), you can do this.
> 
> Yes.
> 
> > However, O_DIRECT would be a bit more
> > than this - since if there already was cached data for part of the I/O
> > you still need to copy those pages up into the user pages which did not
> > get into cache. 
> 
> That's the intention --- O_ALIAS _allows_ the user page to be mapped 
> into the cache, but if existing cached data or alignment constraints
> prevent that, it will fall back to doing a copy.
> 
> One consequence is that O_DIRECT I/O from a file which is already cached
> will always result in copies, but I don't mind that too much.

So maybe an O_CLEANCACHE (or something similar) could be used to indicate
that anything which is found cached should be moved out of the way (flushed
to disk or tossed depending on what is happening). Some other sort of API
such as an fsync variant or that fadvise call which was mentioned recently
could be used to clean cache for a file. This would let those apps which really
want direct disk <-> user memory I/O get what they wanted.

> 
> The pagebuf stuff sounds like it is fairly specialised for now.  As
> long as all of the components that we are talking about can pass kiobufs
> between themselves, we should be able to make them interoperate pretty
> easily.
> 
> Is the pagebuf code intended to be core VFS functionality or do you
> see it being an XFS library component for the forseeable future?

We had talked about trying to use it on some other filesystem to see what
happened, but we don't really have the bandwidth to do that. We don't see
it as being just there for XFS - although, for existing Linux filesystems,
there may not be benefits to switching over to it.

> 
> --Stephen

Steve

Re: O_DIRECT architecture (was Re: info point on linux hdr)

2000-04-18 Thread Stephen C. Tweedie

Hi,

On Tue, Apr 18, 2000 at 07:56:04AM -0500, Steve Lord wrote:
> 
> XFS is using the pagebuf code we wrote (or I should say are writing - it
> needs a lot of work yet). This uses kiobufs to represent data in a set of
> pages. So, we have the infrastructure to take a kiobuf and read or write
> it from disk (OK, it uses buffer heads under the covers).

That's fine, and in fact is exactly what kiobufs were designed for:
to abstract out the storage of the buffer from whatever construction
you happen to use to do the IO.  (Raw IO also uses buffer_heads 
internally but passes data around in kiobufs.)

> I said basic implementation because it is currently paying no attention
> to cached data. The Irix approach to this was to flush or toss cached
> data which overlapped a direct I/O, I am leaning towards keeping them
> as part of the I/O.

The big advantage of the scheme where I map the kiobuf pages into the
real page cache before the I/O, and unmap after, is that cache
coherency at the beginning of the I/O and all the way through it is
guaranteed.  The cost is that the direct I/O may end up doing copies
if there is other I/O going on at the same time to the same page, but
I don't see that as a problem!

>   o using caching to remove the alignment restrictions on direct I/O by
> doing unaligned head and tail processing via buffered I/O.

I'm just planning on doing a copy for any unaligned I/O.  Raw character
devices simply reject unaligned I/O for now, but O_DIRECT will be a 
bit more forgiving.

> > It's something I've been thinking about in the general case.  Basically
> > what I want to do is this:
> > 
> > Augment the inode operations with a new operation, "rw_kiovec" which
> > performs reads and writes on vectors of kiobufs.  
> 
> You should probably take a look at what we have been doing to the ops,
> although our extensions are really biased towards extent based filesystems,
> rather than using getblock to identify individual blocks of file data we
> added a bmap interface to return a larger range - this requires different
> locking semantics than getblock, since the mapping we return covers multiple
> pages. I suspect that any approach which assembles multiple pages in advance
> is going to have similar issues.

OK.  These are probably orthogonal for now, but doing extent bmaps is
an important optimisation.  

Ultimately we are going to have to review the whole device driver 
interface.  We need that both to do things like >2TB block devices, and
also to achieve better efficiency than we can attain right now with a
separate buffer_head for every single block in the I/O.  It's just using
too much CPU; being able to pass kiobufs directly to ll_rw_block along
with a block address list would be much more efficient.

> So if O_ALIAS allows user pages to be put in the cache (provided you use
> O_UNCACHE with it), you can do this.

Yes.

> However, O_DIRECT would be a bit more
> than this - since if there already was cached data for part of the I/O
> you still need to copy those pages up into the user pages which did not
> get into cache. 

That's the intention --- O_ALIAS _allows_ the user page to be mapped 
into the cache, but if existing cached data or alignment constraints
prevent that, it will fall back to doing a copy.

One consequence is that O_DIRECT I/O from a file which is already cached
will always result in copies, but I don't mind that too much.

> We (SGI) really need to get better hooked in on stuff like this - I really
> don't want to see us going off in one direction (pagebuf) and all the other
> filesystems going off in a different direction.

The pagebuf stuff sounds like it is fairly specialised for now.  As
long as all of the components that we are talking about can pass kiobufs
between themselves, we should be able to make them interoperate pretty
easily.

Is the pagebuf code intended to be core VFS functionality or do you
see it being an XFS library component for the forseeable future?

> p.s. did you know we also cache meta data in pages directly?

That was one of the intentions in the new page cache structure, and we
may actually end up moving ext2's metadata caching to use the page 
cache too in the future.  

--Stephen

Re: [Fwd: [linux-audio-dev] info point on linux hdr]

2000-04-18 Thread Stephen C. Tweedie


Hi,

On Tue, Apr 18, 2000 at 10:57:25AM -0400, Paul Barton-Davis wrote:
> >> 1) pre-allocation takes a *long* time. Allocating 24 203MB files on a
> >>clean ext2 partition of 18GB takes many, many minutes, for example.
> >>Presumably, the same overhead is being incurred when block
> >>allocation happens "on the fly".
> >
> >It is not the allocation which is taking ages, it's the actual
> >writing of the data.  
> 
> Except that for preallocation, I only write one byte in every block,
> so for a 203MB file, I only write 52K approximately (ext2 4K blocks).

Umm, how is that going to make _any_ difference at all?  The filesystem
works in blocks, not bytes.  You still end up with 203MB of disk IO.

--Stephen

Re: [Fwd: [linux-audio-dev] info point on linux hdr]

2000-04-18 Thread Paul Barton-Davis


>> 1) pre-allocation takes a *long* time. Allocating 24 203MB files on a
>>clean ext2 partition of 18GB takes many, many minutes, for example.
>>Presumably, the same overhead is being incurred when block
>>allocation happens "on the fly".
>
>It is not the allocation which is taking ages, it's the actual
>writing of the data.  

Except that for preallocation, I only write one byte in every block,
so for a 203MB file, I only write 52K approximately (ext2 4K blocks).

--p

Re: O_DIRECT architecture (was Re: info point on linux hdr)

2000-04-18 Thread Steve Lord

> Hi,
> 
> On Mon, Apr 17, 2000 at 05:58:48PM -0500, Steve Lord wrote:
> > 
> > O_DIRECT on Linux XFS is still a work in progress, we only have
> > direct reads so far. A very basic implementation was made available
> > this weekend.
> 
> Care to elaborate on how you are doing O_DIRECT?

XFS is using the pagebuf code we wrote (or I should say are writing - it
needs a lot of work yet). This uses kiobufs to represent data in a set of
pages. So, we have the infrastructure to take a kiobuf and read or write
it from disk (OK, it uses buffer heads under the covers). I glued this
together with the map_user_kiobuf() and unmap_kiobuf() calls from your raw
I/O driver and that was about it.

We only build these kiobufs for data which is sequential on disk, not for
the whole user request, the sequence we do things in is a bit different,
basically:

while data left to copy

obtain bmap from filesystem representing location of next
chunk of data (sequential on disk)

for buffered I/O

go find pages covering this range - create if they
do not exist.

issue blocking read for pages which are not uptodate

copy out to user space

for direct I/O 

map user pages into a kiobuf

issue blocking read for pages

unmap pages

I said basic implementation because it is currently paying no attention
to cached data. The Irix approach to this was to flush or toss cached
data which overlapped a direct I/O, I am leaning towards keeping them
as part of the I/O.

Other future possibilities I see are:

  o using caching to remove the alignment restrictions on direct I/O by
doing unaligned head and tail processing via buffered I/O.

  o Automatically switching to direct I/O under conditions where there
the I/O would flush to much cache.

> 
> It's something I've been thinking about in the general case.  Basically
> what I want to do is this:
> 
> Augment the inode operations with a new operation, "rw_kiovec" which
> performs reads and writes on vectors of kiobufs.  

You should probably take a look at what we have been doing to the ops,
although our extensions are really biased towards extent based filesystems,
rather than using getblock to identify individual blocks of file data we
added a bmap interface to return a larger range - this requires different
locking semantics than getblock, since the mapping we return covers multiple
pages. I suspect that any approach which assembles multiple pages in advance
is going to have similar issues.

> 
> Provide a generic_rw_kiovec() function which uses the existing page-
> oriented IO vectors to set up page mappings much as generic_file_{read,
> write} do, but honouring the following flags in the file descriptor:
> 
>  * O_ALIAS
>
>Allows the write function to install the page in the kiobuf 
>into the page cache if the data is correctly aligned and there is
>not already a page in the page cache.
> 
>For read, the meaning is different: it allows existing pages in 
>the page cache to be installed into the kiobuf.
> 
>  * O_UNCACHE
> 
>If the IO created a new page in the page cache, then attempt to
>unlink the page after the IO completes.
> 
>  * O_SYNC
> 
>Usual meaning: wait for synchronous write IO completion.
> 
> O_DIRECT becomes no more than a combination of these options.

So if O_ALIAS allows user pages to be put in the cache (provided you use
O_UNCACHE with it), you can do this. However, O_DIRECT would be a bit more
than this - since if there already was cached data for part of the I/O
you still need to copy those pages up into the user pages which did not
get into cache. 

> 
> Furthermore, by implementing this mechanism with kiobufs, we can go
> one step further and perform things like Larry's splice operations by
> performing reads and writes in kiobufs.  Using O_ALIAS kiobuf reads and
> writes gives us copies between regular files entirely in kernel space
> with the minimum possible memory copies.  sendfile() between regular
> files can be optimised to use this mechanism.  The data never has to
> hit user space.
> 
> As an example of the flexibility of the interface, you can perform
> an O_ALIAS, O_UNCACHE sendfile to copy one file to another, with full
> readahead still being performed on the input file but with no memory 
> copies at all.  You can also choose not to have O_UNCACHE and O_SYNC
> on the writes, in which case you have both readahead and writebehind
> with zero copy.
> 
> This is all fairly easy to implement (at least for ext2), and gives
> us much more than just O_DIRECT for no extra work.
> 
> --Stephen

We (SGI) really need to get better hooked in on stuff like this - I really
don't want to see us going off in one direction (pagebuf) and all the other
filesystems going off in a different d

Re: [Fwd: [linux-audio-dev] info point on linux hdr]

2000-04-18 Thread Stephen C. Tweedie


Hi,

On Mon, Apr 17, 2000 at 07:10:43PM +0200, Martin Schenk wrote:

> If you are interested in a more efficient fsync (and a real fdatasync),
> I have some patches that provide better performance for very large
> files (where fsync is mostly busy scanning the page cache for changes),
> and a fdatasync that eliminates writing the inode if not necessary.
> (at the moment these patches are only for 2.3.4?, and I don't have
> the time to keep them up to date - especially as nobody was interested
> the last time I posted them)

Please post them, I will have a look at them.

--Stephen

Re: [Fwd: [linux-audio-dev] info point on linux hdr]

2000-04-18 Thread Stephen C. Tweedie

Hi,

On Mon, Apr 17, 2000 at 01:05:12PM -0400, Paul Barton-Davis wrote:
> 
> Acknowledging your much greater wisdom in this are than me, I don't
> understand the above given that, in my experience:
> 
> 1) pre-allocation takes a *long* time. Allocating 24 203MB files on a
>clean ext2 partition of 18GB takes many, many minutes, for example.
>Presumably, the same overhead is being incurred when block
>allocation happens "on the fly".

It is not the allocation which is taking ages, it's the actual
writing of the data.  Even if you preallocate, you still have the
data writes to worry about!  Preallocation just can't gain that
much.  (It can gain some, but not as much as you would like.)

> 2) Why am I not having any of these problems ? Unlike Benno's code, I
>have a working application that runs just fine. 

Umm, you are running a different application on different hardware, and
you are wondering why you don't see the same effects... not surprising!
In particular, it's really the use of multiple streams which causes big
problems.  The difference between SCSI and IDE is also significant.
(That difference is hopefully improvede by the elevator patches.)

--Stephen

O_DIRECT architecture (was Re: info point on linux hdr)

2000-04-18 Thread Stephen C. Tweedie

Hi,

On Mon, Apr 17, 2000 at 05:58:48PM -0500, Steve Lord wrote:
> 
> O_DIRECT on Linux XFS is still a work in progress, we only have
> direct reads so far. A very basic implementation was made available
> this weekend.

Care to elaborate on how you are doing O_DIRECT?

It's something I've been thinking about in the general case.  Basically
what I want to do is this:

Augment the inode operations with a new operation, "rw_kiovec" which
performs reads and writes on vectors of kiobufs.  

Provide a generic_rw_kiovec() function which uses the existing page-
oriented IO vectors to set up page mappings much as generic_file_{read,
write} do, but honouring the following flags in the file descriptor:

 * O_ALIAS

   Allows the write function to install the page in the kiobuf 
   into the page cache if the data is correctly aligned and there is
   not already a page in the page cache.

   For read, the meaning is different: it allows existing pages in 
   the page cache to be installed into the kiobuf.

 * O_UNCACHE

   If the IO created a new page in the page cache, then attempt to
   unlink the page after the IO completes.

 * O_SYNC

   Usual meaning: wait for synchronous write IO completion.

O_DIRECT becomes no more than a combination of these options.

Furthermore, by implementing this mechanism with kiobufs, we can go
one step further and perform things like Larry's splice operations by
performing reads and writes in kiobufs.  Using O_ALIAS kiobuf reads and
writes gives us copies between regular files entirely in kernel space
with the minimum possible memory copies.  sendfile() between regular
files can be optimised to use this mechanism.  The data never has to
hit user space.

As an example of the flexibility of the interface, you can perform
an O_ALIAS, O_UNCACHE sendfile to copy one file to another, with full
readahead still being performed on the input file but with no memory 
copies at all.  You can also choose not to have O_UNCACHE and O_SYNC
on the writes, in which case you have both readahead and writebehind
with zero copy.

This is all fairly easy to implement (at least for ext2), and gives
us much more than just O_DIRECT for no extra work.

--Stephen

Re: [Fwd: [linux-audio-dev] info point on linux hdr]

Re: O_DIRECT architecture (was Re: info point on linux hdr)

Re: O_DIRECT architecture (was Re: info point on linux hdr)

Re: O_DIRECT architecture (was Re: info point on linux hdr)

Re: [Fwd: [linux-audio-dev] info point on linux hdr]

Re: [Fwd: [linux-audio-dev] info point on linux hdr]

Re: O_DIRECT architecture (was Re: info point on linux hdr)

Re: [Fwd: [linux-audio-dev] info point on linux hdr]

Re: [Fwd: [linux-audio-dev] info point on linux hdr]

O_DIRECT architecture (was Re: info point on linux hdr)

10 matches

Site Navigation

Mail list logo

Footer information