subject:"\[Kiobuf\-io\-devel\] RFC\: Kernel mechanism\: Compound event wait"

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-12 Thread bsuparna



Going through all the discussions once again and trying to look at this
from the point of view of just basic requirements for data structures and
mechanisms, that they imply.

1. Should have a data structure that represents a  memory chain , which may
not be contiguous in physical memory, and which can be passed down as a
single unit all the way  through to lowest level drivers
 - e.g for direct i/o to/from a contiguous virtual address range in
user space (without any intermediate copies)

(Networking and block i/o seem may have require different optimizations in
the design of such a data structure, due to differences in the kind of
patterns expected, as is apparent from the zero-copy networking fragments
vs raw i/o kiobuf/kiovec patches. There are situations when such a data
structure may be passed between subsystems as in the i2o example)

This data structure could be part of an I/O container.

2.  I/O containers may get split or merged as they pass through various
layers --- so any completion mechanism and i/o container design should be
able to account for both cases. At any point, a request could be
 - a collection of several higher level requests,
  or
 - could be one among several sub-requests of a single higher level
request.
(Just as appropriate "clustering"  could happen at each level, appropriate
"splitting" may also take place depending on the situation. It may make
sense to delay splitting as far down the chain as possible in many
situations, where the higher level is only interested in the i/o in its
entirety and not in partial completion )
When caching/buffers are involved, sometimes the sub-requests of a single
higher level request may have individual completion requirements (even when
no merges were involved), because the sub-request buffers may be used to
service other requests alongside. With raw i/o that might not be the case.

3. It is desirable that layers which process the requests along the way
without splitting/merging, be able to pass along the same I/O container
without any duplication or cloning, and intercept async i/o completions for
post processing.

4. (Optional) It would be nice if different kinds of I/O containers or
buffer structures could be used at different levels, without having
explicit linkage fields (like bh --> page, for example) , and in a way that
intermediate drivers or layers can work transparently.

3 & 4 are more of layering related items, which gets a little specific, but
do 1 and 2 cover the general things we are looking for ?

Regards
Suparna

  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : [EMAIL PROTECTED]
  Phone : 91-80-5267117, Extn : 2525


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://vger.kernel.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-12 Thread Pavel Machek


Hi!

> > So you consider inability to select() on regular files _feature_?
> 
> select on files is unimplementable. You can't do background file IO the
> same way you do background receiving of packets on socket. Filesystem is
> synchronous. It can block. 

You can use helper friends if VFS layer is not able to handle
background IO. Then we can do it right in linux-4.4.
Pavel

-- 
I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-12 Thread Jamie Lokier

Linus Torvalds wrote:
> Absolutely. This is exactly what I mean by saying that low-level drivers
> may not actually be able to handle new cases that they've never been asked
> to do before - they just never saw anything like a 64kB request before or
> something that crossed its own alignment.
> 
> But the _higher_ levels are there. And there's absolutely nothing in the
> design that is a real problem. But there's no question that you might need
> to fix up more than one or two low-level drivers.
> 
> (The only drivers I know better are the IDE ones, and as far as I can tell
> they'd have no trouble at all with any of this. Most other normal drivers
> are likely to be in this same situation. But because I've not had a reason
> to test, I certainly won't guarantee even that).

PCI has dma_mask, which distinguishes different device capabilities.
This nice interface handles 64-bit capable devices, 32-bit ones, ISA
limitations (the old 16MB limit) and some other strange devices.

This mask appears in block devices one way or another so that bounce
buffers are used for high addresses.

How about a mask for block devices which indicates the kinds of
alignment and lengths that the driver can handle?  For old drivers that
can't be thoroughly tested, we assume the worst.  Some devices have
hardware limitations.  Newer, tested drivers can relax the limits.

It's probably not difficult to say, "this 64k request can't be handled
so split it into 1k requests".  It integrates naturally with the
decision to use bounce buffers -- alignment restrictions cause copying
just as high addresses causes copying.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-12 Thread Jamie Lokier


Linus Torvalds wrote:
 Absolutely. This is exactly what I mean by saying that low-level drivers
 may not actually be able to handle new cases that they've never been asked
 to do before - they just never saw anything like a 64kB request before or
 something that crossed its own alignment.
 
 But the _higher_ levels are there. And there's absolutely nothing in the
 design that is a real problem. But there's no question that you might need
 to fix up more than one or two low-level drivers.
 
 (The only drivers I know better are the IDE ones, and as far as I can tell
 they'd have no trouble at all with any of this. Most other normal drivers
 are likely to be in this same situation. But because I've not had a reason
 to test, I certainly won't guarantee even that).

PCI has dma_mask, which distinguishes different device capabilities.
This nice interface handles 64-bit capable devices, 32-bit ones, ISA
limitations (the old 16MB limit) and some other strange devices.

This mask appears in block devices one way or another so that bounce
buffers are used for high addresses.

How about a mask for block devices which indicates the kinds of
alignment and lengths that the driver can handle?  For old drivers that
can't be thoroughly tested, we assume the worst.  Some devices have
hardware limitations.  Newer, tested drivers can relax the limits.

It's probably not difficult to say, "this 64k request can't be handled
so split it into 1k requests".  It integrates naturally with the
decision to use bounce buffers -- alignment restrictions cause copying
just as high addresses causes copying.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-12 Thread Pavel Machek


Hi!

  So you consider inability to select() on regular files _feature_?
 
 select on files is unimplementable. You can't do background file IO the
 same way you do background receiving of packets on socket. Filesystem is
 synchronous. It can block. 

You can use helper friends if VFS layer is not able to handle
background IO. Then we can do it right in linux-4.4.
Pavel

-- 
I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-12 Thread bsuparna



Going through all the discussions once again and trying to look at this
from the point of view of just basic requirements for data structures and
mechanisms, that they imply.

1. Should have a data structure that represents a  memory chain , which may
not be contiguous in physical memory, and which can be passed down as a
single unit all the way  through to lowest level drivers
 - e.g for direct i/o to/from a contiguous virtual address range in
user space (without any intermediate copies)

(Networking and block i/o seem may have require different optimizations in
the design of such a data structure, due to differences in the kind of
patterns expected, as is apparent from the zero-copy networking fragments
vs raw i/o kiobuf/kiovec patches. There are situations when such a data
structure may be passed between subsystems as in the i2o example)

This data structure could be part of an I/O container.

2.  I/O containers may get split or merged as they pass through various
layers --- so any completion mechanism and i/o container design should be
able to account for both cases. At any point, a request could be
 - a collection of several higher level requests,
  or
 - could be one among several sub-requests of a single higher level
request.
(Just as appropriate "clustering"  could happen at each level, appropriate
"splitting" may also take place depending on the situation. It may make
sense to delay splitting as far down the chain as possible in many
situations, where the higher level is only interested in the i/o in its
entirety and not in partial completion )
When caching/buffers are involved, sometimes the sub-requests of a single
higher level request may have individual completion requirements (even when
no merges were involved), because the sub-request buffers may be used to
service other requests alongside. With raw i/o that might not be the case.

3. It is desirable that layers which process the requests along the way
without splitting/merging, be able to pass along the same I/O container
without any duplication or cloning, and intercept async i/o completions for
post processing.

4. (Optional) It would be nice if different kinds of I/O containers or
buffer structures could be used at different levels, without having
explicit linkage fields (like bh -- page, for example) , and in a way that
intermediate drivers or layers can work transparently.

3  4 are more of layering related items, which gets a little specific, but
do 1 and 2 cover the general things we are looking for ?

Regards
Suparna

  Suparna Bhattacharya
  Systems Software Group, IBM Global Services, India
  E-mail : [EMAIL PROTECTED]
  Phone : 91-80-5267117, Extn : 2525


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://vger.kernel.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-09 Thread Martin Dalecki


Linus Torvalds wrote:
> 
> On Thu, 8 Feb 2001, Rik van Riel wrote:
> 
> > On Thu, 8 Feb 2001, Mikulas Patocka wrote:
> >
> > > > > You need aio_open.
> > > > Could you explain this?
> > >
> > > If the server is sending many small files, disk spends huge
> > > amount time walking directory tree and seeking to inodes. Maybe
> > > opening the file is even slower than reading it
> >
> > Not if you have a big enough inode_cache and dentry_cache.
> >
> > OTOH ... if you have enough memory the whole async IO argument
> > is moot anyway because all your files will be in memory too.
> 
> Note that this _is_ an important point.
> 
> You should never _ever_ think about pure IO speed as the most important
> thing. Even if you get absolutely perfect IO streaming off the fastest
> disk you can find, I will beat you every single time with a cached setup
> that doesn't need to do IO at all.
> 
> 90% of the VFS layer is all about caching, and trying to avoid IO. Of the
> rest, about 9% is about trying to avoid even calling down to the low-level
> filesystem, because it's faster if we can handle it at a high level
> without any need to even worry about issues like physical disk addresses.
> Even if those addresses are cached.
> 
> The remaining 1% is about actually getting the IO done. At that point we
> end up throwing our hands in the air and saying "ok, this will be slow".
> 
> So if you design your system for disk load, you are missing a big portion
> of the picture.
> 
> There are cases where IO really matter. The most notable one being
> databases, certainly _not_ web or ftp servers. For web- or ftp-servers you
> buy more memory if you want high performance, and you tend to be limited
> by the network speed anyway (if you have multiple gigabit networks and
> network speed isn't an issue, then I can also tell you that buying a few
> gigabyte of RAM isn't an issue, because you are obviously working for
> something like the DoD and have very little regard for the cost of the
> thing ;)
> 
> For databases (and for file servers that you want to be robust over a
> crash), IO throughput is an issue mainly because you need to put the damn
> requests in stable memory somewhere. Which tends to mean that _write_
> speed is what really matters, because the reads you can still try to cache
> as efficiently as humanly possible (and the issue of database design then
> turns into trying to find every single piece of locality you can, so that
> the read caching works as well as possible).
> 
> Short and sweet: "aio_open()" is basically never supposed to be an issue.
> If it is, you've misdesigned something, or you're trying too damn hard to
> single-thread everything (and "hiding" the threading that _does_ happen by
> just calling it "AIO" instead - lying to yourself, in short).

Right - I agree with you that an AIO design is basically hiding an
inherently
multi threaded program flow. This argument is indeed very catchy. And
looking
from some other point one will see that most of the AIO designs are from
times
where multi threading in applications wasn't that common as it is now.
Most prominently coprocesses in a shell come to my mind as a very good
example
about how to handle AIO (sort of)...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-09 Thread Martin Dalecki


Linus Torvalds wrote:
 
 On Thu, 8 Feb 2001, Rik van Riel wrote:
 
  On Thu, 8 Feb 2001, Mikulas Patocka wrote:
 
 You need aio_open.
Could you explain this?
  
   If the server is sending many small files, disk spends huge
   amount time walking directory tree and seeking to inodes. Maybe
   opening the file is even slower than reading it
 
  Not if you have a big enough inode_cache and dentry_cache.
 
  OTOH ... if you have enough memory the whole async IO argument
  is moot anyway because all your files will be in memory too.
 
 Note that this _is_ an important point.
 
 You should never _ever_ think about pure IO speed as the most important
 thing. Even if you get absolutely perfect IO streaming off the fastest
 disk you can find, I will beat you every single time with a cached setup
 that doesn't need to do IO at all.
 
 90% of the VFS layer is all about caching, and trying to avoid IO. Of the
 rest, about 9% is about trying to avoid even calling down to the low-level
 filesystem, because it's faster if we can handle it at a high level
 without any need to even worry about issues like physical disk addresses.
 Even if those addresses are cached.
 
 The remaining 1% is about actually getting the IO done. At that point we
 end up throwing our hands in the air and saying "ok, this will be slow".
 
 So if you design your system for disk load, you are missing a big portion
 of the picture.
 
 There are cases where IO really matter. The most notable one being
 databases, certainly _not_ web or ftp servers. For web- or ftp-servers you
 buy more memory if you want high performance, and you tend to be limited
 by the network speed anyway (if you have multiple gigabit networks and
 network speed isn't an issue, then I can also tell you that buying a few
 gigabyte of RAM isn't an issue, because you are obviously working for
 something like the DoD and have very little regard for the cost of the
 thing ;)
 
 For databases (and for file servers that you want to be robust over a
 crash), IO throughput is an issue mainly because you need to put the damn
 requests in stable memory somewhere. Which tends to mean that _write_
 speed is what really matters, because the reads you can still try to cache
 as efficiently as humanly possible (and the issue of database design then
 turns into trying to find every single piece of locality you can, so that
 the read caching works as well as possible).
 
 Short and sweet: "aio_open()" is basically never supposed to be an issue.
 If it is, you've misdesigned something, or you're trying too damn hard to
 single-thread everything (and "hiding" the threading that _does_ happen by
 just calling it "AIO" instead - lying to yourself, in short).

Right - I agree with you that an AIO design is basically hiding an
inherently
multi threaded program flow. This argument is indeed very catchy. And
looking
from some other point one will see that most of the AIO designs are from
times
where multi threading in applications wasn't that common as it is now.
Most prominently coprocesses in a shell come to my mind as a very good
example
about how to handle AIO (sort of)...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 08, 2001 at 03:52:35PM +0100, Mikulas Patocka wrote:
> 
> > How do you write high-performance ftp server without threads if select
> > on regular file always returns "ready"?
> 
> No, it's not really possible on Linux. Use SYS$QIO call on VMS :-)

Ahh, but even VMS SYS$QIO is synchronous at doing opens, allocation of
the IO request packets, and mapping file location to disk blocks.
Only the data IO is ever async (and Ben's async IO stuff for Linux
provides that too).

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Linus Torvalds

On Thu, 8 Feb 2001, Rik van Riel wrote:

> On Thu, 8 Feb 2001, Mikulas Patocka wrote:
> 
> > > > You need aio_open.
> > > Could you explain this? 
> > 
> > If the server is sending many small files, disk spends huge
> > amount time walking directory tree and seeking to inodes. Maybe
> > opening the file is even slower than reading it
> 
> Not if you have a big enough inode_cache and dentry_cache.
> 
> OTOH ... if you have enough memory the whole async IO argument
> is moot anyway because all your files will be in memory too.

Note that this _is_ an important point.

You should never _ever_ think about pure IO speed as the most important
thing. Even if you get absolutely perfect IO streaming off the fastest
disk you can find, I will beat you every single time with a cached setup
that doesn't need to do IO at all.

90% of the VFS layer is all about caching, and trying to avoid IO. Of the
rest, about 9% is about trying to avoid even calling down to the low-level
filesystem, because it's faster if we can handle it at a high level
without any need to even worry about issues like physical disk addresses.
Even if those addresses are cached.

The remaining 1% is about actually getting the IO done. At that point we
end up throwing our hands in the air and saying "ok, this will be slow".

So if you design your system for disk load, you are missing a big portion
of the picture.

There are cases where IO really matter. The most notable one being
databases, certainly _not_ web or ftp servers. For web- or ftp-servers you
buy more memory if you want high performance, and you tend to be limited
by the network speed anyway (if you have multiple gigabit networks and
network speed isn't an issue, then I can also tell you that buying a few
gigabyte of RAM isn't an issue, because you are obviously working for
something like the DoD and have very little regard for the cost of the
thing ;)

For databases (and for file servers that you want to be robust over a
crash), IO throughput is an issue mainly because you need to put the damn
requests in stable memory somewhere. Which tends to mean that _write_
speed is what really matters, because the reads you can still try to cache
as efficiently as humanly possible (and the issue of database design then
turns into trying to find every single piece of locality you can, so that
the read caching works as well as possible).

Short and sweet: "aio_open()" is basically never supposed to be an issue.
If it is, you've misdesigned something, or you're trying too damn hard to
single-thread everything (and "hiding" the threading that _does_ happen by
just calling it "AIO" instead - lying to yourself, in short).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Linus Torvalds

On Thu, 8 Feb 2001, Marcelo Tosatti wrote:
> 
> On Thu, 8 Feb 2001, Stephen C. Tweedie wrote:
> 
> 
> 
> > > How do you write high-performance ftp server without threads if select
> > > on regular file always returns "ready"?
> > 
> > Select can work if the access is sequential, but async IO is a more
> > general solution.
> 
> Even async IO (ie aio_read/aio_write) should block on the request queue if
> its full in Linus mind.

Not necessarily. I said that "READA/WRITEA" are only worth exporting
inside the kernel - because the latencies and complexities are low-level
enough that it should not be exported to user space as such.

But I could imagine a kernel aio package that does the equivalent of

bh->b_end_io = completion_handler;
generic_make_request(WRITE, bh);/* this may block */
bh= bh->b_next;

/* Now, fill it up as much as we can.. */
current->state = TASK_INTERRUPTIBLE;
while (more data to be written) {
if (generic_make_request(WRITEA, bh) < 0)
break;
bh = bh->b_next;
}

return;

and then you make the _completion handler_ thing continue to feed more
requests. Yes, you may block at some points (because you need to always
have at least _one_ request in-flight in order to have the state machine
active, but you can basically try to avoid blocking more than necessary.

But do you see why the above can't be done from user space? It requires
that the completion handler (which runs in an interrupt context) be able
to continue to feed requests and keep the queue filled. If you don't do
that, you'll never have good throughput, because it takes too long to send
signals, re-schedule or whatever to user mode.

And do you see how it has to block _sometimes_? If people do hundreds of
AIO requests, we can't let memory just fill up with pending writes..

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Linus Torvalds

On Thu, 8 Feb 2001, Martin Dalecki wrote:
> > 
> > But you'll have a bitch of a time trying to merge multiple
> > threads/processes reading from the same area on disk at roughly the same
> > time. Your higher levels won't even _know_ that there is merging to be
> > done until the IO requests hit the wall in waiting for the disk.
> 
> Merging is a hardware tighted optimization, so it should happen, there we you
> really have full "knowlendge" and controll of the hardware -> namely the
> device driver. 

Or, in many cases, the device itself. There are valid reasons for not
doing merging in the driver, but they all tend to boil down to "even lower
layers can do a better job of it". They basically _never_ boil down to
"upper layers already did it for us".

That said, there tend to be advantages to doing "appropriate" clustering
at each level. Upper layers can (and do) use read-ahead to help the lower
levels. The write-out can (and currently does not) try to sort the
requests for better elevator behaviour.

The driver level can (and does) further cluster the requests - even if the
low-level device does a perfect job of orderign and merging on its own
it's usually advantageous to have fewer (and bigger) commands in-flight in
order to have fewer completion interrupts and less command traffic on the
bus.

So it's obviously not entirely black-and-white. Upper layers can help, but
it's a mistake to think that they should "do the work".

(Note: a lot of people seem to think that "layering" means that the
complexity is in upper layers, and that lower layers should be simple and
"stupid". This is not true. A well-balanced layering would have all layers
doing potentially equally complex things - but the complexity should be
_independent_. Complex interactions are bad. But it's also bad to thin
kthat lower levels shouldn't be allowed to optimize because they should be
"simple".).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: select() returning busy for regular files [was Re: [Kiobuf-io-devel]RFC: Kernel mechanism: Compound event wait]

2001-02-08 Thread Linus Torvalds

On Thu, 8 Feb 2001, Pavel Machek wrote:
> > 
> > There are currently no other alternatives in user space. You'd have to
> > create whole new interfaces for aio_read/write, and ways for the kernel to
> > inform user space that "now you can re-try submitting your IO".
> 
> Why is current select() interface not good enough?

Ehh..

One major reason is rather simple: disk request wait times tend to be on
the order of sub-millisecond (remember: if we run out of requests, that
means that we have 256 of them already queued, which means that it's very
likely that several of them will be freed up in the very near future due
to completion).

The fact is, that if you start doing write/select loops, you're going to
waste a _large_ portion of your CPU speed on it.  Especially considering
that the select() call would have to go all the way down to the ll_rw_blk
layer to figure out whether there are more requests etc.

So there is (a) historical reasons that say that regular files can never
wait and EAGAIN is not an acceptable return value and (b) practical
reasons for why such an interface would be a bad one.

There are better ways to do it. Either using threads, or just having a
better aio-like interface.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread James Sutherland

On Thu, 8 Feb 2001, Rik van Riel wrote:

> On Thu, 8 Feb 2001, Mikulas Patocka wrote:
> 
> > > > You need aio_open.
> > > Could you explain this? 
> > 
> > If the server is sending many small files, disk spends huge
> > amount time walking directory tree and seeking to inodes. Maybe
> > opening the file is even slower than reading it
> 
> Not if you have a big enough inode_cache and dentry_cache.

Eh? However big the caches are, you can still get misses which will
require multiple (blocking) disk accesses to handle...

> OTOH ... if you have enough memory the whole async IO argument
> is moot anyway because all your files will be in memory too.

Only for cache hits. If you're doing a Mindcraft benchmark or something
with everything in RAM, you're fine - for real world servers, that's not
really an option ;-)

Really, you want/need cache MISSES to be handled without blocking. However
big the caches, short of running EVERYTHING from a ramdisk, these will
still happen!

James.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Rik van Riel

On Thu, 8 Feb 2001, Mikulas Patocka wrote:

> > > You need aio_open.
> > Could you explain this? 
> 
> If the server is sending many small files, disk spends huge
> amount time walking directory tree and seeking to inodes. Maybe
> opening the file is even slower than reading it

Not if you have a big enough inode_cache and dentry_cache.

OTOH ... if you have enough memory the whole async IO argument
is moot anyway because all your files will be in memory too.

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Marcelo Tosatti




On Thu, 8 Feb 2001, Mikulas Patocka wrote:

> > > The problem is that aio_read and aio_write are pretty useless for ftp or
> > > http server. You need aio_open.
> > 
> > Could you explain this? 
> 
> If the server is sending many small files, disk spends huge amount time
> walking directory tree and seeking to inodes. Maybe opening the file is
> even slower than reading it - read is usually sequential but open needs to
> seek at few areas of disk.
> 
> And if you have one-threaded server using open, close, aio_read and
> aio_write, you actually block the whole server while it is opening a
> single file. This is not how async io is supposed to work.

Ok but this is not the point of the discussion. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Mikulas Patocka


> > The problem is that aio_read and aio_write are pretty useless for ftp or
> > http server. You need aio_open.
> 
> Could you explain this? 

If the server is sending many small files, disk spends huge amount time
walking directory tree and seeking to inodes. Maybe opening the file is
even slower than reading it - read is usually sequential but open needs to
seek at few areas of disk.

And if you have one-threaded server using open, close, aio_read and
aio_write, you actually block the whole server while it is opening a
single file. This is not how async io is supposed to work.

Mikulas


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Jens Axboe

On Thu, Feb 08 2001, Mikulas Patocka wrote:
> > Even async IO (ie aio_read/aio_write) should block on the request queue if
> > its full in Linus mind.
> 
> This is not problem (you can create queue big enough to handle the load).

Well in theory, but in practice this isn't a very good idea. At some
point throwing yet more requests in there doesn't make a whole lot
of sense. You are basically _always_ going to be able to empty the request
list by dirtying lots of data.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Marcelo Tosatti




On Thu, 8 Feb 2001, Mikulas Patocka wrote:

> > > > How do you write high-performance ftp server without threads if select
> > > > on regular file always returns "ready"?
> > > 
> > > Select can work if the access is sequential, but async IO is a more
> > > general solution.
> > 
> > Even async IO (ie aio_read/aio_write) should block on the request queue if
> > its full in Linus mind.
> 
> This is not problem (you can create queue big enough to handle the load).

The point is that you want to be able to not block if the queue full (and
the queue size has nothing to do with that).

> The problem is that aio_read and aio_write are pretty useless for ftp or
> http server. You need aio_open.

Could you explain this? 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Mikulas Patocka


> > > How do you write high-performance ftp server without threads if select
> > > on regular file always returns "ready"?
> > 
> > Select can work if the access is sequential, but async IO is a more
> > general solution.
> 
> Even async IO (ie aio_read/aio_write) should block on the request queue if
> its full in Linus mind.

This is not problem (you can create queue big enough to handle the load).

The problem is that aio_read and aio_write are pretty useless for ftp or
http server. You need aio_open.

Mikulas

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Marcelo Tosatti



On Thu, 8 Feb 2001, Marcelo Tosatti wrote:

> 
> On Thu, 8 Feb 2001, Ben LaHaise wrote:
> 
> 
> 
> > > (besides, latency would suck. I bet you're better off waiting for the
> > > requests if they are all used up. It takes too long to get deep into the
> > > kernel from user space, and you cannot use the exclusive waiters with its
> > > anti-herd behaviour etc).
> > 
> > Ah, but no.  In fact for some things, the wait queue extensions I'm using
> > will be more efficient as things like test_and_set_bit for obtaining a
> > lock gets executed without waking up a task.
> 
> The latency argument is somewhat bogus because there is no problem to
> check the request queue, in the aio syscalls, and simply fail if its full.

Ugh, I forgot to say check the request queue before doing any filesystem
work. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Marcelo Tosatti



On Thu, 8 Feb 2001, Ben LaHaise wrote:



> > (besides, latency would suck. I bet you're better off waiting for the
> > requests if they are all used up. It takes too long to get deep into the
> > kernel from user space, and you cannot use the exclusive waiters with its
> > anti-herd behaviour etc).
> 
> Ah, but no.  In fact for some things, the wait queue extensions I'm using
> will be more efficient as things like test_and_set_bit for obtaining a
> lock gets executed without waking up a task.

The latency argument is somewhat bogus because there is no problem to
check the request queue, in the aio syscalls, and simply fail if its full.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Ben LaHaise

On Tue, 6 Feb 2001, Linus Torvalds wrote:

> There are currently no other alternatives in user space. You'd have to
> create whole new interfaces for aio_read/write, and ways for the kernel to
> inform user space that "now you can re-try submitting your IO".
>
> Could be done. But that's a big thing.

Has been done.  Still needs some work, but it works pretty well.  As for
throttling io, having ios submitted does not have to correspond to them
being queued in the lower layers.  The main issue with async io is
limiting the amount of pinned memory for ios; if that's taken care of, I
don't think it matters how many ios are in flight.

> > An application which sets non blocking behavior and busy waits for a
> > request (which seems to be your argument) is just stupid, of course.
>
> Tell me what else it could do at some point? You need something like
> select() to wait on it. There are no such interfaces right now...
>
> (besides, latency would suck. I bet you're better off waiting for the
> requests if they are all used up. It takes too long to get deep into the
> kernel from user space, and you cannot use the exclusive waiters with its
> anti-herd behaviour etc).

Ah, but no.  In fact for some things, the wait queue extensions I'm using
will be more efficient as things like test_and_set_bit for obtaining a
lock gets executed without waking up a task.

> Simple rule: if you want to optimize concurrency and avoid waiting - use
> several processes or threads instead. At which point you can get real work
> done on multiple CPU's, instead of worrying about what happens when you
> have to wait on the disk.

There do exist plenty of cases where threads are not efficient enough.
Just the stack overhead alone with 8000 threads makes things really suck.
Event based io completion means that server processes don't need to have
the overhead of select/poll.  Add in NT style completion ports for waking
up the right number of worker threads off of the completion queue, and

That said, I don't expect all devices to support async io.  But given
support for files, raw and sockets all the important cases are covered.
The remainder can be supported via userspace helpers.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Mikulas Patocka


Hi!

> So you consider inability to select() on regular files _feature_?

select on files is unimplementable. You can't do background file IO the
same way you do background receiving of packets on socket. Filesystem is
synchronous. It can block. 

> It can be a pretty serious problem with slow block devices
> (floppy). It also hurts when you are trying to do high-performance
> reads/writes. [I know it hurt in userspace sherlock search engine --
> kind of small altavista.]
> 
> How do you write high-performance ftp server without threads if select
> on regular file always returns "ready"?

No, it's not really possible on Linux. Use SYS$QIO call on VMS :-)

You can emulate asynchronous IO with kernel threads like FreeBSD and some
commercial Unices do, but you still need as many (possibly kernel) threads
as many requests you are servicing. 

> > Remember: in the end you HAVE to wait somewhere. You're always going to be
> > able to generate data faster than the disk can take it. SOMETHING
> 
> Userspace wants to _know_ when to stop. It asks politely using
> "select()".

And how do you want to wait for other select()ed events if you are blocked
in wait_for_buffer in get_block (former bmap)?

Making real async IO would require to rewrite all filesystems and whole
VFS _from_scratch_. It won't happen.

Mikulas

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: select() returning busy for regular files [was Re: [Kiobuf-io-devel]RFC: Kernel mechanism: Compound event wait]

2001-02-08 Thread Ben LaHaise

On Thu, 8 Feb 2001, Pavel Machek wrote:

> Hi!
>
> > > Its arguing against making a smart application block on the disk while its
> > > able to use the CPU for other work.
> >
> > There are currently no other alternatives in user space. You'd have to
> > create whole new interfaces for aio_read/write, and ways for the kernel to
> > inform user space that "now you can re-try submitting your IO".
>
> Why is current select() interface not good enough?

Think of random disk io scattered across the disk.  Think about aio_write
providing a means to perform zero copy io without needing to resort to
playing mm tricks write protecting pages in the user's page tables.  It's
also a means for dealing efficiently with thousands of outstanding
requests for network io.  Using a select based interface is going to be an
ugly kludge that still has all the overhead of select/poll.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Marcelo Tosatti




On Thu, 8 Feb 2001, Stephen C. Tweedie wrote:



> > How do you write high-performance ftp server without threads if select
> > on regular file always returns "ready"?
> 
> Select can work if the access is sequential, but async IO is a more
> general solution.

Even async IO (ie aio_read/aio_write) should block on the request queue if
its full in Linus mind.








-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Stephen C. Tweedie

Hi,

On Thu, Feb 08, 2001 at 12:15:13AM +0100, Pavel Machek wrote:
> 
> > EAGAIN is _not_ a valid return value for block devices or for regular
> > files. And in fact it _cannot_ be, because select() is defined to always
> > return 1 on them - so if a write() were to return EAGAIN, user space would
> > have nothing to wait on. Busy waiting is evil.
> 
> So you consider inability to select() on regular files _feature_?

Select might make some sort of sense for sequential access to files,
and for random access via lseek/read but it makes no sense at all for
pread and pwrite where select() has no idea _which_ part of the file
the user is going to want to access next.

> How do you write high-performance ftp server without threads if select
> on regular file always returns "ready"?

Select can work if the access is sequential, but async IO is a more
general solution.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Martin Dalecki


Linus Torvalds wrote:
> 
> On Tue, 6 Feb 2001, Ben LaHaise wrote:
> >
> > On Tue, 6 Feb 2001, Stephen C. Tweedie wrote:
> >
> > > The whole point of the post was that it is merging, not splitting,
> > > which is troublesome.  How are you going to merge requests without
> > > having chains of scatter-gather entities each with their own
> > > completion callbacks?
> >
> > Let me just emphasize what Stephen is pointing out: if requests are
> > properly merged at higher layers, then merging is neither required nor
> > desired.
> 
> I will claim that you CANNOT merge at higher levels and get good
> performance.
> 
> Sure, you can do read-ahead, and try to get big merges that way at a high
> level. Good for you.
> 
> But you'll have a bitch of a time trying to merge multiple
> threads/processes reading from the same area on disk at roughly the same
> time. Your higher levels won't even _know_ that there is merging to be
> done until the IO requests hit the wall in waiting for the disk.

Merging is a hardware tighted optimization, so it should happen, there
we you
really have full "knowlendge" and controll of the hardware -> namely the
device driver. 

> Qutie frankly, this whole discussion sounds worthless. We have solved this
> problem already: it's called a "buffer head". Deceptively simple at higher
> levels, and lower levels can easily merge them together into chains and do
> fancy scatter-gather structures of them that can be dynamically extended
> at any time.
> 
> The buffer heads together with "struct request" do a hell of a lot more
> than just a simple scatter-gather: it's able to create ordered lists of
> independent sg-events, together with full call-backs etc. They are
> low-cost, fairly efficient, and they have worked beautifully for years.
> 
> The fact that kiobufs can't be made to do the same thing is somebody elses
> problem. I _know_ that merging has to happen late, and if others are
> hitting their heads against this issue until they turn silly, then that's
> their problem. You'll eventually learn, or you'll hit your heads into a
> pulp.

Amen.

-- 
- phone: +49 214 8656 283
- job:   STOCK-WORLD Media AG, LEV .de (MY OPPINNIONS ARE MY OWN!)
- langs: de_DE.ISO8859-1, en_US, pl_PL.ISO8859-2, last ressort:
ru_RU.KOI8-R
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait]

2001-02-08 Thread Pavel Machek


Hi!

> > Its arguing against making a smart application block on the disk while its
> > able to use the CPU for other work.
> 
> There are currently no other alternatives in user space. You'd have to
> create whole new interfaces for aio_read/write, and ways for the kernel to
> inform user space that "now you can re-try submitting your IO".

Why is current select() interface not good enough?

Defining that select may say regular file is not ready should be
enough. Okay, maybe you'd want new fcntl() flag saying "I _really_
want this regular file to be non-blocking". No need for new
interfaces.
Pavel
-- 
I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Pavel Machek


Hi!

> > > Reading write(2): 
> > > 
> > >EAGAIN Non-blocking  I/O has been selected using O_NONBLOCK and there was
> > >   no room in the pipe or socket connected to fd to  write  the data
> > >   immediately.
> > > 
> > > I see no reason why "aio function have to block waiting for requests". 
> > 
> > That was my reasoning too with READA etc, but Linus seems to want that we
> > can block while submitting the I/O (as throttling, Linus?) just not
> > until completion.
> 
> Note the "in the pipe or socket" part.
>  ^^
> 
> EAGAIN is _not_ a valid return value for block devices or for regular
> files. And in fact it _cannot_ be, because select() is defined to always
> return 1 on them - so if a write() were to return EAGAIN, user space would
> have nothing to wait on. Busy waiting is evil.

So you consider inability to select() on regular files _feature_?

It can be a pretty serious problem with slow block devices
(floppy). It also hurts when you are trying to do high-performance
reads/writes. [I know it hurt in userspace sherlock search engine --
kind of small altavista.]

How do you write high-performance ftp server without threads if select
on regular file always returns "ready"?
 

> Remember: in the end you HAVE to wait somewhere. You're always going to be
> able to generate data faster than the disk can take it. SOMETHING

Userspace wants to _know_ when to stop. It asks politely using
"select()".
Pavel
-- 
I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Andi Kleen

On Tue, Feb 06, 2001 at 10:14:21AM -0800, Linus Torvalds wrote:
> I will claim that you CANNOT merge at higher levels and get good
> performance.
> 
> Sure, you can do read-ahead, and try to get big merges that way at a high
> level. Good for you.
> 
> But you'll have a bitch of a time trying to merge multiple
> threads/processes reading from the same area on disk at roughly the same
> time. Your higher levels won't even _know_ that there is merging to be
> done until the IO requests hit the wall in waiting for the disk.

Hi,

I've tried to experimentally check this statement.

I instrumented a kernel with the following patch. It keeps a counter
for every merge between unrelated requests. An unrelated requests
is defined as the requests getting allocated from different currents.
I did various tests and suprisingly I was not able to trigger a 
single unrelated merge on my IDE system with various IO loads (dbench,
news expire, news sort, kernel compile, swapping ...) 

So either my patch is wrong (if yes, what is wrong?), or they do simply not 
happen in usual IO loads. I know that it has a few holes (like it doesn't 
count unrelated merges that happen from the same process, or if a process 
quits and another one gets its kernel stack and IO of both is merged it'll 
be counted as related merge), but if unrelated merges were relevant there 
should still show up more, no? 

My pet theory is that page and buffer cache filters most unrelated merges
out. I haven't tried to use raw IO to avoid this problem, but I expect that
anything that does raw IO will do some intelligent IO scheduling on its
own anyways.

If anyone is interested: it would be interesting if other people are 
able to trigger unrelated merges in real loads.
Here is a patch. Display statistics using:

(echo print unrelated_merge ; print related_merge ) | gdb vmlinux /proc/kcore

--- linux/drivers/block/ll_rw_blk.c-REQSTAT Tue Jan 30 13:33:25 2001
+++ linux/drivers/block/ll_rw_blk.c Thu Feb  8 01:13:57 2001
@@ -31,6 +31,9 @@

 #include 

+int unrelated_merge; 
+int related_merge;
+
 /*
  * MAC Floppy IWM hooks
  */
@@ -478,6 +481,7 @@
rq->rq_status = RQ_ACTIVE;
rq->special = NULL;
rq->q = q;
+   rq->originator = current;
}

return rq;
@@ -668,6 +672,11 @@
if (!q->merge_requests_fn(q, req, next, max_segments))
return;

+   if (next->originator != req->originator)
+   unrelated_merge++; 
+   else
+   related_merge++; 
+
q->elevator.elevator_merge_req_fn(req, next);
req->bhtail->b_reqnext = next->bh;
req->bhtail = next->bhtail;
--- linux/include/linux/blkdev.h-REQSTATTue Jan 30 17:17:01 2001
+++ linux/include/linux/blkdev.hWed Feb  7 23:33:35 2001
@@ -45,6 +45,8 @@
struct buffer_head * bh;
struct buffer_head * bhtail;
request_queue_t *q;
+
+   struct task_struct *originator;
 };

 #include 

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Andi Kleen


On Tue, Feb 06, 2001 at 10:14:21AM -0800, Linus Torvalds wrote:
 I will claim that you CANNOT merge at higher levels and get good
 performance.
 
 Sure, you can do read-ahead, and try to get big merges that way at a high
 level. Good for you.
 
 But you'll have a bitch of a time trying to merge multiple
 threads/processes reading from the same area on disk at roughly the same
 time. Your higher levels won't even _know_ that there is merging to be
 done until the IO requests hit the wall in waiting for the disk.

Hi,

I've tried to experimentally check this statement.

I instrumented a kernel with the following patch. It keeps a counter
for every merge between unrelated requests. An unrelated requests
is defined as the requests getting allocated from different currents.
I did various tests and suprisingly I was not able to trigger a 
single unrelated merge on my IDE system with various IO loads (dbench,
news expire, news sort, kernel compile, swapping ...) 

So either my patch is wrong (if yes, what is wrong?), or they do simply not 
happen in usual IO loads. I know that it has a few holes (like it doesn't 
count unrelated merges that happen from the same process, or if a process 
quits and another one gets its kernel stack and IO of both is merged it'll 
be counted as related merge), but if unrelated merges were relevant there 
should still show up more, no? 

My pet theory is that page and buffer cache filters most unrelated merges
out. I haven't tried to use raw IO to avoid this problem, but I expect that
anything that does raw IO will do some intelligent IO scheduling on its
own anyways.

If anyone is interested: it would be interesting if other people are 
able to trigger unrelated merges in real loads.
Here is a patch. Display statistics using:

(echo print unrelated_merge ; print related_merge ) | gdb vmlinux /proc/kcore


--- linux/drivers/block/ll_rw_blk.c-REQSTAT Tue Jan 30 13:33:25 2001
+++ linux/drivers/block/ll_rw_blk.c Thu Feb  8 01:13:57 2001
@@ -31,6 +31,9 @@
 
 #include linux/module.h
 
+int unrelated_merge; 
+int related_merge;
+
 /*
  * MAC Floppy IWM hooks
  */
@@ -478,6 +481,7 @@
rq-rq_status = RQ_ACTIVE;
rq-special = NULL;
rq-q = q;
+   rq-originator = current;
}
 
return rq;
@@ -668,6 +672,11 @@
if (!q-merge_requests_fn(q, req, next, max_segments))
return;
 
+   if (next-originator != req-originator)
+   unrelated_merge++; 
+   else
+   related_merge++; 
+
q-elevator.elevator_merge_req_fn(req, next);
req-bhtail-b_reqnext = next-bh;
req-bhtail = next-bhtail;
--- linux/include/linux/blkdev.h-REQSTATTue Jan 30 17:17:01 2001
+++ linux/include/linux/blkdev.hWed Feb  7 23:33:35 2001
@@ -45,6 +45,8 @@
struct buffer_head * bh;
struct buffer_head * bhtail;
request_queue_t *q;
+
+   struct task_struct *originator;
 };
 
 #include linux/elevator.h




-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Pavel Machek


Hi!

   Reading write(2): 
   
  EAGAIN Non-blocking  I/O has been selected using O_NONBLOCK and there was
 no room in the pipe or socket connected to fd to  write  the data
 immediately.
   
   I see no reason why "aio function have to block waiting for requests". 
  
  That was my reasoning too with READA etc, but Linus seems to want that we
  can block while submitting the I/O (as throttling, Linus?) just not
  until completion.
 
 Note the "in the pipe or socket" part.
  ^^
 
 EAGAIN is _not_ a valid return value for block devices or for regular
 files. And in fact it _cannot_ be, because select() is defined to always
 return 1 on them - so if a write() were to return EAGAIN, user space would
 have nothing to wait on. Busy waiting is evil.

So you consider inability to select() on regular files _feature_?

It can be a pretty serious problem with slow block devices
(floppy). It also hurts when you are trying to do high-performance
reads/writes. [I know it hurt in userspace sherlock search engine --
kind of small altavista.]

How do you write high-performance ftp server without threads if select
on regular file always returns "ready"?
 

 Remember: in the end you HAVE to wait somewhere. You're always going to be
 able to generate data faster than the disk can take it. SOMETHING

Userspace wants to _know_ when to stop. It asks politely using
"select()".
Pavel
-- 
I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

select() returning busy for regular files [was Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait]

2001-02-08 Thread Pavel Machek


Hi!

  Its arguing against making a smart application block on the disk while its
  able to use the CPU for other work.
 
 There are currently no other alternatives in user space. You'd have to
 create whole new interfaces for aio_read/write, and ways for the kernel to
 inform user space that "now you can re-try submitting your IO".

Why is current select() interface not good enough?

Defining that select may say regular file is not ready should be
enough. Okay, maybe you'd want new fcntl() flag saying "I _really_
want this regular file to be non-blocking". No need for new
interfaces.
Pavel
-- 
I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Stephen C. Tweedie


Hi,

On Thu, Feb 08, 2001 at 12:15:13AM +0100, Pavel Machek wrote:
 
  EAGAIN is _not_ a valid return value for block devices or for regular
  files. And in fact it _cannot_ be, because select() is defined to always
  return 1 on them - so if a write() were to return EAGAIN, user space would
  have nothing to wait on. Busy waiting is evil.
 
 So you consider inability to select() on regular files _feature_?

Select might make some sort of sense for sequential access to files,
and for random access via lseek/read but it makes no sense at all for
pread and pwrite where select() has no idea _which_ part of the file
the user is going to want to access next.

 How do you write high-performance ftp server without threads if select
 on regular file always returns "ready"?

Select can work if the access is sequential, but async IO is a more
general solution.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Marcelo Tosatti




On Thu, 8 Feb 2001, Stephen C. Tweedie wrote:

snip

  How do you write high-performance ftp server without threads if select
  on regular file always returns "ready"?
 
 Select can work if the access is sequential, but async IO is a more
 general solution.

Even async IO (ie aio_read/aio_write) should block on the request queue if
its full in Linus mind.








-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: select() returning busy for regular files [was Re: [Kiobuf-io-devel]RFC: Kernel mechanism: Compound event wait]

2001-02-08 Thread Ben LaHaise


On Thu, 8 Feb 2001, Pavel Machek wrote:

 Hi!

   Its arguing against making a smart application block on the disk while its
   able to use the CPU for other work.
 
  There are currently no other alternatives in user space. You'd have to
  create whole new interfaces for aio_read/write, and ways for the kernel to
  inform user space that "now you can re-try submitting your IO".

 Why is current select() interface not good enough?

Think of random disk io scattered across the disk.  Think about aio_write
providing a means to perform zero copy io without needing to resort to
playing mm tricks write protecting pages in the user's page tables.  It's
also a means for dealing efficiently with thousands of outstanding
requests for network io.  Using a select based interface is going to be an
ugly kludge that still has all the overhead of select/poll.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Mikulas Patocka


Hi!

 So you consider inability to select() on regular files _feature_?

select on files is unimplementable. You can't do background file IO the
same way you do background receiving of packets on socket. Filesystem is
synchronous. It can block. 

 It can be a pretty serious problem with slow block devices
 (floppy). It also hurts when you are trying to do high-performance
 reads/writes. [I know it hurt in userspace sherlock search engine --
 kind of small altavista.]
 
 How do you write high-performance ftp server without threads if select
 on regular file always returns "ready"?

No, it's not really possible on Linux. Use SYS$QIO call on VMS :-)

You can emulate asynchronous IO with kernel threads like FreeBSD and some
commercial Unices do, but you still need as many (possibly kernel) threads
as many requests you are servicing. 

  Remember: in the end you HAVE to wait somewhere. You're always going to be
  able to generate data faster than the disk can take it. SOMETHING
 
 Userspace wants to _know_ when to stop. It asks politely using
 "select()".

And how do you want to wait for other select()ed events if you are blocked
in wait_for_buffer in get_block (former bmap)?

Making real async IO would require to rewrite all filesystems and whole
VFS _from_scratch_. It won't happen.

Mikulas

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Ben LaHaise


On Tue, 6 Feb 2001, Linus Torvalds wrote:

 There are currently no other alternatives in user space. You'd have to
 create whole new interfaces for aio_read/write, and ways for the kernel to
 inform user space that "now you can re-try submitting your IO".

 Could be done. But that's a big thing.

Has been done.  Still needs some work, but it works pretty well.  As for
throttling io, having ios submitted does not have to correspond to them
being queued in the lower layers.  The main issue with async io is
limiting the amount of pinned memory for ios; if that's taken care of, I
don't think it matters how many ios are in flight.

  An application which sets non blocking behavior and busy waits for a
  request (which seems to be your argument) is just stupid, of course.

 Tell me what else it could do at some point? You need something like
 select() to wait on it. There are no such interfaces right now...

 (besides, latency would suck. I bet you're better off waiting for the
 requests if they are all used up. It takes too long to get deep into the
 kernel from user space, and you cannot use the exclusive waiters with its
 anti-herd behaviour etc).

Ah, but no.  In fact for some things, the wait queue extensions I'm using
will be more efficient as things like test_and_set_bit for obtaining a
lock gets executed without waking up a task.

 Simple rule: if you want to optimize concurrency and avoid waiting - use
 several processes or threads instead. At which point you can get real work
 done on multiple CPU's, instead of worrying about what happens when you
 have to wait on the disk.

There do exist plenty of cases where threads are not efficient enough.
Just the stack overhead alone with 8000 threads makes things really suck.
Event based io completion means that server processes don't need to have
the overhead of select/poll.  Add in NT style completion ports for waking
up the right number of worker threads off of the completion queue, and

That said, I don't expect all devices to support async io.  But given
support for files, raw and sockets all the important cases are covered.
The remainder can be supported via userspace helpers.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Marcelo Tosatti



On Thu, 8 Feb 2001, Ben LaHaise wrote:

snip

  (besides, latency would suck. I bet you're better off waiting for the
  requests if they are all used up. It takes too long to get deep into the
  kernel from user space, and you cannot use the exclusive waiters with its
  anti-herd behaviour etc).
 
 Ah, but no.  In fact for some things, the wait queue extensions I'm using
 will be more efficient as things like test_and_set_bit for obtaining a
 lock gets executed without waking up a task.

The latency argument is somewhat bogus because there is no problem to
check the request queue, in the aio syscalls, and simply fail if its full.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Marcelo Tosatti



On Thu, 8 Feb 2001, Marcelo Tosatti wrote:

 
 On Thu, 8 Feb 2001, Ben LaHaise wrote:
 
 snip
 
   (besides, latency would suck. I bet you're better off waiting for the
   requests if they are all used up. It takes too long to get deep into the
   kernel from user space, and you cannot use the exclusive waiters with its
   anti-herd behaviour etc).
  
  Ah, but no.  In fact for some things, the wait queue extensions I'm using
  will be more efficient as things like test_and_set_bit for obtaining a
  lock gets executed without waking up a task.
 
 The latency argument is somewhat bogus because there is no problem to
 check the request queue, in the aio syscalls, and simply fail if its full.

Ugh, I forgot to say check the request queue before doing any filesystem
work. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Mikulas Patocka


   How do you write high-performance ftp server without threads if select
   on regular file always returns "ready"?
  
  Select can work if the access is sequential, but async IO is a more
  general solution.
 
 Even async IO (ie aio_read/aio_write) should block on the request queue if
 its full in Linus mind.

This is not problem (you can create queue big enough to handle the load).

The problem is that aio_read and aio_write are pretty useless for ftp or
http server. You need aio_open.

Mikulas

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Marcelo Tosatti




On Thu, 8 Feb 2001, Mikulas Patocka wrote:

How do you write high-performance ftp server without threads if select
on regular file always returns "ready"?
   
   Select can work if the access is sequential, but async IO is a more
   general solution.
  
  Even async IO (ie aio_read/aio_write) should block on the request queue if
  its full in Linus mind.
 
 This is not problem (you can create queue big enough to handle the load).

The point is that you want to be able to not block if the queue full (and
the queue size has nothing to do with that).

 The problem is that aio_read and aio_write are pretty useless for ftp or
 http server. You need aio_open.

Could you explain this? 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Jens Axboe


On Thu, Feb 08 2001, Mikulas Patocka wrote:
  Even async IO (ie aio_read/aio_write) should block on the request queue if
  its full in Linus mind.
 
 This is not problem (you can create queue big enough to handle the load).

Well in theory, but in practice this isn't a very good idea. At some
point throwing yet more requests in there doesn't make a whole lot
of sense. You are basically _always_ going to be able to empty the request
list by dirtying lots of data.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Marcelo Tosatti




On Thu, 8 Feb 2001, Mikulas Patocka wrote:

   The problem is that aio_read and aio_write are pretty useless for ftp or
   http server. You need aio_open.
  
  Could you explain this? 
 
 If the server is sending many small files, disk spends huge amount time
 walking directory tree and seeking to inodes. Maybe opening the file is
 even slower than reading it - read is usually sequential but open needs to
 seek at few areas of disk.
 
 And if you have one-threaded server using open, close, aio_read and
 aio_write, you actually block the whole server while it is opening a
 single file. This is not how async io is supposed to work.

Ok but this is not the point of the discussion. 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Rik van Riel


On Thu, 8 Feb 2001, Mikulas Patocka wrote:

   You need aio_open.
  Could you explain this? 
 
 If the server is sending many small files, disk spends huge
 amount time walking directory tree and seeking to inodes. Maybe
 opening the file is even slower than reading it

Not if you have a big enough inode_cache and dentry_cache.

OTOH ... if you have enough memory the whole async IO argument
is moot anyway because all your files will be in memory too.

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/
http://www.conectiva.com/   http://distro.conectiva.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread James Sutherland


On Thu, 8 Feb 2001, Rik van Riel wrote:

 On Thu, 8 Feb 2001, Mikulas Patocka wrote:
 
You need aio_open.
   Could you explain this? 
  
  If the server is sending many small files, disk spends huge
  amount time walking directory tree and seeking to inodes. Maybe
  opening the file is even slower than reading it
 
 Not if you have a big enough inode_cache and dentry_cache.

Eh? However big the caches are, you can still get misses which will
require multiple (blocking) disk accesses to handle...

 OTOH ... if you have enough memory the whole async IO argument
 is moot anyway because all your files will be in memory too.

Only for cache hits. If you're doing a Mindcraft benchmark or something
with everything in RAM, you're fine - for real world servers, that's not
really an option ;-)

Really, you want/need cache MISSES to be handled without blocking. However
big the caches, short of running EVERYTHING from a ramdisk, these will
still happen!


James.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: select() returning busy for regular files [was Re: [Kiobuf-io-devel]RFC: Kernel mechanism: Compound event wait]

2001-02-08 Thread Linus Torvalds




On Thu, 8 Feb 2001, Pavel Machek wrote:
  
  There are currently no other alternatives in user space. You'd have to
  create whole new interfaces for aio_read/write, and ways for the kernel to
  inform user space that "now you can re-try submitting your IO".
 
 Why is current select() interface not good enough?

Ehh..

One major reason is rather simple: disk request wait times tend to be on
the order of sub-millisecond (remember: if we run out of requests, that
means that we have 256 of them already queued, which means that it's very
likely that several of them will be freed up in the very near future due
to completion).

The fact is, that if you start doing write/select loops, you're going to
waste a _large_ portion of your CPU speed on it.  Especially considering
that the select() call would have to go all the way down to the ll_rw_blk
layer to figure out whether there are more requests etc.

So there is (a) historical reasons that say that regular files can never
wait and EAGAIN is not an acceptable return value and (b) practical
reasons for why such an interface would be a bad one.

There are better ways to do it. Either using threads, or just having a
better aio-like interface.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Linus Torvalds




On Thu, 8 Feb 2001, Martin Dalecki wrote:
  
  But you'll have a bitch of a time trying to merge multiple
  threads/processes reading from the same area on disk at roughly the same
  time. Your higher levels won't even _know_ that there is merging to be
  done until the IO requests hit the wall in waiting for the disk.
 
 Merging is a hardware tighted optimization, so it should happen, there we you
 really have full "knowlendge" and controll of the hardware - namely the
 device driver. 

Or, in many cases, the device itself. There are valid reasons for not
doing merging in the driver, but they all tend to boil down to "even lower
layers can do a better job of it". They basically _never_ boil down to
"upper layers already did it for us".

That said, there tend to be advantages to doing "appropriate" clustering
at each level. Upper layers can (and do) use read-ahead to help the lower
levels. The write-out can (and currently does not) try to sort the
requests for better elevator behaviour.

The driver level can (and does) further cluster the requests - even if the
low-level device does a perfect job of orderign and merging on its own
it's usually advantageous to have fewer (and bigger) commands in-flight in
order to have fewer completion interrupts and less command traffic on the
bus.

So it's obviously not entirely black-and-white. Upper layers can help, but
it's a mistake to think that they should "do the work".

(Note: a lot of people seem to think that "layering" means that the
complexity is in upper layers, and that lower layers should be simple and
"stupid". This is not true. A well-balanced layering would have all layers
doing potentially equally complex things - but the complexity should be
_independent_. Complex interactions are bad. But it's also bad to thin
kthat lower levels shouldn't be allowed to optimize because they should be
"simple".).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Linus Torvalds




On Thu, 8 Feb 2001, Marcelo Tosatti wrote:
 
 On Thu, 8 Feb 2001, Stephen C. Tweedie wrote:
 
 snip
 
   How do you write high-performance ftp server without threads if select
   on regular file always returns "ready"?
  
  Select can work if the access is sequential, but async IO is a more
  general solution.
 
 Even async IO (ie aio_read/aio_write) should block on the request queue if
 its full in Linus mind.

Not necessarily. I said that "READA/WRITEA" are only worth exporting
inside the kernel - because the latencies and complexities are low-level
enough that it should not be exported to user space as such.

But I could imagine a kernel aio package that does the equivalent of

bh-b_end_io = completion_handler;
generic_make_request(WRITE, bh);/* this may block */
bh= bh-b_next;

/* Now, fill it up as much as we can.. */
current-state = TASK_INTERRUPTIBLE;
while (more data to be written) {
if (generic_make_request(WRITEA, bh)  0)
break;
bh = bh-b_next;
}

return;

and then you make the _completion handler_ thing continue to feed more
requests. Yes, you may block at some points (because you need to always
have at least _one_ request in-flight in order to have the state machine
active, but you can basically try to avoid blocking more than necessary.

But do you see why the above can't be done from user space? It requires
that the completion handler (which runs in an interrupt context) be able
to continue to feed requests and keep the queue filled. If you don't do
that, you'll never have good throughput, because it takes too long to send
signals, re-schedule or whatever to user mode.

And do you see how it has to block _sometimes_? If people do hundreds of
AIO requests, we can't let memory just fill up with pending writes..

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Linus Torvalds




On Thu, 8 Feb 2001, Rik van Riel wrote:

 On Thu, 8 Feb 2001, Mikulas Patocka wrote:
 
You need aio_open.
   Could you explain this? 
  
  If the server is sending many small files, disk spends huge
  amount time walking directory tree and seeking to inodes. Maybe
  opening the file is even slower than reading it
 
 Not if you have a big enough inode_cache and dentry_cache.
 
 OTOH ... if you have enough memory the whole async IO argument
 is moot anyway because all your files will be in memory too.

Note that this _is_ an important point.

You should never _ever_ think about pure IO speed as the most important
thing. Even if you get absolutely perfect IO streaming off the fastest
disk you can find, I will beat you every single time with a cached setup
that doesn't need to do IO at all.

90% of the VFS layer is all about caching, and trying to avoid IO. Of the
rest, about 9% is about trying to avoid even calling down to the low-level
filesystem, because it's faster if we can handle it at a high level
without any need to even worry about issues like physical disk addresses.
Even if those addresses are cached.

The remaining 1% is about actually getting the IO done. At that point we
end up throwing our hands in the air and saying "ok, this will be slow".

So if you design your system for disk load, you are missing a big portion
of the picture.

There are cases where IO really matter. The most notable one being
databases, certainly _not_ web or ftp servers. For web- or ftp-servers you
buy more memory if you want high performance, and you tend to be limited
by the network speed anyway (if you have multiple gigabit networks and
network speed isn't an issue, then I can also tell you that buying a few
gigabyte of RAM isn't an issue, because you are obviously working for
something like the DoD and have very little regard for the cost of the
thing ;)

For databases (and for file servers that you want to be robust over a
crash), IO throughput is an issue mainly because you need to put the damn
requests in stable memory somewhere. Which tends to mean that _write_
speed is what really matters, because the reads you can still try to cache
as efficiently as humanly possible (and the issue of database design then
turns into trying to find every single piece of locality you can, so that
the read caching works as well as possible).

Short and sweet: "aio_open()" is basically never supposed to be an issue.
If it is, you've misdesigned something, or you're trying too damn hard to
single-thread everything (and "hiding" the threading that _does_ happen by
just calling it "AIO" instead - lying to yourself, in short).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Stephen C. Tweedie


Hi,

On Thu, Feb 08, 2001 at 03:52:35PM +0100, Mikulas Patocka wrote:
 
  How do you write high-performance ftp server without threads if select
  on regular file always returns "ready"?
 
 No, it's not really possible on Linux. Use SYS$QIO call on VMS :-)

Ahh, but even VMS SYS$QIO is synchronous at doing opens, allocation of
the IO request packets, and mapping file location to disk blocks.
Only the data IO is ever async (and Ben's async IO stuff for Linux
provides that too).

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Neil Brown

On Wednesday February 7, [EMAIL PROTECTED] wrote:
> 
> 
> On Wed, 7 Feb 2001, Christoph Hellwig wrote:
> 
> > On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote:
> > > 
> > > Actually, they really aren't.
> > > 
> > > They kind of _used_ to be, but more and more they've moved away from that
> > > historical use. Check in particular the page cache, and as a really
> > > extreme case the swap cache version of the page cache.
> > 
> > Yes.  And that exactly why I think it's ugly to have the left-over
> > caching stuff in the same data sctruture as the IO buffer.
> 
> I do agree.
> 
> I would not be opposed to factoring out the "pure block IO" part from the
> bh struct. It should not even be very hard. You'd do something like
> 
>   struct block_io {
>   .. here is the stuff needed for block IO ..
>   };
> 
>   struct buffer_head {
>   struct block_io io;
>   .. here is the stuff needed for hashing etc ..
>   }
> 
> and then you make "generic_make_request()" and everything lower down take
> just the "struct block_io".
> 

I was just thinking the same, or a similar thing.
I wanted to do

struct io_head {
 stuff
};
struct buffer_head {
 struct io_head;
 more stuff;
}

so that, as an unnamed substructure, the content of the struct io_head
would automagically be promoted to appear to be content of
buffer_head.
However I then remembered (when it didn't work) that unnamed
substructures are a feature of the Plan-9 C compiler, not the GNU
Compiler Collection. (Any gcc coders out there think this would be a
good thing to add?
  http://plan9.bell-labs.com/sys/doc/compiler.html
)

Anyway, I produced the same result in a rather ugly way with #defines
and modified raid5 to use 32byte block_io structures instead of the
80+ byte buffer_heads, and it ... doesn't quite work :-( it boots
fine, but raid5 dies and the Oops message is a few kilometers away.
Anyway, I think the concept it fine.

Patch is below for your inspection.

It occurs to me that Stephen's desire to pass lots of requests through
make_request all at once isn't a bad idea and could be done by simply
linking the io_heads together with b_reqnext.
This would require:
  1/ all callers of generic_make_request (there are 3) to initialise
 b_reqnext
  2/ all registered make_request_fn functions (there are again 3 I
 think)  to cope with following b_reqnext

It shouldn't be too hard to make the elevator code take advantage of
any ordering that it fines in the list.

I don't have a patch which does this.

NeilBrown

--- ./include/linux/fs.h2001/02/07 22:45:37 1.1
+++ ./include/linux/fs.h2001/02/07 23:09:05
@@ -207,6 +207,7 @@
 #define BH_Protected   6   /* 1 if the buffer is protected */

 /*
+ * THIS COMMENT NO-LONGER CORRECT.
  * Try to keep the most commonly used fields in single cache lines (16
  * bytes) to improve performance.  This ordering should be
  * particularly beneficial on 32-bit processors.
@@ -217,31 +218,43 @@
  * The second 16 bytes we use for lru buffer scans, as used by
  * sync_buffers() and refill_freelist().  -- sct
  */
+
+/* 
+ * io_head is all that is needed by device drivers.
+ */
+#define io_head_fields \
+   unsigned long b_state;  /* buffer state bitmap (see above) */   \
+   struct buffer_head *b_reqnext;  /* request queue */ \
+   unsigned short b_size;  /* block size */\
+   kdev_t b_rdev;  /* Real device */   \
+   unsigned long b_rsector;/* Real buffer location on disk */  \
+   char * b_data;  /* pointer to data block (512 byte) */  \
+   void (*b_end_io)(struct buffer_head *bh, int uptodate); /* I/O completion */ \
+   void *b_private;/* reserved for b_end_io */ \
+   struct page *b_page;/* the page this bh is mapped to */ \
+ /* this line intensionally left blank */
+struct io_head {
+   io_head_fields
+};
+
+/* buffer_head adds all the stuff needed by the buffer cache */
 struct buffer_head {
-   /* First cache line: */
+   io_head_fields
+
struct buffer_head *b_next; /* Hash queue list */
unsigned long b_blocknr;/* block number */
-   unsigned short b_size;  /* block size */
unsigned short b_list;  /* List that this buffer appears */
kdev_t b_dev;   /* device (B_FREE = free) */

atomic_t b_count;   /* users using this block */
-   kdev_t b_rdev;  /* Real device */
-   unsigned long b_state;  /* buffer state bitmap (see above) */
unsigned long b_flushtime;  /* Time when (dirty) buffer should be written 
*/

struct buffer_head *b_next_free;/* lru/free list linkage */
struct buffer_head *b_prev_free;/* doubly linked list

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Stephen C. Tweedie

Hi,

On Wed, Feb 07, 2001 at 12:12:44PM -0700, Richard Gooch wrote:
> Stephen C. Tweedie writes:
> > 
> > Sorry?  I'm not sure where communication is breaking down here, but
> > we really don't seem to be talking about the same things.  SGI's
> > kiobuf request patches already let us pass a large IO through the
> > request layer in a single unit without having to split it up to
> > squeeze it through the API.
> 
> Isn't Linus saying that you can use (say) 4 kiB buffer_heads, so you
> don't need kiobufs? IIRC, kiobufs are page containers, so a 4 kiB
> buffer_head is effectively the same thing.

kiobufs let you encode _any_ contiguous region of user VA or of an
inode's page cache contents in one kiobuf, no matter how many pages
there are in it.  A write of a megabyte to a raw device can be encoded
as a single kiobuf if we want to pass the entire 1MB IO down to the
block layers untouched.  That's what the page vector in the kiobuf is
for.

Doing the same thing with buffer_heads would still require a couple of
hundred of them, and you'd have to submit each such buffer_head to the
IO subsystem independently.  And then the IO layer will just have to
reassemble them on the other side (and it may have to scan the
device's entire request queue once for every single buffer_head to do
so).

> But an API extension to allow passing a pre-built chain would be even
> better.

Yep.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Richard Gooch

Stephen C. Tweedie writes:
> Hi,
> 
> On Tue, Feb 06, 2001 at 06:37:41PM -0800, Linus Torvalds wrote:
> > Absolutely. And this is independent of what kind of interface we end up
> > using, whether it be kiobuf of just plain "struct buffer_head". In that
> > respect they are equivalent.
> 
> Sorry?  I'm not sure where communication is breaking down here, but
> we really don't seem to be talking about the same things.  SGI's
> kiobuf request patches already let us pass a large IO through the
> request layer in a single unit without having to split it up to
> squeeze it through the API.

Isn't Linus saying that you can use (say) 4 kiB buffer_heads, so you
don't need kiobufs? IIRC, kiobufs are page containers, so a 4 kiB
buffer_head is effectively the same thing.

> If you really don't mind the size of the buffer_head as a sg fragment
> header, then at least I'd like us to be able to submit a pre-built
> chain of bh's all at once without having to go through the remap/merge
> cost for each single bh.

Even if you are limited to feeding one buffer_head at a time, the
merge costs should be somewhat mitigated, since you'll decrease your
calls into the API by a factor of 8 or 16.
But an API extension to allow passing a pre-built chain would be even
better.

Hopefully I haven't missed the point. I've got the flu so I'm not
running on all 4 cylinders :-(

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Christoph Hellwig


On Wed, Feb 07, 2001 at 10:36:47AM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 7 Feb 2001, Christoph Hellwig wrote:
> 
> > On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote:
> > > 
> > > Actually, they really aren't.
> > > 
> > > They kind of _used_ to be, but more and more they've moved away from that
> > > historical use. Check in particular the page cache, and as a really
> > > extreme case the swap cache version of the page cache.
> > 
> > Yes.  And that exactly why I think it's ugly to have the left-over
> > caching stuff in the same data sctruture as the IO buffer.
> 
> I do agree.
> 
> I would not be opposed to factoring out the "pure block IO" part from the
> bh struct. It should not even be very hard. You'd do something like
> 
>   struct block_io {
>   .. here is the stuff needed for block IO ..
>   };
> 
>   struct buffer_head {
>   struct block_io io;
>   .. here is the stuff needed for hashing etc ..
>   }
> 
> and then you make "generic_make_request()" and everything lower down take
> just the "struct block_io".

Yep. (besides the name block_io sucks :))

> You'd still leave "ll_rw_block()" and "submit_bh()" operating on bh's,
> because they knoa about bh semantics (ie things like scaling the sector
> number to the bh size etc). Which means that pretty much all the code
> outside the block layer wouldn't even _notice_. Which is a sign of good
> layering.

Yep.

> If you want to do this, please do go ahead.

I'll take a look at it.

> But do realize that this is not exactly a 2.4.x thing ;)

Sure.

Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Linus Torvalds

On Wed, 7 Feb 2001, Christoph Hellwig wrote:

> On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote:
> > 
> > Actually, they really aren't.
> > 
> > They kind of _used_ to be, but more and more they've moved away from that
> > historical use. Check in particular the page cache, and as a really
> > extreme case the swap cache version of the page cache.
> 
> Yes.  And that exactly why I think it's ugly to have the left-over
> caching stuff in the same data sctruture as the IO buffer.

I do agree.

I would not be opposed to factoring out the "pure block IO" part from the
bh struct. It should not even be very hard. You'd do something like

struct block_io {
.. here is the stuff needed for block IO ..
};

struct buffer_head {
struct block_io io;
.. here is the stuff needed for hashing etc ..
}

and then you make "generic_make_request()" and everything lower down take
just the "struct block_io".

You'd still leave "ll_rw_block()" and "submit_bh()" operating on bh's,
because they knoa about bh semantics (ie things like scaling the sector
number to the bh size etc). Which means that pretty much all the code
outside the block layer wouldn't even _notice_. Which is a sign of good
layering.

If you want to do this, please do go ahead.

But do realize that this is not exactly a 2.4.x thing ;)

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Christoph Hellwig


On Tue, Feb 06, 2001 at 09:35:58PM +0100, Ingo Molnar wrote:
> caching bmap() blocks was a recent addition around 2.3.20, and i suggested
> some time ago to cache pagecache blocks via explicit entries in struct
> page. That would be one solution - but it creates overhead.
> 
> but there isnt anything wrong with having the bhs around to cache blocks -
> think of it as a 'cached and recycled IO buffer entry, with the block
> information cached'.

I was not talking about caching physical blocks but the remaining
buffer-cache support stuff.

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Christoph Hellwig


On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote:
> 
> 
> On Tue, 6 Feb 2001, Christoph Hellwig wrote:
> > 
> > The second is that bh's are two things:
> > 
> >  - a cacheing object
> >  - an io buffer
> 
> Actually, they really aren't.
> 
> They kind of _used_ to be, but more and more they've moved away from that
> historical use. Check in particular the page cache, and as a really
> extreme case the swap cache version of the page cache.

Yes.  And that exactly why I think it's ugly to have the left-over
caching stuff in the same data sctruture as the IO buffer.

> It certainly _used_ to be true that "bh"s were actually first-class memory
> management citizens, and actually had a data buffer and a cache associated
> with them. And because of that historical baggage, that's how many people
> still think of them.

I do even know that the pagecache is our primary cache now :)
Anyway having that caching cruft still in is ugly.

> > This is not really an clean appropeach, and I would really like to
> > get away from it.
> 
> Trust me, you really _can_ get away from it. It's not designed into the
> bh's at all. You can already just allocate a single (or multiple) "struct
> buffer_head" and just use them as IO objects, and give them your _own_
> pointers to the IO buffer etc.

So true.  Exactly because of that the data structures should become
seperated also.

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Stephen C. Tweedie

Hi,

On Tue, Feb 06, 2001 at 06:37:41PM -0800, Linus Torvalds wrote:
> >
> However, I really _do_ want to have the page cache have a bigger
> granularity than the smallest memory mapping size, and there are always
> special cases that might be able to generate IO in bigger chunks (ie
> in-kernel services etc)

No argument there.

> > Yes.  We still have this fundamental property: if a user sends in a
> > 128kB IO, we end up having to split it up into buffer_heads and doing
> > a separate submit_bh() on each single one.  Given our VM, PAGE_SIZE
> > (*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in
> > this case.
> 
> Absolutely. And this is independent of what kind of interface we end up
> using, whether it be kiobuf of just plain "struct buffer_head". In that
> respect they are equivalent.

Sorry?  I'm not sure where communication is breaking down here, but
we really don't seem to be talking about the same things.  SGI's
kiobuf request patches already let us pass a large IO through the
request layer in a single unit without having to split it up to
squeeze it through the API.

> > THAT is the overhead that I'm talking about: having to split a large
> > IO into small chunks, each of which just ends up having to be merged
> > back again into a single struct request by the *make_request code.
> 
> You could easily just generate the bh then and there, if you wanted to.

In the current 2.4 tree, we already do: brw_kiovec creates the
temporary buffer_heads on demand to feed them to the IO layers.

> Your overhead comes from the fact that you want to gather the IO together. 

> And I'm saying that you _shouldn't_ gather the IO. There's no point.

I don't --- the underlying layer does.  And that is where the overhead
is: for every single large IO being created by the higher layers,
make_request is doing a dozen or more merges because I can only feed
the IO through make_request in tiny pieces.

> The
> gathering is sufficiently done by the low-level code anyway, and I've
> tried to explain why the low-level code _has_ to do that work regardless
> of what upper layers do.

I know.  The problem is the low-level code doing it a hundred times
for a single injected IO.

> You need to generate a separate sg entry for each page anyway. So why not
> just use the existing one? The "struct buffer_head". Which already
> _handles_ all the issues that you have complained are hard to handle.

Two issues here.  First is that the buffer_head is an enormously
heavyweight object for a sg-list fragment.  It contains a ton of
fields of interest only to the buffer cache.  We could mitigate this
to some extent by ensuring that the relevant fields for IO (rsector,
size, req_next, state, data, page etc) were in a single cache line.

Secondly, the cost of adding each single buffer_head to the request
list is O(n) in the number of requests already on the list.  We end up
walking potentially the entire request queue before finding the
request to merge against, and we do that again and again, once for
every single buffer_head in the list.  We do this even if the caller
went in via a multi-bh ll_rw_block() call in which case we know in
advance that all of the buffer_heads are contiguous on disk.

There is a side problem: right now, things like raid remapping occur
during generic_make_request, before we have a request built.  That
means that all of the raid0 remapping or raid1/5 request expanding is
being done on a per-buffer_head, not per-request, basis, so again
we're doing a whole lot of unnecessary duplicate work when an IO
larger than a buffer_head is submitted.

If you really don't mind the size of the buffer_head as a sg fragment
header, then at least I'd like us to be able to submit a pre-built
chain of bh's all at once without having to go through the remap/merge
cost for each single bh.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Stephen C. Tweedie


Hi,

On Wed, Feb 07, 2001 at 09:10:32AM +, David Howells wrote:
> 
> I presume that correct_size will always be a power of 2...

Yes.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread David Howells



Linus Torvalds <[EMAIL PROTECTED]> wrote:
> Actually, I'd rather leave it in, but speed it up with the saner and
> faster
>
>   if (bh->b_size & (correct_size-1)) {

I presume that correct_size will always be a power of 2...

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread David Howells



Linus Torvalds [EMAIL PROTECTED] wrote:
 Actually, I'd rather leave it in, but speed it up with the saner and
 faster

   if (bh-b_size  (correct_size-1)) {

I presume that correct_size will always be a power of 2...

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Stephen C. Tweedie


Hi,

On Wed, Feb 07, 2001 at 09:10:32AM +, David Howells wrote:
 
 I presume that correct_size will always be a power of 2...

Yes.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Stephen C. Tweedie


Hi,

On Tue, Feb 06, 2001 at 06:37:41PM -0800, Linus Torvalds wrote:
 
 However, I really _do_ want to have the page cache have a bigger
 granularity than the smallest memory mapping size, and there are always
 special cases that might be able to generate IO in bigger chunks (ie
 in-kernel services etc)

No argument there.

  Yes.  We still have this fundamental property: if a user sends in a
  128kB IO, we end up having to split it up into buffer_heads and doing
  a separate submit_bh() on each single one.  Given our VM, PAGE_SIZE
  (*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in
  this case.
 
 Absolutely. And this is independent of what kind of interface we end up
 using, whether it be kiobuf of just plain "struct buffer_head". In that
 respect they are equivalent.

Sorry?  I'm not sure where communication is breaking down here, but
we really don't seem to be talking about the same things.  SGI's
kiobuf request patches already let us pass a large IO through the
request layer in a single unit without having to split it up to
squeeze it through the API.

  THAT is the overhead that I'm talking about: having to split a large
  IO into small chunks, each of which just ends up having to be merged
  back again into a single struct request by the *make_request code.
 
 You could easily just generate the bh then and there, if you wanted to.

In the current 2.4 tree, we already do: brw_kiovec creates the
temporary buffer_heads on demand to feed them to the IO layers.

 Your overhead comes from the fact that you want to gather the IO together. 

 And I'm saying that you _shouldn't_ gather the IO. There's no point.

I don't --- the underlying layer does.  And that is where the overhead
is: for every single large IO being created by the higher layers,
make_request is doing a dozen or more merges because I can only feed
the IO through make_request in tiny pieces.

 The
 gathering is sufficiently done by the low-level code anyway, and I've
 tried to explain why the low-level code _has_ to do that work regardless
 of what upper layers do.

I know.  The problem is the low-level code doing it a hundred times
for a single injected IO.

 You need to generate a separate sg entry for each page anyway. So why not
 just use the existing one? The "struct buffer_head". Which already
 _handles_ all the issues that you have complained are hard to handle.

Two issues here.  First is that the buffer_head is an enormously
heavyweight object for a sg-list fragment.  It contains a ton of
fields of interest only to the buffer cache.  We could mitigate this
to some extent by ensuring that the relevant fields for IO (rsector,
size, req_next, state, data, page etc) were in a single cache line.

Secondly, the cost of adding each single buffer_head to the request
list is O(n) in the number of requests already on the list.  We end up
walking potentially the entire request queue before finding the
request to merge against, and we do that again and again, once for
every single buffer_head in the list.  We do this even if the caller
went in via a multi-bh ll_rw_block() call in which case we know in
advance that all of the buffer_heads are contiguous on disk.


There is a side problem: right now, things like raid remapping occur
during generic_make_request, before we have a request built.  That
means that all of the raid0 remapping or raid1/5 request expanding is
being done on a per-buffer_head, not per-request, basis, so again
we're doing a whole lot of unnecessary duplicate work when an IO
larger than a buffer_head is submitted.


If you really don't mind the size of the buffer_head as a sg fragment
header, then at least I'd like us to be able to submit a pre-built
chain of bh's all at once without having to go through the remap/merge
cost for each single bh.

Cheers,
 Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Christoph Hellwig


On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote:
 
 
 On Tue, 6 Feb 2001, Christoph Hellwig wrote:
  
  The second is that bh's are two things:
  
   - a cacheing object
   - an io buffer
 
 Actually, they really aren't.
 
 They kind of _used_ to be, but more and more they've moved away from that
 historical use. Check in particular the page cache, and as a really
 extreme case the swap cache version of the page cache.

Yes.  And that exactly why I think it's ugly to have the left-over
caching stuff in the same data sctruture as the IO buffer.

 It certainly _used_ to be true that "bh"s were actually first-class memory
 management citizens, and actually had a data buffer and a cache associated
 with them. And because of that historical baggage, that's how many people
 still think of them.

I do even know that the pagecache is our primary cache now :)
Anyway having that caching cruft still in is ugly.

  This is not really an clean appropeach, and I would really like to
  get away from it.
 
 Trust me, you really _can_ get away from it. It's not designed into the
 bh's at all. You can already just allocate a single (or multiple) "struct
 buffer_head" and just use them as IO objects, and give them your _own_
 pointers to the IO buffer etc.

So true.  Exactly because of that the data structures should become
seperated also.

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Christoph Hellwig


On Tue, Feb 06, 2001 at 09:35:58PM +0100, Ingo Molnar wrote:
 caching bmap() blocks was a recent addition around 2.3.20, and i suggested
 some time ago to cache pagecache blocks via explicit entries in struct
 page. That would be one solution - but it creates overhead.
 
 but there isnt anything wrong with having the bhs around to cache blocks -
 think of it as a 'cached and recycled IO buffer entry, with the block
 information cached'.

I was not talking about caching physical blocks but the remaining
buffer-cache support stuff.

Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.
Whip me.  Beat me.  Make me maintain AIX.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Linus Torvalds




On Wed, 7 Feb 2001, Christoph Hellwig wrote:

 On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote:
  
  Actually, they really aren't.
  
  They kind of _used_ to be, but more and more they've moved away from that
  historical use. Check in particular the page cache, and as a really
  extreme case the swap cache version of the page cache.
 
 Yes.  And that exactly why I think it's ugly to have the left-over
 caching stuff in the same data sctruture as the IO buffer.

I do agree.

I would not be opposed to factoring out the "pure block IO" part from the
bh struct. It should not even be very hard. You'd do something like

struct block_io {
.. here is the stuff needed for block IO ..
};

struct buffer_head {
struct block_io io;
.. here is the stuff needed for hashing etc ..
}

and then you make "generic_make_request()" and everything lower down take
just the "struct block_io".

You'd still leave "ll_rw_block()" and "submit_bh()" operating on bh's,
because they knoa about bh semantics (ie things like scaling the sector
number to the bh size etc). Which means that pretty much all the code
outside the block layer wouldn't even _notice_. Which is a sign of good
layering.

If you want to do this, please do go ahead.

But do realize that this is not exactly a 2.4.x thing ;)

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Richard Gooch


Stephen C. Tweedie writes:
 Hi,
 
 On Tue, Feb 06, 2001 at 06:37:41PM -0800, Linus Torvalds wrote:
  Absolutely. And this is independent of what kind of interface we end up
  using, whether it be kiobuf of just plain "struct buffer_head". In that
  respect they are equivalent.
 
 Sorry?  I'm not sure where communication is breaking down here, but
 we really don't seem to be talking about the same things.  SGI's
 kiobuf request patches already let us pass a large IO through the
 request layer in a single unit without having to split it up to
 squeeze it through the API.

Isn't Linus saying that you can use (say) 4 kiB buffer_heads, so you
don't need kiobufs? IIRC, kiobufs are page containers, so a 4 kiB
buffer_head is effectively the same thing.

 If you really don't mind the size of the buffer_head as a sg fragment
 header, then at least I'd like us to be able to submit a pre-built
 chain of bh's all at once without having to go through the remap/merge
 cost for each single bh.

Even if you are limited to feeding one buffer_head at a time, the
merge costs should be somewhat mitigated, since you'll decrease your
calls into the API by a factor of 8 or 16.
But an API extension to allow passing a pre-built chain would be even
better.

Hopefully I haven't missed the point. I've got the flu so I'm not
running on all 4 cylinders :-(

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Stephen C. Tweedie


Hi,

On Wed, Feb 07, 2001 at 12:12:44PM -0700, Richard Gooch wrote:
 Stephen C. Tweedie writes:
  
  Sorry?  I'm not sure where communication is breaking down here, but
  we really don't seem to be talking about the same things.  SGI's
  kiobuf request patches already let us pass a large IO through the
  request layer in a single unit without having to split it up to
  squeeze it through the API.
 
 Isn't Linus saying that you can use (say) 4 kiB buffer_heads, so you
 don't need kiobufs? IIRC, kiobufs are page containers, so a 4 kiB
 buffer_head is effectively the same thing.

kiobufs let you encode _any_ contiguous region of user VA or of an
inode's page cache contents in one kiobuf, no matter how many pages
there are in it.  A write of a megabyte to a raw device can be encoded
as a single kiobuf if we want to pass the entire 1MB IO down to the
block layers untouched.  That's what the page vector in the kiobuf is
for.

Doing the same thing with buffer_heads would still require a couple of
hundred of them, and you'd have to submit each such buffer_head to the
IO subsystem independently.  And then the IO layer will just have to
reassemble them on the other side (and it may have to scan the
device's entire request queue once for every single buffer_head to do
so).

 But an API extension to allow passing a pre-built chain would be even
 better.

Yep.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-07 Thread Neil Brown


On Wednesday February 7, [EMAIL PROTECTED] wrote:
 
 
 On Wed, 7 Feb 2001, Christoph Hellwig wrote:
 
  On Tue, Feb 06, 2001 at 12:59:02PM -0800, Linus Torvalds wrote:
   
   Actually, they really aren't.
   
   They kind of _used_ to be, but more and more they've moved away from that
   historical use. Check in particular the page cache, and as a really
   extreme case the swap cache version of the page cache.
  
  Yes.  And that exactly why I think it's ugly to have the left-over
  caching stuff in the same data sctruture as the IO buffer.
 
 I do agree.
 
 I would not be opposed to factoring out the "pure block IO" part from the
 bh struct. It should not even be very hard. You'd do something like
 
   struct block_io {
   .. here is the stuff needed for block IO ..
   };
 
   struct buffer_head {
   struct block_io io;
   .. here is the stuff needed for hashing etc ..
   }
 
 and then you make "generic_make_request()" and everything lower down take
 just the "struct block_io".
 

I was just thinking the same, or a similar thing.
I wanted to do

struct io_head {
 stuff
};
struct buffer_head {
 struct io_head;
 more stuff;
}

so that, as an unnamed substructure, the content of the struct io_head
would automagically be promoted to appear to be content of
buffer_head.
However I then remembered (when it didn't work) that unnamed
substructures are a feature of the Plan-9 C compiler, not the GNU
Compiler Collection. (Any gcc coders out there think this would be a
good thing to add?
  http://plan9.bell-labs.com/sys/doc/compiler.html
)

Anyway, I produced the same result in a rather ugly way with #defines
and modified raid5 to use 32byte block_io structures instead of the
80+ byte buffer_heads, and it ... doesn't quite work :-( it boots
fine, but raid5 dies and the Oops message is a few kilometers away.
Anyway, I think the concept it fine.

Patch is below for your inspection.

It occurs to me that Stephen's desire to pass lots of requests through
make_request all at once isn't a bad idea and could be done by simply
linking the io_heads together with b_reqnext.
This would require:
  1/ all callers of generic_make_request (there are 3) to initialise
 b_reqnext
  2/ all registered make_request_fn functions (there are again 3 I
 think)  to cope with following b_reqnext

It shouldn't be too hard to make the elevator code take advantage of
any ordering that it fines in the list.

I don't have a patch which does this.

NeilBrown


--- ./include/linux/fs.h2001/02/07 22:45:37 1.1
+++ ./include/linux/fs.h2001/02/07 23:09:05
@@ -207,6 +207,7 @@
 #define BH_Protected   6   /* 1 if the buffer is protected */
 
 /*
+ * THIS COMMENT NO-LONGER CORRECT.
  * Try to keep the most commonly used fields in single cache lines (16
  * bytes) to improve performance.  This ordering should be
  * particularly beneficial on 32-bit processors.
@@ -217,31 +218,43 @@
  * The second 16 bytes we use for lru buffer scans, as used by
  * sync_buffers() and refill_freelist().  -- sct
  */
+
+/* 
+ * io_head is all that is needed by device drivers.
+ */
+#define io_head_fields \
+   unsigned long b_state;  /* buffer state bitmap (see above) */   \
+   struct buffer_head *b_reqnext;  /* request queue */ \
+   unsigned short b_size;  /* block size */\
+   kdev_t b_rdev;  /* Real device */   \
+   unsigned long b_rsector;/* Real buffer location on disk */  \
+   char * b_data;  /* pointer to data block (512 byte) */  \
+   void (*b_end_io)(struct buffer_head *bh, int uptodate); /* I/O completion */ \
+   void *b_private;/* reserved for b_end_io */ \
+   struct page *b_page;/* the page this bh is mapped to */ \
+ /* this line intensionally left blank */
+struct io_head {
+   io_head_fields
+};
+
+/* buffer_head adds all the stuff needed by the buffer cache */
 struct buffer_head {
-   /* First cache line: */
+   io_head_fields
+
struct buffer_head *b_next; /* Hash queue list */
unsigned long b_blocknr;/* block number */
-   unsigned short b_size;  /* block size */
unsigned short b_list;  /* List that this buffer appears */
kdev_t b_dev;   /* device (B_FREE = free) */
 
atomic_t b_count;   /* users using this block */
-   kdev_t b_rdev;  /* Real device */
-   unsigned long b_state;  /* buffer state bitmap (see above) */
unsigned long b_flushtime;  /* Time when (dirty) buffer should be written 
*/
 
struct buffer_head *b_next_free;/* lru/free list linkage */
struct buffer_head *b_prev_free;/* doubly linked list of buffers */
struct buffer_head

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Linus Torvalds

On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
>
> > "struct buffer_head" can deal with pretty much any size: the only thing it
> > cares about is bh->b_size.
> 
> Right now, anything larger than a page is physically non-contiguous,
> and sorry if I didn't make that explicit, but I thought that was
> obvious enough that I didn't need to.  We were talking about raw IO,
> and as long as we're doing IO out of user anonymous data allocated
> from individual pages, buffer_heads are limited to that page size in
> this context.

Sure. That's obviously also one of the reasons why the IO layer has never
seen bigger requests anyway - the data _does_ tend to be fundamentally
broken up into page-size entities, if for no other reason that that is how
user-space sees memory.

However, I really _do_ want to have the page cache have a bigger
granularity than the smallest memory mapping size, and there are always
special cases that might be able to generate IO in bigger chunks (ie
in-kernel services etc)

> Yes.  We still have this fundamental property: if a user sends in a
> 128kB IO, we end up having to split it up into buffer_heads and doing
> a separate submit_bh() on each single one.  Given our VM, PAGE_SIZE
> (*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in
> this case.

Absolutely. And this is independent of what kind of interface we end up
using, whether it be kiobuf of just plain "struct buffer_head". In that
respect they are equivalent.

> THAT is the overhead that I'm talking about: having to split a large
> IO into small chunks, each of which just ends up having to be merged
> back again into a single struct request by the *make_request code.

You could easily just generate the bh then and there, if you wanted to.

Your overhead comes from the fact that you want to gather the IO together. 

And I'm saying that you _shouldn't_ gather the IO. There's no point. The
gathering is sufficiently done by the low-level code anyway, and I've
tried to explain why the low-level code _has_ to do that work regardless
of what upper layers do.

You need to generate a separate sg entry for each page anyway. So why not
just use the existing one? The "struct buffer_head". Which already
_handles_ all the issues that you have complained are hard to handle.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Jens Axboe


On Tue, Feb 06 2001, Linus Torvalds wrote:
> > > [...] so I would be _really_ nervous about just turning it on
> > > silently. This is all very much a 2.5.x-kind of thing ;)
> > 
> > Then you might want to apply this :-)
> > 
> > --- drivers/block/ll_rw_blk.c~  Wed Feb  7 02:38:31 2001
> > +++ drivers/block/ll_rw_blk.c   Wed Feb  7 02:38:42 2001
> > @@ -1048,7 +1048,7 @@
> > /* Verify requested block sizes. */
> > for (i = 0; i < nr; i++) {
> > struct buffer_head *bh = bhs[i];
> > -   if (bh->b_size % correct_size) {
> > +   if (bh->b_size != correct_size) {
> > printk(KERN_NOTICE "ll_rw_block: device %s: "
> >"only %d-char blocks implemented (%u)\n",
> >kdevname(bhs[0]->b_dev),
> 
> Actually, I'd rather leave it in, but speed it up with the saner and
> faster
> 
>   if (bh->b_size & (correct_size-1)) {
>   ...
> 
> That way people who _want_ to test the odd-size thing can do so. And
> normal code (that never generates requests on any other size than the
> "native" size) won't ever notice either way.

Fine, as I said I didn't spot anything bad so that's why it was changed.

> (Oh, we'll eventually need to move to "correct_size == hardware
> blocksize", not the "virtual blocksize" that it is now. As it it a tester
> needs to set the soft-blk size by hand now).

Exactly, wrt earlier mail about submitting < hw block size requests to
the lower levels.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Stephen C. Tweedie

Hi,

On Tue, Feb 06, 2001 at 04:50:19PM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> > 
> > That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
> > enforces a single blocksize on all requests but that relaxing that
> > requirement is no big deal).  Buffer_heads can't deal with data which
> > spans more than a page right now.
> 
> "struct buffer_head" can deal with pretty much any size: the only thing it
> cares about is bh->b_size.

Right now, anything larger than a page is physically non-contiguous,
and sorry if I didn't make that explicit, but I thought that was
obvious enough that I didn't need to.  We were talking about raw IO,
and as long as we're doing IO out of user anonymous data allocated
from individual pages, buffer_heads are limited to that page size in
this context.

> Have you ever spent even just 5 minutes actually _looking_ at the block
> device layer, before you decided that you think it needs to be completely
> re-done some other way? It appears that you never bothered to.

Yes.  We still have this fundamental property: if a user sends in a
128kB IO, we end up having to split it up into buffer_heads and doing
a separate submit_bh() on each single one.  Given our VM, PAGE_SIZE
(*not* PAGE_CACHE_SIZE) is the best granularity we can hope for in
this case.

THAT is the overhead that I'm talking about: having to split a large
IO into small chunks, each of which just ends up having to be merged
back again into a single struct request by the *make_request code.

A constructed IO request basically doesn't care about anything in the
buffer_head except for the data pointer and size, and the completion
status info and callback.  All of the physical IO description is in
the struct request by this point.  The chain of buffer_heads is
carrying around a huge amount of information which isn't used by the
IO, and if the caller is something like the raw IO driver which isn't
using the buffer cache, that extra buffer_head data is just overhead. 

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Linus Torvalds




On Wed, 7 Feb 2001, Jens Axboe wrote:
> 
> > [...] so I would be _really_ nervous about just turning it on
> > silently. This is all very much a 2.5.x-kind of thing ;)
> 
> Then you might want to apply this :-)
> 
> --- drivers/block/ll_rw_blk.c~Wed Feb  7 02:38:31 2001
> +++ drivers/block/ll_rw_blk.c Wed Feb  7 02:38:42 2001
> @@ -1048,7 +1048,7 @@
>   /* Verify requested block sizes. */
>   for (i = 0; i < nr; i++) {
>   struct buffer_head *bh = bhs[i];
> - if (bh->b_size % correct_size) {
> + if (bh->b_size != correct_size) {
>   printk(KERN_NOTICE "ll_rw_block: device %s: "
>  "only %d-char blocks implemented (%u)\n",
>  kdevname(bhs[0]->b_dev),

Actually, I'd rather leave it in, but speed it up with the saner and
faster

if (bh->b_size & (correct_size-1)) {
...

That way people who _want_ to test the odd-size thing can do so. And
normal code (that never generates requests on any other size than the
"native" size) won't ever notice either way.

(Oh, we'll eventually need to move to "correct_size == hardware
blocksize", not the "virtual blocksize" that it is now. As it it a tester
needs to set the soft-blk size by hand now).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Linus Torvalds

On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> > 
> > The fact is, if you have problems like the above, then you don't
> > understand the interfaces. And it sounds like you designed kiobuf support
> > around the wrong set of interfaces.
> 
> They used the only interfaces available at the time...

Ehh.. "generic_make_request()" goes back a _loong_ time. It used to be
called just "make_request()", but all my points still stand.

It's even exported to modules. As far as I know, the raid code has always
used this interface exactly because raid needed to feed back the remapped
stuff and get around the blocksizing in ll_rw_block().

This really isn't anything new. I _know_ it's there in 2.2.x, and I
would not be surprised if it was there even in 2.0.x.

> > If you want to get at the _sector_ level, then you do
> ...
> > which doesn't look all that complicated to me. What's the problem?
> 
> Doesn't this break nastily as soon as the IO hits an LVM or soft raid
> device?  I don't think we are safe if we create a larger-sized
> buffer_head which spans a raid stripe: the raid mapping is only
> applied once per buffer_head.

Absolutely. This is exactly what I mean by saying that low-level drivers
may not actually be able to handle new cases that they've never been asked
to do before - they just never saw anything like a 64kB request before or
something that crossed its own alignment.

But the _higher_ levels are there. And there's absolutely nothing in the
design that is a real problem. But there's no question that you might need
to fix up more than one or two low-level drivers.

(The only drivers I know better are the IDE ones, and as far as I can tell
they'd have no trouble at all with any of this. Most other normal drivers
are likely to be in this same situation. But because I've not had a reason
to test, I certainly won't guarantee even that).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Jens Axboe


On Tue, Feb 06 2001, Linus Torvalds wrote:
> > I don't see anything that would break doing this, in fact you can
> > do this as long as the buffers are all at least a multiple of the
> > block size. All the drivers I've inspected handle this fine, noone
> > assumes that rq->bh->b_size is the same in all the buffers attached
> > to the request.
> 
> It's really easy to get this wrong when going forward in the request list:
> you need to make sure that you update "request->current_nr_sectors" each
> time you move on to the next bh.
> 
> I would not be surprised if some of them have been seriously buggered. 

Maybe have been, but it looks good at least with the general drivers
that I mentioned.

> [...] so I would be _really_ nervous about just turning it on
> silently. This is all very much a 2.5.x-kind of thing ;)

Then you might want to apply this :-)

--- drivers/block/ll_rw_blk.c~  Wed Feb  7 02:38:31 2001
+++ drivers/block/ll_rw_blk.c   Wed Feb  7 02:38:42 2001
@@ -1048,7 +1048,7 @@
/* Verify requested block sizes. */
for (i = 0; i < nr; i++) {
struct buffer_head *bh = bhs[i];
-   if (bh->b_size % correct_size) {
+   if (bh->b_size != correct_size) {
printk(KERN_NOTICE "ll_rw_block: device %s: "
   "only %d-char blocks implemented (%u)\n",
   kdevname(bhs[0]->b_dev),

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Stephen C. Tweedie


Hi,

On Tue, Feb 06, 2001 at 04:41:21PM -0800, Linus Torvalds wrote:
> 
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> > No, it is a problem of the ll_rw_block interface: buffer_heads need to
> > be aligned on disk at a multiple of their buffer size.
> 
> Ehh.. True of ll_rw_block() and submit_bh(), which are meant for the
> traditional block device setup, where "b_blocknr" is the "virtual
> blocknumber" and that indeed is tied in to the block size.
> 
> The fact is, if you have problems like the above, then you don't
> understand the interfaces. And it sounds like you designed kiobuf support
> around the wrong set of interfaces.

They used the only interfaces available at the time...

> If you want to get at the _sector_ level, then you do
...
> which doesn't look all that complicated to me. What's the problem?

Doesn't this break nastily as soon as the IO hits an LVM or soft raid
device?  I don't think we are safe if we create a larger-sized
buffer_head which spans a raid stripe: the raid mapping is only
applied once per buffer_head.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Linus Torvalds

On Wed, 7 Feb 2001, Ingo Molnar wrote:
> 
> most likely some coding error on your side. buffer-size mismatches should
> show up as filesystem corruption or random DMA scribble, not in-driver
> oopses.

I'm not sure. If I was a driver writer (and I'm happy those days are
mostly behind me ;), I would not be totally dis-inclined to check for
various limits and things.

There can be hardware out there that simply has trouble with non-native
alignment, ie be unhappy about getting a 1kB request that is aligned in
memory at a 512-byte boundary. So there are real reasons why drivers might
need updating. Don't dismiss the concerns out-of-hand.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Linus Torvalds

On Wed, 7 Feb 2001, Jens Axboe wrote:
> 
> I don't see anything that would break doing this, in fact you can
> do this as long as the buffers are all at least a multiple of the
> block size. All the drivers I've inspected handle this fine, noone
> assumes that rq->bh->b_size is the same in all the buffers attached
> to the request.

It's really easy to get this wrong when going forward in the request list:
you need to make sure that you update "request->current_nr_sectors" each
time you move on to the next bh.

I would not be surprised if some of them have been seriously buggered. 

On the other hand, I would _also_ not be surprised if we've actually fixed
a lot of them: one of the things that the RAID code and loopback test is
exactly getting these kinds of issues right (not this exact one, but
similar ones).

And let's remember things like the old ultrastor driver that was totally
unable to handle anything but 1kB devices etc. I would not be _totally_
surprised if it turns out that there are still drivers out there that
remember the time when Linux only ever had 1kB buffers. Even if it is 7
years ago or so ;)

(Also, there might be drivers that are "optimized" - they set the IO
length once per request, and just never set it again as they do partial
end_io() calls. None of those kinds of issues would ever be found under
normal load, so I would be _really_ nervous about just turning it on
silently. This is all very much a 2.5.x-kind of thing ;)

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Jeff V. Merkey


On Wed, Feb 07, 2001 at 02:06:27AM +0100, Ingo Molnar wrote:
> 
> On Tue, 6 Feb 2001, Jeff V. Merkey wrote:
> 
> > > I don't see anything that would break doing this, in fact you can
> > > do this as long as the buffers are all at least a multiple of the
> > > block size. All the drivers I've inspected handle this fine, noone
> > > assumes that rq->bh->b_size is the same in all the buffers attached
> > > to the request. This includes SCSI (scsi_lib.c builds sg tables),
> > > IDE, and the Compaq array + Mylex driver. This mostly leaves the
> > > "old-style" drivers using CURRENT etc, the kernel helpers for these
> > > handle it as well.
> > >
> > > So I would appreciate pointers to these devices that break so we
> > > can inspect them.
> > >
> > > --
> > > Jens Axboe
> >
> > Adaptec drivers had an oops.  Also, AIC7XXX also had some oops with it.
> 
> most likely some coding error on your side. buffer-size mismatches should
> show up as filesystem corruption or random DMA scribble, not in-driver
> oopses.
> 
>   Ingo

Oops was in my code, but was caused by these drivers.  The Adaptec 
driver did have an oops that was it's own code address, AIC7XXX 
crashed in my code.

Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Ingo Molnar



On Wed, 7 Feb 2001, Jens Axboe wrote:

> > > Adaptec drivers had an oops.  Also, AIC7XXX also had some oops with it.
> >
> > most likely some coding error on your side. buffer-size mismatches should
> > show up as filesystem corruption or random DMA scribble, not in-driver
> > oopses.
>
> I would suspect so, aic7xxx shouldn't care about anything except the
> sg entries and I would seriously doubt that it makes any such
> assumptions on them :-)

yep - and not a single reference to b_size in aic7xxx.c.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Jeff V. Merkey


On Wed, Feb 07, 2001 at 02:08:53AM +0100, Jens Axboe wrote:
> On Tue, Feb 06 2001, Jeff V. Merkey wrote:
> > Adaptec drivers had an oops.  Also, AIC7XXX also had some oops with it.
> 
> Do you still have this oops?
> 

I can recreate.  Will work on it tommorrow.  SCI testing today.

Jeff

> -- 
> Jens Axboe
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Jens Axboe


On Tue, Feb 06 2001, Jeff V. Merkey wrote:
> Adaptec drivers had an oops.  Also, AIC7XXX also had some oops with it.

Do you still have this oops?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Jens Axboe


On Wed, Feb 07 2001, Ingo Molnar wrote:
> > > So I would appreciate pointers to these devices that break so we
> > > can inspect them.
> > >
> > > --
> > > Jens Axboe
> >
> > Adaptec drivers had an oops.  Also, AIC7XXX also had some oops with it.
> 
> most likely some coding error on your side. buffer-size mismatches should
> show up as filesystem corruption or random DMA scribble, not in-driver
> oopses.

I would suspect so, aic7xxx shouldn't care about anything except the
sg entries and I would seriously doubt that it makes any such
assumptions on them :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Ingo Molnar



On Tue, 6 Feb 2001, Jeff V. Merkey wrote:

> > I don't see anything that would break doing this, in fact you can
> > do this as long as the buffers are all at least a multiple of the
> > block size. All the drivers I've inspected handle this fine, noone
> > assumes that rq->bh->b_size is the same in all the buffers attached
> > to the request. This includes SCSI (scsi_lib.c builds sg tables),
> > IDE, and the Compaq array + Mylex driver. This mostly leaves the
> > "old-style" drivers using CURRENT etc, the kernel helpers for these
> > handle it as well.
> >
> > So I would appreciate pointers to these devices that break so we
> > can inspect them.
> >
> > --
> > Jens Axboe
>
> Adaptec drivers had an oops.  Also, AIC7XXX also had some oops with it.

most likely some coding error on your side. buffer-size mismatches should
show up as filesystem corruption or random DMA scribble, not in-driver
oopses.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Jeff V. Merkey


On Wed, Feb 07, 2001 at 02:02:21AM +0100, Jens Axboe wrote:
> On Tue, Feb 06 2001, Jeff V. Merkey wrote:
> > I remember Linus asking to try this variable buffer head chaining 
> > thing 512-1024-512 kind of stuff several months back, and mixing them to 
> > see what would happen -- result.  About half the drivers break with it.  
> > The interface allows you to do it, I've tried it, (works on Andre's 
> > drivers, but a lot of SCSI drivers break) but a lot of drivers seem to 
> > have assumptions about these things all being the same size in a 
> > buffer head chain. 
> 
> I don't see anything that would break doing this, in fact you can
> do this as long as the buffers are all at least a multiple of the
> block size. All the drivers I've inspected handle this fine, noone
> assumes that rq->bh->b_size is the same in all the buffers attached
> to the request. This includes SCSI (scsi_lib.c builds sg tables),
> IDE, and the Compaq array + Mylex driver. This mostly leaves the
> "old-style" drivers using CURRENT etc, the kernel helpers for these
> handle it as well.
> 
> So I would appreciate pointers to these devices that break so we
> can inspect them.
> 
> -- 
> Jens Axboe

Adaptec drivers had an oops.  Also, AIC7XXX also had some oops with it.

Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Jeff V. Merkey


On Wed, Feb 07, 2001 at 02:01:54AM +0100, Ingo Molnar wrote:
> 
> On Tue, 6 Feb 2001, Jeff V. Merkey wrote:
> 
> > I remember Linus asking to try this variable buffer head chaining
> > thing 512-1024-512 kind of stuff several months back, and mixing them
> > to see what would happen -- result. About half the drivers break with
> > it. [...]
> 
> time to fix them then - instead of rewriting the rest of the kernel ;-)
> 
>   Ingo

I agree.  

Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Ingo Molnar



On Tue, 6 Feb 2001, Jeff V. Merkey wrote:

> I remember Linus asking to try this variable buffer head chaining
> thing 512-1024-512 kind of stuff several months back, and mixing them
> to see what would happen -- result. About half the drivers break with
> it. [...]

time to fix them then - instead of rewriting the rest of the kernel ;-)

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Jens Axboe

On Tue, Feb 06 2001, Jeff V. Merkey wrote:
> I remember Linus asking to try this variable buffer head chaining 
> thing 512-1024-512 kind of stuff several months back, and mixing them to 
> see what would happen -- result.  About half the drivers break with it.  
> The interface allows you to do it, I've tried it, (works on Andre's 
> drivers, but a lot of SCSI drivers break) but a lot of drivers seem to 
> have assumptions about these things all being the same size in a 
> buffer head chain. 

I don't see anything that would break doing this, in fact you can
do this as long as the buffers are all at least a multiple of the
block size. All the drivers I've inspected handle this fine, noone
assumes that rq->bh->b_size is the same in all the buffers attached
to the request. This includes SCSI (scsi_lib.c builds sg tables),
IDE, and the Compaq array + Mylex driver. This mostly leaves the
"old-style" drivers using CURRENT etc, the kernel helpers for these
handle it as well.

So I would appreciate pointers to these devices that break so we
can inspect them.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Jeff V. Merkey


On Tue, Feb 06, 2001 at 04:50:19PM -0800, Linus Torvalds wrote:
> 
> 
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> > 
> > That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
> > enforces a single blocksize on all requests but that relaxing that
> > requirement is no big deal).  Buffer_heads can't deal with data which
> > spans more than a page right now.
> 
> Stephen, you're so full of shit lately that it's unbelievable. You're
> batting a clear 0.000 so far.
> 
> "struct buffer_head" can deal with pretty much any size: the only thing it
> cares about is bh->b_size.
> 
> It so happens that if you have highmem support, then "create_bounce()"
> will work on a per-page thing, but that just means that you'd better have
> done your bouncing into low memory before you call generic_make_request().
> 
> Have you ever spent even just 5 minutes actually _looking_ at the block
> device layer, before you decided that you think it needs to be completely
> re-done some other way? It appears that you never bothered to.
> 
> Sure, I would not be surprised if some device driver ends up being
> surpised if you start passing it different request sizes than it is used
> to. But that's a driver and testing issue, nothing more.
> 
> (Which is not to say that "driver and testing" issues aren't important as
> hell: it's one of the more scary things in fact, and it can take a long
> time to get right if you start doing somehting that historically has never
> been done and thus has historically never gotten any testing. So I'm not
> saying that it should work out-of-the-box. But I _am_ saying that there's
> no point in trying to re-design upper layers that already do ALL of this
> with no problems at all).
> 
>   Linus
> 

I remember Linus asking to try this variable buffer head chaining 
thing 512-1024-512 kind of stuff several months back, and mixing them to 
see what would happen -- result.  About half the drivers break with it.  
The interface allows you to do it, I've tried it, (works on Andre's 
drivers, but a lot of SCSI drivers break) but a lot of drivers seem to 
have assumptions about these things all being the same size in a 
buffer head chain. 

:-)

Jeff


> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Linus Torvalds

On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> 
> That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
> enforces a single blocksize on all requests but that relaxing that
> requirement is no big deal).  Buffer_heads can't deal with data which
> spans more than a page right now.

Stephen, you're so full of shit lately that it's unbelievable. You're
batting a clear 0.000 so far.

"struct buffer_head" can deal with pretty much any size: the only thing it
cares about is bh->b_size.

It so happens that if you have highmem support, then "create_bounce()"
will work on a per-page thing, but that just means that you'd better have
done your bouncing into low memory before you call generic_make_request().

Have you ever spent even just 5 minutes actually _looking_ at the block
device layer, before you decided that you think it needs to be completely
re-done some other way? It appears that you never bothered to.

Sure, I would not be surprised if some device driver ends up being
surpised if you start passing it different request sizes than it is used
to. But that's a driver and testing issue, nothing more.

(Which is not to say that "driver and testing" issues aren't important as
hell: it's one of the more scary things in fact, and it can take a long
time to get right if you start doing somehting that historically has never
been done and thus has historically never gotten any testing. So I'm not
saying that it should work out-of-the-box. But I _am_ saying that there's
no point in trying to re-design upper layers that already do ALL of this
with no problems at all).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Jeff V. Merkey


On Wed, Feb 07, 2001 at 12:36:29AM +, Stephen C. Tweedie wrote:
> Hi,
> 
> On Tue, Feb 06, 2001 at 07:25:19PM -0500, Ingo Molnar wrote:
> > 
> > On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> > 
> > > No, it is a problem of the ll_rw_block interface: buffer_heads need to
> > > be aligned on disk at a multiple of their buffer size.  Under the Unix
> > > raw IO interface it is perfectly legal to begin a 128kB IO at offset
> > > 512 bytes into a device.
> > 
> > then we should either fix this limitation, or the raw IO code should split
> > the request up into several, variable-size bhs, so that the range is
> > filled out optimally with aligned bhs.
> 
> That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
> enforces a single blocksize on all requests but that relaxing that
> requirement is no big deal).  Buffer_heads can't deal with data which
> spans more than a page right now.


I can handle requests larger than a page (64K) but I am not using 
the buffer cache in Linux.  We really need an NT/NetWare like model 
to support the non-Unix FS's properly.

i.e.   

a disk request should be 

and get rid of this fixed block 
stuff with buffer heads. :-)

I understand that the way the elevator is implemented in Linux makes
this very hard at this point to support, since it's very troublesome 
to handling requests that overlap sector boundries.

Jeff


> 
> --Stephen
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Linus Torvalds




On Tue, 6 Feb 2001, Ingo Molnar wrote:
> 
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> 
> > No, it is a problem of the ll_rw_block interface: buffer_heads need to
> > be aligned on disk at a multiple of their buffer size.  Under the Unix
> > raw IO interface it is perfectly legal to begin a 128kB IO at offset
> > 512 bytes into a device.
> 
> then we should either fix this limitation, or the raw IO code should split
> the request up into several, variable-size bhs, so that the range is
> filled out optimally with aligned bhs.

As mentioned, no such limitation exists if you just use the right
interfaces.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Linus Torvalds

On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> 
> On Tue, Feb 06, 2001 at 08:57:13PM +0100, Ingo Molnar wrote:
> > 
> > [overhead of 512-byte bhs in the raw IO code is an artificial problem of
> > the raw IO code.]
> 
> No, it is a problem of the ll_rw_block interface: buffer_heads need to
> be aligned on disk at a multiple of their buffer size.

Ehh.. True of ll_rw_block() and submit_bh(), which are meant for the
traditional block device setup, where "b_blocknr" is the "virtual
blocknumber" and that indeed is tied in to the block size.

That's the whole _point_ of ll_rw_block() and friends - they show the
device at a different "virtual blocking" level than the low-level physical
accesses necessarily are. Which very much means that if you have a 4kB
"view", of the device, you get a stream of 4kB blocks. Not 4kB sized
blocks at 512-byte offsets (or whatebver the hardware blocking size is).

This way the interfaces are independent of the hardware blocksize. Which
is logical and what you'd expect. You need to go to a lower level to see
those kinds of blocking issues.

But it is _not_ true of "generic_make_request()" and the block IO layer in
general. It obviously _cannot_ be true, because the block I/O layer has
always had the notion of merging consecutive blocks together - regardless
of whether the end result is even a power of two or antyhing like that in
size. You can make an IO request for pretty much any size, as long as it's
a multiple of the hardare blocksize (normally 512 bytes, but there are
certainly devices out there with other blocksizes).

The fact is, if you have problems like the above, then you don't
understand the interfaces. And it sounds like you designed kiobuf support
around the wrong set of interfaces.

If you want to get at the _sector_ level, then you do

lock_bh();
bh->b_rdev = device;
bh->b_rsector = sector-number (where linux defines "sector" to be 512 bytes)
bh->b_size = size in bytes (must be a multiple of 512);
bh->b_data = pointer;
bh->b_end_io = callback;
generic_make_request(rw, bh);

which doesn't look all that complicated to me. What's the problem?

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Stephen C. Tweedie


Hi,

On Tue, Feb 06, 2001 at 07:25:19PM -0500, Ingo Molnar wrote:
> 
> On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:
> 
> > No, it is a problem of the ll_rw_block interface: buffer_heads need to
> > be aligned on disk at a multiple of their buffer size.  Under the Unix
> > raw IO interface it is perfectly legal to begin a 128kB IO at offset
> > 512 bytes into a device.
> 
> then we should either fix this limitation, or the raw IO code should split
> the request up into several, variable-size bhs, so that the range is
> filled out optimally with aligned bhs.

That gets us from 512-byte blocks to 4k, but no more (ll_rw_block
enforces a single blocksize on all requests but that relaxing that
requirement is no big deal).  Buffer_heads can't deal with data which
spans more than a page right now.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Jens Axboe


On Wed, Feb 07 2001, Stephen C. Tweedie wrote:
> > [overhead of 512-byte bhs in the raw IO code is an artificial problem of
> > the raw IO code.]
> 
> No, it is a problem of the ll_rw_block interface: buffer_heads need to
> be aligned on disk at a multiple of their buffer size.  Under the Unix
> raw IO interface it is perfectly legal to begin a 128kB IO at offset
> 512 bytes into a device.

Submitting buffers to lower layers that are not hw sector aligned
can't be supported below ll_rw_blk anyway (they can, but look at the
problems this has always created), and I would much rather see stuff
like this handled outside of there.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Stephen C. Tweedie

Hi,

On Tue, Feb 06, 2001 at 08:57:13PM +0100, Ingo Molnar wrote:
> 
> [overhead of 512-byte bhs in the raw IO code is an artificial problem of
> the raw IO code.]

No, it is a problem of the ll_rw_block interface: buffer_heads need to
be aligned on disk at a multiple of their buffer size.  Under the Unix
raw IO interface it is perfectly legal to begin a 128kB IO at offset
512 bytes into a device.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Ingo Molnar



On Wed, 7 Feb 2001, Stephen C. Tweedie wrote:

> No, it is a problem of the ll_rw_block interface: buffer_heads need to
> be aligned on disk at a multiple of their buffer size.  Under the Unix
> raw IO interface it is perfectly legal to begin a 128kB IO at offset
> 512 bytes into a device.

then we should either fix this limitation, or the raw IO code should split
the request up into several, variable-size bhs, so that the range is
filled out optimally with aligned bhs.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Linus Torvalds

On Tue, 6 Feb 2001, Marcelo Tosatti wrote:
> 
> Its arguing against making a smart application block on the disk while its
> able to use the CPU for other work.

There are currently no other alternatives in user space. You'd have to
create whole new interfaces for aio_read/write, and ways for the kernel to
inform user space that "now you can re-try submitting your IO".

Could be done. But that's a big thing.

> An application which sets non blocking behavior and busy waits for a
> request (which seems to be your argument) is just stupid, of course.

Tell me what else it could do at some point? You need something like
select() to wait on it. There are no such interfaces right now...

(besides, latency would suck. I bet you're better off waiting for the
requests if they are all used up. It takes too long to get deep into the
kernel from user space, and you cannot use the exclusive waiters with its
anti-herd behaviour etc).

Simple rule: if you want to optimize concurrency and avoid waiting - use
several processes or threads instead. At which point you can get real work
done on multiple CPU's, instead of worrying about what happens when you
have to wait on the disk.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

1 2 3 4 >

1 - 100 of 388 matches

Mail list logo