Re: [RFC] Heads up on a series of AIO patchsets

2007-01-04 Thread Zach Brown
generic_write_checks() are done in the submission path, not  
repeated during
retries, so such types of checks are not intended to run in the aio  
thread.


Ah, I see, I was missing the short-cut which avoids re-running parts  
of the write path if we're in a retry.


if (!is_sync_kiocb(iocb) && kiocbIsRestarted(iocb)) {
/* nothing to transfer, may just need to sync data */
return ocount;

It's pretty subtle that this has to be placed before the first  
significant current reference and that nothing in the path can return  
-EIOCBRETRY until after all of the significant current references.


In totally unrelated news, I noticed that current->io_wait is set to  
NULL instead of ¤t->__wait after having run the iocb.  I wonder  
if it shouldn't be saved and restored instead.  And maybe update the  
comment over is_sync_wait()?  Just an observation.



That is great and I look forward to it :) I am, however assuming that
whatever implementation you come up will have a different interface
from current linux aio -- i.e. a next generation aio model, that  
will be

easily integratable with kevents etc.


Yeah, that's the hope.

Which takes me back to Ingo's point - lets have the new evolve  
parallely

with the old, if we can, and not hold up the patches for POSIX AIO to
start using kernel AIO, or for epoll to integrate with AIO.


Sure, though there are only so many hours in a day :).


OK, I just took a quick look at your blog and I see that you
are basically implementing Linus' microthreads scheduling approach -
one year since we had that discussion.


Yeah.  I wanted to see what it would look like.


Glad to see that you found a way to make it workable ...


We, that remains to be seen.  If nothing else we'll at least hav  
code to point at when discussing it.  If we all agree it's not the  
right way and dismiss the notion, fine, that's progress :).



(I'm guessing that you are copying over the part
of the stack in use at the time of every switch, is that correct ?


That was my first pass, yeah.  It turned the knob a little too far  
towards the "invasive but efficient" direction for my taste.  I'm now  
giving it a try by having full stacks for each blocked op, we'll see  
how that goes.



At what
point do you do the allocation of the saved stacks ?


I was allocating at block-time to keep memory consumption down, but I  
think my fiddling around with it convinced me that isn't workable.


- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on a series of AIO patchsets

2007-01-04 Thread Pavel Machek
On Tue 2007-01-02 16:18:40, Kent Overstreet wrote:
> >> Any details?
> >
> >Well, one path I tried I couldn't help but post a blog 
> >entry about
> >for my friends.  I'm not sure it's the direction I'll 
> >take with linux-
> >kernel, but the fundamentals are there:  the api should 
> >be the
> >syscall interface, and there should be no difference 
> >between sync and
> >async behaviour.
> >
> >http://www.zabbo.net/?p=72
> 
> Any code you're willing to let people play with? I could 
> at least have
> real test cases, and a library to go along with it as it 
> gets
> finished.
> 
> Another pie in the sky idea:
> One thing that's been bugging me lately (working on a 9p 
> server), is
> sendfile is hard to use in practice because you need 
> packet headers
> and such, and they need to go out at the same time.

splice()?
Pavel

-- 
Thanks for all the (sleeping) penguins.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on a series of AIO patchsets

2007-01-03 Thread Evgeniy Polyakov
On Tue, Jan 02, 2007 at 02:38:13PM -0700, Dan Williams ([EMAIL PROTECTED]) 
wrote:
> Would you have time to comment on the approach I have taken to
> implement a standard asynchronous memcpy interface?  It seems it would
> be a good complement to what you are proposing.  The entity that
> describes the aio operation could take advantage of asynchronous
> engines to carry out copies or other transforms (maybe an acrypto tie
> in as well).
> 
> Here is the posting for 2.6.19.  There has since been updates for
> 2.6.20, but the overall approach remains the same.
> intro: http://marc.theaimsgroup.com/?l=linux-raid&m=116491661527161&w=2
> async_tx: http://marc.theaimsgroup.com/?l=linux-raid&m=116491753318175&w=2

My first impression is that it has too many lists :)

Looks good, but IMHO there are steps to implement further.
I have not found there any kind of scheduler - what if system has two
async engines? What if sync engine faster than async in some cases (and
it is indeed the case for small buffers), and should be selected that time?
What if you will want to add additional transformations for some
devices like crypto processing or checksumming?

I would just create a driver for low-level engine, and exported its
functionality - iop3_async_copy(), iop3_async_checksum(), iop3_async_crypto_1(),
iop3_async_crypto_2() and so on.

There will be a lot of potential users of exactly that functionality,
but not stricly hardcoded higher layer operations like raidX.

More generic solution must be used to select appropriate device.
We had a very brief discussion about asynchronous crypto layer (acrypto)
and how its ideas could be used for async dma engines - user should not
even know how his data has been transferred - it calls async_copy(),
which selects appropriate device (and sync copy is just an additional 
usual device in that case) from the list of devices, exported its
functionality, selection can be done in millions of different ways from
getting the fisrt one from the list (this is essentially how your
approach is implemented right now), or using special (including run-time
updated) heueristics (like it is done in acrypto).

Thinking further, async_copy() is just a usual case for async class of
operations. So the same above logic must be applied on this layer too.

But 'layers are the way to design protocols, not implement them'.
David Miller on netchannels

So, user should not even know about layers - it should just say 'copy
data from pointer A to pointer B', or 'copy data from pointer A to
socket B' or even 'copy it from file "/tmp/file" to "192.168.0.1:80:tcp"',
without ever knowing that there are sockets and/or memcpy() calls,
and if user requests to perform it asynchronously, it must be later
notified (one might expect, that I will prefer to use kevent :)
The same approach thus can be used by NFS/SAMBA/CIFS and other users.

That is how I start to implement AIO (it looks like it becomes popular):
1. system exports set of operations it supports (send, receive, copy,
crypto, )
2. each operation has subsequent set of suboptions (different crypto 
types, for example)
3. each operation has set of low-level drivers, which support it (with
optional performance or any other parameters)
4. each driver when loaded publishes its capabilities (async copy with
speed A, XOR and so on)

>From user's point of view its aio_sendfile() or async_copy() will look
following:
1. call aio_schedule_pointer(source='0xaabbccdd', dest='0x123456578')
1. call aio_schedule_file_socket(source='/tmp/file', dest='socket')
1. call aio_schedule_file_addr(source='/tmp/file',
dest='192.168.0.1:80:tcp')

or any other similar call

then wait for received descriptor in kevent_get_events() or provide own
cookie in each call.

Each request is then converted into FIFO of smaller request like 'open file',
'open socket', 'get in user pages' and so on, each of which should be
handled on appropriate device (hardware or software), completeness of
each request starts procesing of the next one.

Reading microthreading design notes I recall comparison of the NPTL and
Erlang threading models on Debian site - they are _completely_ different 
models, NPTL creates real threads, which is supposed (I hope NOT) 
to be implemented in microthreading design too. It is slow. 
(Or is it not, Zach, we are intrigued :)
It's damn bloody slow to create a thread compared to the correct non-blocking 
state machine. TUX state machine is similar to what I had in my first kevent 
based FS and network AIO patchset, and what I will use for current async 
processing work.


A bit of empty words actually, but it can provide some food for
thoughts.

> Regards,
> 
> Dan

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on a series of AIO patchsets

2007-01-02 Thread Suparna Bhattacharya
On Tue, Jan 02, 2007 at 03:56:09PM -0800, Zach Brown wrote:
> Sorry for the delay, I'm finally back from the holiday break :)

Welcome back !

> 
> >(1) The filesystem AIO patchset, attempts to address one part of
> >the problem
> >which is to make regular file IO, (without O_DIRECT)
> >asynchronous (mainly
> >the case of reads of uncached or partially cached files, and
> >O_SYNC writes).
> 
> One of the properties of the currently implemented EIOCBRETRY aio
> path is that ->mm is the only field in current which matches the
> submitting task_struct while inside the retry path.

Yes and that as I guess you know is to enable the aio worker thread to
operate on the caller's address space for copy_from/to_user. 

The actual io setup and associated checks are expected to have been
handled at submission time.

> 
> It looks like a retry-based aio write path would be broken because of
> this.  generic_write_checks() could run in the aio thread and get its
> task_struct instead of that of the submitter.  The wrong rlimit will
> be tested and SIGXFSZ won't be raised.  remove_suid() could check the
> capabilities of the aio thread instead of those of the submitter.

generic_write_checks() are done in the submission path, not repeated during
retries, so such types of checks are not intended to run in the aio thread.

Did I miss something here ?

> 
> I don't think EIOCBRETRY is the way to go because of this increased
> (and subtle!) complexity.  What are the chances that we would have
> ever found those bugs outside code review?  How do we make sure that
> current references don't sneak back in after having initially audited
> the paths?

The EIOCBRETRY route is not something that is intended to be used blindly,
It is just one alternative to implement an aio operation by splitting up
responsibility between the submitter and aio threads, where aio threads 
can run in the caller's address space.

> 
> Take the io_cmd_epoll_wait patch..
> 
> >issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach
> >Brown with
> >modifications from Jeff Moyer and me) addresses this problem
> >for native
> >linux aio in a simple manner.
> 
> It's simple looking, sure.  This current flipping didn't even occur
> to me while throwing the patch together!
> 
> But that patch ends up calling ->poll (and poll_table->qproc) and
> writing to userspace (so potentially calling ->nopage) from the aio

Yes of course, but why is that a problem ?
The copy_from/to_user/put_user constructs are designed to handle soft failures,
and we are already using the caller's ->mm. Do you see a need for any
additional asserts() ?

If there is something that is needed by ->nopage etc which is not abstracted
out within the ->mm, then we would need to fix that instead, for correctness
anyway, isn't that so ?

Now it is possible that there are minor blocking points in the code and the
effect of these would be to hold up / delay subsequent queued aio operations;
which is an efficiency issue, but not a correctness concern.

> threads.  Are we sure that none of them will behave surprisingly
> because current changed under them?

My take is that we should fix the problems that we see. It is likely that
what manifests relatively more easily with AIO is also a subtle problem
in other cases.

> 
> It might be safe now, but that isn't really the point.  I'd rather we
> didn't have yet one more subtle invariant to audit and maintain.
> 
> At the risk of making myself vulnerable to the charge of mentioning
> vapourware, I will admit that I've been working on a (slightly mad)
> implementation of async syscalls.  I've been quiet about it because I
> don't want to whip up complicated discussion without being able to
> show code that works, even if barely.  I mention it now only to make
> it clear that I want to be constructive, not just critical :).

That is great and I look forward to it :) I am, however assuming that
whatever implementation you come up will have a different interface
from current linux aio -- i.e. a next generation aio model, that will be
easily integratable with kevents etc.

Which takes me back to Ingo's point - lets have the new evolve parallely
with the old, if we can, and not hold up the patches for POSIX AIO to
start using kernel AIO, or for epoll to integrate with AIO.

OK, I just took a quick look at your blog and I see that you
are basically implementing Linus' microthreads scheduling approach -
one year since we had that discussion. Glad to see that you found a way
to make it workable ... (I'm guessing that you are copying over the part
of the stack in use at the time of every switch, is that correct ? At what
point do you do the allocation of the saved stacks ? Sorry I should hold
off all these questions till your patch comes out)

Regards
Suparna

> 
> - z

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
th

Re: [RFC] Heads up on a series of AIO patchsets

2007-01-02 Thread Kent Overstreet

> Any details?

Well, one path I tried I couldn't help but post a blog entry about
for my friends.  I'm not sure it's the direction I'll take with linux-
kernel, but the fundamentals are there:  the api should be the
syscall interface, and there should be no difference between sync and
async behaviour.

http://www.zabbo.net/?p=72


Any code you're willing to let people play with? I could at least have
real test cases, and a library to go along with it as it gets
finished.

Another pie in the sky idea:
One thing that's been bugging me lately (working on a 9p server), is
sendfile is hard to use in practice because you need packet headers
and such, and they need to go out at the same time.

Sendfile listio support would fix this, but it's not a general
solution. What would be really usefull is a way to say that a certain
batch of async ops either all succeed or all fail, and happen
atomically; i.e., transactions for syscalls.

Probably even harder to do than general async syscalls, but it'd be
the best thing since sliced bread... and hey, it seems the logical
next step.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on a series of AIO patchsets

2007-01-02 Thread Zach Brown

Sorry for the delay, I'm finally back from the holiday break :)

(1) The filesystem AIO patchset, attempts to address one part of  
the problem
which is to make regular file IO, (without O_DIRECT)  
asynchronous (mainly
the case of reads of uncached or partially cached files, and  
O_SYNC writes).


One of the properties of the currently implemented EIOCBRETRY aio  
path is that ->mm is the only field in current which matches the  
submitting task_struct while inside the retry path.


It looks like a retry-based aio write path would be broken because of  
this.  generic_write_checks() could run in the aio thread and get its  
task_struct instead of that of the submitter.  The wrong rlimit will  
be tested and SIGXFSZ won't be raised.  remove_suid() could check the  
capabilities of the aio thread instead of those of the submitter.


I don't think EIOCBRETRY is the way to go because of this increased  
(and subtle!) complexity.  What are the chances that we would have  
ever found those bugs outside code review?  How do we make sure that  
current references don't sneak back in after having initially audited  
the paths?


Take the io_cmd_epoll_wait patch..

issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach  
Brown with
modifications from Jeff Moyer and me) addresses this problem  
for native

linux aio in a simple manner.


It's simple looking, sure.  This current flipping didn't even occur  
to me while throwing the patch together!


But that patch ends up calling ->poll (and poll_table->qproc) and  
writing to userspace (so potentially calling ->nopage) from the aio  
threads.  Are we sure that none of them will behave surprisingly  
because current changed under them?


It might be safe now, but that isn't really the point.  I'd rather we  
didn't have yet one more subtle invariant to audit and maintain.


At the risk of making myself vulnerable to the charge of mentioning  
vapourware, I will admit that I've been working on a (slightly mad)  
implementation of async syscalls.  I've been quiet about it because I  
don't want to whip up complicated discussion without being able to  
show code that works, even if barely.  I mention it now only to make  
it clear that I want to be constructive, not just critical :).


- z
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on a series of AIO patchsets

2007-01-02 Thread Dan Williams

On 12/28/06, Evgeniy Polyakov <[EMAIL PROTECTED]> wrote:

[ I'm only subscribed to linux-fsdevel@ from above Cc list, please keep this
list in Cc: for AIO related stuff. ]

On Wed, Dec 27, 2006 at 04:25:30PM +, Christoph Hellwig ([EMAIL PROTECTED]) 
wrote:
> (1) note that there is another problem with the current kevent interface,
>   and that is that it duplicates the event infrastructure for it's
>   underlying subsystems instead of reusing existing code (e.g.
>   inotify, epoll, dio-aio).  If we want kevent to be _the_ unified
>   event system for Linux we need people to help out with straightening
>   out these even provides as Evgeny seems to be unwilling/unable to
>   do the work himself and the duplication is simply not acceptable.

I would rewrite inotify/epoll to use kevent, but I would strongly prefer
that it would be done by peopl who created original interfaces - it is
politic decision, not techinical - I do not want to be blamed on each
corner that I killed other people work :)

FS and network AIO kevent based stuff was dropped from kevent tree in
favour of upcoming project (description below).

According do AIO - my personal opinion is that AIO should be designed
asynchronously in all aspects. Here is brief note on how I plan to
iplement it (I plan to start in about a week after New Year vacations).

===

All existing AIO - both mainline and kevent based lack major feature -
they are not fully asyncronous, i.e. they require synchronous set of
steps, some of which can be asynchronous. For example aio_sendfile() [1]
requires open of the file descriptor and only then aio_sendfile() call.
The same applies to mainline AIO and read/write calls.

My idea is to create real asyncronous IO - i.e. some entity which will
describe set of tasks which should be performed asynchronously (from
user point of view, although read and write obviously must be done after
open and before close), for example syscall which gets as parameter
destination socket and local filename (with optional offset and length
fields), which will asynchronously from user point of view open a file
and transfer requested part to the destination socket and then return
opened file descriptor (or it can be closed if requested). Similar
mechanism can be done for read/write calls.

This approach as long as asynchronous IO at all requires access to user
memory from kernels thread or even interrupt handler (that is where
kevent based AIO completes its requests) - it can be done in the way
similar to how existing kevent ring buffer implementation and also can
use dedicated kernel thread or workqueue to copy data into process
memory.


Would you have time to comment on the approach I have taken to
implement a standard asynchronous memcpy interface?  It seems it would
be a good complement to what you are proposing.  The entity that
describes the aio operation could take advantage of asynchronous
engines to carry out copies or other transforms (maybe an acrypto tie
in as well).

Here is the posting for 2.6.19.  There has since been updates for
2.6.20, but the overall approach remains the same.
intro: http://marc.theaimsgroup.com/?l=linux-raid&m=116491661527161&w=2
async_tx: http://marc.theaimsgroup.com/?l=linux-raid&m=116491753318175&w=2

Regards,

Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on a series of AIO patchsets

2006-12-28 Thread Evgeniy Polyakov
[ I'm only subscribed to linux-fsdevel@ from above Cc list, please keep this
list in Cc: for AIO related stuff. ]

On Wed, Dec 27, 2006 at 04:25:30PM +, Christoph Hellwig ([EMAIL PROTECTED]) 
wrote:
> (1) note that there is another problem with the current kevent interface,
>   and that is that it duplicates the event infrastructure for it's
>   underlying subsystems instead of reusing existing code (e.g.
>   inotify, epoll, dio-aio).  If we want kevent to be _the_ unified
>   event system for Linux we need people to help out with straightening
>   out these even provides as Evgeny seems to be unwilling/unable to
>   do the work himself and the duplication is simply not acceptable.

I would rewrite inotify/epoll to use kevent, but I would strongly prefer
that it would be done by peopl who created original interfaces - it is
politic decision, not techinical - I do not want to be blamed on each
corner that I killed other people work :)

FS and network AIO kevent based stuff was dropped from kevent tree in
favour of upcoming project (description below).

According do AIO - my personal opinion is that AIO should be designed
asynchronously in all aspects. Here is brief note on how I plan to
iplement it (I plan to start in about a week after New Year vacations).

===

All existing AIO - both mainline and kevent based lack major feature -
they are not fully asyncronous, i.e. they require synchronous set of
steps, some of which can be asynchronous. For example aio_sendfile() [1]
requires open of the file descriptor and only then aio_sendfile() call.
The same applies to mainline AIO and read/write calls.

My idea is to create real asyncronous IO - i.e. some entity which will
describe set of tasks which should be performed asynchronously (from
user point of view, although read and write obviously must be done after
open and before close), for example syscall which gets as parameter
destination socket and local filename (with optional offset and length
fields), which will asynchronously from user point of view open a file
and transfer requested part to the destination socket and then return
opened file descriptor (or it can be closed if requested). Similar
mechanism can be done for read/write calls.

This approach as long as asynchronous IO at all requires access to user
memory from kernels thread or even interrupt handler (that is where
kevent based AIO completes its requests) - it can be done in the way
similar to how existing kevent ring buffer implementation and also can
use dedicated kernel thread or workqueue to copy data into process
memory.

It is very interesting task and should greatly speed up workloads of
busy web/ftp and other servers, which can work with a huge number of
files and huge number of clients.
I've put it into TODO list.

Someone, please stop the time for several days, so I could create some
really good things for the universe.

1. Network AIO
http://tservice.net.ru/~s0mbre/old/?section=projects&item=naio

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on a series of AIO patchsets

2006-12-27 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> > unified event system for Linux we need people to help out with 
> > straightening out these even provides as Evgeny seems to be 
> > unwilling/unable to do the work himself and the duplication is 
> > simply not acceptable.
> 
> yeah. The internal machinery should be as unified as possible - but 
> different sets of APIs can be offered, to make it easy for people to 
> extend their existing apps in the most straightforward way.

just to expand on this: i dont think this should be an impediment to the 
POSIX AIO patches. We should get some movement into this and should give 
the capability to glibc and applications. Kernel-internal unification is 
something we are pretty good at doing after the fact. (and if any of the 
APIs dies or gets very uncommon we know in which direction to unify)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on a series of AIO patchsets

2006-12-27 Thread Ingo Molnar

* Christoph Hellwig <[EMAIL PROTECTED]> wrote:

> The real question here is which interface we want people to use for 
> these "combined" applications.  Evgeny is heavily pushing kevent for 
> this while other seem to prefer integration epoll into the aio 
> interface. (1)
> 
> I must admit that kevent seems to be the cleaner way to support this, 
> although I see some advantages for the aio variant.  I do think 
> however that we should not actively promote two differnt interfaces 
> long term.

i see no fundamental disadvantage from doing both. That way the 'market' 
of applications will vote. (we have 2 other fundamental types available 
as well: sync IO and poll() based IO - so it's not like we have the 
choice between 2 or 1 variant, we have the choice between 4 or 3 
variants)

> (1) note that there is another problem with the current kevent
> interface, and that is that it duplicates the event infrastructure 
> for it's underlying subsystems instead of reusing existing code 
> (e.g. inotify, epoll, dio-aio).  If we want kevent to be _the_ 
> unified event system for Linux we need people to help out with 
> straightening out these even provides as Evgeny seems to be 
> unwilling/unable to do the work himself and the duplication is 
> simply not acceptable.

yeah. The internal machinery should be as unified as possible - but 
different sets of APIs can be offered, to make it easy for people to 
extend their existing apps in the most straightforward way.

(In fact i'd like to see all the 'poll table' code to be unified into 
this as well, if possible - it does not really "poll" anything, it's an 
event infrastructure as well, used via the naive select() and poll() 
syscalls. We should fix that naming mistake.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on a series of AIO patchsets

2006-12-27 Thread Christoph Hellwig
On Wed, Dec 27, 2006 at 09:08:56PM +0530, Suparna Bhattacharya wrote:
> (2) Most of these other applications need the ability to process both
> network events (epoll) and disk file AIO in the same loop. With POSIX AIO
> they could at least sort of do this using signals (yeah, and all 
> associated
> issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach Brown with
> modifications from Jeff Moyer and me) addresses this problem for native
> linux aio in a simple manner. Tridge has written a test harness to 
> try out the Samba4 event library modifications to use this. Jeff Moyer
> has a modified version of pipetest for comparison.

The real question here is which interface we want people to use for these
"combined" applications.  Evgeny is heavily pushing kevent for this while
other seem to prefer integration epoll into the aio interface. (1)

I must admit that kevent seems to be the cleaner way to support this,
although I see some advantages for the aio variant.  I do think however
that we should not actively promote two differnt interfaces long term.


(1) note that there is another problem with the current kevent interface,
and that is that it duplicates the event infrastructure for it's
underlying subsystems instead of reusing existing code (e.g.
inotify, epoll, dio-aio).  If we want kevent to be _the_ unified
event system for Linux we need people to help out with straightening
out these even provides as Evgeny seems to be unwilling/unable to
do the work himself and the duplication is simply not acceptable.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] Heads up on a series of AIO patchsets

2006-12-27 Thread Suparna Bhattacharya

Here is a quick attempt to summarize where we are heading with a bunch of
AIO patches that I'll be posting over the next few days. Because a few of
these patches have been hanging around for a bit, and have gone through
bursts of iterations from time to time, falling dormant for other phases,
the intent of this note is to help pull things together into some coherent
picture for folks to comment on the patches and arrive at a decision of
some sort.

Native linux aio (i.e using libaio) is properly supported (in the sense of
being asynchronous) only for files opened with O_DIRECT, which actually
suffices for a major (and most visible) user of AIO, i.e. databases.

However, for other types of users, e.g. Samba and other applications which
use POSIX AIO, there have been several issues outstanding for a while:

(1) The filesystem AIO patchset, attempts to address one part of the problem
which is to make regular file IO, (without O_DIRECT) asynchronous (mainly
the case of reads of uncached or partially cached files, and O_SYNC writes).

(2) Most of these other applications need the ability to process both
network events (epoll) and disk file AIO in the same loop. With POSIX AIO
they could at least sort of do this using signals (yeah, and all associated
issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach Brown with
modifications from Jeff Moyer and me) addresses this problem for native
linux aio in a simple manner. Tridge has written a test harness to 
try out the Samba4 event library modifications to use this. Jeff Moyer
has a modified version of pipetest for comparison.

(3) For glibc POSIX AIO to switch to using native AIO (instead of simulation
with threads) kernel changes are needed to ensure aio sigevent notification
and efficient listio support. Sebestian Dugue's patches for aio sigevent
notifications has undergone several review iterations and seems to be
in good shape now. His patch for lio_listio is pending discussion
on whether to implement it as a separate syscall rather than an additional
iocb command. Bharata B Rao has posted a patch with the syscall variation
for review.

(4) If glibc POSIX AIO switches completely to using native AIO then it
would need basic AIO support for various file types - including sockets,
pipes etc. Since it no longer will be simulating asynchronous behaviour
with threads, it expects the underlying implementation to be asynchronous.
Which is still an issue with native linux AIO, but I now think the problem
to be tractable without a lot of additional work. While (1) helps the case
for regular files, (2) now provides us an alternative infrastructure to
simulate this in kernel using async epoll and O_NONBLOCK for all pollable
fds, i.e. sockets, pipes etc. This should be good enough for working
POSIX AIO.

(5) That leaves just one more todo - implementing aio_fsync() in kernel.

Please note that all of this work is not in conflict with kevent development.
In fact it is my hope that progress made in getting these pieces of the
puzzle in place would also help us along the long term goal of eventual
convergence.

Regards
Suparna

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html