subject:"Re\: \[patch 06\/11\] syslets\: core, documentation"

Re: [patch 06/11] syslets: core, documentation

2007-02-14 Thread Michael K. Edwards


On 2/14/07, Benjamin LaHaise <[EMAIL PROTECTED]> wrote:

My opinion of this whole thread is that it implies that our thread creation
and/or context switch is too slow.  If that is the case, improve those
elements first.  At least some of those optimizations have to be done in
hardware on x86, while on other platforms are probably unnecessary.


Not necessarily too slow, but too opaque in terms of system-wide
impact and global flow control.  Here are the four practical use cases
that I have seen come up in this discussion:

1) Databases that want to parallelize I/O storms, with an emphasis on
getting results that are already cache-hot immediately (not least so
they don't get evicted by other I/O results); there is also room to
push some of the I/O clustering and sequencing logic down into the
kernel.

2) Static-content-intensive network servers, with an emphasis on
servicing those requests that can be serviced promptly (to avoid a
ballooning connection backlog) and avoiding duplication of I/O effort
when many clients suddenly want the same cold content; the real win
may be in "smart prefetch" initiated from outside the network server
proper.

3) Network information gathering GUIs, which want to harvest as much
information as possible for immediate display and then switch to an
event-based delivery mechanism for tardy responses; these need
throttling of concurrent requests (ideally, in-kernel traffic shaping
by request group and destination class) and efficient cancellation of
no-longer-interesting requests.

4) Document search facilities, which need all of the above (big
surprise there) as well as a rich diagnostic toolset, including a
practical snooping and profiling facility to guide tuning for
application responsiveness.

Even if threads were so cheap that you could just fire off one per I/O
request, they're a poor solution to the host of flow control issues
raised in these use cases.  A sequential thread of execution per I/O
request may be the friendliest mental model for the individual delayed
I/Os, but the global traffic shaping and scheduling is a data
structure problem.

The right question to be asking is, what are the operations that need
to be supported on the system-wide pool of pending AIOs, and on what
data structure can they be implemented efficiently?  For instance, can
we provide an RCU priority queue implementation (perhaps based on
splay trees) so that userspace can scan a coherent read-only snapshot
of the structure and select candidates for cancellation, etc., without
interfering with kernel completions?  Or is it more important to have
a three-sided query operation (characteristic of priority search
trees), or perhaps a lower amortized cost bulk delete?

Once you've thought through the data structure manipulation, you'll
know what AIO submission / cancellation / reprioritization interfaces
are practical.  Then you can work on a programming model for
application-level "I/O completions" that is library-friendly and
allows a "fast path" optimization for the fraction of requests that
can be served synchronously.  Then and only then does it make sense to
code-bum the asynchronous path.  Not that it isn't interesting to
think in advance about what stack space completions will run in and
which bits of the task struct needn't be in a coherent condition; but
that's probably not going to guide you to the design that meets the
application needs.

I know I'm teaching my grandmother to suck eggs here.  But there are
application programmers high up the code stack whose code makes
implicit use of asynchronous I/O continuations.  In addition to the
GUI example I blathered about a few days ago, I have in mind Narrative
Javascript's "blocking operator" and Twisted Python's Deferred.  Those
folks would be well served by an async I/O interface to the kernel
which mates well to their language's closure/continuation facilities.
If it's usable from C, that's nice too.  :-)

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 06/11] syslets: core, documentation

2007-02-14 Thread Davide Libenzi

On Wed, 14 Feb 2007, Benjamin LaHaise wrote:

> On Wed, Feb 14, 2007 at 03:17:59PM -0800, Davide Libenzi wrote:
> > > That's an incorrect assumption.  Every task/thread in the system has FPU 
> > > state associated with it, in part due to the fact that glibc has to 
> > > change 
> > > some of the rounding mode bits, making them different than the default 
> > > from 
> > > a freshly initialized state.
> > 
> > IMO I still belive this is not a huge problem. FPU state propagation/copy 
> > can be done in a clever way, once we detect the in-async condition.
> 
> Show me.  clts() and stts() are expensive hardware operations which there 
> is no means of avoiding as control register writes impact the CPU in a not 
> trivial manner.  I've spent far too much time staring at profiles of what 
> goes on in the context switch code in the process of looking for 
> optimizations 
> on this very issue to be ignored on this point.

The trivial case is the cachehit case. Everything flows like usual since 
we don't swap threads.
If we're going to sleep, __async_schedule has to save/copy (depending if 
TS_USEDFPU is set) the current FPU state to the newly selected service 
thread (return-to-userspace thread).
When a fault eventually happen in the new userspace thread, context is 
restored.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 06/11] syslets: core, documentation

2007-02-14 Thread Benjamin LaHaise

On Wed, Feb 14, 2007 at 03:17:59PM -0800, Davide Libenzi wrote:
> > That's an incorrect assumption.  Every task/thread in the system has FPU 
> > state associated with it, in part due to the fact that glibc has to change 
> > some of the rounding mode bits, making them different than the default from 
> > a freshly initialized state.
> 
> IMO I still belive this is not a huge problem. FPU state propagation/copy 
> can be done in a clever way, once we detect the in-async condition.

Show me.  clts() and stts() are expensive hardware operations which there 
is no means of avoiding as control register writes impact the CPU in a not 
trivial manner.  I've spent far too much time staring at profiles of what 
goes on in the context switch code in the process of looking for optimizations 
on this very issue to be ignored on this point.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 06/11] syslets: core, documentation

2007-02-14 Thread Davide Libenzi

On Wed, 14 Feb 2007, Benjamin LaHaise wrote:

> On Wed, Feb 14, 2007 at 01:06:59PM -0800, Davide Libenzi wrote:
> > Bear with me Ben, and let's follow this up :) If you are in the middle of 
> > an MMX copy operation, inside the syscall, you are:
> > 
> > - Userspace, on task A, calls sys_async_exec
> > 
> > - Userspace in _not_ doing any MMX stuff before the call
> 
> That's an incorrect assumption.  Every task/thread in the system has FPU 
> state associated with it, in part due to the fact that glibc has to change 
> some of the rounding mode bits, making them different than the default from 
> a freshly initialized state.

IMO I still belive this is not a huge problem. FPU state propagation/copy 
can be done in a clever way, once we detect the in-async condition.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 06/11] syslets: core, documentation

2007-02-14 Thread Benjamin LaHaise

On Wed, Feb 14, 2007 at 01:06:59PM -0800, Davide Libenzi wrote:
> Bear with me Ben, and let's follow this up :) If you are in the middle of 
> an MMX copy operation, inside the syscall, you are:
> 
> - Userspace, on task A, calls sys_async_exec
> 
> - Userspace in _not_ doing any MMX stuff before the call

That's an incorrect assumption.  Every task/thread in the system has FPU 
state associated with it, in part due to the fact that glibc has to change 
some of the rounding mode bits, making them different than the default from 
a freshly initialized state.

> - We wake task B that will return to userspace

At which point task B has to touch the FPU in userspace as part of the 
cleanup, which adds back in an expensive operation to the whole process.

The whole context switch mechanism is a zero sum game -- everything that 
occurs does so because it *must* be done.  If you remove something at one 
point, then it has to occur somewhere else.

My opinion of this whole thread is that it implies that our thread creation 
and/or context switch is too slow.  If that is the case, improve those 
elements first.  At least some of those optimizations have to be done in 
hardware on x86, while on other platforms are probably unnecessary.

Fwiw, there are patches floating around that did AIO via kernel threads 
for file descriptors that didn't implement AIO (and remember: kernel thread 
context switches are cheaper than userland thread context switches).  At 
least take a stab at measuring what the performance differences are and 
what optimizations are possible before prematurely introducing a new "fast" 
way of doing things that adds a bunch more to maintain.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 06/11] syslets: core, documentation

2007-02-14 Thread Davide Libenzi

On Wed, 14 Feb 2007, Benjamin LaHaise wrote:

> On Wed, Feb 14, 2007 at 12:14:29PM -0800, Davide Libenzi wrote:
> > I think you may have mis-interpreted my words. *When* a schedule would 
> > block a synco execution try, then you do have a context switch. Noone 
> > argue that, and the code is clear. The sys_async_exec thread will block, 
> > and a newly woke up thread will re-emerge to sys_async_exec with a NULL 
> > returned to userspace. But in a "cachehit" case (no schedule happens 
> > during the syscall/*let execution), there is no context switch at all. 
> > That is the whole point of the optimization.
> 
> And I will repeat myself: that cannot be done.  Tell me how the following 
> what if scenario works: you're in an MMX optimized memory copy and you take 
> a page fault.  How does returning to the submittor of the async operation 
> get the correct MMX state restored?  It doesn't.

Bear with me Ben, and let's follow this up :) If you are in the middle of 
an MMX copy operation, inside the syscall, you are:

- Userspace, on task A, calls sys_async_exec

- Userspace in _not_ doing any MMX stuff before the call

- We execute the syscall

- Task A, executing the syscall and inside an MMX copy operation, gets a 
  page fault

- We get a schedule

- Task A MMX state will *follow* task A, that will be put to sleep

- We wake task B that will return to userspace

So if the MMX work happens inside the syscall execution, we're fine 
because its context will follow the same task being put into sleep.
Problem would be to preserve the *caller* (userspace) context. But than 
can be done in a lazy way (detecting if task A user the FPU) like we're 
currently doing it, once we detect a schedule-out condition. That wouldn't 
be the most common case for many userspace programs in any case.




- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 06/11] syslets: core, documentation

2007-02-14 Thread Benjamin LaHaise

On Wed, Feb 14, 2007 at 12:14:29PM -0800, Davide Libenzi wrote:
> I think you may have mis-interpreted my words. *When* a schedule would 
> block a synco execution try, then you do have a context switch. Noone 
> argue that, and the code is clear. The sys_async_exec thread will block, 
> and a newly woke up thread will re-emerge to sys_async_exec with a NULL 
> returned to userspace. But in a "cachehit" case (no schedule happens 
> during the syscall/*let execution), there is no context switch at all. 
> That is the whole point of the optimization.

And I will repeat myself: that cannot be done.  Tell me how the following 
what if scenario works: you're in an MMX optimized memory copy and you take 
a page fault.  How does returning to the submittor of the async operation 
get the correct MMX state restored?  It doesn't.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 06/11] syslets: core, documentation

2007-02-14 Thread Davide Libenzi

On Wed, 14 Feb 2007, Benjamin LaHaise wrote:

> On Wed, Feb 14, 2007 at 11:45:23AM -0800, Davide Libenzi wrote:
> > Sort of, except that the whole thing can complete syncronously w/out 
> > context switches. The real point of the whole fibrils/syslets solution is 
> > that kind of optimization. The solution is as good as it is now, for 
> 
> Except that You Can't Do That (tm).  Try to predict beforehand if the code 
> path being followed will touch the FPU or SSE state, and you can't.  There is 
> no way to avoid the context switch overhead, as you have to preserve things 
> so that whatever state is being returned to the user is as it was.  Unless 
> you plan on resetting the state beforehand, but then you have to call into 
> arch specific code that ends up with a comparable overhead to the context 
> switch.

I think you may have mis-interpreted my words. *When* a schedule would 
block a synco execution try, then you do have a context switch. Noone 
argue that, and the code is clear. The sys_async_exec thread will block, 
and a newly woke up thread will re-emerge to sys_async_exec with a NULL 
returned to userspace. But in a "cachehit" case (no schedule happens 
during the syscall/*let execution), there is no context switch at all. 
That is the whole point of the optimization.

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 06/11] syslets: core, documentation

2007-02-14 Thread Benjamin LaHaise

On Wed, Feb 14, 2007 at 11:45:23AM -0800, Davide Libenzi wrote:
> Sort of, except that the whole thing can complete syncronously w/out 
> context switches. The real point of the whole fibrils/syslets solution is 
> that kind of optimization. The solution is as good as it is now, for 

Except that You Can't Do That (tm).  Try to predict beforehand if the code 
path being followed will touch the FPU or SSE state, and you can't.  There is 
no way to avoid the context switch overhead, as you have to preserve things 
so that whatever state is being returned to the user is as it was.  Unless 
you plan on resetting the state beforehand, but then you have to call into 
arch specific code that ends up with a comparable overhead to the context 
switch.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 06/11] syslets: core, documentation

2007-02-14 Thread Davide Libenzi

On Wed, 14 Feb 2007, Benjamin LaHaise wrote:

> On Wed, Feb 14, 2007 at 09:52:20AM -0800, Davide Libenzi wrote:
> > That'd be, instead of passing a chain of atoms, with the kernel 
> > interpreting conditions, and parameter lists, etc..., we let gcc 
> > do this stuff for us, and we pass the "clet" :) pointer to sys_async_exec, 
> > that exec the above under the same schedule-trapped environment, but in 
> > userspace. We setup a special userspace ad-hoc frame (ala signal), and we 
> > trap underneath task schedule attempt in the same way we do now.
> > We setup the frame and when we return from sys_async_exec, we basically 
> > enter the "clet", that will return to a ret_from_async, that will return 
> > to userspace. Or, maybe we can support both. A simple single-syscall exec 
> > in the way we do now, and a clet way for the ones that requires chains and 
> > conditions. Hmmm?
> 
> Which is just the same as using threads.  My argument is that once you 
> look at all the details involved, what you end up arriving at is the 
> creation of threads.  Threads are relatively cheap, it's just that the 
> hardware currently has several performance bugs with them on x86 (and more 
> on x86-64 with the MSR fiddling that hits the hot path).  Architectures 
> like powerpc are not going to benefit anywhere near as much from this 
> exercise, as the state involved is processed much more sanely.  IA64 as 
> usual is simply doomed by way of having too many registers to switch.

Sort of, except that the whole thing can complete syncronously w/out 
context switches. The real point of the whole fibrils/syslets solution is 
that kind of optimization. The solution is as good as it is now, for 
single syscalls (modulo sys_async_cancel implementation), but for multiple 
chained submission it kinda stinks IMHO. Once you have to build chains, 
and conditions, and new syscalls to implement userspace variable 
increments, and so on..., at that point it's better to have the chain to 
be coded in C ala thread proc. Yes, it requires a frame setup and another 
entry to the kernel, but IMO that will be amortized in the cost of the 
multiple syscalls inside the "clet".

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 06/11] syslets: core, documentation

2007-02-14 Thread Benjamin LaHaise

On Wed, Feb 14, 2007 at 09:52:20AM -0800, Davide Libenzi wrote:
> That'd be, instead of passing a chain of atoms, with the kernel 
> interpreting conditions, and parameter lists, etc..., we let gcc 
> do this stuff for us, and we pass the "clet" :) pointer to sys_async_exec, 
> that exec the above under the same schedule-trapped environment, but in 
> userspace. We setup a special userspace ad-hoc frame (ala signal), and we 
> trap underneath task schedule attempt in the same way we do now.
> We setup the frame and when we return from sys_async_exec, we basically 
> enter the "clet", that will return to a ret_from_async, that will return 
> to userspace. Or, maybe we can support both. A simple single-syscall exec 
> in the way we do now, and a clet way for the ones that requires chains and 
> conditions. Hmmm?

Which is just the same as using threads.  My argument is that once you 
look at all the details involved, what you end up arriving at is the 
creation of threads.  Threads are relatively cheap, it's just that the 
hardware currently has several performance bugs with them on x86 (and more 
on x86-64 with the MSR fiddling that hits the hot path).  Architectures 
like powerpc are not going to benefit anywhere near as much from this 
exercise, as the state involved is processed much more sanely.  IA64 as 
usual is simply doomed by way of having too many registers to switch.

If people really want to go down this path, please make an effort to compare 
threads on a properly tuned platform.  This means that things like the kernel 
and userland stacks must take into account the cache alignment (we do some 
of this already, but there are some very definate L1 cache colour collisions 
between commonly hit data structures amongst threads).  The existing AIO 
ringbuffer suffers from this, as important data is always on the beginning 
of the first page.  Yes, these might be microoptimizations, but accumulated 
changes of this nature have been known to buy 100%+ improvements in 
performance.

-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[EMAIL PROTECTED]>.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 06/11] syslets: core, documentation

2007-02-14 Thread Davide Libenzi

On Wed, 14 Feb 2007, Russell King wrote:

> Let me spell it out, since you appear to have completely missed my point.
> 
> At the moment, SKIP_TO_NEXT_ON_STOP is specified to jump a "jump a full
> syslet_uatom number of bytes".
> 
> If we end up with a system call being added which requires more than
> the currently allowed number of arguments (and it _has_ happened before)
> then either those syscalls are not accessible to syslets, or you need
> to increase the arg_ptr array.

I was thinking about this yesterday, since I honestly thought that this 
whole chaining, and conditions, and parameter lists, and argoument passed 
by pointers, etc... was at the end a little clumsy IMO.
Wouldn't a syslet look better like:

long syslet(void *ctx) {
struct sctx *c = ctx;

if (open(c->file, ...) == -1)
return -1;
read();
send();
blah();
...
return 0;
}

That'd be, instead of passing a chain of atoms, with the kernel 
interpreting conditions, and parameter lists, etc..., we let gcc 
do this stuff for us, and we pass the "clet" :) pointer to sys_async_exec, 
that exec the above under the same schedule-trapped environment, but in 
userspace. We setup a special userspace ad-hoc frame (ala signal), and we 
trap underneath task schedule attempt in the same way we do now.
We setup the frame and when we return from sys_async_exec, we basically 
enter the "clet", that will return to a ret_from_async, that will return 
to userspace. Or, maybe we can support both. A simple single-syscall exec 
in the way we do now, and a clet way for the ones that requires chains and 
conditions. Hmmm?

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 06/11] syslets: core, documentation

2007-02-14 Thread Russell King

On Wed, Feb 14, 2007 at 11:50:39AM +0100, Ingo Molnar wrote:
> * Russell King <[EMAIL PROTECTED]> wrote:
> > On Tue, Feb 13, 2007 at 03:20:42PM +0100, Ingo Molnar wrote:
> > > +Arguments to the system call are implemented via pointers to arguments.
> > > +This not only increases the flexibility of syslet atoms (multiple syslets
> > > +can share the same variable for example), but is also an optimization:
> > > +copy_uatom() will only fetch syscall parameters up until the point it
> > > +meets the first NULL pointer. 50% of all syscalls have 2 or less
> > > +parameters (and 90% of all syscalls have 4 or less parameters).
> > > +
> > > + [ Note: since the argument array is at the end of the atom, and the
> > > +   kernel will not touch any argument beyond the final NULL one, atoms
> > > +   might be packed more tightly. (the only special case exception to
> > > +   this rule would be SKIP_TO_NEXT_ON_STOP atoms, where the kernel will
> > > +   jump a full syslet_uatom number of bytes.) ]
> > 
> > What if you need to increase the number of arguments passed to a 
> > system call later?  That would be an API change since the size of 
> > syslet_uatom would change?
> 
> the syslet_uatom has a constant size right now, and space for a maximum 
> of 6 arguments. /If/ the user knows that a specific atom (which for 
> example does a sys_close()) takes only 1 argument, it could shrink the 
> size of the atom down by 4 arguments.
> 
> [ i'd not actually recommend doing this, because it's generally a 
>   volatile thing to play such tricks - i guess i shouldnt have written 
>   that side-note in the header file :-) ]
> 
> there should be no new ABI issues: the existing syscall ABI never 
> changes, it's only extended. New syslets can rely on new properties of 
> new system calls. This is quite parallel to how glibc handles system 
> calls.

Let me spell it out, since you appear to have completely missed my point.

At the moment, SKIP_TO_NEXT_ON_STOP is specified to jump a "jump a full
syslet_uatom number of bytes".

If we end up with a system call being added which requires more than
the currently allowed number of arguments (and it _has_ happened before)
then either those syscalls are not accessible to syslets, or you need
to increase the arg_ptr array.

That makes syslet_uatom larger.

If syslet_uatom is larger, SKIP_TO_NEXT_ON_STOP increments the syslet_uatom
pointer by a greater number of bytes.

If we're running a set of userspace syslets built for an older kernel on
such a newer kernel, that is an incompatible change which will break.

> > How do you propose syslet users know about these kinds of ABI issues 
> > (including the endian-ness of 64-bit arguments) ?
> 
> syslet users would preferably be libraries like glibc - not applications 
> - i'm not sure the raw syslet interface should be exposed to 
> applications. Thus my current thinking is that syslets ought to be 
> per-arch structures - no need to pad them out to 64 bits on 32-bit 
> architectures - it's per-arch userspace that makes use of them anyway. 
> system call encodings are fundamentally per-arch anyway - every arch 
> does various fixups and has its own order of system calls.
> 
> but ... i'd not be against having a 'generic syscall layer' though, and 
> syslets might be a good starting point for that. But that would 
> necessiate a per-arch table of translating syscall numbers into this 
> 'generic' numbering, at minimum - or a separate sys_async_call_table[].

Okay - I guess the userspace library approach is fine, but it needs
to be documented that applications which build syslets directly are
going to be non-portable.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 06/11] syslets: core, documentation

2007-02-14 Thread Ingo Molnar


* Russell King <[EMAIL PROTECTED]> wrote:

> On Tue, Feb 13, 2007 at 03:20:42PM +0100, Ingo Molnar wrote:
> > +Arguments to the system call are implemented via pointers to arguments.
> > +This not only increases the flexibility of syslet atoms (multiple syslets
> > +can share the same variable for example), but is also an optimization:
> > +copy_uatom() will only fetch syscall parameters up until the point it
> > +meets the first NULL pointer. 50% of all syscalls have 2 or less
> > +parameters (and 90% of all syscalls have 4 or less parameters).
> > +
> > + [ Note: since the argument array is at the end of the atom, and the
> > +   kernel will not touch any argument beyond the final NULL one, atoms
> > +   might be packed more tightly. (the only special case exception to
> > +   this rule would be SKIP_TO_NEXT_ON_STOP atoms, where the kernel will
> > +   jump a full syslet_uatom number of bytes.) ]
> 
> What if you need to increase the number of arguments passed to a 
> system call later?  That would be an API change since the size of 
> syslet_uatom would change?

the syslet_uatom has a constant size right now, and space for a maximum 
of 6 arguments. /If/ the user knows that a specific atom (which for 
example does a sys_close()) takes only 1 argument, it could shrink the 
size of the atom down by 4 arguments.

[ i'd not actually recommend doing this, because it's generally a 
  volatile thing to play such tricks - i guess i shouldnt have written 
  that side-note in the header file :-) ]

there should be no new ABI issues: the existing syscall ABI never 
changes, it's only extended. New syslets can rely on new properties of 
new system calls. This is quite parallel to how glibc handles system 
calls.

> How do you propose syslet users know about these kinds of ABI issues 
> (including the endian-ness of 64-bit arguments) ?

syslet users would preferably be libraries like glibc - not applications 
- i'm not sure the raw syslet interface should be exposed to 
applications. Thus my current thinking is that syslets ought to be 
per-arch structures - no need to pad them out to 64 bits on 32-bit 
architectures - it's per-arch userspace that makes use of them anyway. 
system call encodings are fundamentally per-arch anyway - every arch 
does various fixups and has its own order of system calls.

but ... i'd not be against having a 'generic syscall layer' though, and 
syslets might be a good starting point for that. But that would 
necessiate a per-arch table of translating syscall numbers into this 
'generic' numbering, at minimum - or a separate sys_async_call_table[].

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 06/11] syslets: core, documentation

2007-02-14 Thread Russell King

On Tue, Feb 13, 2007 at 03:20:42PM +0100, Ingo Molnar wrote:
> +Arguments to the system call are implemented via pointers to arguments.
> +This not only increases the flexibility of syslet atoms (multiple syslets
> +can share the same variable for example), but is also an optimization:
> +copy_uatom() will only fetch syscall parameters up until the point it
> +meets the first NULL pointer. 50% of all syscalls have 2 or less
> +parameters (and 90% of all syscalls have 4 or less parameters).
> +
> + [ Note: since the argument array is at the end of the atom, and the
> +   kernel will not touch any argument beyond the final NULL one, atoms
> +   might be packed more tightly. (the only special case exception to
> +   this rule would be SKIP_TO_NEXT_ON_STOP atoms, where the kernel will
> +   jump a full syslet_uatom number of bytes.) ]

What if you need to increase the number of arguments passed to a system
call later?  That would be an API change since the size of syslet_uatom
would change?

Also, what if you have an ABI such that:

sys_foo(int fd, long long a)

where:
 arg[0] <= fd
 arg[1] <= unused
 arg[2] <= low 32-bits a
 arg[3] <= high 32-bits a

it seems you need to point arg[1] to some valid but dummy variable.

How do you propose syslet users know about these kinds of ABI issues
(including the endian-ness of 64-bit arguments) ?

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 06/11] syslets: core, documentation

2007-02-13 Thread Davide Libenzi

On Tue, 13 Feb 2007, Davide Libenzi wrote:

> > > I can understand that chaining syscalls requires variable sharing, but 
> > > the majority of the parameters passed to syscalls are just direct 
> > > ones. Maybe a smart method that allows you to know if a parameter is a 
> > > direct one or a pointer to one? An "unsigned int pmap" where bit N is 
> > > 1 if param N is an indirection? Hmm?
> > 
> > adding such things tends to slow down atom parsing.
> 
> I really think it simplifies it. You simply *copy* the parameter (I'd say 
> that 70+% of cases falls inside here), and if the current "pmap" bit is 
> set, then you do all the indirection copy-from-userspace stuff.
> It also simplify userspace a lot, since you can now pass arrays and 
> structure pointers directly, w/out saving them in a temporary variable.

Very rough sketch below ...


---
struct syslet_uatom {
unsigned long   flags;
unsigned intnr;
unsigned short  nparams;
unsigned short  pmap;
long __user *ret_ptr;
struct syslet_uatom __user  *next;
unsigned long   __user  args[6];
void __user *private;
};

long copy_uatom(struct syslet_atom *atom, struct syslet_uatom __user *uatom)
{
unsigned short i, pmap;
unsigned long __user *arg_ptr;
long ret = 0;

if (!access_ok(VERIFY_WRITE, uatom, sizeof(*uatom)))
return -EFAULT;

ret = __get_user(atom->nr, &uatom->nr);
ret |= __get_user(atom->nparams, &uatom->nparams);
ret |= __get_user(pmap, &uatom->pmap);
ret |= __get_user(atom->ret_ptr, &uatom->ret_ptr);
ret |= __get_user(atom->flags, &uatom->flags);
ret |= __get_user(atom->next, &uatom->next);
if (unlikely(atom->nparams >= 6))
return -EINVAL;
for (i = 0; i < atom->nparams; i++, pmap >>= 1) {
atom->args[i] = uatom->args[i];
if (unlikely(pmap & 1)) {
arg_ptr = (unsigned long __user *) atom->args[i];
if (!access_ok(VERIFY_WRITE, arg_ptr, sizeof(*arg_ptr)))
return -EFAULT;
ret |= __get_user(atom->args[i], arg_ptr);
}
}

return ret;
}

void init_utaom(struct syslet_uatom *ua, unsigned long flags, unsigned int nr,
long *ret_ptr, struct syslet_uatom *next, void *private,
int nparams, ...)
{
int i, mode;
va_list args;

ua->flags = flags;
ua->nr = nr;
ua->ret_ptr = ret_ptr;
ua->next = next;
ua->private = private;
ua->nparams = nparams;
ua->pmap = 0;
va_start(args, nparams);
for (i = 0; i < nparams; i++) {
mode = va_arg(args, int);
ua->args[i] = va_arg(args, unsigned long);
if (mode == UASYNC_INDIR)
ua->pmap |= 1 << i;
}
va_end(args);
}


#define UASYNC_IMM 0
#define UASYNC_INDIR 1
#define UAPD(a) UASYNC_IMM, (unsigned long) (a)
#define UAPI(a) UASYNC_INDIR, (unsigned long) (a)


void foo(void)
{
int fd;
long res;
struct stat stb;
struct syslet_uatom ua;

init_utaom(&ua, 0, __NR_fstat, &res, NULL, NULL, 2,
   UAPI(&fd), UAPD(&stb));
...
}
---



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 06/11] syslets: core, documentation

2007-02-13 Thread Davide Libenzi

On Tue, 13 Feb 2007, Ingo Molnar wrote:

> 
> * Davide Libenzi  wrote:
> 
> > > +The Syslet Atom:
> > > +
> > > +
> > > +The syslet atom is a small, fixed-size (44 bytes on 32-bit) piece of
> > > +user-space memory, which is the basic unit of execution within the syslet
> > > +framework. A syslet represents a single system-call and its arguments.
> > > +In addition it also has condition flags attached to it that allows the
> > > +construction of larger programs (syslets) from these atoms.
> > > +
> > > +Arguments to the system call are implemented via pointers to arguments.
> > > +This not only increases the flexibility of syslet atoms (multiple syslets
> > > +can share the same variable for example), but is also an optimization:
> > > +copy_uatom() will only fetch syscall parameters up until the point it
> > > +meets the first NULL pointer. 50% of all syscalls have 2 or less
> > > +parameters (and 90% of all syscalls have 4 or less parameters).
> > 
> > Why do you need to have an extra memory indirection per parameter in 
> > copy_uatom()? [...]
> 
> yes. Try to use them in real programs, and you'll see that most of the 
> time the variable an atom wants to access should also be accessed by 
> other atoms. For example a socket file descriptor - one atom opens it, 
> another one reads from it, a third one closes it. By having the 
> parameters in the atoms we'd have to copy the fd to two other places.

Yes, of course we have to support the indirection, otherwise chaining 
won't work. But ...



> > I can understand that chaining syscalls requires variable sharing, but 
> > the majority of the parameters passed to syscalls are just direct 
> > ones. Maybe a smart method that allows you to know if a parameter is a 
> > direct one or a pointer to one? An "unsigned int pmap" where bit N is 
> > 1 if param N is an indirection? Hmm?
> 
> adding such things tends to slow down atom parsing.

I really think it simplifies it. You simply *copy* the parameter (I'd say 
that 70+% of cases falls inside here), and if the current "pmap" bit is 
set, then you do all the indirection copy-from-userspace stuff.
It also simplify userspace a lot, since you can now pass arrays and 
structure pointers directly, w/out saving them in a temporary variable.




> > Sigh, I really dislike shared userspace/kernel stuff, when we're 
> > transfering pointers to userspace. Did you actually bench it against 
> > a:
> > 
> > int async_wait(struct syslet_uatom **r, int n);
> > 
> > I can fully understand sharing userspace buffers with the kernel, if 
> > we're talking about KB transferd during a block or net I/O DMA 
> > operation, but for transfering a pointer? Behind each pointer 
> > transfer(4/8 bytes) there is a whole syscall execution, [...]
> 
> there are three main reasons for this choice:
> 
> - firstly, by putting completion events into the user-space ringbuffer
>   the asynchronous contexts are not held up at all, and the threads are
>   available for further syslet use.
> 
> - secondly, it was the most obvious and simplest solution to me - it 
>   just fits well into the syslet model - which is an execution concept 
>   centered around pure user-space memory and system calls, not some 
>   kernel resource. Kernel fills in the ringbuffer, user-space clears it. 
>   If we had to worry about a handshake between user-space and 
>   kernel-space for the completion information to be passed along, that 
>   would either mean extra buffering or extra overhead. Extra buffering 
>   (in the kernel) would be for no good reason: why not buffer it in the 
>   place where the information is destined for in the first place. The 
>   ringbuffer of /pointers/ is what makes this really powerful. I never 
>   really liked the AIO/etc. method /event buffer/ rings. With syslets 
>   the 'cookie' is the pointer to the syslet atom itself. It doesnt get 
>   any more straightforward than that i believe.
> 
> - making 'is there more stuff for me to work on' a simple instruction in
>   user-space makes it a no-brainer for user-space to promptly and
>   without thinking complete events. It's also the right thing to do on 
>   SMP: if one core is solely dedicated to the asynchronous workload,
>   only running on kernel mode, and the other code is only running
>   user-space, why ever switch between protection domains? [except if any
>   of them is idle] The fastest completion signalling method is the
>   /memory bus/, not an interrupt. User-space could in theory even use
>   MWAIT (in user-space!) to wait for the other core to complete stuff. 
>   That makes for a hell of a fast wakeup.

That makes also for a hell ugly retrieval API IMO ;)
If it'd be backed up but considerable performance gains, then it might be OK.
But I believe it won't be the case, and that leave us with an ugly API.
OTOH, if noone else object this, it means that I'm the only wierdo :) and 
the API is just fine.




- Davide


-
To unsubscribe from this list: send the lin

Re: [patch 06/11] syslets: core, documentation

2007-02-13 Thread Ingo Molnar

* Davide Libenzi  wrote:

> > +The Syslet Atom:
> > +
> > +
> > +The syslet atom is a small, fixed-size (44 bytes on 32-bit) piece of
> > +user-space memory, which is the basic unit of execution within the syslet
> > +framework. A syslet represents a single system-call and its arguments.
> > +In addition it also has condition flags attached to it that allows the
> > +construction of larger programs (syslets) from these atoms.
> > +
> > +Arguments to the system call are implemented via pointers to arguments.
> > +This not only increases the flexibility of syslet atoms (multiple syslets
> > +can share the same variable for example), but is also an optimization:
> > +copy_uatom() will only fetch syscall parameters up until the point it
> > +meets the first NULL pointer. 50% of all syscalls have 2 or less
> > +parameters (and 90% of all syscalls have 4 or less parameters).
> 
> Why do you need to have an extra memory indirection per parameter in 
> copy_uatom()? [...]

yes. Try to use them in real programs, and you'll see that most of the 
time the variable an atom wants to access should also be accessed by 
other atoms. For example a socket file descriptor - one atom opens it, 
another one reads from it, a third one closes it. By having the 
parameters in the atoms we'd have to copy the fd to two other places.

but i see your point: i actually had it like that in my earlier 
versions, only changed it to an indirect method later on, when writing 
more complex syslets. And, surprisingly, performance of atom handling 
/improved/ on both Intel and AMD CPUs when i added indirection, because 
the indirection enables the 'tail NULL' optimization. (which wasnt the 
goal of indirection, it was just a side-effect)

> [...] It also forces you to have parameters pointed-to, to be "long" 
> (or pointers), instead of their natural POSIX type (like fd being 
> "int" for example). [...]

this wasnt a big problem while coding syslets. I'd also not expect 
application writers having to do these things on the syscall level - 
this is a system interface after all. But you do have a point.

> I can understand that chaining syscalls requires variable sharing, but 
> the majority of the parameters passed to syscalls are just direct 
> ones. Maybe a smart method that allows you to know if a parameter is a 
> direct one or a pointer to one? An "unsigned int pmap" where bit N is 
> 1 if param N is an indirection? Hmm?

adding such things tends to slow down atom parsing.

there's another reason as well: i wanted syslets to be like 
'instructions' - i.e. not self-modifying. If the fd parameter is 
embedded in the syslet then every syslet has to be replicated

note that chaining does not necessarily require variable sharing: a 
sys_umem_add() atom could be used to modify the next syslet's ->fd 
parameter. So for example

sys_open() -> returns 'fd'
sys_umem_add(&atom1->fd) <= atom1->fd is 0 initially
sys_umem_add(&atom2->fd) <= the first umem_add returns the value
atom1 [uses fd]
atom2 [uses fd]

but i didnt like this approach: this means 1 more atom per indirect 
parameter, and quite some trickery to put the right information into the 
right place. Furthermore, this makes syslets very much tied to the 
'register contents' - instead of them being 'pure instructions/code'.

> > +Completion of asynchronous syslets:
> > +---
> > +
> > +Completion of asynchronous syslets is done via the 'completion ring',
> > +which is a ringbuffer of syslet atom pointers user user-space memory,
> > +provided by user-space in the sys_async_register() syscall. The
> > +kernel fills in the ringbuffer starting at index 0, and user-space
> > +must clear out these pointers. Once the kernel reaches the end of
> > +the ring it wraps back to index 0. The kernel will not overwrite
> > +non-NULL pointers (but will return an error), user-space has to
> > +make sure it completes all events it asked for.
> 
> Sigh, I really dislike shared userspace/kernel stuff, when we're 
> transfering pointers to userspace. Did you actually bench it against 
> a:
> 
> int async_wait(struct syslet_uatom **r, int n);
> 
> I can fully understand sharing userspace buffers with the kernel, if 
> we're talking about KB transferd during a block or net I/O DMA 
> operation, but for transfering a pointer? Behind each pointer 
> transfer(4/8 bytes) there is a whole syscall execution, [...]

there are three main reasons for this choice:

- firstly, by putting completion events into the user-space ringbuffer
  the asynchronous contexts are not held up at all, and the threads are
  available for further syslet use.

- secondly, it was the most obvious and simplest solution to me - it 
  just fits well into the syslet model - which is an execution concept 
  centered around pure user-space memory and system calls, not some 
  kernel resource. Kernel fills in the ringbuffer, user-space clears it. 
  If we had to

Re: [patch 06/11] syslets: core, documentation

2007-02-13 Thread Davide Libenzi

Wow! You really helped Zach out ;)

On Tue, 13 Feb 2007, Ingo Molnar wrote:

> +The Syslet Atom:
> +
> +
> +The syslet atom is a small, fixed-size (44 bytes on 32-bit) piece of
> +user-space memory, which is the basic unit of execution within the syslet
> +framework. A syslet represents a single system-call and its arguments.
> +In addition it also has condition flags attached to it that allows the
> +construction of larger programs (syslets) from these atoms.
> +
> +Arguments to the system call are implemented via pointers to arguments.
> +This not only increases the flexibility of syslet atoms (multiple syslets
> +can share the same variable for example), but is also an optimization:
> +copy_uatom() will only fetch syscall parameters up until the point it
> +meets the first NULL pointer. 50% of all syscalls have 2 or less
> +parameters (and 90% of all syscalls have 4 or less parameters).

Why do you need to have an extra memory indirection per parameter in 
copy_uatom()? It also forces you to have parameters pointed-to, to be 
"long" (or pointers), instead of their natural POSIX type (like fd being 
"int" for example). Also, you need to have array pointers (think about a 
"char buf[];" passed to an async read(2)) to be saved into a pointer 
variable, and pass the pointer of the latter to the async system. Same for 
all structures (ie. stat(2) "struct stat"). Let them be real argouments 
and add a nparams argoument to the structure:

struct syslet_atom {
   unsigned long   flags;
   unsigned intnr;
   unsigned intnparams;
   long __user *ret_ptr;
   struct syslet_uatom __user  *next;
   unsigned long   args[6];
};

I can understand that chaining syscalls requires variable sharing, but the 
majority of the parameters passed to syscalls are just direct ones.
Maybe a smart method that allows you to know if a parameter is a direct 
one or a pointer to one? An "unsigned int pmap" where bit N is 1 if param 
N is an indirection? Hmm?

> +Running Syslets:
> +
> +
> +Syslets can be run via the sys_async_exec() system call, which takes
> +the first atom of the syslet as an argument. The kernel does not need
> +to be told about the other atoms - it will fetch them on the fly as
> +execution goes forward.
> +
> +A syslet might either be executed 'cached', or it might generate a
> +'cachemiss'.
> +
> +'Cached' syslet execution means that the whole syslet was executed
> +without blocking. The system-call returns the submitted atom's address
> +in this case.
> +
> +If a syslet blocks while the kernel executes a system-call embedded in
> +one of its atoms, the kernel will keep working on that syscall in
> +parallel, but it immediately returns to user-space with a NULL pointer,
> +so the submitting task can submit other syslets.
> +
> +Completion of asynchronous syslets:
> +---
> +
> +Completion of asynchronous syslets is done via the 'completion ring',
> +which is a ringbuffer of syslet atom pointers user user-space memory,
> +provided by user-space in the sys_async_register() syscall. The
> +kernel fills in the ringbuffer starting at index 0, and user-space
> +must clear out these pointers. Once the kernel reaches the end of
> +the ring it wraps back to index 0. The kernel will not overwrite
> +non-NULL pointers (but will return an error), user-space has to
> +make sure it completes all events it asked for.

Sigh, I really dislike shared userspace/kernel stuff, when we're 
transfering pointers to userspace. Did you actually bench it against a:

int async_wait(struct syslet_uatom **r, int n);

I can fully understand sharing userspace buffers with the kernel, if we're 
talking about KB transferd during a block or net I/O DMA operation, but 
for transfering a pointer? Behind each pointer transfer(4/8 bytes) there 
is a whole syscall execution, that makes the 4/8 bytes tranfers have a 
relative cost of 0.01% *maybe*. Different case is a O_DIRECT read of 16KB 
of data, where in that case the memory transfer has a relative cost 
compared to the syscall, that can be pretty high. The syscall saving 
argument is moot too, because syscall are cheap, and if there's a lot of 
async traffic, you'll be fetching lots of completions to keep you dispatch 
loop pretty busy for a while.
And the API is *certainly* cleaner.

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 06/11] syslets: core, documentation

Re: [patch 06/11] syslets: core, documentation

Re: [patch 06/11] syslets: core, documentation

Re: [patch 06/11] syslets: core, documentation

Re: [patch 06/11] syslets: core, documentation

Re: [patch 06/11] syslets: core, documentation

Re: [patch 06/11] syslets: core, documentation

Re: [patch 06/11] syslets: core, documentation

Re: [patch 06/11] syslets: core, documentation

Re: [patch 06/11] syslets: core, documentation

Re: [patch 06/11] syslets: core, documentation

Re: [patch 06/11] syslets: core, documentation

Re: [patch 06/11] syslets: core, documentation

Re: [patch 06/11] syslets: core, documentation

Re: [patch 06/11] syslets: core, documentation

Re: [patch 06/11] syslets: core, documentation

Re: [patch 06/11] syslets: core, documentation

Re: [patch 06/11] syslets: core, documentation

Re: [patch 06/11] syslets: core, documentation

19 matches

Site Navigation

Mail list logo

Footer information