OK, having skimmed through Ingo's code once now, I can already see I
have some crow to eat.  But I still have some marginally less stupid
questions.

Cachemiss threads are created with CLONE_VM | CLONE_FS | CLONE_FILES |
CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM.  Does that mean they
share thread-local storage with the userspace thread, have
thread-local storage of their own, or have no thread-local storage
until NPTL asks for it?

When the kernel zeroes the userspace stack pointer in
cachemiss_thread(), presumably the allocation of a new userspace stack
page is postponed until that thread needs to resume userspace
execution (after completion of the first I/O that missed cache).  When
do you copy the contents of the threadlet function's stack frame into
this new stack page?

Is there anything in a struct pt_regs that is expensive to restore
(perhaps because it flushes a pipeline or cache that wasn't already
flushed on syscall entry)?  Is there any reason why the FPU context
has to differ among threadlets that have blocked while executing the
same userspace function with different stacks?  If the TLS pointer
isn't in either of these, where is it, and why doesn't
move_user_context() swap it?

If you set out to cancel one of these threadlets, how are you going to
ensure that it isn't holding any locks?  Is there any reasonable way
to implement a userland finally { } block so that you can release
malloc'd memory and clean up application data structures?

If you want to migrate a threadlet to another CPU on syscall entry
and/or exit, what has to travel other than the userspace stack and the
struct pt_regs?  (I am assuming a quiesced FPU and thread(s) at the
destination with compatible FPU flags.)  Does it make sense for the
userspace stack page to have space reserved for a struct pt_regs
before the threadlet stack frame, so that the entire userspace
threadlet state migrates as one page?

I now see that an effort is already made to schedule threadlets in
bursts, grouped by PID, when several have unblocked since the last
timeslice.  What is the transition cost from one threadlet to another?
Can that transition cost be made lower by reducing the amount of
state that belongs to the individual threadlet vs. the pool of
cachemiss threads associated with that threadlet entrypoint?

Generally, is there a "contract" that could be made between the
threadlet application programmer and the implementation which would
allow, perhaps in future hardware, the kind of invisible pipelined
coprocessing for AIO that has been so successful for FP?

I apologize for having adopted a hostile tone in a couple of previous
messages in this thread; remind me in the future not to alternate
between thinking about code and about the FSF.  :-)  I do really like
a lot of things about the threadlet model, and would rather not see it
given up on for network I/O and NUMA systems.  So I'm going to
reiterate again -- more politely this time -- the need for a
data-structure-centric threadlet pool abstraction that supports
request throttling, reprioritization, bulk cancellation, and migration
of individual threadlets to the node nearest the relevant I/O port.

I'm still not sold on syslets as anything userspace-visible, but I
could imagine them enabling a sort of functional syntax for chaining
I/O operations, with most failures handled as inline "Not-a-Pointer"
values or as "AEIOU" (asynchronously executed I/O unit?) exceptions
instead of syscall-test-branch-syscall-test-branch.  Actually working
out the semantics and getting them adopted as an IEEE standard could
even win someone a Turing award.  :-)

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to