OK, having skimmed through Ingo's code once now, I can already see I have some crow to eat. But I still have some marginally less stupid questions.
Cachemiss threads are created with CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM. Does that mean they share thread-local storage with the userspace thread, have thread-local storage of their own, or have no thread-local storage until NPTL asks for it? When the kernel zeroes the userspace stack pointer in cachemiss_thread(), presumably the allocation of a new userspace stack page is postponed until that thread needs to resume userspace execution (after completion of the first I/O that missed cache). When do you copy the contents of the threadlet function's stack frame into this new stack page? Is there anything in a struct pt_regs that is expensive to restore (perhaps because it flushes a pipeline or cache that wasn't already flushed on syscall entry)? Is there any reason why the FPU context has to differ among threadlets that have blocked while executing the same userspace function with different stacks? If the TLS pointer isn't in either of these, where is it, and why doesn't move_user_context() swap it? If you set out to cancel one of these threadlets, how are you going to ensure that it isn't holding any locks? Is there any reasonable way to implement a userland finally { } block so that you can release malloc'd memory and clean up application data structures? If you want to migrate a threadlet to another CPU on syscall entry and/or exit, what has to travel other than the userspace stack and the struct pt_regs? (I am assuming a quiesced FPU and thread(s) at the destination with compatible FPU flags.) Does it make sense for the userspace stack page to have space reserved for a struct pt_regs before the threadlet stack frame, so that the entire userspace threadlet state migrates as one page? I now see that an effort is already made to schedule threadlets in bursts, grouped by PID, when several have unblocked since the last timeslice. What is the transition cost from one threadlet to another? Can that transition cost be made lower by reducing the amount of state that belongs to the individual threadlet vs. the pool of cachemiss threads associated with that threadlet entrypoint? Generally, is there a "contract" that could be made between the threadlet application programmer and the implementation which would allow, perhaps in future hardware, the kind of invisible pipelined coprocessing for AIO that has been so successful for FP? I apologize for having adopted a hostile tone in a couple of previous messages in this thread; remind me in the future not to alternate between thinking about code and about the FSF. :-) I do really like a lot of things about the threadlet model, and would rather not see it given up on for network I/O and NUMA systems. So I'm going to reiterate again -- more politely this time -- the need for a data-structure-centric threadlet pool abstraction that supports request throttling, reprioritization, bulk cancellation, and migration of individual threadlets to the node nearest the relevant I/O port. I'm still not sold on syslets as anything userspace-visible, but I could imagine them enabling a sort of functional syntax for chaining I/O operations, with most failures handled as inline "Not-a-Pointer" values or as "AEIOU" (asynchronously executed I/O unit?) exceptions instead of syscall-test-branch-syscall-test-branch. Actually working out the semantics and getting them adopted as an IEEE standard could even win someone a Turing award. :-) Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/