Re: [bitc-dev] Runtime issues with unboxed arrays

Jonathan S. Shapiro Mon, 04 Nov 2013 10:46:19 -0800

On Mon, Nov 4, 2013 at 8:48 AM, Ben Kloosterman <[email protected]> wrote:

> On Mon, Nov 4, 2013 at 11:56 PM, Jonathan S. Shapiro <[email protected]>wrote:
>
>> I really want to see that paper, because it really sounds like their
>> implementation was no good. Here are the things I need to know:
>>
>
> http://courses.cs.vt.edu/cs5204/fall05-gback/papers/capriccio-sosp-2003.pdf
>

Thanks. Now that you point me at it, I remember the paper from when it
first came out. They weren't trying to build a general threading package.

>  What was their stack chunk size?
>
>
> "During this test, most functions could
> be executed entirely within the initial 4 KB chunk; when
> necessary, though, threads linked a 16 KB chunk in order
> to call a function that has an 8 KB buﬀer on its stack"
>

Yeah, that's sort of what I suspected they did. I think it's generally
agreed that a 4K stack is too small for a thread stack. 16K is the smallest
I've heard used outside of embedded applications where there is a known
bound on stack depth. I think Coyotos runs off of a 4K kernel stack, but
that's a very special case and we went to some lengths to limit the stack
depth - including rejecting the use of recursion as a coding standard.

In the case of Capriccio, though, the 4K choice is more plausible. They
were trying to design a thread package for an event driven system, and it's
a reasonable hypothesis that such a system, properly written, wouldn't need
deep stacks. It's also a reasonable hypothesis that most threads are idle
most of the time, and that you can usefully reuse stack segments in this
sort of design.

> Why did they do the checkpoints this way?
>>
>
>
> If you are willing to use a guard page, the checkpoint can be done in
>> *zero* marginal instructions. You have to zero the stack frame in any
>> case, so you simply make the first instruction of the procedure zero out
>> the deepest point of the call frame that you will need. If that puts you in
>> the guard page, you get a page fault and you deal with it.
>>
>
> "Our analysis achieves this goal without the use of guard pages, which
> would contribute unnecessary kernel crossings and virtual memory waste."
>
> Its not 100% convincing but they were running 100's of threads .
>

For most use cases, if you need to worry about one marginal page per
thread, you are kidding yourself about the capabilities of your hardware
(again excepting some embedded special cases). An event dispatch scheme is
different, so I think their *real* argument is the one they give in section
1.2, where they assert that the per-thread stack size is a significant
factor that limits the number of simultaneous threads you can execute in a
32-bit address space. That assertion is correct, but it is mainly so
because of limitations in the scheduling dispatch mechanisms of the
underlying OS (no scheduler activations).

It's also the case in event-driven systems that a given thread runs the
same code over and over again. They don't appear to have implemented any
adaptive learning strategy for stack size. It's interesting that they don't
show any demographics on stack size or stack growth faults. I would have
thought that Jeremy would have tried to collect that.

If you're going to go with a stack guard page you can just pause and expand
> the stack it should not be that often... Yes you can just attach a new
> segment but swapping in a whole new stack is also attractive as copies are
> pretty damn fast and you dont need to calculate segments and locality can
> be better ( though does require precise collection) . That is what Go is
> going with now  though their previous implementation had checks in every
> function ( as per the LLVM default)..  In the case of the paper they were
> using C ..
>

You may not be able to relocate the stack if there is a C call frame on it,
because people do all kinds of stupid stuff in C. When you *can* copy the
stack, it is fortunate that stacks are allocated as large objects (so you
don't copy them - you remap them).

> Singularity which used a similar technique to this paper  also records
>> segmented stacks have some cost since they mention possible HW assistance.
>>
>
Singularity made a lot of statements about performance that aren't
credible. Their claim that software and hardware protection have comparable
costs is based on a "conventional OS" implementation that was
*horribly* inefficient.
Given this, you should treat any appeal to hardware support as unproven,
because they haven't tested the notion of operating system support
adequately. I'm disgusted by the lack of integrity in their published
claims, and appalled that the referees let them get away with them.
Concerns about scientific integrity aside, it's just plain unfortunate,
because they claimed a lot of things that it would be useful to actually
*know*, but we have no idea which parts of their claims are scientifically
valid.

shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Runtime issues with unboxed arrays

Reply via email to