higher-order programming and concurrency in low-level languages

Kragen Javier Sitaker Tue, 27 Mar 2012 04:14:12 -0700

(Warning: almost none of the code in this post is tested.)

In C, I miss the kind of higher-order programming I can do in
languages like Ruby and Python, but in its usual form, that kind of
programming seems to require garbage collection (for closures), and
that’s not really compatible with running on a microcontroller.  But
I've thought up some relatively simple enhancements that could
substantially increase the power of C-like languages for
microcontroller-like targets.


I was thinking about the problems of stack overflow that I was
[complaining about on the Arduino back in January][0], and it turns
out John Regehr and the TinyOS folks have done some [interesting work
on bounding stack sizes through static analysis][stacktool].  (Thanks
for telling me, Brandon!)  You feed an AVR executable to their
analyzer, and it tells you what its maximum stack depth is, or the
maximum stack depth of each thread.  This is **awesome**.

[0]: http://lists.canonical.org/pipermail/kragen-tol/2012-January/000943.html
[stacktool]: http://www.cs.utah.edu/~regehr/stacktool/

Stack overflows
---------------

This is important in the first place because, if your stack gets big
enough, it collides with other data, such as the heap, and then
writing to the stack corrupts the other data, and writing to the other
data modifies the stack and very likely crashes your program in a
mysterious way.  This basically never happens in 32-bit operating
systems because, first, the gap between the stack and the heap is
gigabytes in size, and, second, it has other write-protected stuff in
it, so the stack collides with that stuff first, so you get a very
non-mysterious error report: a segfault with, typically, a stack
hundreds of thousands of calls deep.  Or, occasionally, just one, as
Yossi Kreinin hilariously explains in his [The C++ Sucks Series:
Petrifying Functions][1].

[1]: http://www.yosefk.com/blog/the-c-sucks-series-petrifying-functions.html

But it does happen regularly on embedded microcontrollers.

Stack analysis and interrupts
-----------------------------

A lot of their work has to do with interrupts: when an interrupt
fires, the interrupt pushes the current program counter (and, on some
architectures, flags) onto the stack.  Then, the interrupt handler
itself may push more state onto the stack, and if it re-enables
interrupts, it may become a reëntrant interrupt handler, dramatically
complicating the job of stack analysis.

This requires some explanation.  Platforms like the PDP-11 (I think)
allow a higher-priority interrupt to preempt the handler for a
lower-priority interrupt, so that the maximum handling latency for an
interrupt depends only on its handler and the higher-priority
interrupts.  Platforms like the x86 and AVR don’t have multiple
priority levels for executing interrupt handlers; they just have an
"interrupts enabled" flag.  So you have the temptation to make your
interrupt handler reëntrant by reënabling interrupts before the
interrupt handler is done, in order to reduce the latency for
higher-priority interrupts --- but the same original interrupt could
then re-occur and make your stack deeper without limit.

Stacked interrupt handler data is particularly a problem in a system
with a lot of different threads, each of which have small stacks,
since the interrupt handlers could stack up on any of the threads.
This means that the maximum interrupt handler stack depth gets added
to the maximum depth of each thread stack.  A solution to this is for
the interrupt handler to switch to the main stack before pushing
anything else on the stack, so that the cost to each thread stack is
only one word of extra maximum depth.

So that means that if you have, say, a thread that just endlessly
loops in a leaf function with three or four local variables, you could
figure that out at compile-time and **give that thread a stack space
of, say, 10 bytes**.  Which is to say that the entire run-time state
of the thread could comfortably and safely be a local variable of
another function.  Or you could have a global array of them.

Python-style generators
-----------------------

Anyway, though, I was thinking about Python generators.  Python
generators are a limited form of coroutine, which, like [Adam
Dunkels’s protothreads][2], can only yield control to another thread
directly, not from within some other function they’ve called.  So,
although their own local bindings can’t be kept on a call stack
(because some function can instantiate a generator and return it),
functions they call could use the main stack, since those functions
must return before the generator has a chance to yield control back to
another generator or the main thread.

[2]: http://www.sics.se/~adam/pt/

I’ve written before about how [generators simplify problems like Peter
Naur’s telegram problem][3], although [Dave Long demurred][4], but I
haven’t tried to give an explanation of why that is.

[3]: 
http://lists.canonical.org/pipermail/kragen-hacks/2010-September/000514.html
[4]: 
http://lists.canonical.org/pipermail/kragen-discuss/2010-September/001144.html

Consider these three Python generators:

    def each_slice(n, items):
        buf = []

        for item in items:
            buf.append(item)
            if len(buf) + 1 > n:
                yield buf
                buf = []

        yield buf

    def words(file, pattern='[a-zA-Z]+'):
        for line in file:
            for word in re.findall(pattern, line):
                yield word

    def ngrams(n, words):
        queue = []
        for word in words:
            queue.append(word)
            if len(queue) == n:
                yield tuple(queue)          # Lists aren’t hashable.
                queue.pop(0)

The above allows you to write expressions like the following:

    ngrams(2, words(open('corpus')))
    each_slice(2, words(open('corpus')))
    ngrams(3, (word.lower() for word in words(open('corpus'))))

Now, you could also write ngrams() or each_slice() to eagerly consume
an already-computed input array, and fill up and return an array of
their results, at a substantial cost in promptness, storage space,
locality of reference, and (if you don’t have garbage collection)
memory management complexity.  So, for efficiency, you could write
each of the expressions above as an explicit loop, with the various
stages fused together; for example, the third one becomes:

    file = open('corpus')
    queue = []
    for line in file:
        for word in re.findall('[a-zA-Z]+', line):
            word = word.lower()
            queue.append(word)
            if len(queue) == 3:
                use(tuple(queue))
                queue.pop(0)

I hope, however, that you will agree that this is considerably harder
to read and change than the one-line expression above, which is to say
that the parts of the algorithm named as `ngrams` and `words` are
mixed together and therefore not reusable.

The way that traditional APL and NumPy work is to be fully eager, as I
described before; if you write an expression like:

    (peak * sin(arange(n_samples) * (2*pi * hz / rate))).astype(Int16)

then Numeric first allocates an array for the output of `arange` and
fills it up, then multiplies each number by `(2*pi*hz/rate)` (in a new
array), then computes the sine of each number (in a new array), then
multiplies each number by `peak` (in a new array), then converts the
result to 16-bit integers (in a new array), making a total of six
arrays.  For sufficiently large values of `n_samples`, it would
certainly be more efficient to do this computation one sample at a
time, although the optimal value is probably somewhere in the middle.
(I'm obviously no expert here, but you might want to compute at least
one SIMD register’s worth of results, and probably several, before you
go haring off after function pointers and branch target buffers and
mispredicted branches and whatnot.)

You could regain that kind of separation of concerns, without the
generator or coroutine facility, by constructing explicit iterator
objects with `next`; here I use closures to avoid the syntactic tax of
Python classes, but I assume that the iteration is terminated by a
StopIteration exception, which it’s okay to simply pass along to our
caller, as with the normal Python iteration protocol:

    def words(fnext, pattern='[a-zA-Z]+'):
        buf = []
        def next():
            while not buf:
                buf = re.findall(pattern, fnext())
            return buf.pop()
        return next

    def ngrams(n, wnext):
        queue = []
        def next():
            while len(queue) < n:
                queue.append(wnext())
            rv = tuple(queue)
            queue.pop(0)
            return rv

        return next

I’ve done some work here to restructure the code to avoid explicit
“state” variables that encode what was previously implicit in the
program counter, but it seems to me that the code is more complicated
anyway.  Letting each generator track its own state with its program
counter, rather than having to use explicit variables, allows us to
use sequence, alternation, and iteration rather than explicit state
transition functions.

Recasting `each_slice` without generators is a little hairier:

    def each_slice(n, items):
        done = False
        def next():
            if done:
                raise StopIteration

            buf = []
            while len(buf) + 1 <= n:
                try:
                    buf.append(items.next())
                except StopIteration:
                    done = True
                    return buf
            return buf
        return next

Python generators are pull-driven (“external iterators”), but that's
irrelevant; the benefit of coroutines applies equally to push-driven
(“internal”) iterators.  Here's `ngrams` as a push-driven iterator;
note that it is almost exactly the same as before:

    def ngrams(n, output):
        queue = []
        while True:
            queue.append(yield)
            if len(queue) == n:
                output.send(tuple(queue))
                queue.pop(0)

Python-style generators in an extended C
----------------------------------------

So Python-style generators are a very useful language feature.  Are
they compatible with being added to C, or a language with similar
sensibilities?  In particular, are they still useful without closures
and garbage collection?  Yes; here I sketch an outline of how.

Each generator routine would be a data type, like a struct or union,
of some computable and possibly large size.  The value returned from
the generator each time would be returned in the same way as a normal
function return value --- typically in registers for small return
values, or in a fixed spot on the stack or via a hidden pointer
argument for larger ones --- and similarly for values passed in to the
generator upon initial invocation or resumption.  (If you’re trying to
implement this yourself in C, the hidden pointer may not be so hidden.
The key point is that it doesn't make sense for the iterator to
allocate memory for a large return value; its client should do that.)

Perhaps it would be nice to automatically (statically or at allocation
time) initialize variables of generator type when they are created so
that they contain the proper start address for the first resumption;
but if you’re allocating them on the heap, you probably need some kind
of operation to do the initialization explicitly, and in any case,
generators that take arguments are very useful.  So perhaps the syntax
should look something like this:

    int generator range(int start, int stop, int step)
    {
      for (int ii = start; ii < stop; ii += step) yield(ii);
    }

    // ... and here we create a temporary variable on the stack.
    for (int ii: range(0, 10, 1)) printf("%d\n", ii);

    // But we can also create a variable:
    range rr;
    // Later initialize it.  There is no problem using function-call
    // syntax for this, since C can already do function-call syntax
    // with some types of values, namely functions and function
    // pointers.  It just needs a rule for how to apply it to a new
    // type.
    rr(0, 10, 1);
    // And later still use it:
    for (int ii: rr) printf("%d\n", ii);    

Since they will normally all be invoked from the same place, all of
the yield points in the generator need to take the same set of
arguments, if they can take arguments, and all that are not the first
entry point need to return the same type, as well.  (Or types, if
you’re wandering a bit further afield from C.)

When the generator finally does finish, it needs a way to tell its
caller that it’s finished and should not be resumed again.  The
main alternatives are:

- A second return value, or CPU flag.  This is easy to achieve at the
  machine-code level.

- Twiddling the return address; for example, you could add the length
  of a short jump instruction to it, avoiding the need for a
  conditional jump in the caller, but it still needs an unconditional
  jump in the caller.

- A `longjmp()` (or equivalent) to the end of the loop, using a
  `jmp_buf` or equivalent passed as a hidden parameter.

There’s also the question of how to *write* a wider variety of
invocations than `for (foo: somegen) { ... }`.  I mean, to do
Python’s:

    try:
        x = somegen.next()
    except StopIteration:
        y()
    else:
        z(x)

you could do:

    sometype x;
    for (x: somegen) goto ok;
    y();
    if (0) {
    ok:
      z(x);
    }

but that seems rather obscure!

A reasonable, if not very C-like, syntax might be

    z(next(somegen) { y(); goto somewhere; });

That’s only reasonable because you almost always want to handle the
end of an iterator using some kind of nonlocal transfer of control,
even if only a `break;`.

### Generator objects with just the local variables ###

How big is this in-memory generator object?  One approach would be to
make the in-memory generator object contain just the local variables,
program counter, and perhaps temporary variables of the generator
routine, and never to point the stack pointer to it.  (Unless it
happens to be the top thing in the stack frame of some other function,
that is.)  In this case, the generator’s local-variable references
will need to indirect through a context pointer; this is basically the
same thing GCC does with nested functions, and if you want to make the
generator resumption look like a normal function call to the caller,
you could use the same hack GCC uses for that --- as Norm Hardy showed
me one day, GCC wedges this into C by generating a two- or
three-instruction trampoline on the stack that loads the context
pointer and then jumps to the function.

In that case, the generator code itself can execute its “yield” by
stuffing the yielded value into the appropriate place and calling a
yield routine with the address of its generator object.  The yield
routine is something like `savu`/`aretu` in V6 Unix: it sticks what
ought to be its return address into the generator object it’s passed,
and then pops it off the stack before returning --- to the callsite
that had last resumed the generator.

### Generator objects containing entire thread stacks ###

Lua has a more powerful coroutine facility: its coroutines have entire
stacks of their own, so their callees, or callees' callees, can yield
control on their behalf --- so a Lua coroutine is essentially a whole
thread, but one that only runs when you pass control to it explicitly.

Could you do that in C?  `swapcontext()` does more or less that
already, but with the usual risks of stack overflows.  To get rid of
that risk, instead of having a special “generator” kind of function,
with an icky extra indirection on its local variable accesses, you’d
have a “coroutine stack for” type-constructor that could be applied to
any stack-depth-bounded function, like the existing type-constructors
“pointer to” or “array of ints”, whose size was the
statically-computed maximum stack depth for a particular function.

Providing such types would be really hard to do with a normal C
toolchain, since its size would depend on the entire library you were
linking against --- the maximum stack depths and nonrecursivity of the
nonrecursive library functions would become part of their ABI! --- but
it would be entirely reasonable to implement for embedded targets
where the whole linked program might be 16KiB.  It does, however,
require compiling the entire transitive call tree below a function
qwertz in order to figure out how big “coroutine stack for qwertz” was
going to be, so multipass compilers as well as separate compilation
would be pretty much out.

Then the “yield” routine would be more like a normal task switch, like
`swapcontext`, simply switching stacks and returning.

This approach should make it possible to run dozens, if not hundreds,
of cooperative threads, each with full-fledged stacks, within a
microcontroller with 4KiB of RAM.

Ruby’s iterator mechanism, and implementing it in a C-like language
-------------------------------------------------------------------

Ruby’s iterators are almost, but not quite, completely different from
Python iterators.

It occurred to me that Ruby’s iterator mechanism, which I think
originally came from CLU, should *also* be practical to implement in a
language for embedded programming --- in particular, it doesn’t
require garbage collection, but it is much more convenient.  I think
it supports higher-order composition of iterator transformations to
more or less the same extent as Python's mechanism, but I'm still not
quite sure.

So, what are Ruby iterators?

Ruby has several mechanisms for passing around procedures, which is to
say writing higher-order functions, which have the appearance of
having been added one by one over time.  The simplest one is
remarkably effective and remarkably limited:

    def each start, stop, step
      i = start
      while i < stop
        yield i
        i += step
      end
    end

    each 1, 10, 2 do |i|
      puts i
    end

So here we’ve extended the language with a looping construct which
takes a sort of lambda (called, in Ruby, a block) as an argument, but
*without reifying the block as a first-class value*.  Passing
arguments to the block is done with a special-purpose `yield`
construct, which in Ruby is an expression.  `yield` by itself means to
invoke the block with no arguments.  So there is, by construction, no
way to pass the block anywhere else, or do anything other than just
invoke it.  (Other Ruby mechanisms do allow you to do a wider range of
things.)

I think this design comes from CLU iterators originally.

That doesn’t prevent you from doing the usual downward-funarg things
with it; you just have to eta-expand them:

    def dotimes n
      each 0, n, 1 do |i|
        yield i
      end
    end

It does, of course, prevent you from returning the block or storing it
into a data structure, but in a much more elegant way than Pascal’s
downward-funarg mechanism.  It also prevents the block from being
recursive, that is, directly or indirectly calling itself.

You can use this block construct for more than just iteration; for
example, here it provides a comparator for a sort routine:

    # (yield a, b) <= 0 whenever a is metaphorically less than or equal to b.
    def insertion_sort! array
      (0...array.length).each do |nsorted|
        # Find latest point to insert next value:
        pos, to_insert = nsorted, array[nsorted]
        (nsorted-1).downto 0 do |ii|
          if (yield array[ii], to_insert) <= 0
            break
          end
          pos = ii
        end

        # Move later items over to make space, then insert it
        (nsorted-1).downto pos do |ii|
          array[ii+1] = array[ii]
        end
        array[pos] = to_insert
      end
    end

    string_array = ["yes", "YES", "no", "Yes!", "No!"]
    insertion_sort!(string_array) do |a, b|
      a.downcase <=> b.downcase
    end

    p string_array

Traditionally, however, methods like `insertion_sort` above are still
called “iterators”, even though this is somewhat imprecise.

### Efficient compilation of iterators without a heap ###

It occurred to me that you can also *compile* this in a simpler way
than Pascal’s general downward-funarg mechanism, which GCC also
implements for C, albeit without the type-system restrictions that
make it safe in Pascal.

C’s function-pointer mechanism is very simple to implement.  A
function pointer is a single word, and can be invoked (without
arguments or caller-saves, anyway) with a single instruction:

    call *%ecx         # and call the function pointed to

Pascal’s mechanism, by contrast, requires two words --- the function
pointer and a context pointer:

    struct fp { int (*fp)(); void *cp; };

The minimal implementation of the call is therefore two instructions,
plus an additional instruction either inside or outside the procedure;
here it is shown outside:

    mov (%eax), %ecx   # load function pointer
    mov 4(%eax), %edx  # load context pointer
    call *%ecx

And then of course you have to allocate memory for this two-word
structure and initialize it.

(As I mentioned above, GCC provides this facility for C, too.)

Then, inside the nested function, some local variable references
indirect through the context pointer, and some do not.  This means
that some variable references are slower, and the compiler is more
complicated.

It occurred to me that, under specific circumstances, you can
implement Ruby-like blocks more simply.  Later I realized that this
was something I had learned and forgotten from Raphael Finkel’s book,
“Advanced Programming Language Design”, although I think I may be
explaining a technique here isn’t in Finkel.

The key fact is that, since the block cannot be recursive, it doesn’t
need a stack frame for its parameters or local variables.  It can
simply use storage carved out of its caller’s stack frame.  This means
that it only needs one context pointer --- the usual stack frame
pointer --- not two.  And that means that variable references inside
the block can be compiled the same as those outside of it.

(None of the following applies to SPARC or similar register-window
architectures.)

### With a separate frame pointer ###

In runtimes that have a separate frame pointer and stack pointer, and
where functions store their parameters in their stack frame, this is
particularly easy to arrange.  The block is represented as a simple
code pointer, as in C.  The block is invoked by:

1. pushing the iterator’s frame pointer onto the stack;
2. pushing arguments to the block onto the stack or into registers;
3. loading the address for the block code;
4. restoring the iterator’s caller’s frame pointer (which the iterator
   saved on the stack much earlier);
5. calling the block code, pushing a return address onto the stack, or
   wherever your architecture likes to keep it.

For example:

        push %ebp               # save frame pointer
        mov -8(%ebp), %eax      # block parameter 1
        mov -16(%ebp), %ecx     # block parameter 2
        mov -20(%ebp), %edx     # code address of block
        mov 4(%ebp), %ebp       # restore caller’s frame pointer
        call *%edx              # invoke block
        pop %ebp

In the zero-argument case, if we already have the block in a register,
with no caller-saves, this reduces to:

        push %ebp               # save frame pointer
        mov 4(%ebp), %ebp       # restore caller’s frame pointer
        call *%edx              # invoke block
        pop %ebp

Upon entry to the block, the frame pointer is already set up to point
to the stack frame that the block belongs to, so access to local
variables is simply by indexing off the frame pointer.  (And the block
doesn’t need the usual function prologue/epilogue of push %ebp; mov
%esp, %ebp and pop %ebp.)  Block parameters are accessible by indexing
off the stack pointer or in registers.  The block returns control to
the iterator with a single `ret` instruction, and either the `ret` or
the caller pops the parameters, if they were pushed on the stack
instead of passed in registers.

This allows the block to call further functions with abandon; their
stack frames will be pushed on top of the stubby stack frame
containing the block’s return address.

However, nonlocal control flow from the block requires care, as do
nested blocks.  If the block executes a `goto`, `break`, or `return`
that leaves the block, the stack pointer must first be adjusted to
deallocate the iterator’s frame.  In some languages, you also need to
run cleanups such as `finally`, `unwind-protect`, `ensure`, or C++
destructors contained in the iterator.  But, absent cleanups, this can
be handled by storing the desired stack pointer in the frame itself;
it can then be restored with a single indexed load:

        mov 28(%ebp), %esp

(`return` in Ruby, like `^` in Smalltalk, exits the function
containing the block, not just the block itself.  Ruby has a `next`
which exits just the block, returning a value to the iterator.)

Nested blocks are slightly tricky because they may want to access the
parameters of their outer block; for that to work, the outer block
must store its parameters into local variables in the stack frame.

Although it's probably not worth it, this could be made simpler by
giving the iterator a pointer to store those parameters at; instead of
just passing a code address, we pass, Pascal-like, a pointer to a
structure containing the block code address and its parameters, such
as:

    struct twoargblock { void (*cp)(); int param1; int param2; }

Then the sequence to invoke the block becomes something like this
instead:

        push %ebp               # save frame pointer
        mov -20(%ebp), %edx     # address of block structure
        mov -8(%ebp), %ecx      # block parameter 1
        mov %ecx, 4(%edx)
        mov -16(%ebp), %ecx     # block parameter 2
        mov %ecx, 8(%edx)
        mov 4(%ebp), %ebp       # restore caller’s frame pointer
        mov (%edx), %edx        # load code address
        call *%edx              # invoke block

This probably simplifies the compiler, but particularly since most
blocks don’t have other blocks nested inside of them, it’s probably a
net loss for performance.  Even blocks that contain nested blocks may
be better off simply storing the parameters in a slot in the stack
frame on entry:

        mov %eax, -40(%ebp)
        mov %ebx, -44(%ebp)

...because that avoids the need to allocate space for, initialize, and
fetch the field containing the block code address.

### Without a separate frame pointer ###

If you access your local variables by indexing off the stack pointer,
the situation is considerably hairier.  It would seem that the
iterator can’t simply restore its caller’s stack pointer before
invoking the block, because then if the block calls any functions,
those functions’ stack frames will overwrite the iterator’s own stack
frame, causing a crash.

But there’s a solution, as found in Finkel’s book, which can even be
implemented unreliably with no compiler support in pure C.  You just
have to locate the iterator’s stack frame far enough away from its
caller’s stack frame that it won’t get overwritten!  Finkel’s
solution, IIRC, is to stack-allocate a big unused array in a
trampoline function, and then have the iterator eventually return to
its caller by longjmp(), avoiding the trashed return address of the
trampoline function (which probably points into the block).  The
iterator also uses longjmp() to invoke the block, and the block uses
longjmp() to jump back down the stack into the iterator each time it
finishes.

If the compiler can statically bound the stack depth of the block, you
can do slightly better.  Before calling the iterator, the caller
allocates enough stack space for the deepest possible call stack that
the block might invoke, and passes both the block’s code address and
the stack pointer to use to call the block with.  Then the iterator’s
invocation of the block looks like this:

        mov %esp, %eax          # save iterator’s stack pointer
        mov %ecx, %esp          # restore block’s stack pointer
        push %eax
        call *%edx              # invoke block
        pop %esp                # restore iterator’s stack pointer

With this approach, you need only pop the single-word block return
address from the stack before a nonlocal transfer of control out of
the block.
-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol

higher-order programming and concurrency in low-level languages

Reply via email to